选择转录模型

本页面介绍了如何将特定的机器学习模型用于发送到 Cloud Speech-to-Text 的音频转写请求。

转录模型

Cloud Speech-to-Text 会将输入与多个机器学习模型中的一个进行比较,以检测音频剪辑中的字词。每个模型都通过分析数百万个示例(在此是指大量实际的人物说话录音)进行过训练。

Cloud STT 拥有使用特定来源(例如电话通话或视频)音频训练的专用模型。由于经历过这种训练过程,因此当这些专业模型被应用于类型相类似的音频数据时,可以提供更好的结果。

例如,Cloud STT 有一个经过训练的转录模型,用于识别在电话中录制的语音。当 Cloud STT 使用 telephonytelephony_short 模型转写电话音频时,所生成的转写结果会比使用 latest_shortlatest_long 模型时的电话音频转写结果更准确。

下表显示了可用于 Cloud STT 的转录模型。

模型名称 说明
latest_long 此模型适用于任何类型的长篇内容,例如媒体或自然言语和对话。考虑使用此模型来代替视频模型,尤其是在视频模型不支持您的目标语言的情况下。 您还可以用此模型来代替默认模型。
latest_short 该模型适合用于几秒钟的短语音。它有助于尝试捕获命令或其他单发定向语音用例。请考虑使用此模型而不是命令和搜索模型。
telephony 改进的“phone_call”模型版本,最适合源自电话的音频,通常以 8kHz 的采样率录制。
telephony_short 这是现代“电话”模型的专用版本,适用于源自电话通话的音频(通常以 8 kHz 采样率录制)中的简短甚至单个字词话语。
medical_dictation 使用此模型来转录医疗专家口述的备注。

这是一个高于标准价格的高级模型。如需了解详情,请参阅价格页面

medical_conversation 使用此模型转录医疗专业人员和患者之间的对话。

这是一个高于标准价格的高级模型。如需了解详情,请参阅价格页面

以下模型主要基于传统的非合规架构,主要是为了保持旧版和向后兼容性而保留的。
command_and_search 最适合简短话语或单字话语,例如语音指令或语音搜索。
default 最适合那些不适合其他音频模型的音频,例如时间较长的音频或口述。默认模型将为任何类型的音频(包括具有专为其量身定制的单独模型的视频剪辑之类的音频)生成转录结果。但是,使用默认模型识别视频剪辑音频可能会产生质量低于使用视频模型的结果。理想情况下,音频为高保真度格式,以 16kHz 或更高的采样率录制。
phone_call 最适合来自电话通话的音频(通常以 8kHz 的采样率录制)。
video

最适合来自有多个讲话人的视频剪辑或其他来源(例如播客)的音频。对于使用高品质麦克风录制的音频或具有大量背景噪声的音频,此模型通常也是最佳选择。为获得最佳效果,请提供以 16000Hz 或更高采样率录制的音频。

选择用于音频转录的模型

如需指定用于音频转写的特定模型,您必须将请求的 RecognitionConfig 参数的 model 字段设置为允许值之一:例如 latest_longlatest_shorttelephonytelephony_short。Cloud STT 的模型选择功能支持所有语音识别方法:speech:recognizespeech:longrunningrecognize流式传输

对本地音频文件执行转录

协议

如需了解完整的详细信息,请参阅 speech:recognize API 端点。

如需执行同步语音识别,请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例使用 Google Cloud CLI 生成访问令牌。如需了解如何安装 gcloud CLI,请参阅快速入门

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v2/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "model": "video"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
    }
}'

如需详细了解如何配置请求正文,请参阅 RecognitionConfig 参考文档。

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "OK Google stream stranger things from
            Netflix to my TV okay stranger things from
            Netflix playing on TV from the people that brought you
            Google home comes the next evolution of the smart home
            and it's just outside your window me Google know hi
            how can I help okay no what's the weather like outside
            the weather outside is sunny and 76 degrees he's right
            okay no turn on the hose I'm holding sure okay no I'm can
            I eat this lemon tree leaf yes what about this Daisy yes
            but I wouldn't recommend it but I could eat it okay
            Nomad milk to my shopping list I'm sorry that sounds like
            an indoor request I keep doing that sorry you do keep
            doing that okay no is this compost really we're all
            compost if you think about it pretty much everything is
            made up of organic matter and will return",
          "confidence": 0.9251011
        }
      ]
    }
  ]
}

Go

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Go API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证


func modelSelection(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	data, err := os.ReadFile("../testdata/Google_Gnome.wav")
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	req := &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
			Model:           "video",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	}

	resp, err := client.Recognize(ctx, req)
	if err != nil {
		return fmt.Errorf("recognize: %w", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Java API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证

/**
 * Performs transcription of the given audio file synchronously with the selected model.
 *
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Node.js API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Python API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证

from google.cloud import speech

# Instantiates a client
client = speech.SpeechClient()
# Reads a file as bytes
with open("resources/Google_Gnome.wav", "rb") as f:
    audio_content = f.read()

audio = speech.RecognitionAudio(content=audio_content)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    model="video",  # Chosen model
)

response = client.recognize(config=config, audio=audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(f"First alternative of result {i}")
    print(f"Transcript: {alternative.transcript}")

其他语言

C#: 请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Cloud STT 参考文档。

PHP: 请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Cloud STT 参考文档。

Ruby:请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Cloud STT 参考文档

对 Cloud Storage 音频文件执行转录

Go

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Go API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证


import (
	"context"
	"fmt"
	"io"
	"strings"

	speech "cloud.google.com/go/speech/apiv1"
	"cloud.google.com/go/speech/apiv1/speechpb"
)

// transcribe_model_selection_gcs Transcribes the given audio file asynchronously with
// the selected model.
func transcribe_model_selection_gcs(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	audio := &speechpb.RecognitionAudio{
		AudioSource: &speechpb.RecognitionAudio_Uri{Uri: "gs://cloud-samples-tests/speech/Google_Gnome.wav"},
	}

	// The speech recognition model to use
	// See, https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model
	recognitionConfig := &speechpb.RecognitionConfig{
		Encoding:        speechpb.RecognitionConfig_LINEAR16,
		SampleRateHertz: 16000,
		LanguageCode:    "en-US",
		Model:           "video",
	}

	longRunningRecognizeRequest := &speechpb.LongRunningRecognizeRequest{
		Config: recognitionConfig,
		Audio:  audio,
	}

	operation, err := client.LongRunningRecognize(ctx, longRunningRecognizeRequest)
	if err != nil {
		return fmt.Errorf("error running recognize %w", err)
	}

	response, err := operation.Wait(ctx)
	if err != nil {
		return err
	}
	for i, result := range response.Results {
		alternative := result.Alternatives[0]
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "First alternative of result %d", i)
		fmt.Fprintf(w, "Transcript: %s", alternative.Transcript)
	}
	return nil
}

Java

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Java API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证

/**
 * Performs transcription of the remote audio file asynchronously with the selected model.
 *
 * @param gcsUri the path to the remote audio file to transcribe.
 */
public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure request with video media type
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    // Just print the first result here.
    SpeechRecognitionResult result = results.get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如需了解如何安装和使用 Cloud STT 客户端库,请参阅 Cloud STT 客户端库。如需了解详情,请参阅 Cloud STT Node.js API 参考文档

如需向 Cloud STT 进行身份验证,请设置应用默认凭证。如需了解详情,请参阅为本地开发环境设置身份验证

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file.
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

其他语言

C#: 请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Cloud STT 参考文档。

PHP: 请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Cloud STT 参考文档。

Ruby:请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Cloud STT 参考文档