使用 Speech-to-Text 转写视频文件中的音频

本教程介绍如何使用 Speech-to-Text 转录视频文件中的音轨。

音频文件可能来自许多不同的来源。音频数据可能来自电话(如语音邮件)或视频文件所包含的音轨。

Speech-to-Text 可以从多种机器学习模型中选择一种来转录音频文件,以便完美匹配音频的原始来源。为了获得更好的语音转录结果,您可以指定原始音频的来源。这样,Speech-to-Text 就可以在处理您的音频文件时使用针对类似数据训练过的机器学习模型。

准备音频数据

在从视频转录音频之前,您必须从视频文件中提取数据。提取音频数据后,您必须将其存储在 Cloud Storage 存储桶中,或者将其转换为 base64 编码。

提取音频数据

您可以使用能够处理音频和视频文件的任何文件转换工具,例如 FFmpeg

通过以下代码段,使用 ffmpeg 将视频文件转换为音频文件。

ffmpeg -i video-input-file audio-output-file

存储或转换音频数据

您可以转写本地机器或 Cloud Storage 存储桶中存储的音频文件。

通过 Google Cloud CLI,使用以下命令将音频文件上传到现有的 Cloud Storage 存储桶。

gcloud storage cp audio-output-file storage-bucket-uri

如果您使用本地文件并且打算从命令行使用 curl 工具来发送请求,则必须首先将音频文件转换为 base64 编码的数据

请使用以下命令将音频文件转换为文本文件。

base64 audio-output-file -w 0 > audio-data-text

发送转录请求

请使用以下代码将转录请求发送到 Speech-to-Text。

本地文件请求

协议

如需了解完整的详细信息,请参阅 speech:recognize API 端点。

如需执行同步语音识别,请发出 POST 请求并提供相应的请求正文。以下示例展示了一个使用 curl 发出的 POST 请求。该示例使用 Google Cloud CLI 生成访问令牌。如需了解如何安装 gcloud CLI,请参阅快速入门

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "model": "video"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
    }
}'

如需详细了解如何配置请求正文,请参阅 RecognitionConfig 参考文档。

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应:

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "OK Google stream stranger things from
            Netflix to my TV okay stranger things from
            Netflix playing on TV from the people that brought you
            Google home comes the next evolution of the smart home
            and it's just outside your window me Google know hi
            how can I help okay no what's the weather like outside
            the weather outside is sunny and 76 degrees he's right
            okay no turn on the hose I'm holding sure okay no I'm can
            I eat this lemon tree leaf yes what about this Daisy yes
            but I wouldn't recommend it but I could eat it okay
            Nomad milk to my shopping list I'm sorry that sounds like
            an indoor request I keep doing that sorry you do keep
            doing that okay no is this compost really we're all
            compost if you think about it pretty much everything is
            made up of organic matter and will return",
          "confidence": 0.9251011
        }
      ]
    }
  ]
}

Go

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Go API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证


func modelSelection(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	data, err := os.ReadFile("../testdata/Google_Gnome.wav")
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	req := &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
			Model:           "video",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	}

	resp, err := client.Recognize(ctx, req)
	if err != nil {
		return fmt.Errorf("recognize: %w", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Java API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证

/**
 * Performs transcription of the given audio file synchronously with the selected model.
 *
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Node.js API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Python API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证

from google.cloud import speech

# Instantiates a client
client = speech.SpeechClient()
# Reads a file as bytes
with open("resources/Google_Gnome.wav", "rb") as f:
    audio_content = f.read()

audio = speech.RecognitionAudio(content=audio_content)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    model="video",  # Chosen model
)

response = client.recognize(config=config, audio=audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(f"First alternative of result {i}")
    print(f"Transcript: {alternative.transcript}")

其他语言

C#:请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Speech-to-Text 参考文档

PHP:请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Speech-to-Text 参考文档

Ruby:请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Speech-to-Text 参考文档

远程文件请求

Go

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Go API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证


import (
	"context"
	"fmt"
	"io"
	"strings"

	speech "cloud.google.com/go/speech/apiv1"
	"cloud.google.com/go/speech/apiv1/speechpb"
)

// transcribe_model_selection_gcs Transcribes the given audio file asynchronously with
// the selected model.
func transcribe_model_selection_gcs(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	audio := &speechpb.RecognitionAudio{
		AudioSource: &speechpb.RecognitionAudio_Uri{Uri: "gs://cloud-samples-tests/speech/Google_Gnome.wav"},
	}

	// The speech recognition model to use
	// See, https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model
	recognitionConfig := &speechpb.RecognitionConfig{
		Encoding:        speechpb.RecognitionConfig_LINEAR16,
		SampleRateHertz: 16000,
		LanguageCode:    "en-US",
		Model:           "video",
	}

	longRunningRecognizeRequest := &speechpb.LongRunningRecognizeRequest{
		Config: recognitionConfig,
		Audio:  audio,
	}

	operation, err := client.LongRunningRecognize(ctx, longRunningRecognizeRequest)
	if err != nil {
		return fmt.Errorf("error running recognize %w", err)
	}

	response, err := operation.Wait(ctx)
	if err != nil {
		return err
	}
	for i, result := range response.Results {
		alternative := result.Alternatives[0]
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "First alternative of result %d", i)
		fmt.Fprintf(w, "Transcript: %s", alternative.Transcript)
	}
	return nil
}

Java

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Java API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证

/**
 * Performs transcription of the remote audio file asynchronously with the selected model.
 *
 * @param gcsUri the path to the remote audio file to transcribe.
 */
public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure request with video media type
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    // Just print the first result here.
    SpeechRecognitionResult result = results.get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Node.js API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file.
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如需了解如何安装和使用 Speech-to-Text 客户端库,请参阅 Speech-to-Text 客户端库。 如需了解详情,请参阅 Speech-to-Text Python API 参考文档

如需向 Speech-to-Text 进行身份验证,请设置应用默认凭证。 如需了解详情,请参阅为本地开发环境设置身份验证

from google.cloud import speech

client = speech.SpeechClient()

audio = speech.RecognitionAudio(
    uri="gs://cloud-samples-tests/speech/Google_Gnome.wav"
)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    model="video",  # Chosen model
)

operation = client.long_running_recognize(config=config, audio=audio)

print("Waiting for operation to complete...")
response = operation.result(timeout=90)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(f"First alternative of result {i}")
    print(f"Transcript: {alternative.transcript}")

其他语言

C#:请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Speech-to-Text 参考文档

PHP:请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Speech-to-Text 参考文档

Ruby:请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Speech-to-Text 参考文档