本頁面由 Cloud Translation API 翻譯而成。

使用 Speech-to-Text 轉錄影片檔案中的音訊

本教學課程說明如何使用 Speech-to-Text 轉錄影片檔案的音軌。

音訊檔案可能來自許多不同的來源。音訊資料可能來自手機 (例如語音留言)，也可能來自影片檔案中的音軌。

Speech-to-Text 可以使用數種機器學習模型的其中一種來轉錄音訊檔案，完美配合音訊的原始來源。透過指定原始音訊的來源，即可獲得更好的語音內容轉錄結果。指定來源可以讓 Speech-to-Text 透過使用類似您音訊檔案的資料訓練的機器學習模型來處理音訊檔案。

準備音訊資料

您必須先擷取影片檔案的資料，才能從影片轉錄音訊。擷取音訊資料後，您必須將其儲存於 Cloud Storage 值區或是轉換為 base64 編碼。

擷取音訊資料

您可以使用處理音訊和影片檔案的任何檔案轉換工具，例如 FFmpeg。

使用下列程式碼片段，透過 ffmpeg 將影片檔案轉換為音訊檔案。

ffmpeg -i video-input-file audio-output-file

儲存或轉換音訊資料

您可以轉錄儲存於本機電腦或 Cloud Storage 值區中的音訊檔案。

使用下列指令，透過 Google Cloud CLI 將音訊檔案上傳至現有的 Cloud Storage bucket。

gcloud storage cp audio-output-file storage-bucket-uri

如果您使用本機檔案，而且打算從指令列使用 curl 工具傳送要求，則必須先將音訊檔案轉換成採用 base64 編碼的資料。

請使用以下指令將音訊檔案轉換成文字檔。

base64 audio-output-file -w 0 > audio-data-text

傳送轉錄要求

請使用下列程式碼，將轉錄要求傳送至 Speech-to-Text。

本機檔案要求

通訊協定

如要瞭解完整的詳細資訊，請參閱 speech:recognize API 端點。

如要執行同步語音辨識，請提出 POST 要求並提供適當的要求內容。以下為使用 curl 的 POST 要求示例。這個範例使用 Google Cloud CLI 產生存取權杖。如需安裝 gcloud CLI 的操作說明，請參閱快速入門導覽課程。

curl -s -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
    https://speech.googleapis.com/v1/speech:recognize \
    --data '{
    "config": {
        "encoding": "LINEAR16",
        "sampleRateHertz": 16000,
        "languageCode": "en-US",
        "model": "video"
    },
    "audio": {
        "uri": "gs://cloud-samples-tests/speech/Google_Gnome.wav"
    }
}'

如要進一步瞭解如何設定要求內容，請參閱 RecognitionConfig 參考說明文件。

如果要求成功，伺服器會傳回 200 OK HTTP 狀態碼與 JSON 格式的回應：

{
  "results": [
    {
      "alternatives": [
        {
          "transcript": "OK Google stream stranger things from
            Netflix to my TV okay stranger things from
            Netflix playing on TV from the people that brought you
            Google home comes the next evolution of the smart home
            and it's just outside your window me Google know hi
            how can I help okay no what's the weather like outside
            the weather outside is sunny and 76 degrees he's right
            okay no turn on the hose I'm holding sure okay no I'm can
            I eat this lemon tree leaf yes what about this Daisy yes
            but I wouldn't recommend it but I could eat it okay
            Nomad milk to my shopping list I'm sorry that sounds like
            an indoor request I keep doing that sorry you do keep
            doing that okay no is this compost really we're all
            compost if you think about it pretty much everything is
            made up of organic matter and will return",
          "confidence": 0.9251011
        }
      ]
    }
  ]
}

Go

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Go API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。


func modelSelection(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	data, err := os.ReadFile("../testdata/Google_Gnome.wav")
	if err != nil {
		return fmt.Errorf("ReadFile: %w", err)
	}

	req := &speechpb.RecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
			Model:           "video",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Content{Content: data},
		},
	}

	resp, err := client.Recognize(ctx, req)
	if err != nil {
		return fmt.Errorf("recognize: %w", err)
	}

	for i, result := range resp.Results {
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "Result %d\n", i+1)
		for j, alternative := range result.Alternatives {
			fmt.Fprintf(w, "Alternative %d: %s\n", j+1, alternative.Transcript)
		}
	}
	return nil
}

Java

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Java API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

/**
 * Performs transcription of the given audio file synchronously with the selected model.
 *
 * @param fileName the path to a audio file to transcribe
 */
public static void transcribeModelSelection(String fileName) throws Exception {
  Path path = Paths.get(fileName);
  byte[] content = Files.readAllBytes(path);

  try (SpeechClient speech = SpeechClient.create()) {
    // Configure request with video media type
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio recognitionAudio =
        RecognitionAudio.newBuilder().setContent(ByteString.copyFrom(content)).build();

    RecognizeResponse recognizeResponse = speech.recognize(recConfig, recognitionAudio);
    // Just print the first result here.
    SpeechRecognitionResult result = recognizeResponse.getResultsList().get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Node.js API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;
const fs = require('fs');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const filename = 'Local path to audio file, e.g. /path/to/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  content: fs.readFileSync(filename).toString('base64'),
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Python API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

from google.cloud import speech

# Instantiates a client
client = speech.SpeechClient()
# Reads a file as bytes
with open("resources/Google_Gnome.wav", "rb") as f:
    audio_content = f.read()

audio = speech.RecognitionAudio(content=audio_content)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    model="video",  # Chosen model
)

response = client.recognize(config=config, audio=audio)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(f"First alternative of result {i}")
    print(f"Transcript: {alternative.transcript}")

其他語言

C#：請按照用戶端程式庫頁面上的 C# 設定說明操作，然後前往 .NET 適用的 Speech-to-Text 參考說明文件。

PHP：請按照用戶端程式庫頁面的 PHP 設定說明操作，然後前往 PHP 適用的 Speech-to-Text 參考文件。

Ruby：請按照用戶端程式庫頁面的Ruby 設定說明操作，然後前往 Ruby 適用的 Speech-to-Text 參考說明文件。

遠端檔案要求

Go

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Go API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。


import (
	"context"
	"fmt"
	"io"
	"strings"

	speech "cloud.google.com/go/speech/apiv1"
	"cloud.google.com/go/speech/apiv1/speechpb"
)

// transcribe_model_selection_gcs Transcribes the given audio file asynchronously with
// the selected model.
func transcribe_model_selection_gcs(w io.Writer) error {
	ctx := context.Background()

	client, err := speech.NewClient(ctx)
	if err != nil {
		return fmt.Errorf("NewClient: %w", err)
	}
	defer client.Close()

	audio := &speechpb.RecognitionAudio{
		AudioSource: &speechpb.RecognitionAudio_Uri{Uri: "gs://cloud-samples-tests/speech/Google_Gnome.wav"},
	}

	// The speech recognition model to use
	// See, https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model
	recognitionConfig := &speechpb.RecognitionConfig{
		Encoding:        speechpb.RecognitionConfig_LINEAR16,
		SampleRateHertz: 16000,
		LanguageCode:    "en-US",
		Model:           "video",
	}

	longRunningRecognizeRequest := &speechpb.LongRunningRecognizeRequest{
		Config: recognitionConfig,
		Audio:  audio,
	}

	operation, err := client.LongRunningRecognize(ctx, longRunningRecognizeRequest)
	if err != nil {
		return fmt.Errorf("error running recognize %w", err)
	}

	response, err := operation.Wait(ctx)
	if err != nil {
		return err
	}
	for i, result := range response.Results {
		alternative := result.Alternatives[0]
		fmt.Fprintf(w, "%s\n", strings.Repeat("-", 20))
		fmt.Fprintf(w, "First alternative of result %d", i)
		fmt.Fprintf(w, "Transcript: %s", alternative.Transcript)
	}
	return nil
}

Java

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Java API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

/**
 * Performs transcription of the remote audio file asynchronously with the selected model.
 *
 * @param gcsUri the path to the remote audio file to transcribe.
 */
public static void transcribeModelSelectionGcs(String gcsUri) throws Exception {
  try (SpeechClient speech = SpeechClient.create()) {

    // Configure request with video media type
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            // encoding may either be omitted or must match the value in the file header
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            // sample rate hertz may be either be omitted or must match the value in the file
            // header
            .setSampleRateHertz(16000)
            .setModel("video")
            .build();

    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);

    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    // Just print the first result here.
    SpeechRecognitionResult result = results.get(0);
    // There can be several alternative transcripts for a given chunk of speech. Just use the
    // first (most likely) one here.
    SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
    System.out.printf("Transcript : %s\n", alternative.getTranscript());
  }
}

Node.js

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Node.js API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

// Imports the Google Cloud client library for Beta API
/**
 * TODO(developer): Update client library import to use new
 * version of API when desired features become available
 */
const speech = require('@google-cloud/speech').v1p1beta1;

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const model = 'Model to use, e.g. phone_call, video, default';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
  model: model,
};
const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file.
const [response] = await client.recognize(request);
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log('Transcription: ', transcription);

Python

如要瞭解如何安裝及使用 Speech-to-Text 的用戶端程式庫，請參閱這篇文章。詳情請參閱 Speech-to-Text Python API 參考說明文件。

如要向語音轉文字服務進行驗證，請設定應用程式預設憑證。詳情請參閱「為本機開發環境設定驗證」。

from google.cloud import speech

client = speech.SpeechClient()

audio = speech.RecognitionAudio(
    uri="gs://cloud-samples-tests/speech/Google_Gnome.wav"
)

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="en-US",
    model="video",  # Chosen model
)

operation = client.long_running_recognize(config=config, audio=audio)

print("Waiting for operation to complete...")
response = operation.result(timeout=90)

for i, result in enumerate(response.results):
    alternative = result.alternatives[0]
    print("-" * 20)
    print(f"First alternative of result {i}")
    print(f"Transcript: {alternative.transcript}")

其他語言

C#：請按照用戶端程式庫頁面上的 C# 設定說明操作，然後前往 .NET 適用的 Speech-to-Text 參考說明文件。

PHP：請按照用戶端程式庫頁面的 PHP 設定說明操作，然後前往 PHP 適用的 Speech-to-Text 參考文件。

Ruby：請按照用戶端程式庫頁面的Ruby 設定說明操作，然後前往 Ruby 適用的 Speech-to-Text 參考說明文件。

使用 Speech-to-Text 轉錄影片檔案中的音訊 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

準備音訊資料

擷取音訊資料

儲存或轉換音訊資料

傳送轉錄要求

本機檔案要求

通訊協定

Go

Java

Node.js

Python

其他語言

遠端檔案要求

Go

Java

Node.js

Python

其他語言

使用 Speech-to-Text 轉錄影片檔案中的音訊