משתמשים חדשים ב-Cloud Speech-to-Text צריכים להשתמש ב-V2 API. במדריך ההעברה אפשר לקרוא איך מעבירים פרויקטים קיימים לגרסה העדכנית ביותר.

תמלול של קובצי אודיו ארוכים לטקסט

בדף הזה מוסבר איך לתמלל קובצי אודיו ארוכים (באורך של יותר מדקה) לטקסט באמצעות Cloud Speech-to-Text API וזיהוי דיבור אסינכרוני.

מידע על זיהוי דיבור אסינכרוני

זיהוי דיבור אסינכרוני מתחיל פעולה ממושכת של עיבוד אודיו. משתמשים בזיהוי דיבור אסינכרוני כדי לתמלל אודיו שאורכו יותר מ-60 שניות. באודיו קצר יותר, זיהוי דיבור סינכרוני מהיר ופשוט יותר. הגבול העליון לזיהוי דיבור אסינכרוני הוא 480 דקות.

Cloud Speech-to-Text ועיבוד אסינכרוני

אפשר לשלוח תוכן אודיו ישירות לקובץ מקומי אל Cloud Speech-to-Text לעיבוד אסינכרוני. עם זאת, מגבלת הזמן של קובצי אודיו מקומיים היא 60 שניות. ניסיון לתמלל קבצי אודיו מקומיים שאורכם יותר מ-60 שניות יגרום לשגיאה. כדי להשתמש בזיהוי דיבור אסינכרוני לתמלול אודיו שאורכו יותר מ-60 שניות, צריך לשמור את הנתונים בקטגוריה של Cloud Storage.

אפשר לאחזר את תוצאות הפעולה באמצעות ה-method‏ google.longrunning.Operations. התוצאות יישארו זמינות לאחזור במשך 5 ימים (120 שעות). יש לכם גם אפשרות להעלות את התוצאות ישירות לקטגוריה של Cloud Storage.

תמלול קובצי אודיו ארוכים באמצעות קטגוריה של Cloud Storage

בדוגמאות האלה נעשה שימוש בקטגוריה של Cloud Storage כדי לאחסן את קלט האודיו הגולמי לתהליך התמלול ארוך הטווח. דוגמה לתגובה אופיינית של פעולת longrunningrecognize מופיעה במאמרי העזרה.

פרוטוקול

פרטים נוספים זמינים בנקודת קצה ל-API של speech:longrunningrecognize.

כדי לבצע זיהוי דיבור סינכרוני, שולחים בקשת POST ומספקים את גוף הבקשה המתאים. בדוגמה הבאה מוצגת בקשת POST באמצעות curl. בדוגמה נעשה שימוש ב-Google Cloud CLI כדי ליצור אסימון גישה. הוראות להתקנת ה-CLI של gcloud מופיעות במדריך למתחילים.

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {
    'language_code': 'en-US'
  },
  'audio':{
    'uri':'gs://cloud-samples-tests/speech/brooklyn.flac'
  }
}" "https://speech.googleapis.com/v1/speech:longrunningrecognize"

מידע נוסף על הגדרת גוף הבקשה מופיע במאמרי העזרה בנושא RecognitionConfig ו-RecognitionAudio.

אם הבקשה מצליחה, השרת מחזיר קוד סטטוס 200 OK של HTTP ואת התשובה בפורמט JSON:

{
  "name": "7612202767953098924"
}

כאשר name הוא השם של הפעולה הממושכת שנוצרה עבור הבקשה.

ממתינים לסיום העיבוד. משך העיבוד תלוי באודיו המקורי. ברוב המקרים, התוצאות יתקבלו תוך חצי מהזמן של קובץ האודיו המקורי. כדי לקבל את הסטטוס של פעולה ממושכת, שולחים GETבקשה לנקודת הקצה https://speech.googleapis.com/v1/operations/. מחליפים את your-operation-name בערך name שהוחזר מהבקשה longrunningrecognize.

curl -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     "https://speech.googleapis.com/v1/operations/your-operation-name"

אם הבקשה מצליחה, השרת מחזיר קוד סטטוס 200 OK של HTTP ואת התשובה בפורמט JSON:

{
  "name": "7612202767953098924",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2017-07-20T16:36:55.033650Z",
    "lastUpdateTime": "2017-07-20T16:37:17.158630Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "how old is the Brooklyn Bridge",
            "confidence": 0.96096134,
          }
        ]
      },
      {
        "alternatives": [
          {
            ...
          }
        ]
      }
    ]
  }
}

אם הפעולה לא הסתיימה, אפשר לדגום את נקודת הקצה על ידי שליחה חוזרת של בקשת GET עד שהמאפיין done של התגובה יהיה true.

gcloud

פרטים נוספים זמינים בפקודה recognize-long-running.

כדי לבצע זיהוי דיבור אסינכרוני, משתמשים ב-Google Cloud CLI ומציינים את הנתיב של קובץ מקומי או כתובת URL של Cloud Storage.

gcloud ml speech recognize-long-running \
    'gs://cloud-samples-tests/speech/brooklyn.flac' \
     --language-code='en-US' --async

אם הבקשה מצליחה, השרת מחזיר את המזהה של הפעולה ארוכת הטווח בפורמט JSON.

{
  "name": OPERATION_ID
}

אחרי כן, מריצים את הפקודה הבאה כדי לקבל מידע על הפעולה.

gcloud ml speech operations describe OPERATION_ID

אפשר גם לבצע דגימה של הפעולה עד שהיא תושלם על ידי הפעלת הפקודה הבאה.

gcloud ml speech operations wait OPERATION_ID

אחרי שהפעולה מסתיימת, היא מחזירה תמליל של האודיו בפורמט JSON.

{
  "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9840146,
          "transcript": "how old is the Brooklyn Bridge"
        }
      ]
    }
  ]
}

Go

מידע על התקנה ושימוש בספריית הלקוח של Cloud STT מופיע במאמר ספריות הלקוח של Cloud STT. מידע נוסף מופיע במאמרי העזרה של Cloud STT Go API.

כדי לבצע אימות ב-Cloud STT, צריך להגדיר את Application Default Credentials. מידע נוסף זמין במאמר הגדרת אימות לסביבת פיתוח מקומית.


func sendGCS(w io.Writer, client *speech.Client, gcsURI string) error {
	ctx := context.Background()

	// Send the contents of the audio file with the encoding and
	// and sample rate information to be transcripted.
	req := &speechpb.LongRunningRecognizeRequest{
		Config: &speechpb.RecognitionConfig{
			Encoding:        speechpb.RecognitionConfig_LINEAR16,
			SampleRateHertz: 16000,
			LanguageCode:    "en-US",
		},
		Audio: &speechpb.RecognitionAudio{
			AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
		},
	}

	op, err := client.LongRunningRecognize(ctx, req)
	if err != nil {
		return err
	}
	resp, err := op.Wait(ctx)
	if err != nil {
		return err
	}

	// Print the results.
	for _, result := range resp.Results {
		for _, alt := range result.Alternatives {
			fmt.Fprintf(w, "\"%v\" (confidence=%3f)\n", alt.Transcript, alt.Confidence)
		}
	}
	return nil
}

Java

מידע על התקנה ושימוש בספריית הלקוח של Cloud STT מופיע במאמר ספריות הלקוח של Cloud STT. מידע נוסף מופיע במאמרי העזרה של Cloud STT Java API.

כדי לבצע אימות ב-Cloud STT, צריך להגדיר את Application Default Credentials. מידע נוסף זמין במאמר הגדרת אימות לסביבת פיתוח מקומית.

/**
 * Performs non-blocking speech recognition on remote FLAC file and prints the transcription.
 *
 * @param gcsUri the path to the remote LINEAR16 audio file to transcribe.
 */
public static void asyncRecognizeGcs(String gcsUri) throws Exception {
  // Configure polling algorithm
  SpeechSettings.Builder speechSettings = SpeechSettings.newBuilder();
  TimedRetryAlgorithm timedRetryAlgorithm =
      OperationTimedPollAlgorithm.create(
          RetrySettings.newBuilder()
              .setInitialRetryDelay(Duration.ofMillis(500L))
              .setRetryDelayMultiplier(1.5)
              .setMaxRetryDelay(Duration.ofMillis(5000L))
              .setInitialRpcTimeout(Duration.ZERO) // ignored
              .setRpcTimeoutMultiplier(1.0) // ignored
              .setMaxRpcTimeout(Duration.ZERO) // ignored
              .setTotalTimeout(Duration.ofHours(24L)) // set polling timeout to 24 hours
              .build());
  speechSettings.longRunningRecognizeOperationSettings().setPollingAlgorithm(timedRetryAlgorithm);

  // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
  try (SpeechClient speech = SpeechClient.create(speechSettings.build())) {

    // Configure remote file request for FLAC
    RecognitionConfig config =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.FLAC)
            .setLanguageCode("en-US")
            .setSampleRateHertz(16000)
            .build();
    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    // Use non-blocking call for getting file transcription
    OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
        speech.longRunningRecognizeAsync(config, audio);
    while (!response.isDone()) {
      System.out.println("Waiting for response...");
      Thread.sleep(10000);
    }

    List<SpeechRecognitionResult> results = response.get().getResultsList();

    for (SpeechRecognitionResult result : results) {
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcription: %s\n", alternative.getTranscript());
    }
  }
}

Node.js

מידע על התקנה ושימוש בספריית הלקוח של Cloud STT מופיע במאמר ספריות הלקוח של Cloud STT. מידע נוסף מופיע במאמרי העזרה של Cloud STT Node.js API.

כדי לבצע אימות ב-Cloud STT, צריך להגדיר את Application Default Credentials. מידע נוסף זמין במאמר הגדרת אימות לסביבת פיתוח מקומית.

// Imports the Google Cloud client library
const speech = require('@google-cloud/speech');

// Creates a client
const client = new speech.SpeechClient();

/**
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const gcsUri = 'gs://my-bucket/audio.raw';
// const encoding = 'Encoding of the audio file, e.g. LINEAR16';
// const sampleRateHertz = 16000;
// const languageCode = 'BCP-47 language code, e.g. en-US';

const config = {
  encoding: encoding,
  sampleRateHertz: sampleRateHertz,
  languageCode: languageCode,
};

const audio = {
  uri: gcsUri,
};

const request = {
  config: config,
  audio: audio,
};

// Detects speech in the audio file. This creates a recognition job that you
// can wait for now, or get its result later.
const [operation] = await client.longRunningRecognize(request);
// Get a Promise representation of the final result of the job
const [response] = await operation.promise();
const transcription = response.results
  .map(result => result.alternatives[0].transcript)
  .join('\n');
console.log(`Transcription: ${transcription}`);

Python

מידע על התקנה ושימוש בספריית הלקוח של Cloud STT מופיע במאמר ספריות הלקוח של Cloud STT. מידע נוסף מופיע במאמרי העזרה של Cloud STT Python API.

כדי לבצע אימות ב-Cloud STT, צריך להגדיר את Application Default Credentials. מידע נוסף זמין במאמר הגדרת אימות לסביבת פיתוח מקומית.

from google.cloud import speech


def transcribe_gcs(gcs_uri: str) -> str:
    """Asynchronously transcribes the audio file from Cloud Storage
    Args:
        gcs_uri: The Google Cloud Storage path to an audio file.
            E.g., "gs://storage-bucket/file.flac".
    Returns:
        The generated transcript from the audio file provided.
    """
    client = speech.SpeechClient()

    audio = speech.RecognitionAudio(uri=gcs_uri)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=44100,
        language_code="en-US",
    )

    operation = client.long_running_recognize(config=config, audio=audio)

    print("Waiting for operation to complete...")
    response = operation.result(timeout=90)

    transcript_builder = []
    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        transcript_builder.append(f"\nTranscript: {result.alternatives[0].transcript}")
        transcript_builder.append(f"\nConfidence: {result.alternatives[0].confidence}")

    transcript = "".join(transcript_builder)
    print(transcript)

    return transcript

שפות נוספות

‫C#‎: צריך לפעול לפי הוראות ההגדרה של C# ‎ בדף של ספריות הלקוח ואז לעבור אל מאמרי העזרה של Cloud STT ל-‎ .NET.

‫PHP: צריך לפעול לפי הוראות ההגדרה של PHP בדף של ספריות הלקוח ואז לעבור אל מאמרי העזרה של Cloud STT ל-PHP.

‫Ruby: פועלים לפי הוראות ההגדרה של Ruby בדף של ספריות הלקוח, ואז עוברים אל מאמרי העזרה של Cloud STT ל-Ruby.

העלאת תוצאות התמלול לקטגוריה של Cloud Storage

‫Cloud Speech-to-Text תומך בהעלאת התוצאות של זיהוי דיבור לטווח ארוך ישירות לקטגוריה של Cloud Storage. אם מטמיעים את התכונה הזו באמצעות טריגרים של Cloud Storage, העלאות ל-Cloud Storage יכולות להפעיל התראות שקוראות ל-Cloud Functions, וכך לא צריך לבצע סקר ב-Cloud Speech-to-Text כדי לקבל תוצאות זיהוי.

כדי שהתוצאות יועלו לקטגוריה של Cloud Storage, צריך לספק את הגדרת הפלט האופציונלית TranscriptOutputConfig בבקשת הזיהוי ארוכת הטווח.

      message TranscriptOutputConfig {

        oneof output_type {
          // Specifies a Cloud Storage URI for the recognition results. Must be
          // specified in the format: `gs://bucket_name/object_name`
          string gcs_uri = 1;
        }
      }

פרוטוקול

פרטים נוספים זמינים בנקודת קצה ל-API של longrunningrecognize.

בדוגמה הבאה מוצגת בקשת POST באמצעות curl, שבה גוף הבקשה מציין את הנתיב לקטגוריה ב-Cloud Storage. התוצאות מועלות למיקום הזה כקובץ JSON שבו מאוחסנים נתוני SpeechRecognitionResult.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'config': {...},
  'output_config': {
     'gcs_uri':'gs://bucket/result-output-path.json'
  },
  'audio': {
    'uri': 'gs://bucket/audio-path'
  }
}" "https://speech.googleapis.com/v2/speech:longrunningrecognize"

השדה LongRunningRecognizeResponse כולל את הנתיב לקטגוריה של Cloud Storage שבה בוצעה ההעלאה. אם ההעלאה נכשלה, תוחזר שגיאת פלט. אם כבר קיים קובץ עם אותו שם, התוצאות יועלו לקובץ חדש עם חותמת זמן כסיומת.

{
  ...
  "metadata": {
    ...
    "outputConfig": {...}
  },
  ...
  "response": {
    ...
    "results": [...],
    "outputConfig": {
      "gcs_uri":"gs://bucket/result-output-path"
    },
    "outputError": {...}
  }
}

נסו בעצמכם

אנחנו ממליצים למשתמשים חדשים ב-Google Cloud ליצור חשבון כדי שיוכלו להעריך את הביצועים של Cloud STT בתרחישים מהעולם האמיתי. לקוחות חדשים מקבלים בחינם גם קרדיט בשווי 300 $להרצה, לבדיקה ולפריסה של עומסי העבודה.

להתנסות ב-Cloud STT בחינם

תמלול של קובצי אודיו ארוכים לטקסט קל לארגן דפים בעזרת אוספים אפשר לשמור ולסווג תוכן על סמך ההעדפות שלך.

מידע על זיהוי דיבור אסינכרוני

Cloud Speech-to-Text ועיבוד אסינכרוני

תמלול קובצי אודיו ארוכים באמצעות קטגוריה של Cloud Storage

פרוטוקול

gcloud

Go

Java

Node.js

Python

שפות נוספות

העלאת תוצאות התמלול לקטגוריה של Cloud Storage

פרוטוקול

נסו בעצמכם

תמלול של קובצי אודיו ארוכים לטקסט