Use transcription with Speech-to-Text Chirp 3

Chirp 3, the latest generation of Google's multilingual Automatic Speech Recognition (ASR)-specific generative models offered by Google Cloud's Speech-to-Text (STT) API v2, is available for Voice transcription.

Set up

Follow these steps to enable transcription with Speech-to-Text Chirp 3.

Console

When you create or update a conversation profile using the Agent Assist console, follow these steps to configure Speech-to-Text settings to use the Chirp 3 model.

  1. Click Conversation profiles.
  2. Click the name of your profile.
  3. Navigate to the Speech to Text Config section.
  4. Choose Chirp 3 for the model.
  5. (Optional) Select Use Long Form Model for AA Telephony SipRec Integration if the audio is transmitted through Telephony Integration.
  6. (Optional) Configure Language Code and up to one Alternative Language Codes for language-agnostic transcription.
  7. (Optional) Configure auto as the language code for language-agnostic transcription.
  8. (Optional) Configure Phrases for speech adaptation to improve accuracy with model adaptation.

REST API

You can call the API directly to create or update a conversation profile. Enable STT V2 with the ConversationProfile.sttConfig.useSttV2 field, as shown in the following example. Example Configuration:

{
"name": "projects/PROJECT_ID/locations/global/conversationProfiles/CONVERSATION_PROFILE_ID",f
"displayName": "CONVERSATION_PROFILE_NAME",
"automatedAgentConfig": {
},
"humanAgentAssistantConfig": {
  "notificationConfig": {
    "topic": "projects/PROJECT_ID/topics/FEATURE_SUGGESTION_TOPIC_ID",
    "messageFormat": "JSON"
  },
  "humanAgentSuggestionConfig": {
    "featureConfigs": [{
      "enableEventBasedSuggestion": true,
      "suggestionFeature": {
        "type": "ARTICLE_SUGGESTION"
      },
      "conversationModelConfig": {
      }
    }]
  },
  "messageAnalysisConfig": {
  }
},
"sttConfig": {
  "model": "chirp_3",
  "useSttV2": true,
},
"languageCode": "en-US"
}

Best practices

Follow these suggestions to get the most from voice transcription with Chirp 3 model.

Audio streaming

To maximize Chirp 3 performance, send audio in near real time. This means if you have X seconds of audio, stream it in roughly X seconds. Break your audio into small chunks, each with a frame size of 100 ms. For more audio streaming best practices, see the Speech-to-Text documentation.

Use speech adaptation

Use transcription with Chirp 3 speech adaptation only with inline phrases configured in the conversation profile.

Regional and language support

Chirp 3 is available for all Speech-to-Text languages with different launch readiness, and in all Agent Assist regions except northamerica-northeast1, northamerica-northeast2, and asia-south1.

Quotas

The number of transcription requests using the Chirp 3 model is limited by the SttV2StreamingRequestsPerMinutePerResourceTypePerRegion quota with chirp_3 labeled as the resource type. See the Google Cloud quotas guide for information on quota usage and how to request a quota increase.

For quotas, transcription requests to the global Dialogflow endpoints are in the us-central1 region.