Chirp 3: インスタントカスタム音声

Vertex AI Studio のインスタントカスタム音声を使ってみる Colab で試す GitHub でノートブックを表示する

Chirp 3 のインスタントカスタム音声機能を使用すると、高品質の音声録音でモデルをトレーニングして、パーソナライズされた音声モデルを作成できます。インスタントカスタム音声は、個人の声を迅速に生成できます。この音声は、Cloud TTS API を使用して音声を合成するために使用できます。Cloud TTS API は、ストリーミングと長文テキストの両方をサポートしています。

技術的な詳細

対応言語	利用できる言語をご覧ください。
対象リージョン	`global`、`us`、`eu`、`asia-southeast1`、`asia-northeast1`、`europe-west2`
サポートされる出力形式	`streaming`: `LINEAR16`（デフォルト）、`ALAW`、`MULAW`、`OGG_OPUS`、`PCM` `batch`: `LINEAR16`（デフォルト）、`ALAW`、`MULAW`、`OGG_OPUS`、`PCM`
サポートされているエンコード形式	`LINEAR16`、`PCM`、`MP3`、`M4A`
サポートされている機能	テキストベースのプロンプト: 句読点、一時停止、言いよどみを使用して、自然な流れやリズムを加えます。一時停止タグ:（試験運用機能）合成音声に必要に応じて一時停止を挿入します。ペースの制御: 合成音声の速度を 0.25 倍～2 倍速まで調整します発音の制御: （試験運用機能）IPA または X-SAMPA 音声エンコードを使用した単語やフレーズのカスタム発音。言語転送: ロケール `en-US` の音声クローニングキーは、次のロケールで出力を合成できます: `de-DE`、`es-US`、`es-ES`、`fr-CA`、`fr-FR`、`pt-BR`

対応言語

インスタントカスタム音声は、以下の言語でサポートされています。

言語	BCP-47 コード	同意宣言
アラビア語（XA）	ar-XA	.أنا مالك هذا الصوت وأوافق على أن تستخدم Google هذا الصوت لإنشاء نموذج صوتي اصطناعي
ベンガル語（インド）	bn-IN	আমি এই ভয়েসের মালিক এবং আমি একটি সিন্থেটিক ভয়েস মডেল তৈরি করতে এই ভয়েস ব্যবহার করে Google-এর সাথে সম্মতি দিচ্ছি।
中国語（中国）	cmn-CN	我是此声音的拥有者并授权谷歌使用此声音创建语音合成模型
英語（オーストラリア）	en-AU	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
英語（インド）	en-IN	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
英語（英国）	en-GB	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
英語（米国）	en-US	I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.
フランス語（カナダ）	fr-CA	Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
フランス語（フランス）	fr-FR	Je suis le propriétaire de cette voix et j'autorise Google à utiliser cette voix pour créer un modèle de voix synthétique.
ドイツ語（ドイツ）	de-DE	Ich bin der Eigentümer dieser Stimme und bin damit einverstanden, dass Google diese Stimme zur Erstellung eines synthetischen Stimmmodells verwendet.
グジャラト語（インド）	gu-IN	હું આ વોઈસનો માલિક છું અને સિન્થેટિક વોઈસ મોડલ બનાવવા માટે આ વોઈસનો ઉપયોગ કરીને google ને હું સંમતિ આપું છું
ヒンディー語（インド）	hi-IN	मैं इस आवाज का मालिक हूं और मैं सिंथेटिक आवाज मॉडल बनाने के लिए Google को इस आवाज का उपयोग करने की सहमति देता हूं
インドネシア語（インドネシア）	id-ID	Saya pemilik suara ini dan saya menyetujui Google menggunakan suara ini untuk membuat model suara sintetis.
イタリア語（イタリア）	it-IT	Sono il proprietario di questa voce e acconsento che Google la utilizzi per creare un modello di voce sintetica.
日本語（日本）	ja-JP	私はこの音声の所有者であり、Google がこの音声を使用して音声合成モデルを作成することを承認します。
カンナダ語（インド）	kn-IN	ನಾನು ಈ ಧ್ವನಿಯ ಮಾಲಿಕ ಮತ್ತು ಸಂಶ್ಲೇಷಿತ ಧ್ವನಿ ಮಾದರಿಯನ್ನು ರಚಿಸಲು ಈ ಧ್ವನಿಯನ್ನು ಬಳಸಿಕೊಂಡುಗೂಗಲ್ ಗೆ ನಾನು ಸಮ್ಮತಿಸುತ್ತೇನೆ.
韓国語（韓国）	ko-KR	나는 이 음성의 소유자이며 구글이 이 음성을 사용하여 음성 합성 모델을 생성할 것을 허용합니다.
マラヤーラム語（インド）	ml-IN	ഈ ശബ്ദത്തിന്റെ ഉടമ ഞാനാണ്, ഒരു സിന്തറ്റിക് വോയ്‌ಸ್ മോഡൽ സൃഷ്ടിക്കാൻ ഈ ശബ്‌ദം ഉപയോഗിക്കുന്നതിന് ഞാൻ Google-ന് സമ്മതം നൽകുന്നു."
マラーティー語（インド）	mr-IN	मी या आवाजाचा मालक आहे आणि सिंथेटिक व्हॉइस मॉडेल तयार करण्यासाठी हा आवाज वापरण्यासाठी मी Google ला संमती देतो
オランダ語（オランダ）	nl-NL	Ik ben de eigenaar van deze stem en ik geef Google toestemming om deze stem te gebruiken om een synthetisch stemmodel te maken.
ポーランド語（ポーランド）	pl-PL	Jestem właścicielem tego głosu i wyrażam zgodę na wykorzystanie go przez Google w celu utworzenia syntetycznego modelu głosu.
ポルトガル語（ブラジル）	pt-BR	Eu sou o proprietário desta voz e autorizo o Google a usá-la para criar um modelo de voz sintética.
ロシア語（ロシア）	ru-RU	Я являюсь владельцем этого голоса и даю согласие Google на использование этого голоса для создания модели синтетического голоса.
タミル語（インド）	ta-IN	நான் இந்த குரலின் உரிமையாளர் மற்றும் செயற்கை குரல் மாதிரியை உருவாக்க இந்த குரலை பயன்படுத்த குகல்க்கு நான் ஒப்புக்கொள்கிறேன்.
テルグ語（インド）	te-IN	నేను ఈ వాయిస్ యజమానిని మరియు సింతటిక్ వాయిస్ మోడల్ ని రూపొందించడానికి ఈ వాయిస్ ని ఉపయోగించడానికి googleకి నేను సమ్మతిస్తున్నాను.
タイ語（タイ）	th-TH	ฉันเป็นเจ้าของเสียงนี้ และฉันยินยอมให้ Google ใช้เสียงนี้เพื่อสร้างแบบจำลองเสียงสังเคราะห์
トルコ語（トルコ）	tr-TR	Bu sesin sahibi benim ve Google'ın bu sesi kullanarak sentetik bir ses modeli oluşturmasına izin veriyorum.
ベトナム語（ベトナム）	vi-VN	Tôi là chủ sở hữu giọng nói này và tôi đồng ý cho Google sử dụng giọng nói này để tạo mô hình giọng nói tổng hợp.
スペイン語（スペイン）	es-ES	Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.
スペイン語（米国）	es-US	Soy el propietario de esta voz y doy mi consentimiento para que Google la utilice para crear un modelo de voz sintética.

インスタントカスタム音声を使用する

以降のセクションでは、Text-to-Speech API で Chirp 3 のインスタントカスタム音声機能を使用する方法について説明します。

同意宣言と参照音声を録音する

同意宣言を録音する: インスタントカスタム音声に関する法律と倫理のガイドラインに準拠するために、必要な同意宣言を適切な言語で、サポートされている音声エンコーディング形式を使って、最長 10 秒のモノラル音声ファイルとして録音します（「私はこの音声の所有者であり、Google がこの音声を使用して合成音声モデルを作成することに同意します」）。
参照音声を録音する: パソコンのマイクを使用して、サポートされている音声エンコーディング形式で、最長 10 秒のモノラル音声ファイルとして録音します。録音中に背景雑音が入らないようにしてください。同意宣言と参照音声を同じ環境で録音します。
音声ファイルを保存する: 録音した音声ファイルを、指定した Cloud Storage のロケーションに保存します。

高品質な参照音声と同意宣言の音声を作成するためのガイドライン

高品質な参照音声と同意音声を作成するためのガイドラインは次のとおりです。

音声はできるだけ 10 秒に近い長さにしてください。
音声では自然な間とリズムが保たれている必要があります。
音声のバックグラウンドノイズを最小限に抑える必要があります。
詳細については、サポートされている音声エンコードをご覧ください。任意のサンプルレートを使用できます。
このモデルはマイクの音質をそのまま再現するため、録音が不明瞭であれば、生成される音声も不明瞭になります。
声はダイナミックに、最終的な出力音声よりも少し表現豊かにしてください。声にはクローン音声に持たせたい抑揚やリズムを反映させます。たとえば、参照音声に自然な一時停止や区切りが含まれていなければ、クローン音声も一時停止をうまく表現できません。
単調で退屈なものではなく、わくわくする活気に満ちたプロンプトにしてください。そうすることで、モデルはそのエネルギーを読み取り、再現することができます。

REST API を使用してインスタントカスタム音声を作成する

インスタントカスタム音声は、音声クローニングキーという形をとります。このキーは、音声データをテキスト文字列で表現したものです。

留意事項

カスタム音声の作成に関する重要なポイントは次のとおりです。

音声クローニングキーはクライアント側に保存され、リクエストごとに提供されるため、作成できる音声クローニングキーの数に制限はありません。
同じ音声クローニングキーを複数のクライアントまたはデバイスで同時に使用できます。
プロジェクトごとに 1 分あたり 10 個の音声クローニングキーを作成できます。詳細については、リクエストの上限をご覧ください。
デフォルトの同意文の代わりにカスタムの同意文スクリプトを使用することはできません。選択した言語の提供された同意文スクリプトを必ず使用してください。

import requests, os, json

def create_instant_custom_voice_key(
    access_token, project_id, reference_audio_bytes, consent_audio_bytes
):
    url = "https://texttospeech.googleapis.com/v1beta1/voices:generateVoiceCloningKey"

    request_body = {
        "reference_audio": {
            # Supported audio_encoding values are LINEAR16, PCM, MP3, and M4A.
            "audio_config": {"audio_encoding": "LINEAR16"},
            "content": reference_audio_bytes,
        },
        "voice_talent_consent": {
            # Supported audio_encoding values are LINEAR16, PCM, MP3, and M4A.
            "audio_config": {"audio_encoding": "LINEAR16"},
            "content": consent_audio_bytes,
        },
        "consent_script": "I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model.",
        "language_code": "en-US",
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8",
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        return response_json.get("voiceCloningKey")

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

REST API を使用してインスタントカスタム音声で合成する

音声クローニングキーを使用して、REST API で音声を合成します。

import requests, os, json, base64
from IPython.display import Audio, display

def synthesize_text_with_cloned_voice(access_token, project_id, voice_key, text):
    url = "https://texttospeech.googleapis.com/v1beta1/text:synthesize"

    request_body = {
        "input": {
            "text": text
        },
        "voice": {
            "language_code": "en-US",
            "voice_clone": {
                "voice_cloning_key": voice_key,
            }
        },
        "audioConfig": {
            # Supported audio_encoding values are LINEAR16, PCM, MP3, and M4A.
            "audioEncoding": "LINEAR16",
        }
    }

    try:
        headers = {
            "Authorization": f"Bearer {access_token}",
            "x-goog-user-project": project_id,
            "Content-Type": "application/json; charset=utf-8"
        }

        response = requests.post(url, headers=headers, json=request_body)
        response.raise_for_status()

        response_json = response.json()
        audio_content = response_json.get("audioContent")

        if audio_content:
            display(Audio(base64.b64decode(audio_content), rate=24000))
        else:
            print("Error: Audio content not found in the response.")
            print(response_json)

    except requests.exceptions.RequestException as e:
        print(f"Error making API request: {e}")
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON response: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

Python クライアントライブラリを使用してインスタントカスタム音声で合成する

この例では、Python クライアントライブラリを使用して、ファイル voice_cloning_key.txt に保存されている音声クローニングキーを使用して、カスタム音声の即時合成を行います。音声クローニングキーの生成方法については、REST API を使用してインスタントカスタム音声を作成するをご覧ください。

from google.cloud import texttospeech
from google.cloud.texttospeech_v1beta1.services.text_to_speech import client


def perform_voice_cloning(
    voice_cloning_key: str,
    transcript: str,
    language_code: str,
    synthesis_output_path: str,
    tts_client: client.TextToSpeechClient,
) -> None:
  """Perform voice cloning and write output to a file.

  Args:
    voice_cloning_key: The voice cloning key.
    transcript: The transcript to synthesize.
    language_code: The language code.
    synthesis_output_path: The synthesis audio output path.
    tts_client: The TTS client to use.
  """
  voice_clone_params = texttospeech.VoiceCloneParams(
      voice_cloning_key=voice_cloning_key
  )
  voice = texttospeech.VoiceSelectionParams(
      language_code=language_code, voice_clone=voice_clone_params
  )
  request = texttospeech.SynthesizeSpeechRequest(
      input=texttospeech.SynthesisInput(text=transcript),
      voice=voice,
      audio_config=texttospeech.AudioConfig(
          audio_encoding=texttospeech.AudioEncoding.LINEAR16,
          sample_rate_hertz=24000,
      ),
  )
  response = tts_client.synthesize_speech(request)
  with open(synthesis_output_path, 'wb') as out:
    out.write(response.audio_content)
    print(f'Audio content written to file {synthesis_output_path}.')


if __name__ == '__main__':
  client = texttospeech.TextToSpeechClient()
  with open('voice_cloning_key.txt', 'r') as f:
    key = f.read()
  perform_voice_cloning(
      voice_cloning_key=key,
      transcript='Hello world!',
      language_code='en-US',
      synthesis_output_path='/tmp/output.wav',
      tts_client=client,
  )

Python クライアントライブラリを使用してインスタントカスタム音声でストリーミング合成を行う

この例では、Python クライアントライブラリを使用して、インスタントカスタム音声のストリーミング合成を行います。voice_cloning_key.txt に保存された音声クローニングキーを使用します。音声クローニングキーの生成方法については、REST API を使用してインスタントカスタム音声を作成するをご覧ください。

import io
import wave
from google.cloud import texttospeech
from google.cloud.texttospeech_v1beta1.services.text_to_speech import client


def perform_voice_cloning_with_simulated_streaming(
    voice_cloning_key: str,
    simulated_streamed_text: list[str],
    language_code: str,
    synthesis_output_path: str,
    tts_client: client.TextToSpeechClient,
) -> None:
  """Perform voice cloning for a given reference audio, voice talent consent, and consent script.

  Args:
    voice_cloning_key: The voice cloning key.
    simulated_streamed_text: The list of transcripts to synthesize, where each
      item represents a chunk of streamed text. This is used to simulate
      streamed text input and is not meant to be representative of real-world
      streaming usage.
    language_code: The language code.
    synthesis_output_path: The path to write the synthesis audio output to.
    tts_client: The TTS client to use.
  """
  voice_clone_params = texttospeech.VoiceCloneParams(
      voice_cloning_key=voice_cloning_key
  )
  streaming_config = texttospeech.StreamingSynthesizeConfig(
      voice=texttospeech.VoiceSelectionParams(
          language_code=language_code, voice_clone=voice_clone_params
      ),
      streaming_audio_config=texttospeech.StreamingAudioConfig(
          audio_encoding=texttospeech.AudioEncoding.PCM,
          sample_rate_hertz=24000,
      ),
  )
  config_request = texttospeech.StreamingSynthesizeRequest(
      streaming_config=streaming_config
  )

  # Request generator. Consider using Gemini or another LLM with output
  # streaming as a generator.
  def request_generator():
    yield config_request
    for text in simulated_streamed_text:
      yield texttospeech.StreamingSynthesizeRequest(
          input=texttospeech.StreamingSynthesisInput(text=text)
      )

  streaming_responses = tts_client.streaming_synthesize(request_generator())
  audio_buffer = io.BytesIO()
  for response in streaming_responses:
    print(f'Audio content size in bytes is: {len(response.audio_content)}')
    audio_buffer.write(response.audio_content)

  # Write collected audio outputs to a WAV file.
  with wave.open(synthesis_output_path, 'wb') as wav_file:
    wav_file.setnchannels(1)
    wav_file.setsampwidth(2)
    wav_file.setframerate(24000)
    wav_file.writeframes(audio_buffer.getvalue())
    print(f'Audio content written to file {synthesis_output_path}.')


if __name__ == '__main__':
  client = texttospeech.TextToSpeechClient()
  with open('voice_cloning_key.txt', 'r') as f:
    key = f.read()
  perform_voice_cloning_with_simulated_streaming(
      voice_cloning_key=key,
      simulated_streamed_text=[
          'Hello world!',
          'This is the second text chunk.',
          'This simulates streaming text for synthesis.',
      ],
      language_code='en-US',
      synthesis_output_path='streaming_output.wav',
      tts_client=client,
  )

Chirp 3: HD 音声操作を使用する

インスタントカスタム音声は、Chirp 3: HD 音声でサポートされているペースコントロール、一時停止コントロール、カスタム発音機能と同じ機能に対応しています。Chirp 3: HD 音声コントロールについて詳しくは、Chirp 3: HD 音声コントロールをご覧ください。

インスタントカスタム音声と同様に SynthesizeSpeechRequest または StreamingSynthesizeConfig を調整することで、3 つの機能をすべてインスタントカスタム音声で有効にできます。

音声操作に対応している言語

ペースコントロールはすべての言語 / 地域で利用できます。
一時停止コントロールはすべての言語 / 地域で利用できます。
カスタム発音は、bn-IN、gu-IN、th-TH、vi-VN を除くすべての言語 / 地域でご利用いただけます。

多言語転送を有効にする

インスタントカスタム音声では、指定された言語 / 地域の組み合わせで多言語転送を利用できます。つまり、en-US などの言語コードで生成された音声クローニングキーを使用して、es-ES などの別の言語で音声合成を行うことが可能です。言語 / 地域が en-US である音声クローニングキーは、de-DE、es-US、es-ES、fr-CA、fr-FR、pt-BR の言語 / 地域で出力を合成できます。

このコードサンプルは、en-US の音声クローニングキーを使用して es-ES の音声を合成するための SynthesizeRequest を構成する方法を示しています。

voice_clone_params = texttospeech.VoiceCloneParams(
    voice_cloning_key=en_us_voice_cloning_key
)
request = texttospeech.SynthesizeSpeechRequest(
  input=texttospeech.SynthesisInput(text=transcript),
  voice=texttospeech.VoiceSelectionParams(
      language_code='es-ES', voice_clone=voice_clone_params
  ),
  audio_config=texttospeech.AudioConfig(
      audio_encoding=texttospeech.AudioEncoding.LINEAR16,
      sample_rate_hertz=24000,
  ),
)

en-US の音声クローニングキーを使用して es-ES の音声を合成するための StreamingSynthesizeConfig 構成例:

voice_clone_params = texttospeech.VoiceCloneParams(
    voice_cloning_key=en_us_voice_cloning_key
)
streaming_config = texttospeech.StreamingSynthesizeConfig(
    voice=texttospeech.VoiceSelectionParams(
        language_code='es-ES', voice_clone=voice_clone_params
    ),
    streaming_audio_config=texttospeech.StreamingAudioConfig(
        audio_encoding=texttospeech.AudioEncoding.PCM,
        sample_rate_hertz=24000,
    ),
)