This document provides a high-level architecture for an application that uses AI to generate podcasts based on audio input.
The intended audience for this document includes architects, developers, and administrators who build and manage generative AI applications in the cloud for the media and marketing industries. The document assumes that you have a foundational understanding of generative AI.
The Deployment section of this document provides code samples for generative AI workloads that involve multi-modal input and output formats.
Architecture
The following diagram shows an architecture for a podcast producer application in Google Cloud. The application uses AI to generate podcasts from audio files, such as live commentary for a sports event.
The architecture shows the following flow:
- A user uploads audio files to a Cloud Storage bucket.
- Eventarc triggers a Cloud Run service.
- The Cloud Run service sends the audio files to Speech-to-Text.
- Speech-to-Text produces time-stamped transcripts of the audio files.
The Cloud Run service sends the transcripts to Gemini API in Vertex AI, with a prompt to generate a script for a podcast.
For example, the prompt could be to generate a script for a 15-minute podcast about the highlights of a sports event based on certain keywords in the commentary.
Gemini generates a draft of a podcast script.
The Cloud Run service sends the draft script to the user.
The user reviews and edits the draft script and then sends the final script to Text-to-Speech.
Text-to-Speech produces a podcast audio file.
Products used
This example architecture uses the following Google Cloud products:
- Speech-to-Text: An API that uses Google's speech recognition technologies to transcribe audio to text.
- Vertex AI: An ML platform that lets you train and deploy ML models and AI applications, and customize LLMs for use in AI-powered applications.
- Text-to-Speech: An API to create natural-sounding, synthetic human speech from text.
- Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
- Cloud Run: A serverless compute platform that lets you run containers directly on top of Google's scalable infrastructure.
- Eventarc: A serverless solution to asynchronously route messages triggered by events.
Deployment
To experiment with using Google Cloud products for workloads that involve multi-modal input and output formats such as audio and text, try the following code samples:
- Generate a transcript of an audio interview.
- Generate a multi-speaker podcast by using Gemini and Text-to-Speech API.
- Record audio and generate a translation.
What's next
- Explore more generative AI architecture guides.
- For an overview of architectural principles and recommendations that are specific to AI and ML workloads in Google Cloud, see the AI and ML perspective in the Well-Architected Framework.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.
Contributors
Author: Kumar Dhanagopal | Cross-Product Solution Developer
Other contributors:
- Amina Mansour | Head of Cloud Platform Evaluations Team
- Megan O'Keefe | Developer Advocate
- Samantha He | Technical Writer
- Shir Meir Lador | Developer Relations Engineering Manager