Generative AI use case: Generate podcasts from audio files

Last reviewed 2025-12-12 UTC

This document provides a high-level architecture for an application that uses AI to generate podcasts based on audio input.

The intended audience for this document includes architects, developers, and administrators who build and manage generative AI applications in the cloud for the media and marketing industries. The document assumes that you have a foundational understanding of generative AI.

The Deployment section of this document provides code samples for generative AI workloads that involve multi-modal input and output formats.

Architecture

The following diagram shows an architecture for a podcast producer application in Google Cloud. The application uses AI to generate podcasts from audio files, such as live commentary for a sports event.

Architecture for a generative AI application that generates podcasts from audio files.

The architecture shows the following flow:

A user uploads audio files to a Cloud Storage bucket.
Eventarc triggers a Cloud Run service.
The Cloud Run service sends the audio files to Speech-to-Text.
Speech-to-Text produces time-stamped transcripts of the audio files.
The Cloud Run service sends the transcripts to Gemini API in Vertex AI, with a prompt to generate a script for a podcast.

For example, the prompt could be to generate a script for a 15-minute podcast about the highlights of a sports event based on certain keywords in the commentary.
Gemini generates a draft of a podcast script.
The Cloud Run service sends the draft script to the user.
The user reviews and edits the draft script and then sends the final script to Text-to-Speech.
Text-to-Speech produces a podcast audio file.

Products used

This example architecture uses the following Google Cloud products:

Speech-to-Text: An API that uses Google's speech recognition technologies to transcribe audio to text.
Vertex AI: An ML platform that lets you train and deploy ML models and AI applications, and customize LLMs for use in AI-powered applications.
Text-to-Speech: An API to create natural-sounding, synthetic human speech from text.
Cloud Storage: A low-cost, no-limit object store for diverse data types. Data can be accessed from within and outside Google Cloud, and it's replicated across locations for redundancy.
Cloud Run: A serverless compute platform that lets you run containers directly on top of Google's scalable infrastructure.
Eventarc: A serverless solution to asynchronously route messages triggered by events.

Deployment

To experiment with using Google Cloud products for workloads that involve multi-modal input and output formats such as audio and text, try the following code samples:

What's next

Explore more generative AI architecture guides.
For an overview of architectural principles and recommendations that are specific to AI and ML workloads in Google Cloud, see the AI and ML perspective in the Well-Architected Framework.
For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.

Contributors

Author: Kumar Dhanagopal | Cross-Product Solution Developer

Other contributors:

Amina Mansour | Head of Cloud Platform Evaluations Team
Megan O'Keefe | Developer Advocate
Samantha He | Technical Writer
Shir Meir Lador | Developer Relations Engineering Manager

Generative AI use case: Generate podcasts from audio files Stay organized with collections Save and categorize content based on your preferences.