Generar texto a partir de una petición multimodal

En este ejemplo se muestra cómo generar texto a partir de una petición multimodal con el modelo Gemini. La petición consta de tres imágenes y dos peticiones de texto. El modelo genera una respuesta de texto que describe las imágenes y las peticiones de texto.

Código de ejemplo

Go

Antes de probar este ejemplo, sigue las instrucciones de configuración de Go que se indican en la guía de inicio rápido de Vertex AI con bibliotecas de cliente. Para obtener más información, consulta la documentación de referencia de la API Go de Vertex AI.

Para autenticarte en Vertex AI, configura las credenciales predeterminadas de la aplicación. Para obtener más información, consulta el artículo Configurar la autenticación en un entorno de desarrollo local.

import (
	"context"
	"fmt"
	"io"
	"os"

	genai "google.golang.org/genai"
)

// generateWithMultiLocalImages shows how to generate text using multiple local image inputs.
func generateWithMultiLocalImages(w io.Writer) error {
	ctx := context.Background()

	client, err := genai.NewClient(ctx, &genai.ClientConfig{
		HTTPOptions: genai.HTTPOptions{APIVersion: "v1"},
	})
	if err != nil {
		return fmt.Errorf("failed to create genai client: %w", err)
	}

	// Read local image files
	image1, err := os.ReadFile("latte.jpg")
	if err != nil {
		return fmt.Errorf("failed to read image1: %w", err)
	}
	image2, err := os.ReadFile("scones.jpg")
	if err != nil {
		return fmt.Errorf("failed to read image2: %w", err)
	}

	modelName := "gemini-2.5-flash"
	contents := []*genai.Content{
		{
			Role: "user",
			Parts: []*genai.Part{
				{Text: "Generate a list of all the objects contained in both images."},
				{InlineData: &genai.Blob{
					MIMEType: "image/jpeg",
					Data:     image1,
				}},
				{InlineData: &genai.Blob{
					MIMEType: "image/jpeg",
					Data:     image2,
				}},
			},
		},
	}

	// Call the model
	resp, err := client.Models.GenerateContent(ctx, modelName, contents, nil)
	if err != nil {
		return fmt.Errorf("failed to generate content: %w", err)
	}

	fmt.Fprintln(w, resp.Text())

	// Example response:
	// Here is a list of all the distinct objects found in both images:
	// 1.  **Coffee** (in mugs/cups; one is clearly a latte with heart art, others are also coffee/latte)
	// 2.  **Mug(s)/Cup(s)** (yellow in the top image, white in the bottom image)
	// 3.  **Cake** (sliced, in the top image)
	// 4.  **Plate** (white, under the cake slice in the top image)
	// 5.  **Fork** (partially visible on the plate in the top image)
	// 6.  **Scones/Biscuits** (blueberry, in the bottom image)
	// 7.  **Blueberries** (scattered and in a bowl in the bottom image)
	// 8.  **Bowl** (small, dark, holding blueberries in the bottom image)
	// 9.  **Spoon** (silver, with "LET'S JAM" inscription, in the bottom image)
	// 10. **Flowers** (peonies, in the bottom image)
	// 11. **Leaves** (green, possibly mint, in the bottom image)
	// 12. **Paper** (parchment or wax paper, in the bottom image)
	// 13. **Table/Surface** (wooden in the top image, textured/painted in the bottom image)
	// ...

	return nil
}

Java

Antes de probar este ejemplo, sigue las instrucciones de configuración de Java que se indican en la guía de inicio rápido de Vertex AI con bibliotecas de cliente. Para obtener más información, consulta la documentación de referencia de la API Java de Vertex AI.


import com.google.genai.Client;
import com.google.genai.types.Content;
import com.google.genai.types.GenerateContentResponse;
import com.google.genai.types.HttpOptions;
import com.google.genai.types.Part;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;

public class TextGenerationWithMultiLocalImage {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String modelId = "gemini-2.5-flash";
    String localImageFilePath1 = "your/local/img1.jpg";
    String localImageFilePath2 = "your/local/img2.jpg";
    generateContent(modelId, localImageFilePath1, localImageFilePath2);
  }

  // Generates text using multiple local images
  public static String generateContent(
      String modelId, String localImageFilePath1, String localImageFilePath2) throws IOException {
    // Initialize client that will be used to send requests. This client only needs to be created
    // once, and can be reused for multiple requests.
    try (Client client =
        Client.builder()
            .location("global")
            .vertexAI(true)
            .httpOptions(HttpOptions.builder().apiVersion("v1").build())
            .build()) {

      // Read content from local files.
      byte[] localFileImg1Bytes = Files.readAllBytes(Paths.get(localImageFilePath1));
      byte[] localFileImg2Bytes = Files.readAllBytes(Paths.get(localImageFilePath2));

      GenerateContentResponse response =
          client.models.generateContent(
              modelId,
              Content.fromParts(
                  Part.fromBytes(localFileImg1Bytes, "image/jpeg"),
                  Part.fromBytes(localFileImg2Bytes, "image/jpeg"),
                  Part.fromText("Generate a list of all the objects contained in both images")),
              null);

      System.out.print(response.text());
      // Example response:
      // Based on both images, here are the objects contained in both:
      //
      // 1.  **Coffee cups (or mugs)**: Both images feature one or more cups containing a beverage.
      // 2.  **Coffee (or a similar beverage)**: Both images contain a liquid beverage in the cups,
      // appearing to be coffee or a coffee-like drink.
      // 3.  **Table (or a flat surface)**: Both compositions are set on a flat surface, likely a
      // table or countertop.
      return response.text();
    }
  }
}

Node.js

Antes de probar este ejemplo, sigue las instrucciones de configuración de Node.js que se indican en la guía de inicio rápido de Vertex AI con bibliotecas de cliente. Para obtener más información, consulta la documentación de referencia de la API Node.js de Vertex AI.

const {GoogleGenAI} = require('@google/genai');
const fs = require('fs');

const GOOGLE_CLOUD_PROJECT = process.env.GOOGLE_CLOUD_PROJECT;
const GOOGLE_CLOUD_LOCATION = process.env.GOOGLE_CLOUD_LOCATION || 'global';

function loadImageAsBase64(path) {
  const bytes = fs.readFileSync(path);
  return bytes.toString('base64');
}

async function generateContent(
  projectId = GOOGLE_CLOUD_PROJECT,
  location = GOOGLE_CLOUD_LOCATION,
  imagePath1,
  imagePath2
) {
  const client = new GoogleGenAI({
    vertexai: true,
    project: projectId,
    location: location,
  });

  // TODO(Developer): Update the below file paths to your images
  const image1 = loadImageAsBase64(imagePath1);
  const image2 = loadImageAsBase64(imagePath2);

  const response = await client.models.generateContent({
    model: 'gemini-2.5-flash',
    contents: [
      {
        role: 'user',
        parts: [
          {
            text: 'Generate a list of all the objects contained in both images.',
          },
          {
            inlineData: {
              data: image1,
              mimeType: 'image/jpeg',
            },
          },
          {
            inlineData: {
              data: image2,
              mimeType: 'image/jpeg',
            },
          },
        ],
      },
    ],
  });

  console.log(response.text);

  // Example response:
  //  Okay, here's a jingle combining the elements of both sets of images, focusing on ...
  //  ...

  return response.text;
}

Python

Antes de probar este ejemplo, sigue las instrucciones de configuración de Python que se indican en la guía de inicio rápido de Vertex AI con bibliotecas de cliente. Para obtener más información, consulta la documentación de referencia de la API Python de Vertex AI.

from google import genai
from google.genai.types import HttpOptions, Part

client = genai.Client(http_options=HttpOptions(api_version="v1"))
# TODO(Developer): Update the below file paths to your images
# image_path_1 = "path/to/your/image1.jpg"
# image_path_2 = "path/to/your/image2.jpg"
with open(image_path_1, "rb") as f:
    image_1_bytes = f.read()
with open(image_path_2, "rb") as f:
    image_2_bytes = f.read()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
        "Generate a list of all the objects contained in both images.",
        Part.from_bytes(data=image_1_bytes, mime_type="image/jpeg"),
        Part.from_bytes(data=image_2_bytes, mime_type="image/jpeg"),
    ],
)
print(response.text)
# Example response:
# Okay, here's a jingle combining the elements of both sets of images, focusing on ...
# ...

Siguientes pasos

Para buscar y filtrar ejemplos de código de otros productos de Google Cloud , consulta el Google Cloud navegador de ejemplos.

Generar texto a partir de una petición multimodal Organízate con las colecciones Guarda y clasifica el contenido según tus preferencias.