Build a multimodal ordering experience using the streaming API

This guide provides instructions and best practices for engineers building food ordering experiences with the FoodOrderingService.BidiProcessOrder RPC method. This real-time, bidirectional streaming API is the core of the Food Ordering AI Agent, enabling dynamic, conversational order-taking in various applications such as mobile apps, voice assistants, drive-thrus, and kiosks.

Overview of BidiProcessOrder

The BidiProcessOrder method establishes a persistent, two-way communication channel between your client application and the Food Ordering AI Agent. Unlike standard unary request & response RPCs, this streaming approach allows for:

  • Low-latency interaction: Continuous exchange of information without the overhead of repeated HTTP requests.
  • Multimodal input: Handling of audio streams (for voice ordering), text inputs, and client-side events.
  • Real-time responses: The agent can send back audio, text, order updates, and other signals as the conversation unfolds.

BidiProcessOrder cannot be invoked using REST. Integrations must use a connection-oriented protocol:

  • gRPC (Recommended): Provides a robust and efficient framework for bidirectional streaming.
  • WebSocket: Suitable for clients or environments where gRPC isn't a fit due to programming language or network constraints.

Refer to the BidiProcessOrder API Reference for detailed type definitions. WebSocket integrations use JSON representations of these types, as described in the WebSocket section.

Prerequisites

Before integrating with BidiProcessOrder:

  1. Enable the API: Ensure the Food Ordering AI Agent API is enabled in your Google Cloud project. bash gcloud services enable foodorderingaiagent.googleapis.com --project=PROJECT_ID
  2. Authentication: Decide your authentication approach and setup any necessary service accounts and IAM roles, as described in Authentication.
  3. Menu Ingestion: A valid Menu must be ingested and associated with a Store. See Integrating Menu Data for details.

Authentication

To securely connect to the BidiProcessOrder RPC, your application must authenticate using a Google Cloud Service Account.

1. Configure a Service Account

  • Create a Service Account: In your Google Cloud project, create a Service Account that your application will use to authenticate to the Food Ordering AI Agent API. See Creating and managing service accounts.
  • Grant IAM Roles: Grant the necessary IAM roles to this service account. The primary role required to call BidiProcessOrder is:

    • Food Ordering Agent User (roles/foodorderingaiagent.agentUser): Allows the service account to connect to the ordering service and process sessions.

    You can grant this role using the Google Cloud console or gcloud: bash gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \ --role="roles/foodorderingaiagent.agentUser"

2. Application Authentication Flow

The exact authentication flow depends on your application architecture, especially whether the client application (e.g., mobile app, kiosk software) connects directly or through your own backend.

Common Scenario: Authenticating a consumer-facing client application

This is a typical pattern for mobile or web applications:

  1. Client-to-YourAuth: The end-user client app (mobile, web) authenticates with your existing user authentication system (this could be Firebase Authentication, your own OAuth server, etc.).
  2. Token Exchange: The client app, after authenticating the user, requests a short-lived token from a secure backend service you control (e.g., an "API Token Service").
  3. Access Token Generation: Your backend service, using the credentials of the Google Cloud Service Account principal configured in Step 1, generates a standard OAuth 2.0 access token for the https://www.googleapis.com/auth/cloud-platform scope. This can be done using the Google Cloud Authentication client libraries.

    • Security: Service account keys or credentials used to generate these tokens must be securely stored and managed on your backend. Never expose service account private keys directly to end-user client applications. See Best practices for managing service account keys.
  4. Token to Client: Your backend service returns the generated Google access token to the client app.

  5. API Call: The client app uses this Google access token to authenticate its gRPC or WebSocket connection to the BidiProcessOrder RPC.

3. Using the Token

  • gRPC: The Google gRPC client libraries typically handle token refreshing and inclusion in the call metadata when provided with service account credentials.
  • WebSocket (Non-Browser): Include the token in the Authorization: Bearer TOKEN header.
  • WebSocket (Browser): As noted in the WebSocket section, direct browser WebSocket connections cannot use Authorization headers. A server-side streaming proxy is needed to authenticate your clients connection to Google Cloud.

Connecting to the API

You can establish a stream using gRPC client libraries or a WebSocket connection.

gRPC

Using gRPC is the recommended approach. You'll use the client libraries for your language of choice (e.g., Java, Go, Python, Node.js) which are based on the BidiProcessOrder API Reference.

The basic steps involve:

  1. Create a gRPC channel to the Food Ordering AI Agent API endpoint (e.g., foodorderingaiagent.googleapis.com).
  2. Obtain a client stub for FoodOrderingService.
  3. Invoke the BidiProcessOrder method, which returns a stream object for both sending requests and receiving responses.
  4. Implement business logic according to your use case which concurrently:
    • Sends audio, text, and event input from the end user.
    • Handles messages from the agent including audio, text, and events.

WebSocket

For WebSocket connections, the URL path is:

wss://foodorderingaiagent.googleapis.com/ws/google.cloud.foodorderingaiagent.v1beta.FoodOrderingService/BidiProcessOrder/locations/LOCATION

  • LOCATION: e.g., us

Required Headers:

  • Authorization: Bearer TOKEN - Where TOKEN is an OAuth 2.0 access token obtained for your service account.

Message Format:

  • Client to Server: Messages sent to the API (e.g., Config, AudioInput, TextInput, EventInput) must be JSON representations of the BidiProcessOrderRequest proto, sent as websocket.TextMessage.
  • Server to Client: Messages received from the API (BidiProcessOrderResponse) will be sent as websocket.BinaryMessage, but the content of these binary messages is a JSON payload.
  • Binary Data: Binary data within the JSON payloads (e.g., customerAudio in AudioInput, agentAudio in AgentAudio) must be base64 encoded.

Session Lifecycle

Each call to BidiProcessOrder initiates a session. The session remains active as long as the stream is open.

  1. Initiation (Config Message):

    • Upon establishing the connection, the first message sent by the client must be a BidiProcessOrderRequest containing the Config message.
    • Required Fields in Config:
      • session: A unique client-generated session identifier. Format: projects/PROJECT/locations/LOCATION/sessions/SESSION_ID.
      • store: The resource name of the Store. Format: projects/PROJECT/locations/LOCATION/brands/BRAND/stores/STORE.
    • The agent uses the store to load the appropriate menu and configuration.
  2. Sending Inputs:

    • After the initial Config, the client can send a stream of BidiProcessOrderRequest messages containing one of the following inputs:
      • AudioInput: Raw audio data (typically 16-bit linear PCM at 16000 Hz, no headers). Used for voice interactions.
      • TextInput: Text messages from the user.
      • EventInput: Signals for events such as DriveOffEvent (for drive-thru use cases when the vehicle departs), CrewInterjectionEvent (for any situation wherein a human takes over the order taking role mid-conversation), or OrderStateUpdateEvent (if the order is modified on the client-side, e.g., using a touch interface).
  3. Receiving Responses:

    • Concurrently, the agent sends back a stream of BidiProcessOrderResponse messages. Your client must be prepared to handle various response types within the oneof response field:
      • AgentAudio: Synthesized audio bytes to be played to the user, used for voice interactions.
      • AgentText: Text version of the agent's response.
      • SpeechRecognition: Transcript of the recognized user speech.
      • UpdatedOrderState: Contains the complete current state of the customer's Order whenever it's updated by the agent. Use this to update your application's order representation. This should typically result in an update to a user interface or a system of record for order state information, such as a point of sale system.
      • InterruptionSignal: Indicates the user interrupted the agent's speech. The client should immediately stop playing any outgoing AgentAudio.
      • AgentEvent: Special events, such as RestartOrder, requiring client action.
      • SuggestedOptions: Provides contextually relevant options a user might select next, useful for display on a screen.
      • EndSession: Signals the session has been terminated by the agent (e.g., order complete, user drive-off, or agent escalation).
  4. Closing the Stream:

    • The stream can be closed by the client or the server. Typically, the server signals the end of a conversation using an EndSession message. The client should close the stream when this message is received.

Handling Specific Message Types

The following sections describe how to handle specific response types that your client will receive when calling BidiProcessOrder.

AudioInput

  • Stream audio in chunks as it becomes available.
  • Format: 16-bit linear PCM, 16000 Hz sample rate.
  • Audio chunks do not include the audio headers that typically prefix a WAV file.
  • For drive-thru scenarios with echo cancellation enabled (enable_echo_cancellation in Config), provide both customer_audio and crew_audio.

UpdatedOrderState

  • This message provides the full state of the order each time it's sent. Replace any local cache of the order with the contents of the Order message received.
  • Use the custom_integration_attributes within the Order items and modifiers to map the Order content into equivalent entities within your application's system of record.

InterruptionSignal

  • Upon receiving, immediately halt playback of any AgentAudio and clear any buffered agent audio. This ensures a natural conversational flow when the user interrupts the agent's speech.

EndSession

  • Check the EndType (e.g., DRIVE_OFF, AGENT_ESCALATION).
  • Your application should gracefully close the connection and transition the user appropriately (e.g., notify a human supervisor in the case of AGENT_ESCALATION, or transition to an order confirmation state).

Best Practices

  • Handle Messages Asynchronously: Minimize latency by using threads or non-blocking I/O to concurrently send requests and process incoming responses.
  • Reconnection Logic: Implement robust reconnection logic in case of network issues, remembering to send the initial Config message with the same session ID to attempt resumption.
  • Error Handling: Monitor the stream for errors. gRPC and WebSocket libraries provide mechanisms to detect stream closure or transport errors. Log these events and handle them gracefully.
  • Audio Buffering: Manage audio buffers carefully, implementing buffering if necessary, to ensure smooth playback of AgentAudio and timely delivery of AudioInput. Carefully consider the tradeoff between latency and playback quality when deciding your buffering scheme.
  • Session ID Management: Ensure session IDs are unique for each distinct order/conversation.
  • Resource Management: Close streams and release resources when the session is complete or if unrecoverable errors occur.
  • Timeouts: While the stream itself can be long-lived (up to 15 minutes by default), consider application-level timeouts for specific states if needed.

Example Integration Flow (Conceptual)

  1. Client App (e.g., Mobile App) initiates an order.
  2. Establish gRPC/WebSocket connection to BidiProcessOrder.
  3. Send BidiProcessOrderRequest with Config (session ID, store ID).
  4. Receive initial AgentAudio (e.g., welcome message) and play it.
  5. User speaks: Capture audio, stream it in AudioInput messages.
  6. Receive SpeechRecognition (display transcript), AgentAudio (play response), and potentially UpdatedOrderState (update UI cart).
  7. If user interrupts, receive InterruptionSignal, stop playback.
  8. Continue exchange of audio or text inputs and agent responses.
  9. User confirms order: Agent sends final UpdatedOrderState.
  10. Agent sends EndSession: Client closes the stream and finalizes the order in the POS system using data from the last UpdatedOrderState.