Build a multimodal ordering experience using the streaming API

This guide provides instructions and best practices for engineers building food ordering experiences with the FoodOrderingService.BidiProcessOrder RPC method. This real-time, bidirectional streaming API is the core of the Food Ordering AI Agent, enabling dynamic, conversational order-taking in various applications such as mobile apps, voice assistants, drive-thrus, and kiosks.

Overview of BidiProcessOrder

The BidiProcessOrder method establishes a persistent, two-way communication channel between your client application and the Food Ordering AI Agent. Unlike standard unary request and response RPCs, this streaming approach allows for:

Low-latency interaction: Continuous exchange of information without the overhead of repeated HTTP requests.
Multimodal input: Handling of audio streams (for voice ordering), text inputs, and client-side events.
Real-time responses: The agent can send back audio, text, order updates, and other signals as the conversation unfolds.

BidiProcessOrder cannot be invoked using REST. Integrations must use a connection-oriented protocol:

gRPC (Recommended): Provides a robust and efficient framework for bidirectional streaming.
WebSocket: Suitable for clients or environments where gRPC isn't a fit due to programming language or network constraints.

Refer to the BidiProcessOrder API Reference for detailed type definitions. WebSocket integrations use JSON representations of these types, as described in the WebSocket section.

Prerequisites

Before integrating with BidiProcessOrder:

Enable the API: Ensure the Food Ordering AI Agent API is enabled in your Google Cloud project. bash gcloud services enable foodorderingaiagent.googleapis.com --project=PROJECT_ID
Authentication: Decide your authentication approach and setup any necessary service accounts and IAM roles, as described in Authentication.
Menu Ingestion: A valid Menu must be ingested and associated with a Store. See Integrating Menu Data for details.

Authentication

To securely connect to the BidiProcessOrder RPC, your application must authenticate using a Google Cloud Service Account.

1. Configure a Service Account

Create a Service Account: In your Google Cloud project, create a Service Account that your application will use to authenticate to the Food Ordering AI Agent API. See Creating and managing service accounts.
Grant IAM Roles: Grant the necessary IAM roles to this service account. The primary role required to call BidiProcessOrder is:
- Food Ordering Agent User (roles/foodorderingaiagent.agentUser): Allows the service account to connect to the ordering service and process sessions.
You can grant this role using the Google Cloud console or gcloud: bash gcloud projects add-iam-policy-binding PROJECT_ID \ --member="serviceAccount:SERVICE_ACCOUNT_EMAIL" \ --role="roles/foodorderingaiagent.agentUser"

2. Application Authentication Flow

The exact authentication flow depends on your application architecture, especially whether the client application (e.g., mobile app, kiosk software) connects directly or through your own backend.

Common Scenario: Authenticating a consumer-facing client application

This is a typical pattern for mobile or web applications:

Client-to-YourAuth: The end-user client app (mobile, web) authenticates with your existing user authentication system (this could be Firebase Authentication, your own OAuth server, etc.).
Token Exchange: The client app, after authenticating the user, requests a short-lived token from a secure backend service you control (e.g., an "API Token Service").
Access Token Generation: Your backend service, using the credentials of the Google Cloud Service Account principal configured in Step 1, generates a standard OAuth 2.0 access token for the https://www.googleapis.com/auth/cloud-platform scope. This can be done using the Google Cloud Authentication client libraries.
- Security: Service account keys or credentials used to generate these tokens must be securely stored and managed on your backend. Never expose service account private keys directly to end-user client applications. See Best practices for managing service account keys.
Note: The access token provided to clients must be associated with a principal with minimal IAM permissions (i.e. only the Food Ordering AI Agent User role). Public users can trivially abuse the access token to invoke arbitrary Google Cloud APIs on behalf of the service account, so it's critical that IAM is configured correctly to reject such calls.
Token to Client: Your backend service returns the generated Google access token to the client app.
API Call: The client app uses this Google access token to authenticate its gRPC or WebSocket connection to the BidiProcessOrder RPC.

3. Using the Token

gRPC: The Google gRPC client libraries typically handle token refreshing and inclusion in the call metadata when provided with service account credentials.
WebSocket (Non-Browser): Include the token in the Authorization: Bearer TOKEN header.
WebSocket (Browser): As noted in the WebSocket section, direct browser WebSocket connections cannot use Authorization headers. A server-side streaming proxy is needed to authenticate your clients connection to Google Cloud.

Connecting to the API

You can establish a stream using gRPC client libraries or a WebSocket connection.

gRPC

Using gRPC is the recommended approach. You'll use the client libraries for your language of choice (e.g., Node.js) which are based on the BidiProcessOrder API Reference.

The basic steps involve:

Create a gRPC channel to the Food Ordering AI Agent API endpoint (e.g., foodorderingaiagent.googleapis.com).
Obtain a client stub for FoodOrderingService.
Invoke the BidiProcessOrder method, which returns a stream object for both sending requests and receiving responses.
Implement business logic according to your use case which concurrently:
- Sends audio, text, and event input from the end user.
- Handles messages from the agent including audio, text, and events.

Node.js


const {FoodOrderingServiceClient} = require('@google-cloud/foodorderingaiagent');

const client = new FoodOrderingServiceClient();

// The stream is initialized immediately. You can now write commands and attach listeners.

const stream = client.bidiProcessOrder();

WebSocket

For WebSocket connections, the URL path is:

wss://foodorderingaiagent.googleapis.com/ws/google.cloud.foodorderingaiagent.v1beta.FoodOrderingService/BidiProcessOrder/locations/LOCATION

LOCATION: e.g., us

Required Headers:

Authorization: Bearer TOKEN - Where TOKEN is an OAuth 2.0 access token obtained for your service account.

Message Format:

Client to Server: Messages sent to the API (e.g., Config, AudioInput, TextInput, EventInput) must be JSON representations of the BidiProcessOrderRequest proto, sent as websocket.TextMessage.
Server to Client: Messages received from the API (BidiProcessOrderResponse) will be sent as websocket.BinaryMessage, but the content of these binary messages is a JSON payload.
Binary Data: Binary data within the JSON payloads (e.g., customerAudio in AudioInput, agentAudio in AgentAudio) must be base64 encoded.

Node.js WebSocket Example

Here is an example of how to connect and interact with the API using WebSockets in Node.js with the ws library:

const WebSocket = require('ws');

// Replace with your actual values
const location = 'LOCATION';
const projectId = 'PROJECT_ID';
const sessionId = 'SESSION_ID';
const brandId = 'BRAND_ID';
const storeId = 'STORE_ID';
const token = 'OAUTH_TOKEN';

const wsUrl = `wss://foodorderingaiagent.googleapis.com/ws/google.cloud.foodorderingaiagent.v1beta.FoodOrderingService/BidiProcessOrder/locations/${location}`;

const ws = new WebSocket(wsUrl, {
  headers: {
    'Authorization': `Bearer ${token}`
  }
});

ws.on('open', () => {
  console.log('Connected to WebSocket');

  // 1. Send the required initial Config message
  const configRequest = {
    config: {
      session: `projects/${projectId}/locations/${location}/sessions/${sessionId}`,
      store: `projects/${projectId}/locations/${location}/brands/${brandId}/stores/${storeId}`
    }
  };

  // Client-to-server messages are sent as TextMessage
  ws.send(JSON.stringify(configRequest));
  console.log('Sent Config message');
});

ws.on('message', (data, isBinary) => {
  // The documentation specifies that server-to-client messages
  // are sent as BinaryMessage containing a JSON payload.
  if (isBinary) {
    try {
      const response = JSON.parse(data.toString('utf8'));
      console.log('Received response:', response);

      if (response.agentText) {
        console.log(`Agent: ${response.agentText.text}`);
      }

      if (response.agentAudio) {
        const audioBytes = Buffer.from(response.agentAudio.agentAudio, 'base64');
        console.log(`Received ${audioBytes.length} bytes of agent audio.`);
        // Play or process the audio bytes here
      }

      if (response.endSession) {
        console.log('Session ended by agent.');
        ws.close();
      }
    } catch (e) {
      console.error('Failed to parse JSON response:', e);
    }
  }
});

ws.on('close', () => {
  console.log('Connection closed');
});

Session Lifecycle

Each call to BidiProcessOrder initiates a session. The session remains active as long as the stream is open.

1. Initiation (Config Message)

Upon establishing the connection, the first message sent by the client must be a BidiProcessOrderRequest containing the Config message.
Required Fields in Config:
- session: A unique client-generated session identifier. Format: projects/PROJECT/locations/LOCATION/sessions/SESSION_ID.
  - store: The resource name of the Store. Format: projects/PROJECT/locations/LOCATION/brands/BRAND/stores/STORE.
    - The agent uses the store to load the appropriate menu and configuration.

Node.js

// Send the first message containing Config
stream.write({
  config: {
    session: client.sessionPath(projectId, location, sessionId),
    store: client.storePath(projectId, location, brandId, storeId),
  }
});

2. Sending Inputs

After the initial Config, the client can send a stream of BidiProcessOrderRequest messages containing one of the following inputs:
- AudioInput: Raw audio data (typically 16-bit linear PCM at 16000 Hz, no headers). Used for voice interactions.
- TextInput: Text messages from the user.
- EventInput: Signals for events such as DriveOffEvent (for drive-thru use cases when the vehicle departs), CrewInterjectionEvent (for any situation wherein a human takes over the order taking role mid-conversation), or OrderStateUpdateEvent (if the order is modified on the client-side, e.g., using a touch interface).

Node.js

// Stream user inputs over the active connection
stream.write({textInput: {text: 'Hi, I\'d like to order a cheeseburger.'}});

3. Receiving Responses

Concurrently, the agent sends back a stream of BidiProcessOrderResponse messages. Your client must be prepared to handle various response types within the oneof response field:
- AgentAudio: Synthesized audio bytes to be played to the user, used for voice interactions.
- AgentText: Text version of the agent's response.
- SpeechRecognition: Transcript of the recognized user speech.
- UpdatedOrderState: Contains the complete current state of the customer's Order whenever it's updated by the agent. Use this to update your application's order representation. This should typically result in an update to a user interface or a system of record for order state information, such as a point of sale system.
- InterruptionSignal: Indicates the user interrupted the agent's speech. The client should immediately stop playing any outgoing AgentAudio.
- AgentEvent: Special events, such as RestartOrder, requiring client action.
- SuggestedOptions: Provides contextually relevant options a user might select next, useful for display on a screen.
- EndSession: Signals the session has been terminated by the agent (e.g., order complete, user drive-off, or agent escalation).

Node.js

// Attach event listeners to handle responses sequentially
stream.on('data', (response) => {
  if (response.agentAudio) {
    console.log(`Received ${response.agentAudio.agentAudio.length} bytes of agent audio.`);
  } else if (response.agentText) {
    console.log(`Agent: ${response.agentText.text}`);
  } else if (response.speechRecognition) {
    console.log(`Recognized User Speech: ${response.speechRecognition.transcript}`);
  } else if (response.updatedOrderState) {
    console.log('Order updated.');
  } else if (response.interruptionSignal) {
    console.log('User interrupted the agent. Stop playing audio!');
  } else if (response.endSession) {
    console.log(`Session ended. Type: ${response.endSession.type}, Reason: ${response.endSession.reason}`);
    stream.end();
  }
});

stream.on('error', (err) => {
  console.error('Stream error:', err);
});

4. Closing the Stream

The stream can be closed by the client or the server. Typically, the server signals the end of a conversation using an EndSession message. The client should close the stream when this message is received.

Handling Specific Message Types

The following sections describe how to handle specific response types that your client will receive when calling BidiProcessOrder.

`AudioInput`

Stream audio in chunks as it becomes available.
Format: 16-bit linear PCM, 16000 Hz sample rate.
Audio chunks do not include the audio headers that typically prefix a WAV file.
For drive-thru scenarios with echo cancellation enabled (enable_echo_cancellation in Config), provide both customer_audio and crew_audio.

`UpdatedOrderState`

This message provides the full state of the order each time it's sent. Replace any local cache of the order with the contents of the Order message received.
Use the custom_integration_attributes within the Order items and modifiers to map the Order content into equivalent entities within your application's system of record.

`InterruptionSignal`

Upon receiving, immediately halt playback of any AgentAudio and clear any buffered agent audio. This ensures a natural conversational flow when the user interrupts the agent's speech.

`EndSession`

Check the EndType (e.g., DRIVE_OFF, AGENT_ESCALATION).
Your application should gracefully close the connection and transition the user appropriately (e.g., notify a human supervisor in the case of AGENT_ESCALATION, or transition to an order confirmation state).

Best Practices

Handle Messages Asynchronously: Minimize latency by using threads or non-blocking I/O to concurrently send requests and process incoming responses.
Reconnection Logic: Implement robust reconnection logic in case of network issues, remembering to send the initial Config message with the same session ID to attempt resumption.
Error Handling: Monitor the stream for errors. gRPC and WebSocket libraries provide mechanisms to detect stream closure or transport errors. Log these events and handle them gracefully.
Audio Buffering: Manage audio buffers carefully, implementing buffering if necessary, to ensure smooth playback of AgentAudio and timely delivery of AudioInput. Carefully consider the tradeoff between latency and playback quality when deciding your buffering scheme.
Session ID Management: Ensure session IDs are unique for each distinct order/conversation.
Resource Management: Close streams and release resources when the session is complete or if unrecoverable errors occur.
Timeouts: While the stream itself can be long-lived (up to 15 minutes by default), consider application-level timeouts for specific states if needed.

Example Integration Flow (Conceptual)

Client App (e.g., Mobile App) initiates an order.
Establish gRPC/WebSocket connection to BidiProcessOrder.
Send BidiProcessOrderRequest with Config (session ID, store ID).
Receive initial AgentAudio (e.g., welcome message) and play it.
User speaks: Capture audio, stream it in AudioInput messages.
Receive SpeechRecognition (display transcript), AgentAudio (play response), and potentially UpdatedOrderState (update UI cart).
If user interrupts, receive InterruptionSignal, stop playback.
Continue exchange of audio or text inputs and agent responses.
User confirms order: Agent sends final UpdatedOrderState.
Agent sends EndSession: Client closes the stream and finalizes the order in the POS system using data from the last UpdatedOrderState.

End-to-end Example

While the instructions above break down the streaming concepts piece-by-piece, here is what a complete end-to-end integration flow looks like.

Node.js

Before trying this sample, follow the Node.js setup instructions in the Food Ordering AI Agent quickstart using client libraries.

To authenticate to Food Ordering AI Agent, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

const {FoodOrderingServiceClient} = require('@google-cloud/foodorderingaiagent');

async function bidiProcessOrderSample(projectId, location, brand, store, sessionId) {
  const client = new FoodOrderingServiceClient();

  // Create the resource names
  const sessionPath = client.sessionPath(projectId, location, sessionId);
  const storePath = client.storePath(projectId, location, brand, store);

  // Initialize the stream using gRPC. See the WebSocket section for the equivalent WebSocket implementation.
  const stream = client.bidiProcessOrder();

  // Attach event listeners to handle responses sequentially
  stream.on('data', (response) => {
    if (response.agentAudio) {
      console.log(`Received ${response.agentAudio.agentAudio.length} bytes of agent audio.`);
    } else if (response.agentText) {
      console.log(`Agent: ${response.agentText.text}`);
    } else if (response.speechRecognition) {
      console.log(`Recognized User Speech: ${response.speechRecognition.transcript}`);
    } else if (response.updatedOrderState) {
      console.log('Order updated.');
    } else if (response.interruptionSignal) {
      console.log('User interrupted the agent. Stop playing audio!');
    } else if (response.endSession) {
      console.log(`Session ended. Type: ${response.endSession.type}, Reason: ${response.endSession.reason}`);
      stream.end();
    }
  });

  stream.on('error', (err) => {
    console.error('Stream error:', err);
  });

  // 1. Send the first message containing Config
  stream.write({
    config: {
      session: sessionPath,
      store: storePath,
    }
  });

  // 2. Stream user inputs over the active connection
  stream.write({textInput: {text: 'Hi, I\'d like to order a cheeseburger.'}});
}

Build a multimodal ordering experience using the streaming API Stay organized with collections Save and categorize content based on your preferences.

Overview of BidiProcessOrder

Prerequisites

Authentication

1. Configure a Service Account

2. Application Authentication Flow

Common Scenario: Authenticating a consumer-facing client application

3. Using the Token

Connecting to the API

gRPC

Node.js

WebSocket

Node.js WebSocket Example

Session Lifecycle

1. Initiation (Config Message)

Node.js

2. Sending Inputs

Node.js

3. Receiving Responses

Node.js

4. Closing the Stream

Handling Specific Message Types

AudioInput

UpdatedOrderState

InterruptionSignal

EndSession

Best Practices

Example Integration Flow (Conceptual)

End-to-end Example

Node.js

Build a multimodal ordering experience using the streaming API

`AudioInput`

`UpdatedOrderState`

`InterruptionSignal`

`EndSession`