Analyze data on GKE using BigQuery, Cloud Run, and Gemma

This tutorial demonstrates how to derive insights from big datasets by using BigQuery, Cloud Run, and the Gemma LLM. In this tutorial, you deploy a sample application to Google Kubernetes Engine (GKE). The sample app leverages BigQuery for data storage and processing, Cloud Run for request handling, and the Gemma LLM for analyzing data and generating predictions based on incoming prompts.

This tutorial is intended for cloud platform administrators and architects, Data and AI specialists, ML engineers, and MLOps (DevOps) professionals. Before reading this page, ensure that you're familiar with Kubernetes and a notebook environment like Jupyter.

As a prerequisite to this tutorial, you must complete the tutorial Serve Gemma open models using GPUs on GKE with Hugging Face TGI. The TGI framework facilitates the model serving process.

Why GKE and BigQuery

BigQuery is a Platform as a Service (PaaS), fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. BigQuery lets you focus on analyzing data to find meaningful insights while using familiar SQL and built-in machine learning.

Using GPUs on GKE with TGI, you can deploy a Gemma language model to analyze and summarize user interactions in natural language. Subsequently, by integrating BigQuery with GKE, you can use BigQuery to efficiently handle massive datasets (like Google Analytics) and the model's natural language understanding capabilities to generate meaningful insights.

For example, as a data scientist or analyst, or a business decision-maker in an ecommerce company, you might want to understand user behavior on your website or app. This insight can help you optimize and personalize user journeys and make informed business decisions to boost sales.

In this scenario, you could take raw Google Analytics data from BigQuery, feed it to the Gemma model, and receive page visit summaries and insights in natural language. The Gemma model, which runs on scalable infrastructure with GPU acceleration from GKE, rapidly processes user journey data, identifying patterns and trends. You could gain insights to pinpoint popular product combinations, reveal common drop-off points in the checkout process, and highlight successful marketing campaigns driving traffic to specific landing pages.

Benefits

This solution offers a streamlined workflow with the following advantages:

  • BigQuery integration: use BigQuery to store and process large datasets (like the Google Analytics data in this tutorial). This lets you query and aggregate the data needed for the model's analysis.
  • GPU acceleration: run the Gemma model on a GKE cluster with GPU support to speed up the inference process, generating predictions much faster than with the CPU-based processors.
  • Reduced cost and time: save time and resources by using the open source, pre-trained Gemma language model, eliminating the need to build a custom model from scratch.

Serve a Gemma model

Go to the tutorial Serve Gemma open models using GPUs on GKE with Hugging Face TGI and follow the instructions starting from Before you begin through Interact with the model using curl to ensure that your Gemma model is deployed successfully and you can interact with it.

For the purpose of this tutorial, deploy the Gemma 2B-it model.

Set up VPC network

Create or use the VPC network in the us-central1 region so that your remote function can connect to the GKE cluster. In this tutorial, use the Default VPC.

To ensure that your BigQuery dataset, remote function, and the underlying Cloud Run functions are deployed in compatible locations, the VPC network must be in the same region as your BigQuery remote function. In this tutorial, when you set BigQuery DataFrames options while creating a remote function, you specify US as a location for your dataset, which defaults to the us-central1 region for your Cloud Run functions. Therefore, create or use the VPC in the us-central1 region.

Create a load balancer

Follow these instructions to create an internal load balancer in your GKE cluster:

  1. Create the following tgi-2b-lb-service.yaml manifest:

    apiVersion: v1
    kind: Service
    metadata:
    name: llm-lb-service
    annotations:
        networking.gke.io/load-balancer-type: "Internal"
    spec:
    selector:
      app: gemma-server
    type: LoadBalancer
    ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
    
  2. Open a new Cloud Shell terminal and run the following command to apply the manifest:

    kubectl apply -f tgi-2b-lb-service.yaml
    
  3. Get the load balancer IP address. You might need to wait for 1-2 minutes before this IP address can be fetched:

    kubectl get service llm-lb-service --output yaml | grep ip:
    

You'll use this IP address to communicate with your gemma-server application that's running behind the load balancer.

Create a connector

You use a Serverless VPC Access connector to send and receive requests through your VPC network without using the public internet. For more information, see Serverless VPC Access.

In this tutorial you create a connector with a new, dedicated subnet to avoid any IP address conflicts with existing resources in the VPC. For instructions, see the Create a connector section and follow the gcloud instructions for the Create a connector and a new subnet section.

Alternatively, if you want to use an existing subnet, follow the instructions for the Create a connector using an existing subnet section.

For more information, see Connector subnet requirements.

Create a notebook

In this tutorial, you use a Colab Enterprise notebook to run all your code for defining the BigQuery remote function and performing the analysis.

To create a Colab Enterprise notebook by using Google Cloud console:

  1. In Google Cloud console, go to the Colab Enterprise Notebooks page:

    Go to Notebooks

  2. In the Region menu, select us-central1. This is the same region where you create all your services in this tutorial.

  3. Next to Files, click Create a notebook.

Your new notebook appears on the My notebooks tab.

To run code in your new notebook, insert a new code cell in your notebook for every command or a code snippet you want to run.

Create a BigQuery remote function

One of the ways you can define a BigQuery remote function is by using the bigframes library. In this section, use bigframes to create a remote function called process_incoming. This remote function takes Google Analytics data as an input, constructs a prompt, and sends it to your Gemma model for analysis.

In the Colab Enterprise notebook you created:

  1. Click + Code to insert a new code cell.
  2. Copy the following code in the new code cell:

    # Install the necessary packages on the notebook runtime
    %pip install --upgrade bigframes --quiet
    
    import bigframes.pandas as bpd
    import os
    import ast
    import requests
    
    # Replace the following  variables
    # Use the format ip:port
    # For example, "10.128.05:8000"
    lb_url = "LOADBALANCER_IP_ADDRESS:8000"
    
    # Set BigQuery DataFrames options
    bpd.options.bigquery.project = "PROJECT_ID"
    bpd.options.bigquery.location = "US"
    # Update the VPC connector name with the one you created
    vpc_connector_name = "VPC_CONNECTOR_NAME"
    
    # Create a remote function using bigframes
    # https://cloud.google.com/bigquery/docs/remote-functions#bigquery-dataframes
    
    @bpd.remote_function(
      dataset="ga_demo",
      name="ga_explain_example",
      bigquery_connection="bigframes-rf-conn",
      reuse=True,
      packages=["requests"],
      cloud_function_vpc_connector=VPC_CONNECTOR_NAME,
      cloud_function_service_account="default",
    )
    def process_incoming(data: str) -> str:
      ga_data = ast.literal_eval(data)
      USER_PROMPT = """
          'The following are the results from Google Analytics.
          They are reverse ranked.
          reverse_event_number 1 is the last page visited.
          reverse_event_number 2 is the second last page visited.
          You are given the following data.
          {}
          Can you summarize what was the most popular page people landed on and what page they came from?
      """.format(ga_data)
    
      url = 'http://{}/generate'.format(lb_url)
    
      myobj = {
          "inputs": USER_PROMPT,
          "temperature": 0.90,
          "top_p": 0.95,
          "max_tokens": 2048
      }
      x = requests.post(url, json=myobj)
      result = x.text
      return (result)
    
    function_name = process_incoming.bigframes_remote_function
    print (f"The function name is: {function_name}")
    
    

    Replace the following:

    In this tutorial, the location of your BigQuery dataset is set to US, which defaults to us-central1 region.

  3. Click Run cell.

The output displays the name of the function similar to the following:

The function name is: PROJECT_ID.ga_demo.ga_explain_example

Analyze user behavior

In this section, you analyze user behavior on your website using the process_incoming remote function in either of the following two ways:

  • using BigQuery DataFrames
  • using the bq command-line tool to run a query directly in SQL.

Use BigQuery DataFrames

To run the remote function using BigQuery DataFrames in the Colab Enterprise notebook you created:

  1. Click + Code to insert a new code cell.
  2. Copy the following code in the new code cell, and click Run cell.
# Generate a list of all matchups and their histories as a JSON

grouping_sql = """
with
data_table as (
 SELECT
 distinct
   user_pseudo_id,
   events.value.string_value,
   event_timestamp,
   rank() over (partition by user_pseudo_id order by event_timestamp desc) as reverse_event_number
 FROM
   `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131` as events20210131,
   unnest (events20210131.event_params) as events
 where events.key = 'page_location'
 qualify reverse_event_number < 3
)
select
*,TO_JSON_STRING (data_table) as ga_history
from data_table
limit 10;

"""

ga_df = bpd.read_gbq(grouping_sql)
post_processed = ga_df.assign(results=ga_df['ga_history'].apply(process_incoming),axis=1)
post_processed.head(10)

The following output shows the sample results of the query:

user_pseudo_id string_value event_timestamp reverse_event_number ga_history results axis
0 2342103247.0307162928 https://shop.googlemerchandisestore.com/Google... 1612096237169825 2 {"user_pseudo_id":"2342103247.0307162928","str... {"generated_text":"\n 'The following are...
1 48976087.6959390698 https://www.googlemerchandisestore.com/ 1612056537823270 2 {"user_pseudo_id":"48976087.6959390698","strin... {"generated_text":"\n \n ```python\n imp...

Use bq command-line tool

Alternatively, you can use the bq command-line tool to perform analysis directly using SQL.

To run the remote function using bq command-line tool in the Colab Enterprise notebook you created:

  1. Click + Code to insert a new code cell.
  2. Copy the following code in the new code cell, and replace PROJECT_ID with your project ID.

    # Update with your PROJECT_ID
    
    function_name = 'PROJECT_ID.ga_demo.ga_explain_example'
    
    new_sql = """'with \
    data_table as ( \
    SELECT \
    distinct \
      user_pseudo_id, \
      events.value.string_value, \
      event_timestamp, \
      rank() over (partition by user_pseudo_id order by event_timestamp desc) as reverse_event_number \
    FROM \
      `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210131` as events20210131, \
      unnest (events20210131.event_params) as events \
    where events.key = "page_location" \
    qualify reverse_event_number < 3 \
    ) \
    select \
    *, `{}`(TO_JSON_STRING (data_table)) as result \
    from data_table \
    limit 10;' \
    """.format(function_name)
    
    # Run query using bq cli directly in a notebook cell
    
    !bq query --use_legacy_sql=false \
    {new_sql}
    
  3. Click Run cell.

The following output shows the sample results of the query:

user_pseudo_id string_value event_timestamp reverse_event_number result
86037838.0267811614 https://shop.googlemerchandisestore.com/Google+Redesign/Apparel/Google+Dino+Game+Tee 1612128627715585 1 {"generated_text":"Answer:\n The most popular page was https://shop.googlemerchandisestore.com/Google+Redesign/Apparel/Google+Dino+Game+Tee\n The next most popular page was the page they came from.\n\n Explanation:\n\nThe provided data shows that the current user visited Google's merchandise store specifically for the product "Google Dino Game Tee." \n \nImportant Considerations:\n\n* Data Interpretation: You can't definitively say the"}
4024190.3037653934 https://shop.googlemerchandisestore.com/Google+Redesign/Apparel/Google+Black+Cloud+Zip+Hoodie 1612085948486438 1 {"generated_text":"\n ```python\n import pandas as pd\n\n data = {'user_pseudo_id': ['4024190.3037653934', '4024190.3037653934', '4024190.3037653934'],\n 'string_value': ['https://shop.googlemerchandisestore.com"}