Use natural language to analyze data with the Data Science Agent

This tutorial shows how to build a machine learning (ML) model by using natural language prompts with the Colab Enterprise Data Science Agent.

In the tutorial, you build an ML model to forecast liquor sales by using the Iowa liquor retail sales public dataset. The AI-powered agent lets you use natural language prompts to write, explain, and troubleshoot code directly within a notebook to accelerate your data science workflows.

This tutorial is intended for data practitioners.

Objectives

In this tutorial, you learn how to use the Data Science Agent to perform the following tasks:

  • Perform exploratory data analysis (EDA) of the Iowa liquor retail sales public dataset to understand data distributions, check for missing values, and verify overall data quality.
  • Find the stores that sold the most gallons of alcohol across all products.
  • Build, train, and evaluate a model that forecasts liquor sales by using BigQuery ML.
  • Generate and summarize key insights and model performance.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  5. Verify that billing is enabled for your Google Cloud project.

  6. Enable the BigQuery, Gemini for Google Cloud, Dataform, and Compute Engine APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

    For new projects, the BigQuery API is automatically enabled.

Required roles

If you created a new project, you have all required permissions to complete this tutorial. If you use an existing project, ask your administrator to grant you the following roles.

Permissions to create and run notebooks

To get the permissions that you need to create and run notebooks, ask your administrator to grant you the BigQuery Studio User (roles/bigquery.studioUser) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

To view the permissions that are required to create and run notebooks, see the setup steps on the Create notebooks page.

For more information about BigQuery Identity and Access Management (IAM), see Access control with IAM.

Create a Colab Enterprise notebook and connect to a runtime

Colab Enterprise notebooks are BigQuery Studio code assets powered by Dataform. You can use notebooks to complete analysis and ML workflows by using SQL, Python, and other common packages and APIs.

To create a new notebook and connect it to the default runtime, follow these steps:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the left pane, expand your project, and then click Notebooks.

  3. Click New notebook > Empty notebook.

  4. Click Save.

  5. To see the new notebook, click the Notebooks tab. You might need to click refresh Refresh .

  6. For your untitled notebook, click more_vert Open actions, and then choose Rename.

  7. For Notebook name, enter predict_liquor_sales, and then click Rename.

  8. Click the predict_liquor_sales tab.

  9. On the notebook toolbar, click Connect to connect your notebook to the default runtime environment.

Use Data Science Agent to analyze the data

The Data Science Agent is a Gemini-powered assistant that can write, explain, and troubleshoot code directly within your notebook. It can help you with tasks ranging from exploratory data analysis to generating machine learning predictions and forecasts, including the following:

  • Generating plans. Create a step-by-step plan to solve a data science problem.
  • Data exploration and cleaning. Identify missing values, outliers, and visualize distributions.
  • Feature engineering. Convert categorical features and create new ones.
  • Model training and evaluation. Train models like Linear Regression or Random Forest and compare their performance.

In this tutorial, you use the Data Science Agent to analyze data in the Iowa liquor retail sales public dataset, and to create a model that forecasts liquor sales.

To use Data Science Agent to analyze the data, follow these steps:

  1. In your predict_liquor_sales notebook, if the Gemini chat window isn't open, click spark Toggle Gemini in Colab to open the chat dialog.

  2. In the chat interface, enter the following prompt:

    Analyze the liquor sales data from
    bigquery-public-data.iowa_liquor_sales.sales by performing exploratory data
    analysis. Find the stores that sold the most gallons of alcohol across
    all products. Then, create a model that predicts liquor sales by store by
    using BigQuery ML. Visualize the results, evaluate the model,
    summarize the model's performance, and provide any key insights.
    
  3. Verify that the model is set to Gemini 3.0 Flash.

  4. Click Send.

  5. Review the plan generated by the agent. The plan looks similar to the following.

    "I've analyzed the request for Iowa liquor sales data and created a strategic plan. Since the data is stored in BigQuery and you've specified using BigQuery ML, we'll start with data exploration, identify top-performing stores, and then build a regression model to predict sales. Here is the step-by-step plan:

    • Data Exploration and Profiling: Perform an initial exploration of the 'bigquery-public-data.iowa_liquor_sales.sales' table to understand the data distribution, check for missing values, and verify data quality for key columns like store information, volume sold, and sales dollars.
    • Identify Top Stores by Volume: Aggregate the total gallons of alcohol sold by each store across all products and identify the top-performing stores.
    • Data Preparation: Prepare the training data by aggregating liquor sales by store and selecting relevant features for the prediction model.
    • Model Development: Train a regression model using BigQuery ML to predict liquor sales for each store.
    • Model Evaluation: Evaluate the trained model's performance using standard regression metrics.
    • Model Prediction: Generate predictions for liquor sales using the trained model.
    • Visualization: Create a scatter plot comparing the predicted liquor sales versus the actual values to visually assess the model's accuracy.
    • Finish task: Summarize the model's performance, provide key insights into store sales, and conclude the analysis."
  6. After you've reviewed the plan, click Accept and run. The agent generates the first subtask: exploring and profiling the data. The agent pauses until you review and accept the code. If the agent is waiting for user input, the generated cells are rendered with a green background.

  7. Review the generated code cell and the reasoning provided by the agent.

    The code and reasoning provided by DSA after it examined the public dataset.

  8. Click Accept and run. If the agent encounters issues with its approach, it provides reasoning about how to correct the issue and prompts you to accept the altered code.

  9. Review the output in the code cell.

    The results produced by the data exploration subtask.

    Below the results, the agent creates a new cell to complete the next subtask: finding the stores with the highest liquor sales.

    The SQL query generated by the agent that finds the top liquor stores by gallons sold.

  10. Review the generated SQL code that queries the data for the top stores by gallons of alcohol sold. You can examine the agent's reasoning by viewing the Reasoning text cell above the code. If you're satisfied that the code is correct, click Accept and run.

  11. Examine the query results in the cell's output. The results are similar to the following:

    The query results that show top liquor stores by sales.

  12. Review the code and reasoning generated by the agent for the next subtask: preparing the data for model training.

    The code and reasoning generated to prepare the data for model training.

  13. After you verify that the SQL code is correct, click Accept and run.

  14. Review the output in the code cell. You see a message similar to JOB ID 123456 successfully executed.

  15. Review the code and reasoning generated by the agent for the next subtask: training the regression model.

    The code and reasoning generated to train the regression model.

  16. After you view the code and reasoning, click Accept and run.

  17. Review the output in the code cell. You see a message similar to JOB ID 123456 successfully executed.

  18. Review the code and reasoning generated by the agent for the next subtask: model evaluation.

    The code and reasoning generated to evaluate the model.

  19. After you view the code and reasoning, click Accept and run.

  20. Review the output in the code cell.

    The output of the model evaluation subtask.

  21. Review the code and reasoning generated by the agent for the next subtask: generating predictions.

    The code and reasoning to generate predictions.

  22. After you view the code and reasoning, click Accept and run.

  23. Review the output in the code cell. You see a message similar to JOB ID 123456 successfully executed.

    After the query runs, the agent creates a code cell for the next subtask: visualizing the data.

    The code and reasoning to visualize the results.

  24. After you view the code and reasoning, click Accept and run.

  25. Review the output in the code cell. You see a chart that plots actual versus predicted liquor sales. The chart looks similar to the following:

    The actual versus predicted sales chart.

    After the chart is generated, the agent generates a summary of the results with key findings and insights.

    The summary generated by the agent.

  26. After you review the summary, click Accept to complete the plan.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the notebook you created. To delete your notebook, follow these steps:

  1. Go to the BigQuery page.

    Go to BigQuery

  2. In the left pane, expand your project, and then click Notebooks.

  3. For your predict_liquor_sales notebook, click more_vert Open actions, and then choose Delete.

  4. Click Delete to remove the notebook.

What's next