Train AI and ML models

Learn how to train AI and ML models in the Google Cloud Data Agent Kit extension for Visual Studio Code.

In this quickstart, you use a session template and a sample Jupyter notebook to predict New York City taxi tip amounts. Using a remote Jupyter kernel with PySpark, you try out various models like linear regression, random forest, and XGBoost. This process lets you perform distributed training and inference. It demonstrates scalability across multiple machines using Spark ML and the XGBoost library.

Although not covered in this quickstart, there are multiple ways to train AI and ML models using Google Cloud Data Agent Kit extension for Visual Studio Code:

If your training dataset is large or you want the distributed training capabilities that Apache Spark offers, you can use Spark notebooks with remote kernels.
If your dataset is in BigQuery and BigQuery ML supports your use case, you can use a BigQuery DataFrames notebook.
If your dataset is small and you want to train your model locally, you can use a Python notebook.

Before you begin

Before you begin, do the following:

Install the extension.
Configure extension settings.
Review the guidance at Find and explore data.

Create a Spark runtime template

Serverless Spark Runtime templates let you start an Apache Spark session with a given set of configurations. To create a new Serverless Runtime template, complete the following steps:

In the IDE activity bar, click the Google Cloud Data Agent Kit icon.
In the Google Cloud Data Agent Kit menu, expand Apache Spark.
Expand Serverless and then click + Create serverless runtimes. A Serverless Runtime creation form appears.
In the Display Name field, enter ai-ml-tutorial.
Go to the Auto Scaling section.
Set spark.dynamicAllocation.enabled to false in the drop-down list. This setting is needed for XGBoost to work with Apache Spark.
Leave all other fields set to the default.
Click Submit.

Create a new notebook

Next, create a new Spark notebook:

Under Apache Spark in the Google Cloud Data Agent Kit tab, click + New Spark Notebook.
Choose Remote Kernel for the kernel type.
Click Start with a sample notebook.
In the list of samples, select Data Science with PySpark and Distributed XGBoost. An untitled Jupyter notebook appears.

Train your model

In the notebook tab, click Run All. The kernel picker asks you to select a kernel to execute the notebook with.
Click Select Another Kernel.
Click Remote Spark Kernels.
Select the ai-ml-tutorial on Serverless Spark, the Runtime template that you created earlier.

You see the following notification while the system creates your Serverless Spark session: Connecting to kernel: ai-ml-tutorial on Serverless Spark. When the notebook connects to the remote PySpark kernel, execution starts at the first cell. This process takes approximately two to three minutes.

Inspect your Spark session

In the Google Cloud Data Agent Kit tab, under Apache Spark, expand the ai-ml-tutorial Runtime template. The IDE displays the list of interactive sessions that you have created with this runtime template.
Locate the session that the system created by executing the notebook at the top of the list. Click the session to see its details. You can review the session configuration and the resources that the system consumed to execute your notebook.

Clean up

After successfully executing the notebook, perform the following cleanup steps.

In the Google Cloud Data Agent Kit tab, under Apache Spark, right-click Serverless and select List Serverless Runtimes. The list of Serverless Runtimes appears.
Click the Action menu for ai-ml-tutorial to list all the Interactive Sessions that the system created from your template.
Under Actions, click Delete.
Go back to the Serverless Runtimes window.
Under Actions for ai-ml-tutorial, click Delete.
Click Confirm to delete the template that you created for this tutorial.