Transform data

The Google Cloud Data Agent Kit extension for Visual Studio Code lets you use notebooks for data cleaning, feature engineering, and deep analysis.There are three types of notebooks to choose from.

BigQuery DataFrames notebooks. These are Python notebooks that let you process massive datasets in BigQuery using familiar pandas and scikit-learn APIs. They support writing code in GoogleSQL for BigQuery in addition to Python.
Managed Apache Spark notebooks with local kernel. These are Python notebooks that let you create and execute jobs on Managed Service for Apache Spark using its Spark Connect library.
Managed Apache Spark notebooks with remote kernel. These notebooks enable you to execute your notebook on a remote kernel that runs entirely on Managed Service for Apache Spark. No part of your code is executed locally on your computer. In addition to PySpark, you can write your code in Spark SQL with the help of %%sparksql cell magic.

Before you begin

For BigQuery notebooks, the bigframes library must be installed in the same Python virtual environment in which you execute your notebook. When you create a new notebook, the initialization cell contains the following line, which is commented:

#%pip install --upgrade bigframes

Optional: if you don't have the bigframes library installed in your Python virtual environment, remove the comment.
Optional: if you plan to write SQL code in your notebook, install bigquery-magics:

pip install --upgrade bigquery-magics

Required roles

To get the permissions that you need to execute BigQuery notebooks, ask your administrator to grant you the BigQuery Studio User (roles/bigquery.studioUser) on the project you selected in the extension.

To get the permissions that you need for Managed Service for Apache Spark notebooks, ask your administrator to grant you the following roles on the project:

Dataproc Editor (roles/dataproc.editor)
Dataproc Serverless Editor (roles/dataproc.serverlessEditor)

For more information about granting roles, see Manage access to projects, folders, and organizations. You might also be able to get the required permissions through custom roles or other predefined roles.

Transform your data

For data in a BigLake or BigQuery table, the extension provides notebook templates to get you started.

Navigate to the table

Navigate to a BigQuery or BigLake table:

Open the command palette by pressing Ctrl/Cmd-Shift-P.
Expand the Catalog explorer and find your BigQuery or BigLake.
Right-click the table ID.
From the floating menu, choose Load in Spark DataFrame or Load in BigQuery DataFrame. A new editor displays information about the table.

You can also find the table using Universal Search. Click the table ID to open a new editor, click the Data tab, and then choose Load in Spark DataFrame or Load in BigQuery DataFrame.

Initialize a notebook

After you load the table, a new Jupyter notebook opens in an editor tab containing the necessary code to load your table in your chosen dataframe type.

If you don't have the required library installed in your Python virtual environment, uncomment the pip install line.
Click Select Kernel and choose a Python kernel.

For Managed Spark notebooks with remote kernels you must choose a remote Spark kernel.
Run the cell by clicking ▷ Run All or pressing Shift+Enter at the bottom of the cell.
If you are prompted to install missing software, click Install.

The cell creates a dataframe containing the data in the selected table.

Apply data transformations to the DataFrame

Add additional cells to the notebook and write the code to transform your data. For BigQuery DataFrames, you can transform the DataFrame using the pandas-compatible API provided by BigQuery DataFrames.

Alternatively, BigQuery DataFrames provides a magics command that you can use to transform a DataFrame by using SQL in a Jupyter notebook. To transform your data using SQL, complete the following steps:

Create and run a cell to activate the Jupyter magics.

%load_ext bigframes
Create a SQL cell using the %%bqsql magics.

Save the results

Use one of the many output methods provided by your DataFrames type to save the transformed data to BigQuery or Cloud Storage. For BigQuery DataFrames, output methods include the following:

For small data, you can export to Arrow or Pandas for further local manipulation and visualization.

Clean up

To avoid incurring charges to your Google Cloud account, delete any resources that you created but no longer need.