The Google Cloud Data Agent Kit extension for Visual Studio Code lets you use notebooks for data cleaning, feature engineering, and deep analysis.There are three types of notebooks to choose from.
- BigQuery DataFrames notebooks. These are Python notebooks that let you process massive datasets in BigQuery using familiar pandas and scikit-learn APIs. They support writing code in GoogleSQL for BigQuery in addition to Python.
Managed Apache Spark notebooks with local kernel. These are Python notebooks that let you create and execute jobs on Managed Service for Apache Spark using its Spark Connect library.
Managed Apache Spark notebooks with remote kernel. These notebooks enable you to execute your notebook on a remote kernel that runs entirely on Managed Service for Apache Spark. No part of your code is executed locally on your computer. In addition to PySpark, you can write your code in Spark SQL with the help of
%%sparksqlcell magic.
Before you begin
For BigQuery notebooks, the bigframes library must be installed
in the same Python virtual environment in which you execute your notebook. When
you create a new notebook, the initialization cell contains the following line,
which is commented:
#%pip install --upgrade bigframes
Optional: if you don't have the
bigframeslibrary installed in your Python virtual environment, remove the comment.Optional: if you plan to write SQL code in your notebook, install
bigquery-magics:
pip install --upgrade bigquery-magics
Required roles
To get the permissions that you need to execute BigQuery
notebooks, ask your administrator to grant you the BigQuery Studio
User
(roles/bigquery.studioUser) on the project you selected in the extension.
To get the permissions that you need for Managed Service for Apache Spark notebooks, ask your administrator to grant you the following roles on the project:
- Dataproc Editor
(
roles/dataproc.editor) - Dataproc Serverless
Editor
(
roles/dataproc.serverlessEditor)
For more information about granting roles, see Manage access to projects, folders, and organizations. You might also be able to get the required permissions through custom roles or other predefined roles.
Transform your data
For data in a BigLake or BigQuery table, the extension provides notebook templates to get you started.
Navigate to the table
Navigate to a BigQuery or BigLake table:
- Open the command palette by pressing
Ctrl/Cmd-Shift-P. - Expand the Catalog explorer and find your BigQuery or BigLake.
- Right-click the table ID.
From the floating menu, choose Load in Spark DataFrame or Load in BigQuery DataFrame. A new editor displays information about the table.
You can also find the table using Universal Search. Click the table ID to open a new editor, click the Data tab, and then choose Load in Spark DataFrame or Load in BigQuery DataFrame.
Initialize a notebook
After you load the table, a new Jupyter notebook opens in an editor tab containing the necessary code to load your table in your chosen dataframe type.
If you don't have the required library installed in your Python virtual environment, uncomment the pip install line.
Click Select Kernel and choose a Python kernel.
For Managed Spark notebooks with remote kernels you must choose a remote Spark kernel.
Run the cell by clicking ▷ Run All or pressing
Shift+Enterat the bottom of the cell.If you are prompted to install missing software, click Install.
The cell creates a dataframe containing the data in the selected table.
Apply data transformations to the DataFrame
Add additional cells to the notebook and write the code to transform your data. For BigQuery DataFrames, you can transform the DataFrame using the pandas-compatible API provided by BigQuery DataFrames.
Alternatively, BigQuery DataFrames provides a magics command that you can use to transform a DataFrame by using SQL in a Jupyter notebook. To transform your data using SQL, complete the following steps:
Create and run a cell to activate the Jupyter magics.
%load_ext bigframesCreate a SQL cell using the
%%bqsqlmagics.
Save the results
Use one of the many output methods provided by your DataFrames type to save the transformed data to BigQuery or Cloud Storage. For BigQuery DataFrames, output methods include the following:
For small data, you can export to Arrow or Pandas for further local manipulation and visualization.
Clean up
To avoid incurring charges to your Google Cloud account, delete any resources that you created but no longer need.