Use the Data Engineering Agent in Visual Studio Code

The Data Engineering Agent within the Google Cloud Data Agent Kit extension can assist you in creating and building orchestration pipelines in your integrated development environment (IDE). By leveraging Gemini for Google Cloud, the Data Engineering Agent provides a natural language interface to automate the generation, modification, and management of complex orchestration workflows.

The Data Agent Kit extension is supported in VS Code.

The Data Engineering Agent supports the following common data engineering tasks:

Create orchestration pipelines: Generate a new pipeline in an empty workspace or add additional pipelines to existing projects.
Modify pipeline structure: Use natural language to add, remove, or update individual actions within an orchestration pipeline.
Manage execution metadata: Change pipeline names and update execution schedules, for example, from manual to daily runs.
Troubleshoot pipeline runs: Proactively identify the root cause of failed pipeline runs and apply agent-suggested fixes.

Before you begin

Before you use the Data Engineering Agent in your IDE, perform the steps in this section.

Install the Data Agent Kit extension for Visual Studio Code. The Data Engineering Agent is included in the Data Agent Kit extension.
Enable the Gemini Data Analytics API and Dataform API.

Enable the APIs
Install version 563.0.0 or later of the Google Cloud SDK.
Install the gcloud beta commands.
Configure an environment in Managed Service for Apache Airflow. Use the default Managed Service for Apache Airflow environment configuration. Then, in the Scheduler settings of the Data Agent Kit extension, input the name of your Managed Service for Apache Airflow environment, the ID of the Google Cloud project where the environment is hosted, and the region where the environment is located.

Required roles

To get the permissions that you need to interact with the Data Engineering Agent and its underlying services, ask your administrator to grant you the following IAM roles on project:

Gemini Data Analytics Stateless Chat User (roles/geminidataanalytics.dataAgentStatelessUser)
Dataform Code Editor (roles/dataform.codeEditor)
BigQuery Job User (roles/bigquery.jobUser)
To list environments and manage Apache Airflow DAGs: Composer User (roles/composer.user)
To deploy the orchestration pipeline or update the pipeline using a designated Managed Airflow environment service account: Service Account User (roles/iam.serviceAccountUser)

For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Best practices

Understand that the agent follows a multi-step loop. The agent first generates a plan for your approval. Next, the agent performs the act (for example, writing code). Finally, the agent verifies the results using dry runs or tests.
The agent's performance depends on the files open in your workspace. Use the @file syntax or open relevant SQLX files to give the agent the necessary context to build your orchestration logic.

Create an orchestration pipeline

To create an orchestration pipeline in an empty workspace or add an additional orchestration pipeline to an existing workspace, do the following:

Open your IDE with the Data Agent Kit extension installed.
Open the Ask Agent panel.
Enter a natural language prompt to generate an orchestration pipeline. For example:
```
 Create an orchestration pipeline that unifies my Google Ads and YouTube Ads
 data into a single marketing table.
```
Once you have entered a prompt, click Send.
Review the generated pipeline structure and apply the changes.

Update a pipeline schedule

To change the orchestration pipeline name or update the execution schedule (for example, from manual to daily), do the following:

Open your IDE with the Data Agent Kit extension installed.
Navigate to your existing orchestration pipeline configuration.
Open the Ask Agent panel.
Enter a natural language prompt to update the pipeline schedule. For example:
```
Update the execution schedule for this pipeline to run daily at 2 AM.
```
The agent updates the underlying configuration, for example, the Apache Airflow DAG settings.
Review and save the updated pipeline schedule.

Modify pipeline actions

To add or delete individual actions in your orchestration pipeline, do the following:

Open your IDE with the Data Agent Kit extension installed.
Identify the pipeline action you want to add or delete.
Open the Ask Agent panel.

Enter a natural language prompt to modify the pipeline actions. For example:

Add a new action to the pipeline that runs the daily_sales_aggregation table
task.

Review and save the updated pipeline definition.

Troubleshoot

If you encounter any errors during orchestration pipeline generation, ensure that you have completed all the prerequisites required to run the Data Engineering Agent. For more information, see Before you begin.

To troubleshoot a failed orchestration or data pipeline run, do the following:

Open your IDE with the Data Agent Kit extension installed.
In your pipeline or development workspace, click the Executions tab.
From the executions list, find the failed data pipeline run. You can identify failed runs in the Status column of the execution run.
Hover over the failure icon, then click Investigate. The Data Engineering Agent analyzes the logs and identifies root causes, such as schema drift or data type mismatches.
In the Ask Agent panel, review the suggested fix.
To resolve the issue, enter a prompt such as, Apply the suggested fix to the pipeline. Alternatively, you can manually update the SQLX code based on the agent's analysis.

What's next

Learn how to deploy orchestration pipelines.
Learn how to create runner environments that execute orchestration pipelines.
Learn about manually defining and fine-tuning your pipeline and deployment configurations.
Learn how to build and modify orchestration pipelines using Google Cloud CLI commands.
Learn how to use the Data Engineering Agent to build and modify data pipelines in Google Cloud console.