This page explains how to provision resources for your pipelines.
About resource provisioning in Orchestration Pipelines
Orchestration Pipelines uses an Infrastructure-as-Code (IaC) approach to manage Google Cloud resources used by your data pipelines, which brings the following benefits:
- Version Control. Infrastructure changes are tracked in Git.
- Repeatability. Environments can be recreated reliably.
- Collaboration. Team members can review and contribute to infrastructure definitions.
- Automation. Integrates into CI/CD pipelines.
- Optionality and Coexistence. The resource provisioning framework is optional. If you already use Terraform or other established IaC practices for resource provisioning, you can continue to do so. You can manage a subset of resources relevant to a specific pipeline or application, potentially coexisting with resources managed by other tools.
In Orchestration Pipelines, your project can have one or more deployment environments. Each deployment environment's configuration defines how pipelines and resources that belong to this environment are deployed. For example, you can have one deployment environment for developing, and another environment for production.
Example of a deployment environment configuration. Provisioned resources are
defined in the resources mapping.
environments:
dev:
project: example-dev-project
region: us-central1
variables:
dataset_name: marketing_analytics_dev
secrets:
dts_api_key: "projects/example-dev-project/secrets/dev-dts-key/versions/latest"
resources:
- type: dataproc.cluster
name: example-static-cluster-resource
definition:
clusterName: example-static-cluster
When you deploy your pipeline, Orchestration Pipelines uses a stateless "create or update" model to provision resources that you've defined:
If a resource is defined but doesn't exist, Orchestration Pipelines creates it.
If a resource exists:
(Default) Update the resource's configuration to match the definition.
If you define this behavior in the resource's configuration, Orchestration Pipelines can ignore changes or re-create a resource.
If you delete a resource's definition from the configuration, it doesn't result in the deletion of the resource. This approach prioritizes safety and prevents accidental data loss.
If you rename an existing resource, Orchestration Pipelines creates a new resource with the new name and keeps the original resource.
Provisioned resources compared to resource profiles
Resource profiles are template files containing the definition of one or more Google Cloud resources. They are different from provisioned resources and can be used together with them:
With provisioned resources: Instead of defining the same resource configuration inline within each development environment in your
deployment.yaml, you can define it once in a profile and then reference it. Provisioned resources support a wide range of resource types that can be defined in resource profiles.With pipeline actions: You can use resource profiles in actions where resources are provisioned for the duration of an action. By using a resource profile instead of specifying the resource's configuration inline, you can keep the resource configuration separate from the pipeline action and reuse one configuration for several pipeline actions. Pipeline actions support resource profiles only for Managed Service for Apache Spark resources, for example when executing pipeline actions in an ephemeral cluster.
View available resource types
See Resource types.
You can also view all available resources in gcloud CLI with the following command:
gcloud beta orchestration-pipelines resource-types list
Add a new resource
To add a new provisioned resource to the deployment environment's configuration, add its definition as follows:
In your deployment environment's configuration, add a new item to the
environments.DEVELOPMENT_ENVIRONMENT.resourceslist.Specify the following keys:
type: The type of the Google Cloud resource to provision. Examples:bigquery.dataset,dataform.repository. You can view available resource types using a gcloud CLI command.name: A logical name for the resource within thedeployment.yamlfile.parent: Specifies a parent resource for resource types that require one. Put the parent resource'snameas the value.updateAction: Specifies which action must be taken when a change is detected in the resource's configuration:(Default)
patch: Update the resource's properties that have changed.The update operation only modifies the properties specified in the resource's configuration (YAML), leaving other existing properties unchanged. You can use this behavior, for example, to manage only properties important to pipeline execution, and keep other properties configured manually.
If changes affect immutable fields or the resource is immutable, the deployment fails. In such cases, we recommend adjusting the definition to only modify mutable fields. If this isn't possible, you can change the update action to
recreate.skip: Ignore changes and don't update the resource's configuration.We recommend to use this option when you want to manage the resource's existence but perform the configuration changes and updates using other means, for example, manually.
recreate: If any changes are detected, delete the existing resource and then create a new one based on the current resource's definition.We recommend to use this option for resources that are entirely immutable or when changes are made to fields that can't be updated in-place.
definition: Resource's specification, as a mapping that reflects resource's configuration structure in the resource's API.(Optional)
metadata: Orchestration Pipelines-specific metadata. Some resource types use metadata fields to configure the resource in Google Cloud For example, themetadata.locationfield can be used to create zonal resources.
Validate and deploy your pipeline. Orchestration Pipelines will provision the new resource when the pipeline is deployed.
Example: Long-running resources
This example demonstrates adding a long-running resource, in this case, a static Managed Service for Apache Spark cluster. Once provisioned, it can be used in pipeline actions. We recommend to use this general approach when your pipelines use persistent resources.
The following example adds a static Managed Service for Apache Spark cluster named
example-static-cluster to the dev deployment environment. The resource's
definition is provided based on the Dataproc API.
environments:
dev:
project: "example-project"
region: "us-central1"
# A runner environment for executing pipeline actions
composer_environment: "example-runner-environment"
resources:
- type: dataproc.cluster
name: example-static-cluster
updateAction: patch
definition:
config:
masterConfig:
numInstances: 1
machineTypeUri: n1-standard-4
workerConfig:
numInstances: 3
machineTypeUri: n1-standard-4
This cluster can be used in pipeline actions as usual. There's no difference in usage compared to, for example, a manually created cluster.
modelVersion: "1.0"
pipelineId: "example-dataproc-pipeline"
...
actions:
- pyspark:
name: "run-pyspark-with-pyfiles-on-existing-cluster"
engine:
dataprocOnGce:
existingCluster:
clusterName: "example-static-cluster"
location: {{ region }}
projectId: {{ project }}
mainFilePath: "scripts/my_spark_job_with_pyfiles.py"
pyFiles:
- "scripts/lib1.py"
Example: Automated build and release process for Dataform
This example demonstrates an automated build and release process for Dataform. In the example scenario:
A Dataform repository is connected to a GitHub repository through Developer Connect.
You push changes to the GitHub repository.
After the changes are pushed, you deploy the pipeline.
Dataform pulls the code from the GitHub repository, creates a compilation result, and makes it ready for execution.
A workflow configuration is then used run workflows with this automatically compiled release.
The following code example defines a Dataform repository, a
release configuration, and a workflow configuration for it in the dev
deployment environment:
- The
gitCommitish: "{{ COMMIT_SHA }}"line ties the release configuration to the specific Git commit being deployed. TheCOMMIT_SHAis a variable that resolves to the deployed pipeline bundle's commit SHA. - The
codeCompilationConfig.pipelineConfig.pathkey points to a subfolder that contains pipeline assets. In this way, you can keep multiple Dataform pipelines in a single repository. Setting
releaseCompilationResulttoautoin the definition of thereleaseConfiginstructs Orchestration Pipelines to trigger a Dataform compilation after thereleaseConfigresource is created or updated with the newgitCommitish:- The framework first upserts (updates or creates) the
releaseConfigresource to point to the specified commit. - Then, because of the
autosetting, it calls the Dataform API to create a new Compilation Result based on the code at that commit. - The
releaseConfigis updated again to point to the newly created Compilation Result ID, making that version the "live" one.
- The framework first upserts (updates or creates) the
environments:
dev:
project: example-project
region: us-central1
composer_environment: example-runner-environment
artifact_storage:
bucket: example-bucket
path_prefix: initialized-artifact-bucket
pipelines:
- source: initialized-pipeline.yaml
- source: dataform_local_pipeline.yaml
- source: dataform_service_pipeline.yaml
resources:
- name: {{ repository_name }}
type: dataform.repository
definition:
labels:
bigquery-deployment: preview
- type: dataform.repository.releaseConfig
name: subfolder-release
parent: {{ repository_name }}
definition:
gitCommitish: {{ COMMIT_SHA }}
releaseCompilationResult: auto
codeCompilationConfig:
pipelineConfig:
pipelineType: DATAFORM
path: weather_dataform
- type: dataform.repository.workflowConfig
name: {{ workflow_config_name }}
parent: {{ repository_name }}
definition:
releaseConfig: subfolder-release
invocationConfig:
serviceAccount: {{ service_account }}
variables:
service_account: example-account@example-project.iam.gserviceaccount.com
network_uri: projects/example-project/global/networks/default
subnetwork_uri: projects/example-project/regions/us-central1/subnetworks/default
region: us-central1
repository_name: weather-aggregation-repo
workflow_config_name: updated-subfolder-workflow
An example pipeline action that runs a workflow with the created workflow configuration:
modelVersion: '1.0'
pipelineId: dataform_service_pipeline
description: Updated run Dataform pipeline via Dataform Service
runner: airflow
owner: data-eng-team
defaults:
projectId: {{ project }}
location: {{ region }}
executionConfig:
retries: 1
actions:
- pipeline:
name: run_dataform_service
framework:
dataform:
dataformService:
location: {{ region }}
projectId: {{ project }}
repositoryId: {{ repository_name }}
workflowInvocation:
workflowConfig: projects/{{ project }}/locations/{{ region }}/repositories/{{
repository_name }}/workflowConfigs/{{ workflow_config_name }}
- python:
name: fibonacci_python
mainFilePath: scripts/fibonacci.py
pythonCallable: fibonacciTen
engine:
local: {}
- sql:
name: dummy_bq_query
engine:
bigquery:
location: {{ region }}
query:
inline: 'SELECT COUNT(*) FROM `{{ project }}.weather_data.sensor_readings`'
tags:
- job:datacloud:vscode