The Spanner to Cloud Storage Text template is a batch pipeline that reads in data from a Spanner table, and writes it to Cloud Storage as CSV text files.
Pipeline requirements
- The input Spanner table must exist before running the pipeline.
Template parameters
Required parameters
- spannerTable: The Spanner table to read the data from.
- spannerProjectId: The ID of the Google Cloud project that contains the Spanner database to read data from.
- spannerInstanceId: The instance ID of the requested table.
- spannerDatabaseId: The database ID of the requested table.
- textWritePrefix: The Cloud Storage path prefix that specifies where the data is written. For example,
gs://mybucket/somefolder/.
Optional parameters
- csvTempDirectory: The Cloud Storage path where temporary CSV files are written. For example,
gs://your-bucket/your-path. - spannerPriority: The request priority (https://cloud.google.com/spanner/docs/reference/rest/v1/RequestOptions) for Spanner calls. Possible values are
HIGH,MEDIUM,LOW. The default value isMEDIUM. - spannerHost: The Cloud Spanner endpoint to call in the template. Only used for testing. For example,
https://batch-spanner.googleapis.com. Defaults to: https://batch-spanner.googleapis.com. - spannerSnapshotTime: The timestamp that corresponds to the version of the Spanner database that you want to read from. The timestamp must be specified in the RFC 3339 (https://tools.ietf.org/html/rfc3339) UTC Zulu Time format. The timestamp must be in the past and maximum timestamp staleness (https://cloud.google.com/spanner/docs/timestamp-bounds#maximum_timestamp_staleness) applies. For example,
1990-12-31T23:59:60Z. Defaults to empty. - dataBoostEnabled: Set to
trueto use the compute resources of Spanner Data Boost to run the job with near-zero impact on Spanner OLTP workflows. When true, requires thespanner.databases.useDataBoostIdentity and Access Management (IAM) permission. For more information, see Data Boost overview (https://cloud.google.com/spanner/docs/databoost/databoost-overview). Defaults to: false.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Cloud Spanner to Text Files on Cloud Storage template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow jobs run JOB_NAME \ --gcs-location gs://dataflow-templates-REGION_NAME/VERSION/Spanner_to_GCS_Text \ --region REGION_NAME \ --parameters \ spannerProjectId=SPANNER_PROJECT_ID,\ spannerDatabaseId=DATABASE_ID,\ spannerInstanceId=INSTANCE_ID,\ spannerTable=TABLE_ID,\ textWritePrefix=gs://BUCKET_NAME/output/
Replace the following:
JOB_NAME: a unique job name of your choiceVERSION: the version of the template that you want to useYou can use the following values:
latestto use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME: the region where you want to deploy your Dataflow job—for example,us-central1SPANNER_PROJECT_ID: the Google Cloud project ID of the Spanner database from which you want to read dataDATABASE_ID: the Spanner database IDBUCKET_NAME: the name of your Cloud Storage bucketINSTANCE_ID: the Spanner instance IDTABLE_ID: the Spanner table ID
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/templates:launch?gcsPath=gs://dataflow-templates-LOCATION/VERSION/Spanner_to_GCS_Text { "jobName": "JOB_NAME", "parameters": { "spannerProjectId": "SPANNER_PROJECT_ID", "spannerDatabaseId": "DATABASE_ID", "spannerInstanceId": "INSTANCE_ID", "spannerTable": "TABLE_ID", "textWritePrefix": "gs://BUCKET_NAME/output/" }, "environment": { "zone": "us-central1-f" } }
Replace the following:
PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME: a unique job name of your choiceVERSION: the version of the template that you want to useYou can use the following values:
latestto use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION: the region where you want to deploy your Dataflow job—for example,us-central1SPANNER_PROJECT_ID: the Google Cloud project ID of the Spanner database from which you want to read dataDATABASE_ID: the Spanner database IDBUCKET_NAME: the name of your Cloud Storage bucketINSTANCE_ID: the Spanner instance IDTABLE_ID: the Spanner table ID
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.