The Spanner to BigQuery template is a batch pipeline that reads data from a Spanner table and writes the data to BigQuery.
Pipeline requirements
- The source Spanner table must exist prior to running the pipeline.
- The BigQuery dataset must exist prior to running the pipeline.
- A JSON file that describes your BigQuery schema.
The file must contain a top-level JSON array titled
fields. The contents of thefieldsarray must use the following pattern:{"name": "COLUMN_NAME", "type": "DATA_TYPE"}.The following JSON describes an example BigQuery schema:
{ "fields": [ { "name": "location", "type": "STRING" }, { "name": "name", "type": "STRING" }, { "name": "age", "type": "STRING" }, { "name": "color", "type": "STRING" }, { "name": "coffee", "type": "STRING" } ] }
The Spanner to BigQuery batch template doesn't support importing data into
STRUCT(Record) fields in the target BigQuery table.
Template parameters
Required parameters
- spannerInstanceId: The instance ID of the Spanner database to read from.
- spannerDatabaseId: The database ID of the Spanner database to export.
- outputTableSpec: The BigQuery output table location to write the output to. For example,
<PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME>.Depending on thecreateDispositionspecified, the output table might be created automatically using the user provided Avro schema.
Optional parameters
- spannerProjectId: The ID of the project that the Spanner database resides in. The default value for this parameter is the project where the Dataflow pipeline is running.
- spannerTableId: The table name of the Spanner database to export. Ignored if sqlQuery is set.
- spannerRpcPriority: The request priority (https://cloud.google.com/spanner/docs/reference/rest/v1/RequestOptions) for Spanner calls. Possible values are
HIGH,MEDIUM, andLOW. The default value isHIGH. - sqlQuery: The SQL query to use to read data from the Spanner database. Required if spannerTableId is empty.
- bigQuerySchemaPath: The Cloud Storage path (gs://) to the JSON file that defines your BigQuery schema. This is required if the Create Disposition is not CREATE_NEVER For example,
gs://your-bucket/your-schema.json. - writeDisposition: The BigQuery WriteDisposition (https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload) value. For example,
WRITE_APPEND,WRITE_EMPTY, orWRITE_TRUNCATE. Defaults toWRITE_APPEND. - createDisposition: The BigQuery CreateDisposition (https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload). For example,
CREATE_IF_NEEDEDandCREATE_NEVER. Defaults toCREATE_IF_NEEDED. - useStorageWriteApi: If
true, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value isfalse. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api). - useStorageWriteApiAtLeastOnce: When using the Storage Write API, specifies the write semantics. To use at-least-once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), set this parameter to
true. To use exactly-once semantics, set the parameter tofalse. This parameter applies only whenuseStorageWriteApiistrue. The default value isfalse.
Run the template
Console
- Go to the Dataflow Create job from template page. Go to Create job from template
- In the Job name field, enter a unique job name.
- Optional: For Regional endpoint, select a value from the drop-down menu. The default
region is
us-central1.For a list of regions where you can run a Dataflow job, see Dataflow locations.
- From the Dataflow template drop-down menu, select the Spanner to BigQuery template.
- In the provided parameter fields, enter your parameter values.
- Click Run job.
gcloud
In your shell or terminal, run the template:
gcloud dataflow flex-template run JOB_NAME \ --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex \ --project=PROJECT_ID \ --region=REGION_NAME \ --parameters \ spannerInstanceId=SPANNER_INSTANCE_ID,\ spannerDatabaseId=SPANNER_DATABASE_ID,\ spannerTableId=SPANNER_TABLE_ID,\ sqlQuery=SQL_QUERY,\ outputTableSpec=OUTPUT_TABLE_SPEC,\
Replace the following:
JOB_NAME: a unique job name of your choiceVERSION: the version of the template that you want to useYou can use the following values:
latestto use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
REGION_NAME: the region where you want to deploy your Dataflow job—for example,us-central1SPANNER_INSTANCE_ID: the Spanner instance IDSPANNER_DATABASE_ID: the Spanner database IDSPANNER_TABLE_ID: the Spanner table nameSQL_QUERY: the SQL queryOUTPUT_TABLE_SPEC: the BigQuery table location
API
To run the template using the REST API, send an HTTP POST request. For more information on the
API and its authorization scopes, see
projects.templates.launch.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch { "launchParameter": { "jobName": "JOB_NAME", "parameters": { "spannerInstanceId": "SPANNER_INSTANCE_ID", "spannerDatabaseId": "SPANNER_DATABASE_ID", "spannerTableId": "SPANNER_TABLE_ID", "sqlQuery": "SQL_QUERY", "outputTableSpec": "OUTPUT_TABLE_SPEC", }, "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Spanner_to_BigQuery_Flex", "environment": { "maxWorkers": "10" } } }
Replace the following:
PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow jobJOB_NAME: a unique job name of your choiceVERSION: the version of the template that you want to useYou can use the following values:
latestto use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/- the version name, like
2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/
LOCATION: the region where you want to deploy your Dataflow job—for example,us-central1SPANNER_INSTANCE_ID: the Spanner instance IDSPANNER_DATABASE_ID: the Spanner database IDSPANNER_TABLE_ID: the Spanner table nameSQL_QUERY: the SQL queryOUTPUT_TABLE_SPEC: the BigQuery table location
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.