By configuring partitioning and clustering for your BigQuery destination, you can optimize query performance and control costs. Partitioning and clustering lets you reduce the amount of data that BigQuery needs to scan during query execution.
Table partitioning
Table partitioning divides your table into segments called partitions, which make it easier to manage and query your data. You partition tables by specifying a partition column, which is used to segment the table.
Datastream supports three types of partitioning for BigQuery tables:
Partitioning by ingestion time: you can partition a table based on the time that Datastream ingests the data into BigQuery. You can select hourly, daily, monthly, or yearly granularity for ingestion time partitioning. Daily is the default granularity.
Partitioning by time-unit: you can partition a table based on a source column of a data type that Datastream maps to one of the following BigQuery data types:
DATE,DATETIME, orTIMESTAMP. For information about how Datastream maps source data types to BigQuery data types, see Data type mappings in BigQuery.Depending on the BigQuery data type to which the source column maps, you can select one of the following granularity types:
DAY,MONTH,YEAR: for theDATEdata type.HOUR,DAY,MONTH,YEAR: for theDATETIMEandTIMESTAMPdata types.
Partitioning by integer range: you can partition a table by a source column of a data type that Datastream then maps to the BigQuery
INTEGERdata type. You need to provide the following information:- Start: the start value of the first partition range (inclusive).
- End: The end value of the last partition range (exclusive).
- Interval: The width of each partition range.
For more information, see Integer range partitioning.
For more information about partitioning in BigQuery, see Introduction to partitioned tables and Managing partitioned tables.
Table clustering
Clustering sorts the data in a table based on the values in the clustering columns and colocates data with similar values. Queries that filter by the clustered columns only scan the relevant data blocks instead of the entire table or table partition. This can improve query performance and reduce query costs.
You can cluster your BigQuery tables by up to four columns. The order in which you specify the columns determines the sort order of the data.
You can cluster a table based on a source column of a data type that Datastream maps to one of the following BigQuery data types:
BIGNUMERICBOOLEANDATEDATETIMEGEOGRAPHYINT64NUMERICRANGESTRINGTIMESTAMP
For information about how Datastream maps source data types to BigQuery data types, see Data type mappings in BigQuery.
For more information about clustering in BigQuery, see Introduction to clustered tables and Manage clustered tables.
Limitations
For partitioning limitations in BigQuery, see Introduction to partitioned tables.
For clustering limitations in BigQuery, see Introduction to clustered tables.
For quota limits that apply to BigQuery partitioned tables, see Quotas and limits.
Additionally, the following limitations apply when using Datastream:
- Partitioning and clustering is only applied when Datastream creates a new BigQuery table. If you want to configure or modify partitioning and clustering settings for a table that already exists, first delete it from BigQuery, and then configure or modify its settings in Datastream.
- For MongoDB sources, you can partition your tables only by ingestion time.
- Clustering isn't supported for MongoDB databases.
What's next
- To learn more about streams, see Stream lifecycle.
- To learn how to create a stream, see Create a stream.
- To learn more about BigQuery, see BigQuery destination.