Performance characteristics of Pub/Sub to BigQuery pipelines

This page describes performance characteristics for Dataflow streaming jobs that read from Pub/Sub and write to BigQuery. It gives benchmark test results for two types of streaming pipeline:

  • Map-only (per-message transformation): Pipelines that perform per-message transformations, without tracking state or grouping elements across the stream. Examples include ETL, field validation, and schema mapping.

  • Windowed aggregation (GroupByKey): Pipelines that perform stateful operations and group data based on a key and a time window. Examples include counting events, calculating sums, and collecting records for a user session.

Most workloads for streaming data integration fall into these two categories. If your pipeline follows a similar pattern, you can use these benchmarks to assess your Dataflow job against a well-performing reference configuration.

Test methodology

Benchmarks were conducted using the following resources:

  • A pre-provisioned Pub/Sub topic with a steady input load. Messages were generated by using the Streaming Data Generator template.

    • Message rate: Roughly 1,000,000 messages per second
    • Input Load: 1 GiB/s
    • Message format: Randomly-generated JSON text with a fixed schema
    • Message size: Approximately 1 KiB per message
  • A standard BigQuery table.

  • Dataflow streaming pipelines based on the Pub/Sub to BigQuery template. These pipelines perform the minimal required parsing and schema mapping. No custom user-defined function (UDF) was used.

After horizontal scaling stabilized and the pipeline reached steady state, the pipelines were allowed to run for roughly one day, after which the results were collected and analyzed.

Dataflow pipelines

Two pipeline variants were tested:

Map-only pipeline. This pipeline performs a simple mapping and conversion of JSON messages. For this test, the Pub/Sub to BigQuery template was used without modification.

  • Semantics: The pipeline was tested using both exactly-once mode and at-least-once mode. At-least-once processing provides better throughput. However, it should only be used when duplicate records are acceptable or the downstream sink handles deduplication.

Windowed aggregation pipeline. This pipeline groups messages by a specific key in fixed-size windows and writes the aggregated records to BigQuery. For this test, a custom Apache Beam pipeline based on the Pub/Sub to BigQuery template was used.

  • Aggregation logic: For each fixed, non-overlapping 1-minute window, messages with the same key were collected and written as a single aggregated record to BigQuery. This type of aggregation is commonly used in log processing to combine related events, such as a user's activity, into a single record for downstream analysis.

  • Key parallelism: The benchmark used 1,000,000 uniformly distributed keys.

  • Semantics: The pipeline was tested using exactly-once mode. Aggregations require exactly-once semantics to ensure correctness, and to prevent double-counting within a group and window.

Job configuration

The following table shows how the Dataflow jobs were configured.

Setting Map only, exactly-once Map only, at-least-once Windowed aggregation, exactly-once
Worker machine type n1-standard-2 n1-standard-2 n1-standard-2
Worker machine vCPUs 2 2 2
Worker machine RAM 7.5 GiB 7.5 GiB 7.5 GiB
Worker machine Persistent Disk Standard Persistent Disk (HDD), 30 GB Standard Persistent Disk (HDD), 30 GB Standard Persistent Disk (HDD), 30 GB
Initial workers 70 30 180
Maximum workers 100 100 250
Streaming Engine Yes Yes Yes
Horizontal autoscaling Yes Yes Yes
Billing model Resource-based billing Resource-based billing Resource-based billing
Storage Write API enabled? Yes Yes Yes
Storage Write API streams 200 Not applicable 500
Storage Write API triggering frequency 5 seconds Not applicable 5 seconds

The BigQuery Storage Write API is recommended for streaming pipelines. When using exactly-once mode with Storage Write API, you can adjust the following settings:

  • Number of write streams. To ensure sufficient key parallelism in the write stage, set the number of Storage Write API streams to a value greater than the number of worker CPUs, while maintaining a reasonable level of BigQuery write stream throughput.

  • Triggering frequency. A single-digit second value is suitable for high-throughput pipelines.

For more information, see Write from Dataflow to BigQuery.

Benchmark results

This section describes the results of the benchmark tests.

Throughput and resource usage

The following table shows the test results for pipeline throughput and resource usage.

Result Map only, exactly-once Map only, at-least-once Windowed aggregation, exactly-once
Input throughput per worker Mean: 17 MBps, n=3 Mean: 21 MBps, n=3 Mean: 6 MBps, n=3
Average CPU utilization across all workers Mean: 65%, n=3 Mean: 69%, n=3 Mean: 80%, n=3
Number of worker nodes Mean: 57, n=3 Mean: 48, n=3 Mean: 169, n=3
Streaming Engine Compute Units per hour Mean: 125, n=3 Mean: 46, n=3 Mean: 354, n=3

The autoscaling algorithm can affect the target CPU utilization level. To achieve higher or lower target CPU utilization, you can set the autoscaling range or the worker utilization hint. Higher utilization targets can lead to lower costs but also worse tail latency, especially for varying loads.

For a window aggregation pipeline, the type of aggregation, the window size, and key parallelism can have a large impact on resource usage.

Latency

The following table shows the benchmark results for pipeline latency.

Total stage end-to-end latency Map only, exactly-once Map only, at-least-once Windowed aggregation, exactly-once
P50 Mean: 800 ms, n=3 Mean: 160 ms, n=3 Mean: 3,400 ms, n=3
P95 Mean: 2,000 ms, n=3 Mean: 250 ms, n=3 Mean: 13,000 ms, n=3
P99 Mean: 2,800 ms, n=3 Mean: 410 ms, n=3 Mean: 25,000 ms, n=3

The tests measured per-stage end-to-end latency (the job/streaming_engine/stage_end_to_end_latencies metric) across three long-running test executions. This metric measures how long the Streaming Engine spends in each pipeline stage. It encompasses all internal steps of the pipeline, such as:

  • Shuffling and queueing messages for processing
  • The actual processing time; for example, converting messages to row objects
  • Writing persistent state, as well as time spent queueing to write persistent state

Another latency metric is data freshness. However, data freshness is affected by factors such as user-defined windowing and upstream delays in the source. System latency provides a more objective baseline for a pipeline's internal processing efficiency and health under load.

The data was measured over approximately one day per run, with the initial startup periods discarded in order to reflect stable, steady-state performance. The results show two factors that introduce additional latency:

  • Exactly-once mode. To achieve exactly-once semantics, deterministic shuffling and persistent state lookups are required for de-duplication. At-least-once mode performs significantly faster, because it bypasses these steps.

  • Windowed aggregation. Messages must be fully shuffled, buffered, and written to persistent state before window closure, adding to the end-to-end latency.

The benchmarks shown here represent a baseline. Latency is highly sensitive to pipeline complexity. Custom UDFs, additional transforms, and complex windowing logic can all increase latency. Simple, highly-reducing aggregations, such as sum and count, tend to result in lower latency than state-heavy operations, such as collecting elements into a list.

Estimate costs

You can estimate the baseline cost of your own, comparable pipeline with Resource-based billing by using the Google Cloud Platform pricing calculator, as follows:

  1. Open the pricing calculator.
  2. Click Add to estimate.
  3. Select Dataflow.
  4. For Service type, select "Dataflow Classic".
  5. Select Advanced settings to show the full set of options.
  6. Choose the location where the job runs.
  7. For Job type, select "Streaming".
  8. Select Enable Streaming Engine.
  9. Enter information for the job run hours, worker nodes, worker machines, and Persistent Disk storage.
  10. Enter the estimated number of Streaming Engine Compute Units.

Resource usage and cost scales roughly linearly with the input throughput, although for small jobs with only a few workers, the total cost is dominated by fixed costs. As a starting point, you can extrapolate the number of worker nodes and the resource consumption from the benchmark results.

For example, suppose you that you run a map-only pipeline in exactly-once mode, with an input data rate of 100 MiB/s. Based on the benchmark results for a 1 GiB/s pipeline, you can estimate the resource requirements as follows:

  • Scaling Factor: (100 MiB/s) / (1 GiB/s) = 0.1
  • Projected worker nodes: 57 workers × 0.1 = 5.7 workers
  • Projected number of Streaming Engine Compute Units per hour: 125 × 0.1 = 12.5 units per hour

This value should only be used as an initial estimate. The actual throughput and cost can vary significantly, based on factors such as machine type, message size distribution, user code, aggregation type, key parallelism, and window size. For more information, see Best practices for Dataflow cost optimization.

Run a test pipeline

This section shows the gcloud dataflow flex-template run commands that were used to run the map-only pipeline.

Exactly-once mode

gcloud dataflow flex-template run JOB_ID \
  --template-file-gcs-location gs://dataflow-templates-us-central1/latest/flex/PubSub_to_BigQuery_Flex \
  --enable-streaming-engine \
  --num-workers 70 \
  --max-workers 100 \
  --parameters \
inputSubscription=projects/PROJECT_IDsubscriptions/SUBSCRIPTION_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
useStorageWriteApi=true,\
numStorageWriteApiStreams=200 \
storageWriteApiTriggeringFrequencySec=5

At-least-once mode

gcloud dataflow flex-template run JOB_ID \
  --template-file-gcs-location gs://dataflow-templates-us-central1/latest/flex/PubSub_to_BigQuery_Flex \
  --enable-streaming-engine \
  --num-workers 30 \
  --max-workers 100 \
  --parameters \
inputSubscription=projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
useStorageWriteApi=true \
  --additional-experiments streaming_mode_at_least_once

Replace the following:

  • JOB_ID: the Dataflow job ID
  • PROJECT_ID: the project ID
  • SUBSCRIPTION_NAME: the name of the Pub/Sub subscription
  • DATASET: the name of the BigQuery dataset
  • TABLE_NAME: the name of the BigQuery table

Generate test data

To generate test data, use the following command to run the Streaming Data Generator template:

gcloud dataflow flex-template run JOB_ID \
  --template-file-gcs-location gs://dataflow-templates-us-central1/latest/flex/Streaming_Data_Generator \
  --num-workers 70 \
  --max-workers 100 \
  --parameters \
topic=projects/PROJECT_ID/topics/TOPIC_NAME,\
qps=1000000,\
maxNumWorkers=100,\
schemaLocation=SCHEMA_LOCATION

Replace the following:

  • JOB_ID: the Dataflow job ID
  • PROJECT_ID: the project ID
  • TOPIC_NAME: the name of the Pub/Sub topic
  • SCHEMA_LOCATION: the path to a schema file in Cloud Storage

The Streaming Data Generator template uses a JSON Data Generator file to define the message schema. The benchmark tests used a message schema similar to the following:

{
  "logStreamId": "{{integer(1000001,2000000)}}",
  "message": "{{alphaNumeric(962)}}"
}

Next steps