This page describes performance characteristics for Dataflow streaming jobs that read from Pub/Sub and write to BigQuery. It gives benchmark test results for two types of streaming pipeline:
Map-only (per-message transformation): Pipelines that perform per-message transformations, without tracking state or grouping elements across the stream. Examples include ETL, field validation, and schema mapping.
Windowed aggregation (
GroupByKey): Pipelines that perform stateful operations and group data based on a key and a time window. Examples include counting events, calculating sums, and collecting records for a user session.
Most workloads for streaming data integration fall into these two categories. If your pipeline follows a similar pattern, you can use these benchmarks to assess your Dataflow job against a well-performing reference configuration.
Test methodology
Benchmarks were conducted using the following resources:
A pre-provisioned Pub/Sub topic with a steady input load. Messages were generated by using the Streaming Data Generator template.
- Message rate: Roughly 1,000,000 messages per second
- Input Load: 1 GiB/s
- Message format: Randomly-generated JSON text with a fixed schema
- Message size: Approximately 1 KiB per message
A standard BigQuery table.
Dataflow streaming pipelines based on the Pub/Sub to BigQuery template. These pipelines perform the minimal required parsing and schema mapping. No custom user-defined function (UDF) was used.
After horizontal scaling stabilized and the pipeline reached steady state, the pipelines were allowed to run for roughly one day, after which the results were collected and analyzed.
Dataflow pipelines
Two pipeline variants were tested:
Map-only pipeline. This pipeline performs a simple mapping and conversion of JSON messages. For this test, the Pub/Sub to BigQuery template was used without modification.
- Semantics: The pipeline was tested using both exactly-once mode and at-least-once mode. At-least-once processing provides better throughput. However, it should only be used when duplicate records are acceptable or the downstream sink handles deduplication.
Windowed aggregation pipeline. This pipeline groups messages by a specific key in fixed-size windows and writes the aggregated records to BigQuery. For this test, a custom Apache Beam pipeline based on the Pub/Sub to BigQuery template was used.
Aggregation logic: For each fixed, non-overlapping 1-minute window, messages with the same key were collected and written as a single aggregated record to BigQuery. This type of aggregation is commonly used in log processing to combine related events, such as a user's activity, into a single record for downstream analysis.
Key parallelism: The benchmark used 1,000,000 uniformly distributed keys.
Semantics: The pipeline was tested using exactly-once mode. Aggregations require exactly-once semantics to ensure correctness, and to prevent double-counting within a group and window.
Job configuration
The following table shows how the Dataflow jobs were configured.
| Setting | Map only, exactly-once | Map only, at-least-once | Windowed aggregation, exactly-once |
|---|---|---|---|
| Worker machine type | n1-standard-2 |
n1-standard-2 |
n1-standard-2 |
| Worker machine vCPUs | 2 | 2 | 2 |
| Worker machine RAM | 7.5 GiB | 7.5 GiB | 7.5 GiB |
| Worker machine Persistent Disk | Standard Persistent Disk (HDD), 30 GB | Standard Persistent Disk (HDD), 30 GB | Standard Persistent Disk (HDD), 30 GB |
| Initial workers | 70 | 30 | 180 |
| Maximum workers | 100 | 100 | 250 |
| Streaming Engine | Yes | Yes | Yes |
| Horizontal autoscaling | Yes | Yes | Yes |
| Billing model | Resource-based billing | Resource-based billing | Resource-based billing |
| Storage Write API enabled? | Yes | Yes | Yes |
| Storage Write API streams | 200 | Not applicable | 500 |
| Storage Write API triggering frequency | 5 seconds | Not applicable | 5 seconds |
The BigQuery Storage Write API is recommended for streaming pipelines. When using exactly-once mode with Storage Write API, you can adjust the following settings:
Number of write streams. To ensure sufficient key parallelism in the write stage, set the number of Storage Write API streams to a value greater than the number of worker CPUs, while maintaining a reasonable level of BigQuery write stream throughput.
Triggering frequency. A single-digit second value is suitable for high-throughput pipelines.
For more information, see Write from Dataflow to BigQuery.
Benchmark results
This section describes the results of the benchmark tests.
Throughput and resource usage
The following table shows the test results for pipeline throughput and resource usage.
| Result | Map only, exactly-once | Map only, at-least-once | Windowed aggregation, exactly-once |
|---|---|---|---|
| Input throughput per worker | Mean: 17 MBps, n=3 | Mean: 21 MBps, n=3 | Mean: 6 MBps, n=3 |
| Average CPU utilization across all workers | Mean: 65%, n=3 | Mean: 69%, n=3 | Mean: 80%, n=3 |
| Number of worker nodes | Mean: 57, n=3 | Mean: 48, n=3 | Mean: 169, n=3 |
| Streaming Engine Compute Units per hour | Mean: 125, n=3 | Mean: 46, n=3 | Mean: 354, n=3 |
The autoscaling algorithm can affect the target CPU utilization level. To achieve higher or lower target CPU utilization, you can set the autoscaling range or the worker utilization hint. Higher utilization targets can lead to lower costs but also worse tail latency, especially for varying loads.
For a window aggregation pipeline, the type of aggregation, the window size, and key parallelism can have a large impact on resource usage.
Latency
The following table shows the benchmark results for pipeline latency.
| Total stage end-to-end latency | Map only, exactly-once | Map only, at-least-once | Windowed aggregation, exactly-once |
|---|---|---|---|
| P50 | Mean: 800 ms, n=3 | Mean: 160 ms, n=3 | Mean: 3,400 ms, n=3 |
| P95 | Mean: 2,000 ms, n=3 | Mean: 250 ms, n=3 | Mean: 13,000 ms, n=3 |
| P99 | Mean: 2,800 ms, n=3 | Mean: 410 ms, n=3 | Mean: 25,000 ms, n=3 |
The tests measured per-stage end-to-end latency
(the job/streaming_engine/stage_end_to_end_latencies
metric) across three long-running test executions. This metric measures how long
the Streaming Engine spends in each pipeline stage. It encompasses all internal
steps of the pipeline, such as:
- Shuffling and queueing messages for processing
- The actual processing time; for example, converting messages to row objects
- Writing persistent state, as well as time spent queueing to write persistent state
Another latency metric is data freshness. However, data freshness is affected by factors such as user-defined windowing and upstream delays in the source. System latency provides a more objective baseline for a pipeline's internal processing efficiency and health under load.
The data was measured over approximately one day per run, with the initial startup periods discarded in order to reflect stable, steady-state performance. The results show two factors that introduce additional latency:
Exactly-once mode. To achieve exactly-once semantics, deterministic shuffling and persistent state lookups are required for de-duplication. At-least-once mode performs significantly faster, because it bypasses these steps.
Windowed aggregation. Messages must be fully shuffled, buffered, and written to persistent state before window closure, adding to the end-to-end latency.
The benchmarks shown here represent a baseline. Latency is highly sensitive to pipeline complexity. Custom UDFs, additional transforms, and complex windowing logic can all increase latency. Simple, highly-reducing aggregations, such as sum and count, tend to result in lower latency than state-heavy operations, such as collecting elements into a list.
Estimate costs
You can estimate the baseline cost of your own, comparable pipeline with Resource-based billing by using the Google Cloud Platform pricing calculator, as follows:
- Open the pricing calculator.
- Click Add to estimate.
- Select Dataflow.
- For Service type, select "Dataflow Classic".
- Select Advanced settings to show the full set of options.
- Choose the location where the job runs.
- For Job type, select "Streaming".
- Select Enable Streaming Engine.
- Enter information for the job run hours, worker nodes, worker machines, and Persistent Disk storage.
- Enter the estimated number of Streaming Engine Compute Units.
Resource usage and cost scales roughly linearly with the input throughput, although for small jobs with only a few workers, the total cost is dominated by fixed costs. As a starting point, you can extrapolate the number of worker nodes and the resource consumption from the benchmark results.
For example, suppose you that you run a map-only pipeline in exactly-once mode, with an input data rate of 100 MiB/s. Based on the benchmark results for a 1 GiB/s pipeline, you can estimate the resource requirements as follows:
- Scaling Factor: (100 MiB/s) / (1 GiB/s) = 0.1
- Projected worker nodes: 57 workers × 0.1 = 5.7 workers
- Projected number of Streaming Engine Compute Units per hour: 125 × 0.1 = 12.5 units per hour
This value should only be used as an initial estimate. The actual throughput and cost can vary significantly, based on factors such as machine type, message size distribution, user code, aggregation type, key parallelism, and window size. For more information, see Best practices for Dataflow cost optimization.
Run a test pipeline
This section shows the
gcloud dataflow flex-template run
commands that were used to run the map-only pipeline.
Exactly-once mode
gcloud dataflow flex-template run JOB_ID \
--template-file-gcs-location gs://dataflow-templates-us-central1/latest/flex/PubSub_to_BigQuery_Flex \
--enable-streaming-engine \
--num-workers 70 \
--max-workers 100 \
--parameters \
inputSubscription=projects/PROJECT_IDsubscriptions/SUBSCRIPTION_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
useStorageWriteApi=true,\
numStorageWriteApiStreams=200 \
storageWriteApiTriggeringFrequencySec=5
At-least-once mode
gcloud dataflow flex-template run JOB_ID \
--template-file-gcs-location gs://dataflow-templates-us-central1/latest/flex/PubSub_to_BigQuery_Flex \
--enable-streaming-engine \
--num-workers 30 \
--max-workers 100 \
--parameters \
inputSubscription=projects/PROJECT_ID/subscriptions/SUBSCRIPTION_NAME,\
outputTableSpec=PROJECT_ID:DATASET.TABLE_NAME,\
useStorageWriteApi=true \
--additional-experiments streaming_mode_at_least_once
Replace the following:
JOB_ID: the Dataflow job IDPROJECT_ID: the project IDSUBSCRIPTION_NAME: the name of the Pub/Sub subscriptionDATASET: the name of the BigQuery datasetTABLE_NAME: the name of the BigQuery table
Generate test data
To generate test data, use the following command to run the Streaming Data Generator template:
gcloud dataflow flex-template run JOB_ID \
--template-file-gcs-location gs://dataflow-templates-us-central1/latest/flex/Streaming_Data_Generator \
--num-workers 70 \
--max-workers 100 \
--parameters \
topic=projects/PROJECT_ID/topics/TOPIC_NAME,\
qps=1000000,\
maxNumWorkers=100,\
schemaLocation=SCHEMA_LOCATION
Replace the following:
JOB_ID: the Dataflow job IDPROJECT_ID: the project IDTOPIC_NAME: the name of the Pub/Sub topicSCHEMA_LOCATION: the path to a schema file in Cloud Storage
The Streaming Data Generator template uses a JSON Data Generator file to define the message schema. The benchmark tests used a message schema similar to the following:
{ "logStreamId": "{{integer(1000001,2000000)}}", "message": "{{alphaNumeric(962)}}" }
Next steps
- Use the Dataflow job monitoring interface
- Best practices for Dataflow cost optimization
- Troubleshoot slow or stuck streaming jobs
- Read from Pub/Sub to Dataflow
- Write from Dataflow to BigQuery