Performance characteristics of Kafka to BigQuery pipelines

This page describes performance characteristics for Dataflow streaming jobs that read from Apache Kafka and write to BigQuery. It gives benchmark test results for map-only pipelines, which perform per-message transformations without tracking state or grouping elements across the stream.

Many data integration workloads, including ETL, field validation, and schema mapping, fall into the map-only category. If your pipeline follows this pattern, you can use these benchmarks to assess your Dataflow job against a well-performing reference configuration.

Test methodology

Benchmarks were conducted using the following resources:

  • A Managed Service for Apache Kafka cluster. Messages were generated by using the Streaming Data Generator template.

    • Message rate: Roughly 1,000,000 messages per second
    • Input Load: 1 GiB/s
    • Message format: Randomly-generated JSON text with a fixed schema
    • Message size: Approximately 1 KiB per message
    • Kafka partitions: 1000
  • A standard BigQuery table.

  • A Dataflow streaming pipeline that used the Apache Kafka to BigQuery template. This pipeline performs the minimal required parsing and schema mapping. No custom user-defined function (UDF) was used.

After horizontal scaling stabilized and the pipeline reached steady state, the pipelines were allowed to run for roughly one day, after which the results were collected and analyzed.

Dataflow pipeline

This benchmark uses a map-only pipeline that performs a simple mapping and conversion of JSON messages. The pipeline was tested using both exactly-once mode and at-least-once mode. At-least-once processing provides better throughput. However, it should only be used when duplicate records are acceptable or the downstream sink handles deduplication.

Job configuration

The following table shows how the Dataflow jobs were configured.

Setting Value
Worker machine type e2-standard-2
Worker machine vCPUs 2
Worker machine RAM 8 GB
Worker machine Persistent Disk Standard Persistent Disk (HDD), 30 GB
Maximum workers 120
Streaming Engine Yes
Horizontal autoscaling Yes
Billing model Resource-based billing
Storage Write API enabled? Yes
Storage Write API streams 400
Storage Write API triggering frequency 5 seconds
Message Format JSON
Kafka Authentication mode

Application Default Credentials (ADC).

For more information, see Authentication types for Kafka brokers.

The BigQuery Storage Write API is recommended for streaming pipelines. When using exactly-once mode with Storage Write API, you can adjust the following settings:

  • Number of write streams. To ensure sufficient key parallelism in the write stage, set the number of Storage Write API streams to a value greater than the number of worker CPUs, while following the per-stream throughput recommendations.

  • Triggering frequency. A single-digit second value is suitable for high-throughput pipelines.

For more information, see Write from Dataflow to BigQuery.

Special consideration should also be given to the number of Apache Kafka partitions. To ensure sufficient key parallelism in the read stage, the number of partitions should at least equal the total number of worker vCPUs. For more information, see Read from Apache Kafka to Dataflow.

Benchmark results

This section describes the results of the benchmark tests.

Throughput and resource usage

The following table shows the test results for pipeline throughput and resource usage.

Result Exactly-once At-least-once
Input throughput per worker Mean: 15 MBps, n=3 Mean: 18 MBps, n=3
Average CPU utilization across all workers Mean: 70%, n=3 Mean: 75%, n=3
Number of worker nodes Mean: 63, n=3 Mean: 53, n=3
Streaming Engine Compute Units per hour Mean: 58, n=3 Mean: 0, n=3

The autoscaling algorithm can affect the target CPU utilization level. To achieve higher or lower target CPU utilization, you can set the autoscaling range or the worker utilization hint. Higher utilization targets can lead to lower costs but also worse tail latency, especially for varying loads.

Latency

The following table shows the benchmark results for pipeline latency for exactly-once mode, excluding the input stage.

Total stage end-to-end latency, excluding input stage Exactly-once
P50 Mean: 1,200 ms, n=3
P95 Mean: 3,000 ms, n=3
P99 Mean: 5,400 ms, n=3

The tests measured per-stage end-to-end latency (the job/streaming_engine/stage_end_to_end_latencies metric) across three long-running test executions. This metric measures how long the Streaming Engine spends in each pipeline stage. It encompasses all internal steps of the pipeline, such as:

  • Shuffling and queueing messages for processing
  • The actual processing time; for example, converting messages to row objects
  • Writing persistent state, as well as time spent queueing to write persistent state

Due to a limitation of the metric, the input stage latency isn't reported. Therefore, it's not included in the total.

The benchmarks shown here represent a baseline. Latency is highly sensitive to pipeline complexity. Custom UDFs, additional transforms, and complex windowing logic can all increase latency.

Estimate costs

You can estimate the baseline cost of your own, comparable pipeline with Resource-based billing by using the Google Cloud Platform pricing calculator, as follows:

  1. Open the pricing calculator.
  2. Click Add to estimate.
  3. Select Dataflow.
  4. For Service type, select "Dataflow Classic".
  5. Select Advanced settings to show the full set of options.
  6. Choose the location where the job runs.
  7. For Job type, select "Streaming".
  8. Select Enable Streaming Engine.
  9. Enter information for the job run hours, worker nodes, worker machines, and Persistent Disk storage.
  10. Enter the estimated number of Streaming Engine Compute Units.

Resource usage and cost scales roughly linearly with the input throughput, although for small jobs with only a few workers, the total cost is dominated by fixed costs. As a starting point, you can extrapolate the number of worker nodes and the resource consumption from the benchmark results.

For example, suppose that you run a map-only pipeline in exactly-once mode, with an input data rate of 100 MiB/s. Based on the benchmark results for a 1 GiB/s pipeline, you can estimate the resource requirements as follows:

  • Scaling Factor: (100 MiB/s) / (1 GiB/s) = 0.1
  • Projected worker nodes: 63 workers × 0.1 = 6.3 workers
  • Projected number of Streaming Engine Compute Units per hour: 58 × 0.1 = 5.8 units per hour

This value should only be used as an initial estimate. The actual throughput and cost can vary significantly, based on factors such as machine type, message size distribution, user code, aggregation type, key parallelism, and window size. For more information, see Best practices for Dataflow cost optimization.

Run a test pipeline

This section shows the gcloud dataflow flex-template run commands that were used to run the map-only pipeline.

Exactly-once mode

gcloud dataflow flex-template run JOB_NAME \
  --project=PROJECT_ID \
  --template-file-gcs-location=gs://dataflow-templates-us-central1/latest/flex/Kafka_to_BigQuery_Flex \
  --enable-streaming-engine \
  --parameters \
readBootstrapServerAndTopic="KAFKA_BOOTSTRAP_ADDRESS;KAFKA_TOPIC",\
kafkaReadAuthenticationMode=APPLICATION_DEFAULT_CREDENTIALS,\
messageFormat=JSON,\
writeMode=SINGLE_TABLE_NAME,\
outputTableSpec="PROJECT_ID:BQ_DATASET.BQ_TABLE_NAME",\
useBigQueryDLQ=true,\
outputDeadletterTable="PROJECT_ID:BQ_DATASET.BQ_TABLE_NAME_dlq",\
numStorageWriteApiStreams=400

At-least-once mode

gcloud dataflow flex-template run JOB_NAME \
  --project=PROJECT_ID \
  --template-file-gcs-location=gs://dataflow-templates-us-central1/latest/flex/Kafka_to_BigQuery_Flex \
  --enable-streaming-engine \
  --additional-experiments=streaming_mode_at_least_once \
  --parameters \
readBootstrapServerAndTopic="KAFKA_BOOTSTRAP_ADDRESS;KAFKA_TOPIC",\
kafkaReadAuthenticationMode=APPLICATION_DEFAULT_CREDENTIALS,\
messageFormat=JSON,\
writeMode=SINGLE_TABLE_NAME,\
outputTableSpec="PROJECT_ID:BQ_DATASET.BQ_TABLE_NAME",\
useBigQueryDLQ=true,\
outputDeadletterTable="PROJECT_ID:BQ_DATASET.BQ_TABLE_NAME_dlq",\
numStorageWriteApiStreams=400,\
useStorageWriteApiAtLeastOnce=true

Replace the following:

  • JOB_NAME: the Dataflow job name
  • PROJECT_ID: the project ID
  • KAFKA_BOOTSTRAP_ADDRESS: the bootstrap address of the Apache Kafka cluster
  • KAFKA_TOPIC: the name of the Kafka topic
  • BQ_DATASET: the name of the BigQuery dataset
  • BQ_TABLE_NAME: the name of the BigQuery table

Generate test data

To generate test data, use the following command to run the Streaming Data Generator template:

gcloud dataflow flex-template run JOB_NAME \
  --project=PROJECT_ID \
  --template-file-gcs-location=gs://dataflow-templates-us-central1/latest/flex/Streaming_Data_Generator \
  --max-workers=140 \
  --parameters \
schemaLocation=SCHEMA_LOCATION,\
qps=1000000,\
sinkType=KAFKA,\
bootstrapServer=KAFKA_BOOTSTRAP_ADDRESS,\
kafkaTopic=KAFKA_TOPIC,\
outputType=JSON

Replace the following:

  • JOB_NAME: the Dataflow job name
  • PROJECT_ID: the project ID
  • SCHEMA_LOCATION: the path to a schema file in Cloud Storage
  • KAFKA_BOOTSTRAP_ADDRESS: the bootstrap address of the Apache Kafka cluster
  • KAFKA_TOPIC: the name of the Kafka topic

The Streaming Data Generator template uses a JSON Data Generator file to define the message schema. The benchmark tests used a message schema similar to the following:

{
  "logStreamId": "{{integer(1000001,2000000)}}",
  "message": "{{alphaNumeric(962)}}"
}

Next steps