Dataproc release notes

These release notes apply to the core Dataproc service, and include:
  • Announcements of the latest Dataproc image versions installed on the Compute Engine VMs used in Dataproc clusters. See the Dataproc version list for a list of supported Dataproc images, with links to pages that list the software components installed on current and recently released Dataproc images
  • Announcements of new and updated Dataproc and Serverless for Apache Spark features, bug fixes, known issues, and deprecated functionality
Release schedule: The release of the latest Dataproc images can take up to one week to roll out to all regions. Until the rollout is complete, the latest Dataproc images may not be available in your region.

You can see the latest product updates for all of Google Cloud on the Google Cloud page, browse and filter all release notes in the Google Cloud console, or programmatically access release notes in BigQuery.

To get the latest product updates delivered to you, add the URL of this page to your feed reader, or add the feed URL directly.

October 16, 2025

Announcement
Change

Dataproc on Compute Engine: The default image version of premium tier clusters is now 2.3.

October 06, 2025

Announcement
Change

Serverless for Apache Spark: Upgraded Apache Spark to version 3.5.3 in the latest 2.3 Serverless for Apache Spark runtime versions.

September 15, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.149-debian10, 2.0.149-ubuntu18, 2.0.149-rocky8
  • 2.1.98-debian11, 2.1.98-ubuntu20, 2.1.98-ubuntu20-arm, 2.1.98-rocky8
  • 2.2.66-debian12, 2.2.66-ubuntu22, 2.2.66-ubuntu22-arm, 2.2.66-rocky9
  • 2.3.13-debian12, 2.3.13-ubuntu22, 2.3.13-ubuntu22-arm, 2.3.13-ml-ubuntu22, 2.3.13-rocky9

September 08, 2025

Announcement

Announcing the Preview release of Dataproc on Compute Engine image version 3.0.0-RC1:

  • Spark 4.0.0
  • Hadoop 3.4.1
  • Hive 4.1.0
  • Tez 0.10.5
  • Cloud Storage Connector 3.1.4
  • Conda 24.11
  • Java 17
  • Python 3.11
  • R 4.3
  • Scala 2.13

September 02, 2025

Feature

Multi-tenant clusters are now available in Preview. Many data engineers and scientists can share a multi-tenant cluster to execute their workloads in isolation from each other.

August 29, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.147-debian10, 2.0.147-ubuntu18, 2.0.147-rocky8
  • 2.1.96-debian11, 2.1.96-ubuntu20, 2.1.96-ubuntu20-arm, 2.1.96-rocky8
  • 2.2.64-debian12, 2.2.64-ubuntu22, 2.2.64-ubuntu22-arm, 2.2.64-rocky9
  • 2.3.10-debian12, 2.3.10-ubuntu22, 2.3.10-ubuntu22-arm, 2.3.10-ml-ubuntu22, 2.3.10-rocky9
Announcement

August 14, 2025

Announcement

August 12, 2025

Feature

Dataproc on Compute Engine: Image versions 2.2 and 2.3: The Iceberg optional component supports the BigLake Iceberg REST catalog.

July 31, 2025

Announcement

Dataproc Serverless for Spark: Subminor version 1.1.111 is the last release of runtime version 1.1, which will no longer be supported and will not receive new releases.

July 25, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

2.3.7-debian12, 2.3.7-ubuntu22, 2.3.7-ubuntu22-arm, 2.3.7-ml-ubuntu22, and 2.3.7-rocky9

The 2.3.7-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

July 15, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

2.3.6-debian12, 2.3.6-ubuntu22, 2.3.6-ml-ubuntu22, and 2.3.6-rocky9

The 2.3.6-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

Feature

Dataproc now allows Dynamic update of multi-tenancy clusters.

July 07, 2025

Feature

The Cluster Scheduled Stop feature is available in preview. You can use this feature to stop clusters after a specified idle period, at a specified future time, or after a specified period from the cluster creation or update request.

July 04, 2025

Change

Serverless for Apache Spark (formerly known as Dataproc Serverless for Spark) now supports OS Login organization policy. Organizations, folders, and projects that enforce the OS Login policy can now use Serverless for Apache Spark.

July 01, 2025

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.110
  • 1.2.54
  • 2.2.54
  • 2.3.5

June 20, 2025

Change

Dataproc Serverless for Spark: Upgraded the Cloud Storage connector version to 2.2.28 in the 1.1 runtime.

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.144-debian10, 2.0.144-rocky8, 2.0.144-ubuntu18
  • 2.1.92-debian11, 2.1.92-rocky8, 2.1.92-ubuntu20, 2.1.92-ubuntu20-arm
  • 2.2.60-debian12, 2.2.60-rocky9, 2.2.60-ubuntu22
  • 2.3.4-debian12, 2.3.4-rocky9, 2.3.4-ubuntu22, and 2.3.4-ml-ubuntu22.

The 2.3.4-ml-ubuntu22 image extends the 2.3 base image with ML-specific libraries.

Change

Dataproc on Compute Engine: Dataproc now automatically configures Knox Gateway configuration properties gateway.dispatch.whitelist.services and gateway.dispatch.whitelist for component web UIs within the cluster.

Fixed

Dataproc on Compute Engine: Fixed a bug in trino-jvm cluster properties. To configure Trino JVM options prefixed with trino-jvm, follow these guidelines:

  • Configure JVM options starting with -XX:, without :. For JVM flags without a value, add = at the end. For example, add trino-jvm:-XX+HeapDumpOnOutOfMemoryError= as -XX:+HeapDumpOnOutOfMemoryError in the jvm.config.
  • Specify JVM options system properties with a -D prefix the same way. For example, trino-jvm:-Dsystem.property.name=value.
  • Any value containing : cannot be provided as a cluster property.

June 10, 2025

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.108
  • 1.2.52
  • 2.2.52
  • 2.3.3

June 06, 2025

Fixed

Dataproc Serverless for Spark: Fixed a bug that prevented the spark.executorEnv property from correctly setting specific executor environment variables across all runtimes.

June 01, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.143-debian10, 2.0.143-rocky8, 2.0.143-ubuntu18
  • 2.1.91-debian11, 2.1.90-rocky8, 2.1.91-ubuntu20, 2.1.91-ubuntu20-arm
  • 2.2.59-debian12, 2.2.59-rocky9, 2.2.59-ubuntu22
Fixed

Dataproc on Compute Engine: Fixed the ordering of log entries generated from clusters created with 2.2+ image versions by assigning timestamps closer to the log generation time.

May 30, 2025

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.106
  • 1.2.50
  • 2.2.50
  • 2.3.1

May 28, 2025

Announcement

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime versions 2.3, which include the following components:

  • Spark 3.5.1
  • BigQuery Spark Connector 0.42.3
  • Cloud Storage Connector 3.1.2
  • Java 17
  • Python 3.11
  • R 4.3
  • Scala 2.13

May 23, 2025

Feature

Dataproc now supports the creation of zero-scale clusters, available in preview. This feature provides a cost-effective way to use Dataproc clusters, as they utilize only secondary workers that can be scaled down to zero when not in use.

Change

New Dataproc on Compute Engine subminor image versions:

  • 2.0.142-debian10, 2.0.142-rocky8, 2.0.142-ubuntu18
  • 2.1.90-debian11, 2.1.90-rocky8, 2.1.90-ubuntu20, 2.1.90-ubuntu20-arm
  • 2.2.58-debian12, 2.2.58-rocky9, 2.2.58-ubuntu22

May 15, 2025

Announcement

May 02, 2025

Change

Dataproc on Compute Engine: Upgraded oauth2l to v1.3.3 to address CVEs.

Fixed

Dataproc on Compute Engine: Fixed an issue with Apache Hudi that caused failure in Hudi CLI.

May 01, 2025

Change

Native Query Execution now supports reading Apache ORC complex types.

April 29, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

2.0.138-debian10, 2.0.138-rocky8, 2.0.138-ubuntu18

2.1.86-debian11, 2.1.86-rocky8, 2.1.86-ubuntu20, 2.1.86-ubuntu20-arm

2.2.54-debian12, 2.2.54-rocky9, 2.2.54-ubuntu22

Fixed

Dataproc on Compute Engine: Added an temporary object hold on the spark-job-history folder in Cloud Stroage to prevent deletion by Cloud Storage life cycling.

April 17, 2025

Change

Dataproc on Compute Engine: The Spark BigQuery connector has been upgraded to version 0.34.1 in the latest 2.2 image version.

Fixed

Fixed a bug in which Jupyter fails to restart upon cluster restart on Personal Authentication clusters.

April 08, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.2.52-debian12, 2.2.52-rocky9, 2.2.52-ubuntu22

April 01, 2025

Fixed

Dataproc on Compute Engine: Fixed incorrectly attributed Dataproc job logs in Cloud Logging for clusters created with 2.2+ image versions. This happened when multiple Dataproc jobs were running concurrently on the same cluster.

March 31, 2025

Announcement

March 17, 2025

Change

New Dataproc on Compute Engine subminor image versions:

  • 2.0.136-debian10, 2.0.136-rocky8, 2.0.136-ubuntu18
  • 2.1.84-debian11, 2.1.84-rocky8, 2.1.84-ubuntu20, 2.1.84-ubuntu20-arm
  • 2.2.50-debian12, 2.2.50-rocky9, 2.2.50-ubuntu22

March 10, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.135-debian10, 2.0.135-rocky8, 2.0.135-ubuntu18
  • 2.1.83-debian11, 2.1.83-rocky8, 2.1.83-ubuntu20, 2.1.83-ubuntu20-arm
  • 2.2.49-debian12, 2.2.49-rocky9, 2.2.49-ubuntu22

March 01, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.134-debian10, 2.0.134-rocky8, 2.0.134-ubuntu18
  • 2.1.82-debian11, 2.1.82-rocky8, 2.1.82-ubuntu20, 2.1.82-ubuntu20-arm
  • 2.2.48-debian12, 2.2.48-rocky9, 2.2.48-ubuntu22

February 24, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.133-debian10, 2.0.133-rocky8, 2.0.133-ubuntu18
  • 2.1.81-debian11, 2.1.81-rocky8, 2.1.81-ubuntu20, 2.1.81-ubuntu20-arm
  • 2.2.47-debian12, 2.2.47-rocky9, 2.2.47-ubuntu22

February 17, 2025

Announcement

February 10, 2025

Announcement

Dataproc on Compute Engine: To help diagnose Dataproc clusters, you can set the following cluster properties to true when you create a cluster:

Note: starting May 10, 2025, these properties will be set to true by default.

February 09, 2025

Announcement

February 02, 2025

Announcement

January 31, 2025

Change

New Dataproc on Compute Engine subminor image versions:

  • 2.0.130-debian10, 2.0.130-rocky8, 2.0.130-ubuntu18
  • 2.1.78-debian11, 2.1.78-rocky8, 2.1.78-ubuntu20, 2.1.78-ubuntu20-arm
  • 2.2.44-debian12, 2.2.44-rocky9, 2.2.44-ubuntu22

January 30, 2025

Change

Dataproc on Compute Engine: Private Google Access is now automatically enabled in the configured subnetwork when creating clusters with internal IP addresses.

January 24, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.129-debian10, 2.0.129-rocky8, 2.0.129-ubuntu18
  • 2.1.77-debian11, 2.1.77-rocky8, 2.1.77-ubuntu20, 2.1.77-ubuntu20-arm
  • 2.2.43-debian12, 2.2.43-rocky9, 2.2.43-ubuntu22
Announcement

Dataproc cluster caching now supports ARM images.

Announcement

Zeppelin component added to 2.1-Ubuntu20-arm images.

January 17, 2025

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.128-debian10, 2.0.128-rocky8, 2.0.128-ubuntu18
  • 2.1.76-debian11, 2.1.76-rocky8, 2.1.76-ubuntu20, 2.1.76-ubuntu20-arm
  • 2.2.42-debian12, 2.2.42-rocky9, 2.2.42-ubuntu22
Announcement

January 13, 2025

Announcement

Dataproc Serverless for Spark: On March 10, 2025, the Dataproc Resource Manager API will be enabled as part of General Availability (GA) for Dataproc Serverless 3.0+ versions.

User action will not be required in response to this API enablement change.

The Dataproc Resource Manager will be implemented as a stand-alone Google Cloud API, dataprocrm.googleapis.com. It will allow Dataproc distributions of open source software, ,particularly Apache Spark, to directly communicate resource requirements.

December 12, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.126-debian10, 2.0.126-rocky8, 2.0.126-ubuntu18
  • 2.1.74-debian11, 2.1.74-rocky8, 2.1.74-ubuntu20, 2.1.74-ubuntu20-arm
  • 2.2.40-debian12, 2.2.40-rocky9, 2.2.40-ubuntu22

November 20, 2024

Change

Dataproc Serverless for Spark: Spark Lineage is available for all supported Dataproc Serverless for Spark runtime versions.

November 18, 2024

Feature

Dataproc is now available in the northamerica-south1 region (Queretaro, Mexico).

November 11, 2024

Announcement

Announcing the General Availability (GA) of Spot and non-preemptible VM mixing for Dataproc secondary workers which allows you to mix spot and non-preemptible secondary workers when you create a Dataproc cluster.

October 31, 2024

Announcement
Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.125-debian10, 2.0.125-rocky8, 2.0.125-ubuntu18
  • 2.1.73-debian11, 2.1.73-rocky8, 2.1.73-ubuntu20, 2.1.73-ubuntu20-arm
  • 2.2.39-debian12, 2.2.39-rocky9, 2.2.39-ubuntu22

Note: When using Dataproc version 2.0.125 with the ranger-gcs-plugin, please create a customer support request for your project to use the enhanced version of the plugin prior to its GA release. This note does not apply Dataproc on Compute Engine image versions 2.1 and 2.2.

Fixed

Disabled HiveServer2 Ranger policy synchronization in non-HA clusters for latest image version 2.1 and later. Policy synchronization is causing instability of the HiveServer2 process while trying to connect to ZooKeeper, which is not active by default in non-HA clusters.

October 25, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.124-debian10, 2.0.124-rocky8, 2.0.124-ubuntu18
  • 2.1.72-debian11, 2.1.72-rocky8, 2.1.72-ubuntu20, 2.1.72-ubuntu20-arm
  • 2.2.38-debian12, 2.2.38-rocky9, 2.2.38-ubuntu22
Change

Dataproc Serverless for Spark: Added common AI/ML Python packages by default to Dataproc Serverless for Spark 1.2 and 2.2 runtimes.

October 21, 2024

Announcement

Announcing the General Availability (GA) release of Spark UI for Dataproc Serverless Batches and Interactive sessions which allows you to monitor and debug your serverless Spark workloads. Spark UI is available by default and free of cost for all Dataproc Serverless workloads.

October 17, 2024

Announcement

October 14, 2024

Change

Dataproc Clusters created with image versions 2.0.57+, 2.1.5+, or 2.2+: Secondary workers' control plane operations are made by the Dataproc Service Agent service account (service-<project-number>@dataproc-accounts.iam.gserviceaccount.com). They will no longer use the Google APIs Service Agent service account (<project-number>@cloudservices.gserviceaccount.com).

Announcement

October 11, 2024

Announcement

October 08, 2024

Announcement

October 04, 2024

Announcement

September 30, 2024

Announcement

September 23, 2024

Change

Dataproc Serverless for Spark: Added the google-cloud-dlp Python package by default to the Dataproc Serverless for Spark runtimes.

Fixed

Dataproc Serverless for Spark: Fixed an issue that would cause some batches and sessions to fail to start when using the premium compute tier.

September 21, 2024

Announcement

Blocklisted the following Dataproc on Compute Engine subminor image versions:

  • 2.0.119-debian10, 2.0.103-rocky8, 2.0.103-ubuntu18
  • 2.1.67-debian11, 2.1.51-rocky8, 2.1.51-ubuntu20, 2.1.51-ubuntu20-arm
  • 2.2.33-debian12, 2.2.17-rocky9, 2.2.17-ubuntu22

September 06, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.117-debian10, 2.0.117-rocky8, 2.0.117-ubuntu18
  • 2.1.65-debian11, 2.1.65-rocky8, 2.1.65-ubuntu20, 2.1.65-ubuntu20-arm
  • 2.2.31-debian12, 2.2.31-rocky9, 2.2.31-ubuntu22
Change

Dataproc on Compute Engine: The latest 2.2 image versions support Hudi Trino integration natively. If both components are selected when you create a Dataproc cluster, Trino will be configured to support Hudi automatically.

September 03, 2024

Change

Dataproc on Compute Engine: Apache Spark upgraded to version 3.5.1 in image version 2.2 starting with image version 2.2.30.

August 22, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.77
  • 1.2.21
  • 2.0.85
  • 2.2.21
Announcement

Dataproc Serverless for Spark: Subminor version 2.0.85 is the last release of runtime version 2.0, which will no longer be supported and will not receive new releases.

August 19, 2024

Announcement

syslog is now available for Dataproc cluster nodes in Cloud Logging. See Dataproc logs for cluster and job log information.

August 15, 2024

Change

New Dataproc Serverless for Spark runtime versions:

  • 1.1.76
  • 1.2.20
  • 2.0.84
  • 2.2.20

August 12, 2024

Change

New Dataproc Serverless for Spark runtime versions:

  • 1.1.75
  • 1.2.19
  • 2.0.83
  • 2.2.19

July 31, 2024

Change

Dataproc Serverless for Spark: Upgraded Spark BigQuery connector to version 0.36.4 in the latest 1.2 and 2.2 Dataproc Serverless for Spark runtime versions.

July 25, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.113-debian10, 2.0.113-rocky8, 2.0.113-ubuntu18
  • 2.1.61-debian11, 2.1.61-rocky8, 2.1.61-ubuntu20, 2.1.61-ubuntu20-arm
  • 2.2.27-debian12, 2.2.27-rocky9, 2.2.27-ubuntu22
Announcement

Enabled user sync by default for clusters using Ranger.

July 22, 2024

Change

Added support for N4 and C4 machine types for Dataproc image versions 2.1 and above. The following default configurations are now applied to clusters created with N4 or C4 machine types:

  • bootdisktype = "hyperdisk-balanced"
  • nictype = "gvnic"

July 19, 2024

Change

New Dataproc Serverless for Spark runtime versions:

  • 1.1.72
  • 1.2.16
  • 2.0.80
  • 2.2.16

Note: Dataproc Serverless for Spark runtime versions 1.1.71, 1.2.15, 2.0.79, and 2.2.15 were not released.

July 18, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.112-debian10, 2.0.112-rocky8, 2.0.112-ubuntu18
  • 2.1.60-debian11, 2.1.60-rocky8, 2.1.60-ubuntu20, 2.1.60-ubuntu20-arm
  • 2.2.26-debian12, 2.2.26-rocky9, 2.2.26-ubuntu22

July 17, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.70
  • 1.2.14
  • 2.0.78
  • 2.2.14

July 12, 2024

Change

New Dataproc on Compute Engine subminor image versions:

  • 2.0.111-debian10, 2.0.112-rocky8, 2.0.112-ubuntu18
  • 2.1.59-debian11, 2.1.60-rocky8, 2.1.60-ubuntu20, 2.1.60-ubuntu20-arm
  • 2.2.25-debian12, 2.2.26-rocky9, 2.2.26-ubuntu22

July 08, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.110-debian10, 2.0.110-rocky8, 2.0.110-ubuntu18
  • 2.1.58-debian11, 2.1.58-rocky8, 2.1.58-ubuntu20, 2.1.58-ubuntu20-arm
  • 2.2.24-debian12, 2.2.24-rocky9, 2.2.24-ubuntu22

July 05, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.68
  • 1.2.12
  • 2.0.76
  • 2.2.12

July 03, 2024

Change

Dataproc on Compute Engine: Apache Hadoop upgraded to version 3.2.4 in image version 2.0 starting with image version 2.0.109.

June 26, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.67
  • 1.2.11
  • 2.0.75
  • 2.2.11

June 21, 2024

Breaking

Dataproc Serverless for Spark: To fix compatibility with open table formats (Apache Iceberg, Apache Hudi and Delta Lake), the ANTLR version will be downgraded from 4.13.1 to 4.9.3 in Dataproc Serverless for Spark runtime versions 1.2 and 2.2 on June 26, 2024.

June 20, 2024

Announcement

Dataproc Serverless for Spark: Spark runtime version 2.2 will become the default Dataproc Serverless for Spark runtime version on September 6, 2024.

June 13, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.106-debian10, 2.0.106-rocky8, 2.0.106-ubuntu18
  • 2.1.54-debian11, 2.1.54-rocky8, 2.1.54-ubuntu20, 2.1.54-ubuntu20-arm
  • 2.2.20-debian12, 2.2.20-rocky9, 2.2.20-ubuntu22
Change

Dataproc Serverless for Spark: Upgraded Spark BigQuery connector to version 0.36.3 in the latest 1.2 and 2.2 Dataproc Serverless for Spark runtime versions.

June 06, 2024

Change

Dataproc on Compute Engine: When creating a cluster with the latest Dataproc on Compute Engine image versions, the secondary worker boot disk type now defaults to the primary worker boot disk type, which is pd-standard if the primary worker boot disk type is not specified.

June 05, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.64
  • 1.2.8
  • 2.0.72
  • 2.2.8

June 03, 2024

Change

Dataproc Serverless for Spark: Automatically apply goog-dataproc-session-id, goog-dataproc-session-uuid and goog-dataproc-location labels for a session resource.

May 30, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.104-debian10, 2.0.104-rocky8, 2.0.104-ubuntu18
  • 2.1.52-debian11, 2.1.52-rocky8, 2.1.52-ubuntu20, 2.1.52-ubuntu20-arm
  • 2.2.18-debian12, 2.2.18-rocky9, 2.2.18-ubuntu22
Announcement

Dataproc Serverless for Spark: Subminor version 2.1.50 is the last release of runtime version 2.1, which will no longer be supported and will not receive new releases.

Change

Dataproc Serverless for Spark: Removed Spark data lineage support for runtime version 1.2.

Change

Dataproc Serverless for Spark: Enabled Spark checkpoint (spark.checkpoint.compress) and RDD (spark.rdd.compress) compression in the latest 1.2 and 2.2 runtime versions.

May 22, 2024

Change

Upgraded Spark BigQuery connector to version 0.36.2 in the latest 1.2 and 2.2 Dataproc Serverless for Spark runtime versions.

May 16, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.102-debian10, 2.0.102-rocky8, 2.0.102-ubuntu18

  • 2.1.50-debian11, 2.1.50-rocky8, 2.1.50-ubuntu20, 2.1.50-ubuntu20-arm

  • 2.2.16-debian12, 2.2.16-rocky9, 2.2.16-ubuntu22

Breaking

Anaconda's default channel is disabled for package installations on Dataproc on Compute Engine.

May 06, 2024

Change

Dataproc on Compute Engine:

May 01, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.60
  • 1.2.4
  • 2.0.68
  • 2.1.47
  • 2.2.4

April 21, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.98-debian10, 2.0.98-rocky8, 2.0.98-ubuntu18
  • 2.1.46-debian11, 2.1.46-rocky8, 2.1.46-ubuntu20, 2.1.46-ubuntu20-arm
  • 2.2.12-debian12, 2.2.12-rocky9, 2.2.12-ubuntu22

April 18, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.58
  • 1.2.2
  • 2.0.66
  • 2.1.45
  • 2.2.2
Change

Updated the default autoscaling V2 cool-down time from 2m to 1m to reduce scaling latency.

Fixed

Fixed a bug where Dataproc Serverless sessions that live longer than 48 hours are underbilled.

April 09, 2024

Feature

Dataproc Serverless for Spark: The preview release of Advanced troubleshooting, including Gemini-assisted troubleshooting, is now available for Spark workloads submitted with the following or later-released runtime versions:

  • 1.1.55
  • 1.2.0-RC1
  • 2.0.63
  • 2.1.42
  • 2.2.0-RC15

April 04, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.57
  • 1.2.1
  • 2.0.65
  • 2.1.44
  • 2.2.1

March 28, 2024

Feature

Dataproc on Compute Engine: New Hadoop Google Secret Manager Credential Provider feature introduced in latest Dataproc on Compute Engine 2.0 image versions.

March 27, 2024

Change

Dataproc Serverless for Spark:

  • Upgraded Spark to version 3.5.1 in the latest 1.2 and 2.2 runtimes.
  • Upgraded Conda to version 24.1 in the latest 1.2 and 2.2 runtimes.
  • Upgraded Spark BigQuery connector to version 0.36.1 in the latest 1.2 and 2.2 runtimes.

March 20, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.55
  • 1.2.0-RC1
  • 2.0.63
  • 2.1.42
  • 2.2.0-RC15
Change

Dataproc Serverless for Spark:

  • Upgraded Spark RAPIDS plugin to version 24.2.0 in the latest runtimes.
  • Upgraded Spark to version 3.3.4 in the latest 1.1 and 2.0 runtimes.
  • Backported SPARK-44198 in the latest 1.2 and 2.2 runtimes.

March 14, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.54
  • 2.0.62
  • 2.1.41
  • 2.2.0-RC14

March 07, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.53
  • 2.0.61
  • 2.1.40
  • 2.2.0-RC13
Change

Dataproc Serverless for Spark: Upgraded Cloud Storage connector to 2.2.20 version in the latest 1.1, 2.0, and 2.1 runtimes.

March 06, 2024

Libraries

Dataproc on Compute Engine: Upgraded Cloud Storage connector version to 2.2.20 for 2.0 and 2.1 images.

March 04, 2024

Change

Dataproc Serverless for Spark: Extended Spark metrics collected for a batch now include executor:resultSize, executor:shuffleBytesWritten, and executor:shuffleTotalBytesRead.

February 29, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.52
  • 2.0.60
  • 2.1.39
  • 2.2.0-RC12

February 28, 2024

Feature

Dataproc on Compute Engine: The new Secret Manager credential provider feature is available in the latest 2.1 image versions.

Change

Dataproc on Compute Engine:

  • Upgraded Zookeeper to 3.8.3 for Dataproc 2.2.
  • Upgraded ORC for Hive to 1.15.13 for Dataproc 2.1.
  • Upgraded ORC for Spark to 1.7.10 for Dataproc 2.1.
  • Extended expiry for the internal Knox Gateway certificate from one year to five years from cluster creation for Dataproc images 2.0, 2.1, and 2.2.

February 15, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.50
  • 2.0.58
  • 2.1.37
  • 2.2.0-RC10
Announcement

Dataproc Serverless for Spark: Spark Lineage is available for Dataproc Serverless for Spark 1.1 runtime.

February 08, 2024

Feature

Dataproc on Compute Engine Ranger Cloud Storage enhancement:

  • Enabled downscoping
  • Added caching of tokens in local cache

Both settings are configurable and can be enabled by customers: see Use Ranger with caching and downscoping .

Change

Dataproc Serverless for Spark: Backported patch for HADOOP-18652.

February 02, 2024

Change

Dataproc on Compute Engine: Bucket ttl validation now also runs for buckets created by Dataproc.

Change

Dataproc on Compute Engine: Added a warning during cluster creation if the cluster Cloud Storage staging bucket is using the legacy fine-grained/ACL IAM configuration instead of the recommended Uniform bucket-level access controls.

Change

Dataproc Serverless for Spark: When dynamic allocation is enabled, the initial executor number is determined by max of spark.dynamicAllocation.initialExecutors and spark.executor.instances.

February 01, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.48
  • 2.0.56
  • 2.1.35
  • 2.2.0-RC8
Change

Dataproc on Compute Engine: Backported patches for HIVE-21214, HIVE-23154, HIVE-23354 and HIVE-23614.

January 31, 2024

Feature

Dataproc is now available in the africa-south1 region (Johannesburg, South Africa).

January 25, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.47
  • 2.0.55
  • 2.1.34
  • 2.2.0-RC7

January 24, 2024

Change

Backport HIVE-19568: Active/Passive HiveServer2 HA: Disallow direct connection to passive instance.

January 19, 2024

Change

Dataproc on Compute Engine: The default yarn.nm.liveness-monitor.expiry-interval-ms Hadoop YARN setting has been changed in the latest image versions from 15000 (15 seconds) to 120000 (2 minutes).

Change

Dataproc on Compute Engine: Upgraded Cloud Storage connector version to 2.2.19 in the latest 2.0 and 2.1 images.

January 05, 2024

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.44
  • 2.0.52
  • 2.1.31
  • 2.2.0-RC4

January 04, 2024

Announcement

The following previously released sub-minor versions of Dataproc images have been rolled back and can only be used when updating existing clusters that already use them:

  • 2.0.88-debian10, 2.0.88-rocky8, 2.0.88-ubuntu18
  • 2.1.36-debian11, 2.1.36-rocky8, 2.1.36-ubuntu20, 2.1.36-ubuntu20-arm
  • 2.2.2-debian12, 2.2.2-rocky9, 2.2.2-ubuntu22

January 02, 2024

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.88-debian10, 2.0.88-rocky8, 2.0.88-ubuntu18
  • 2.1.36-debian11, 2.1.36-rocky8, 2.1.36-ubuntu20, 2.1.36-ubuntu20-arm
  • 2.2.2-debian12, 2.2.2-rocky9, 2.2.2-ubuntu22

  • Rollback Notice: See the January 4, 2024 release note rollback notice.

Change

Dataproc on Compute Engine: Changed the Hive Server2 and MetaStore maximum default JVM heap size to 32GiB. Previously, the limit was set to 1/4 of total node memory, which could be too large on large-memory machines.

December 21, 2023

Announcement

New Dataproc Serverless for Spark runtime versions:

  • 1.1.43
  • 2.0.51
  • 2.1.30
  • 2.2.0-RC3

December 18, 2023

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.87-debian10, 2.0.87-rocky8, 2.0.87-ubuntu18
  • 2.1.35-debian11, 2.1.35-rocky8, 2.1.35-ubuntu20, 2.1.35-ubuntu20-arm
  • 2.2.1-debian12, 2.2.1-rocky9, 2.2.1-ubuntu22

December 06, 2023

Announcement

Announcing the Preview release of Dataproc Serverless for Spark 2.2 runtime:

  • Spark 3.5.0
  • BigQuery Spark Connector 0.34.0
  • Cloud Storage Connector 3.0.0-RC1
  • Conda 23.10
  • Java 17
  • Python 3.12
  • R 4.3
  • Scala 2.13

December 04, 2023

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.85-debian10, 2.0.85-rocky8, 2.0.85-ubuntu18
  • 2.1.33-debian11, 2.1.33-rocky8, 2.1.33-ubuntu20, 2.1.33-ubuntu20-arm
Feature

Added the Confidential Computing option on the "Manage Security" panel on the "Create a Dataproc cluster on Compute Engine" page in the Google Cloud console.

Fixed

Fixed Dataproc Hub issue in latest Dataproc on Compute Engine 2.1 image.

December 01, 2023

Change

The Cloud Storage connector has been upgraded to version 2.2.18 in all Dataproc Serverless for Spark runtimes.

November 17, 2023

Change

Upgraded the Cloud Storage connector version to 2.2.18 in the latest 2.0 and 2.1 Dataproc on Compute Engine image versions.

Fixed

Fixed a regression in the Zeppelin websocket rules that caused a websocket error in Zeppelin notebooks.

Issue

The Python kernel does not work in Zeppelin on the Dataproc on Compute Engine 2.1 image version. Other kernels are not impacted.

November 15, 2023

Announcement

You can use CMEK (Customer Managed Encrytion Keys) with encrypted Dataproc cluster data, incuding persistent disk data, job arguments and queries submitted with Dataproc jobs, and cluster data saved in the cluster Dataproc staging bucket. See Use CMEK with cluster data for more information.

November 10, 2023

Announcement

Announcing the General Availability (GA) release of Dataproc Jupyter Plugin and its availability in Vertex AI Workbench instance notebooks.

November 07, 2023

Fixed

Set spark.shuffle.mapOutput.minSizeForBroadcast=128m to fix SPARK-38101 when Dataproc Serverless Spark dynamic allocation is enabled.

November 01, 2023

Announcement

Announcing the Preview release of Dataproc Flexible VMs. This feature lets you specify prioritized lists of secondary worker VM types that Dataproc will select from when creating your cluster. Dataproc will select the VM type with sufficient available capacity while taking quotas and reservations into account.

October 30, 2023

Feature

Added spark.dataproc.scaling.version=2 config to let customers control the Dataproc Serverless for Spark autoscaling version.

Fixed

Fixed Knox rewrite rules for Zeppelin URLs in some cases in the latest 2.0 and 2.1 Dataproc on Compute Engine image versions.

October 27, 2023

Announcement

October 25, 2023

Announcement

October 23, 2023

Announcement
Feature

Dataproc on Compute Engine: Dataproc now collects the dataproc.googleapis.com/job/yarn/vcore_seconds and dataproc.googleapis.com/job/yarn/memory_seconds job-level resource attribution metrics to track YARN application vcore and memory usage during job execution. These metrics are collected by default and are not chargeable to customers.

Feature

Dataproc on Compute Engine: Dataproc now collects a dataproc.googleapis.com/node/yarn/nodemanager/health health metric to track the health of individual YARN node managers running on VMs. This metric is written against the gce_instance monitored resource to help you find suspect nodes. It is collected by default and is not chargeable to customers.

October 09, 2023

October 06, 2023

Change

Added the gs.http.connect-timeout and gs.http.read-timeout properties in Flink to set the connection timeout and read timeout for java-storage client in the latest Dataproc on Compute Engine 2.1 image version.

September 22, 2023

Announcement
Change

In the latest Dataproc on Compute Engine 2.0 and 2.1 image versions, unset the CLOUDSDK_PYTHON variable to allow the gcloud command-line tool to use its bundled Python interpreter.

September 19, 2023

Feature

Dataproc is now available in the me-central2 region (Dammam, Saudi Arabia).

September 15, 2023

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.76-debian10, 2.0.76-rocky8, 2.0.76-ubuntu18
  • 2.1.24-debian11, 2.1.24-rocky8, 2.1.24-ubuntu20, 2.1.24-ubuntu20-arm
Change

Scala has been upgraded to version 2.12.18 and Apache Tez has been upgraded to version 0.10.2 in Dataproc on Compute Engine 2.1 images.

September 12, 2023

Change

The dataproc.diagnostics.enabled property is now avaiable to enable running diagnostics on Dataproc Serverless for Spark. The existing spark.dataproc.diagnostics.enabled property will be deprecated for use with newer runtimes.

September 08, 2023

Announcement
Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.75-debian10, 2.0.75-rocky8, 2.0.75-ubuntu18
  • 2.1.23-debian11, 2.1.23-rocky8, 2.1.23-ubuntu20, 2.1.23-ubuntu20-arm
Change

The Apache Spark version has been upgraded from 3.3.0 to 3.3.2 in Dataproc on Compute Engine 2.1 images.

August 29, 2023

Announcement

Announcing the Preview release of Dataproc Serverless for Spark Interactive sessions and the Dataproc Jupyter Plugin.

August 23, 2023

Fixed

Fixed a Dataproc Serverless issue where Spark batches failed with unhelpful error messages.

August 11, 2023

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.73-debian10, 2.0.73-rocky8, 2.0.73-ubuntu18
  • 2.1.21-debian11, 2.1.21-rocky8, 2.1.21-ubuntu20, 2.1.21-ubuntu20-arm
Announcement

Added new Dataproc Serverless Templates for batch workload creation:

  • Cloud Spanner to Cloud Storage
  • Cloud Storage to JDBC
  • Cloud Storage to Cloud Storage
  • Hive to BigQuery
  • JDBC to Cloud Spanner
  • JDBC to JDBC
  • Pub/Sub to Cloud Storage
Feature

Improved the reliability of Dataproc Serverless compute node initialization with a Premium disk tier option.

August 07, 2023

Announcement

Added a dataproc:dataproc.cluster.caching.enabled flag to enable and disable Dataproc on Compute Engine cluster caching. The flag is false by default. Use this feature with the latest Dataproc on Compute Engine images.

August 06, 2023

Change

The following previously released sub-minor versions of Dataproc on Compute Engine images unintentionally reverted several dependency library versions. This caused a risk of backward-incompatibility for some workloads.

These sub-minor versions have been rolled back, and can only be used when updating existing clusters that already use them:

  • 2.0.71-debian10, 2.0.71-rocky8, 2.0.71-ubuntu18
  • 2.1.19-debian11, 2.1.19-rocky8, 2.1.19-ubuntu20, 2.1.19-ubuntu20-arm

August 05, 2023

Announcement

New Dataproc on Compute Engine image versions:

  • 2.0.72-debian10, 2.0.72-rocky8, 2.0.72-ubuntu18
  • 2.1.20-debian11, 2.1.20-rocky8, 2.1.20-ubuntu20, 2.1.20-ubuntu20-arm
Change

Downgraded Cloud Storage connector version to 2.2.15 in all Dataproc on Compute Engine image versions to prevent potential performance regression.

Security

Backported ZEPPELIN-5434 to image 2.1 to fix CVE-2022-2048.

July 30, 2023

Change

New Dataproc on Compute Engine image versions:

  • 2.0.71-debian10, 2.0.72-rocky8, 2.0.72-ubuntu18
  • 2.1.19-debian11, 2.1.20-rocky8, 2.1.20-ubuntu20, 2.1.20-ubuntu20-arm

Note: The above image versions were rolled back. See the August 6, 2023 release note

July 26, 2023

Breaking

Clusters cannot be created with a driver node group if the cluster image version is older than 2.0.57 or 2.1.5, or if the permissions for the staging bucket are missing.

Change

Added recommendation details in Autoscaler Stackdriver logs for the CANCEL and DO_NOT_CANCEL recommendations.

July 21, 2023

Announcement

New Dataproc on Compute Engine image versions, which includes a 2.1.18-ubuntu20-arm image that supports ARM machine types:

  • 2.0.70-debian10, 2.0.70-rocky8, 2.0.70-ubuntu18
  • 2.1.18-debian11, 2.1.18-rocky8, 2.1.18-ubuntu20, 2.1.18-ubuntu20-arm

July 14, 2023

Announcement
Change

Upgraded the Cloud Storage connector version to 2.2.16 in Dataproc Serverless for Spark runtimes.

July 07, 2023

Feature

The goog-dataproc-batch-id, goog-dataproc-batch-uuid and goog-dataproc-location labels are now automatically applied to Dataproc Serverless batch resources.

June 28, 2023

Announcement

New Dataproc on Compute Engine subminor image versions:

  • 2.0.68-debian10, 2.0.68-rocky8, 2.0.68-ubuntu18
  • 2.1.16-debian11, 2.1.16-rocky8, 2.1.16-ubuntu20
Change

Backported ZEPPELIN-5755 to Zeppelin 0.10 in 2.1 images for Spark 3.3 support.

June 26, 2023

Announcement

Added Dataproc Serverless Templates for batch creation:

  • Cloud Storage to BigQuery
  • Cloud Storage to Cloud Spanner
  • Hive to Cloud Storage
  • JDBC to BigQuery
  • JDBC to Cloud Storage

June 14, 2023

Announcement

June 02, 2023

Announcement
Change

Upgrade Cloud Storage connector to 2.2.14 version in Dataproc Serverless for Spark runtimes.

June 01, 2023

Announcement

New sub-minor versions of Dataproc images:

  • 2.0.66-debian10, 2.0.66-rocky8, 2.0.66-ubuntu18
  • 2.1.14-debian11, 2.1.14-rocky8, 2.1.14-ubuntu20

May 26, 2023

Announcement

New sub-minor versions of Dataproc images:

  • 2.0.65-debian10, 2.0.65-rocky8, 2.0.65-ubuntu18
  • 2.1.13-debian11, 2.1.13-rocky8, 2.1.13-ubuntu20

May 24, 2023

Fixed

Fixed Serverless history server endpoint URL when Persistent History Server (PHS) was setup without using a wildcard.

May 19, 2023

Announcement
Change

Upgraded the Cloud Storage connector to 2.2.13 version in Dataproc Serverless for Spark runtimes.

Fixed

Fixed the NoClassDefFoundError for log4j class in Zeppelin BigQuery interpreter in 2.0 images.

Fixed

Backported HIVE-22891 to 2.0 images.

May 18, 2023

Announcement

New sub-minor versions of Dataproc images:

  • 2.0.64-debian10, 2.0.64-rocky8, 2.0.64-ubuntu18
  • 2.1.12-debian11, 2.1.12-rocky8, 2.1.12-ubuntu20
Feature

You can now use --properties=dataproc:componentgateway.ha.enabled=true to enable the Dataproc Component Gateway and Knox along with the Spark History Server (SHS) UI in HA mode.

May 05, 2023

Announcement
Change

April 28, 2023

Announcement
Change

Upgrade Spark to 3.4.0 and its dependencies in Dataproc Serverless for Spark 2.1 runtime:

  • Jetty to 9.4.51.v20230217
  • ORC to 1.8.3
  • Parquet to 1.13.0
  • Protobuf to 3.22.3
Announcement

New sub-minor versions of Dataproc images:

  • 1.5.89-debian10, 1.5.89-rocky8, 1.5.89-ubuntu18
  • 2.0.63-debian10, 2.0.63-rocky8, 2.0.63-ubuntu18
  • 2.1.11-debian11, 2.1.11-rocky8, 2.1.11-ubuntu20
Change

hive principal will be used for Hive catalog queries via presto in kerberos cluster.

April 24, 2023

Feature

Autoscaler recommendation reasoning details are available now in Cloud Logging logs.

April 18, 2023

Change

Add Autoscaler recommendation reasoning details in Cloud Logging.

April 14, 2023

Announcement

April 06, 2023

Announcement

April 04, 2023

Announcement

March 28, 2023

Feature

Dataproc cluster creation now supports the pd-extreme disk type.

Change

Dataproc on GKE now disallows update operations.

Change

Dataproc on GKE diagnose operation now verifies that the master agent is running.

March 27, 2023

Announcement

New sub-minor versions of Dataproc images:

  • 1.5.86-debian10, 1.5.86-rocky8, 1.5.86-ubuntu18
  • 2.0.60-debian10, 2.0.60-rocky8, 2.0.60-ubuntu18
  • 2.1.8-debian11, 2.1.8-rocky8, 2.1.8-ubuntu20

March 24, 2023

Announcement
Change

Upgrade Python to 3.11 and Conda to 23.1 in Dataproc Serverless for Spark runtime 2.1

March 17, 2023

Change

March 16, 2023

Announcement

New sub-minor versions of Dataproc images:

  • 1.5.85-debian10, 1.5.85-rocky8, 1.5.85-ubuntu18
  • 2.0.59-debian10, 2.0.59-rocky8, 2.0.59-ubuntu18
  • 2.1.7-debian11, 2.1.7-rocky8, 2.1.7-ubuntu20

March 10, 2023

Announcement

March 06, 2023

Change

Added stronger validations to disallow upper-case characters in template IDs per Resource Names guidance, which allows Workflow template creation to fail fast instead of failing at workflow template instantiation.

Change

Added decision metric field in Stackdriver autoscaler logs.

March 02, 2023

Announcement
Change

Release Dataproc Serverless for Spark runtime 2.1 preview:

  • Spark 3.4.0-rc1
  • BigQuery Spark Connector 0.28.0
  • Cloud Storage Connector 2.2.11
  • Conda 22.11
  • Java 17
  • Python 3.10
  • R 4.2
  • Scala 2.13

February 28, 2023

Feature

--properties=dataproc:agent.ha.enabled=true can now be used to enable the Dataproc Agent in high availability mode. This property is supported by Dataproc Image versions 2.0 and above.

February 23, 2023

Announcement

February 10, 2023

Announcement
Feature

Dataproc Serverless for Spark now supports unconditional TTL to batches. The workload will be terminated after the TTL without waiting for work to complete.

February 03, 2023

Announcement

January 27, 2023

Announcement
Announcement

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime version 1.1, which includes the following components:

  • Spark 3.3.1
  • BigQuery Spark Connector 0.28.0
  • Cloud Storage Connector 2.2.9
  • Conda 22.11
  • Java 11
  • Python 3.10
  • R 4.2
  • Scala 2.12
Change

January 23, 2023

Feature

Added support for enabling Hive Metastore OSS metrics by passing hivemetastore to --metric-sources property during cluster creation.

Change

Upgraded Parquet to 1.12.2 for 2.1 images.

January 13, 2023

Announcement

December 19, 2022

Change

Backported Spark patch in Dataproc Serverless for Spark runtime 1.0 and 2.0:

  • SPARK-40481: Ignore stage fetch failure caused by decommissioned executor.

December 12, 2022

Announcement

New sub-minor versions of Dataproc images:

  • 1.5.78-debian10, 1.5.78-rocky8, 1.5.78-ubuntu18
  • 2.0.52-debian10, 2.0.52-rocky8, 2.0.52-ubuntu18
  • 2.1.0-debian11, 2.1.0-rocky8, 2.1.0-ubuntu20
Change

Upgrade Cloud Storage connector version to 2.1.9 for 1.5 images.

Change

Dataproc Serverless for Spark runtime 2.0:

  • Upgrade Cloud Storage connector to 2.2.9
  • Upgrade Spark dependencies:
    • Protobuf to 3.21.9
    • RoaringBitmap to 0.9.35
Change

Backport Spark patches in Dataproc Serverless for Spark runtime 1.0 and 2.0:

  • SPARK-39324: Log ExecutorDecommission as INFO level in TaskSchedulerImpl
  • SPARK-40168: Handle SparkException during shuffle block migration
  • SPARK-40269: Randomize the orders of peer in BlockManagerDecommissioner
  • SPARK-40778: Make HeartbeatReceiver as an IsolatedRpcEndpoint

December 09, 2022

Feature

Added the dataproc.googleapis.com/job/state metric to track the status of Dataproc Jobs states (such as, RUNNING or PENDING). This metric is collected by default and is not chargeable to customers.

Dataproc job IDs are now queryable and viewable from MQL(Monitoring Query Language), and the metric can be used for long-running job monitoring and alerting.

November 17, 2022

Change

Added support for Dataproc to attach to a gRPC Dataproc Metastore in any region.

Change

Nodemanagers in DECOMMISSIONING, NEW, and SHUTDOWN state are now included in the /cluster/yarn/nodemanagers metric.

Change

Dataproc Serverless for Spark now shows the subminor runtime version used in the runtimeConfig.version field,

November 14, 2022

Fixed

Backported HIVE-17317 in the latest 2.0 and 2.1 images.

Fixed

Dataproc Serverless for Spark runtime version 1.0.23 and 2.0.3 downgrades google-auth-oauthlib Python package to fix gcsfs Python package.

Security

Upgraded Apache Commons Text to 1.10.0 for Knox in 1.5 images, and for Spark, Pig, Knox in 2.0 images, addressing CVE-2022-42889.

November 11, 2022

Deprecated

Dataproc Serverless for Spark runtime versions 1.0.22 and 2.0.2 will be deprecated on 11/11/2022. New batch submissions that use these runtime versions will fail starting 11/11/2022. This is due to an update to the google auth library which breaks running Pyspark batch workloads having dependency on gcsfs. Upcoming runtime versions will address this issue.

Deprecated

Dataproc images 2.0.50 and preview 2.1.0-RC3 are deprecated and cluster creations based on these images will fail starting 11/11/2022. This is due to an update to the google auth library which breaks running Pyspark batch workloads having dependency on gcsfs. Upcoming image versions will have a fix to address this issue.

October 31, 2022

Feature

Dataproc Serverless for Spark now outputs approximate_usage after a workload finishes that shows the approximate DCU and shuffle storage resource consumption by the workload.

Change

Removed the Auto Zone placement check for supported machine types.

October 28, 2022

Announcement

Dataproc Serverless for Spark now now uses runtime version 1.0.21 and 2.0.1.

Security

Dataproc Serverless for Spark runtime version 2.0.1 upgrades Apache Commons Text to 1.10.0, addressing CVE-2022-42889

October 25, 2022

Announcement

Dataproc Serverless for Spark runtime version 2.0 will become the default Dataproc Serverless for Spark runtime version on December 13, 2022.

October 24, 2022

Feature

Dataproc Serverless for Spark now supports spark.dataproc.diagnostics.enabled property that enables auto diagnostics on Batch failure. Note that enabling auto diagnostics will hold compute and storage quota after Batch is complete and until diagnostics is finished.

October 21, 2022

Announcement

Announcing the General Availability (GA) release of Dataproc Serverless for Spark runtime 2.0.

Announcement

Dataproc Serverless for Spark now uses runtime version 1.0.20 and 2.0.0.

Libraries

Upgraded the Conscrypt library to 2.5.2 in the latest 1.5 and 2.0 images.

Change

Disabled auto deletion of files under /tmp in the latest Rocky images. Previous Rocky images have files in the /tmp folder deleted every 10 days due to default OS system setting in /usr/lib/tmpfiles.d/tmp.conf.

Change

Set yarn:spark.yarn.shuffle.stop_on_failure to true by default in the latest 1.5 and 2.0 images. This change causes YARN node manager startup to fail if the Spark external shuffle service startup fails. On VM boot, Dataproc will continuously restart the YARN node manager until it is able to start. This change reduces Spark executor errors, such as: org.apache.spark.SparkException: Unable to register with external shuffle server due to : Failed to connect to <worker host>:7337, particularly when starting a stopped cluster. See Spark external shuffle service documentation for details.

Security

Enabled Spark authentication and encryption for Kerberos enabled clusters created with the latest 1.5 and 2.0 images.

Fixed

Backported the patch for SPARK-36383 in the latest 2.0 images.

Fixed

Backported the patch for HIVE-19310 in the latest 1.5 images.

Fixed

Backported the patch for HIVE-20004 in the latest 2.0 images.

Fixed

Fixed an issue in which Presto queries might fail when submitted to HA clusters in the latest 1.5 and 2.0 images.

Fixed

Fixed a "gsutil not found" issue in the latest 1.5 and 2.0 Ubuntu images.

Fixed

Backported the patch for HIVE-20607 in the latest 2.0 images.

October 03, 2022

Announcement

Preemptible SPOT VMs can be used as secondary workers in a Dataproc cluster. Unlike legacy preemptible VMs with a 24-hour maximum lifetime, Spot VMs have no maximum lifetime.

October 01, 2022

Feature

Dataproc Serverless for Spark now supports Artifact Registry with image streaming.

September 27, 2022

Feature

Dataproc Auto Zone Placement now takes ANY reservation into account by default.

September 26, 2022

Change

Dataproc Serverless for Spark now uses runtime version 1.0.19 and 2.0.0-RC4, which also upgrades both runtimes to Cloud Storage Connector to 2.2.8.

September 20, 2022

Change

Dataproc Serverless for Spark: Reduced the latency between batch workload completion and when a batch is marked SUCCEEDED.

Change

Dataproc Serverless for Spark: Sets a maximum limit of 500 workers per scale up or scale down operation.

Change

Dataproc on Compute Engine: Stop all master and worker VMs when starting a cluster fails due to stockout or insufficient quota.

September 19, 2022

Change

Dataproc Serverless for Spark now uses runtime version 1.0.18 and 2.0.0-RC3.

August 24, 2022

Announcement

Announcing the Preview release of Dataproc custom constraints, which can be used to allow or deny specific operations on Dataproc clusters.

August 22, 2022

Announcement

Announcing Dataproc Serverless for Spark preview runtime version 2.0.0-RC1, which includes the following components:

  • Spark 3.3.0
  • Cloud Storage Connector 2.2.7
  • Java 17
  • Conda 4.13
  • Python 3.10
  • R 4.1
  • Scala 2.13

August 13, 2022

Change

New sub-minor versions of Dataproc images:

1.5.73-debian10, 1.5.73-rocky8, 1.5.73-ubuntu18

2.0.47-debian10, 2.0.47-rocky8, 2.0.47-ubuntu18

Change

Enabled Spark authentication and encryption for Kerberos clusters in 1.5 and 2.0 images.

Change

Dataproc Serverless for Spark now uses runtime version 1.0.15, which upgrades the following Spark dependencies to the following versions:

  • Jackson 2.13.3
  • Jetty 9.4.46.v20220331
  • ORC 1.7.4
  • Parquet 1.12.3
  • Protobuf 3.19.4
  • RoaringBitmap 0.9.28
Change

Dataproc on Compute Engine images now have master VM memory protection enabled by default. Jobs may be terminated to prevent the master VM running out of memory.

Breaking

FallbackHiveAuthorizerFactory is now set by default on newly created 1.5 and 2.0 image clusters that have the any of the following features enabled:

If you encounter a Cannot modify <PARAM> or similar runtime error when running a SET statement in a Hive query, this means the parameter is not in list of allowable runtime parameters. You can allow the parameter using hive.security.authorization.sqlstd.confwhitelist.append as a cluster property when you create a cluster.

Example:

--properties="hive:hive.security.authorization.sqlstd.confwhitelist.append=tez.application.tags,<ADDITIONAL-1>,<ADDITIONAL-2>"

August 01, 2022

Change

Upgraded Hadoop to version 3.2.3 in 2.0 images.

Change

Upgraded Hadoop to version 2.10.2 version 2.10.2 in 1.5 images.

Change

Default MySQL instance root password changed to a random value in 1.5 and 2.0 images. New password is now stored in MySQL configuration file accessible only by the OS level root user.

Fixed

Backported the patch for KNOX-1997 in 2.0 images.

Fixed

Backported the patches for HIVE-19047 and HIVE-19048 in 1.5 images.

July 01, 2022

Change

New sub-minor versions of Dataproc images:

1.5.71-debian10, 1.5.71-rocky8, 1.5.71-ubuntu18

2.0.45-debian10, 2.0.45-rocky8, 2.0.45-ubuntu18

June 21, 2022

Change

New sub-minor versions of Dataproc images:

1.5.70-debian10, 1.5.70-rocky8, 1.5.70-ubuntu18

2.0.44-debian10, 2.0.44-rocky8, 2.0.44-ubuntu18

Feature

Dataproc Metastore: For 1.5 images, added a spark.hadoop.hive.eager.fetch.functions.enabled Spark Hive client property to control whether the client fetches all functions from Hive Metastore during initialization. The default setting is true, which preserves the existing behavior of fetching all functions. If set to false, the client will not fetch all functions during initialization, which can help reduce high latency during initialization, particularly when there are many functions and the Metastore is not located in the client's region.

June 14, 2022

Change

New sub-minor versions of Dataproc images:

1.5.69-debian10, 1.5.69-rocky8, 1.5.69-ubuntu18

2.0.43-debian10, 2.0.43-rocky8, 2.0.43-ubuntu18

June 06, 2022

Announcement

Announcing the General Availability (GA) release of Dataproc Persistent History Server, which provides web interfaces to view job history for jobs run on active or deleted Dataproc clusters.

Change

Dataproc Serverless for Spark now uses runtime version 1.0.13.

Change

Dataproc on GKE Spark 3.1 images upgraded to Spark version 3.1.3.

Change

Upgrade Cloud Storage connector version 2.1.8 for 1.5 images only.

May 31, 2022

Feature

Dataproc is now available in the europe-southwest1 region (Madrid, Spain).

Feature

Dataproc is now available in the europe-west9 region (Paris, France).

May 30, 2022

Change

New sub-minor versions of Dataproc images:

1.5.67-debian10, 1.5.67-ubuntu18, 1.5.67-rocky8

2.0.41-debian10, 2.0.41-ubuntu18, 2.0.41-rocky8

Change

Dataproc on GKE error messages now provide additional information.

May 23, 2022

Change

New sub-minor versions of Dataproc images:

1.5.66-debian10, 1.5.66-ubuntu18, 1.5.66-rocky8

2.0.40-debian10, 2.0.40-ubuntu18, 2.0.40-rocky8

Change

Upgraded Spark to 3.1.3 in Dataproc image version 2.0.

Fixed

Fixed a bug where job was not being marked as terminated after master node reboot.

Fixed

Fixed a bug where Flink was not able to run on HA clusters.

Fixed

Fixed a bug with HDFS directories initialization when core:fs.defaultFS is set to an external HDFS.

May 09, 2022

Fixed

Fixed an issue where chronyd systemd service failed to start due to a race condition between systemd-timesyncd and chronyd.

Deprecated

Dataproc Serverless for Spark runtime version 1.0.1 is unavailable for new batch submissions.

May 03, 2022

Change

Dataproc Serverless for Spark now uses runtime version 1.0.11.

Change

If you request to cancel a job in one of the following states, Dataproc will return the job, but not initiate cancellation, since it is already in progress: CANCEL_PENDING, CANCEL_STARTED, or CANCELLED.

Change

Added Dataproc Serverless support for updating the Cloud Storage connector using the dataproc.gcsConnector.version and dataproc.gcsConnector.uri properties.

Fixed

Hive: Upgrade to Apache ORC 1.5.13 in image version 2.0. Notable in this release are 2 bug fixes: ORC-598 and ORC-672, related to handling ORC files with arrays larger than 1024 elements.

April 22, 2022

Change

New sub-minor versions of Dataproc images:

1.5.63-debian10, 1.5.63-ubuntu18, 1.5.63-rocky8

2.0.37-debian10, 2.0.37-ubuntu18, 2.0.37-rocky8

Change

Dataproc Serverless for Spark now uses runtime version 1.0.10.

April 20, 2022

Feature

Dataproc is now available in the europe-west8 region (Milan, Italy).

April 11, 2022

Fixed

Fixed and issue in which the Dataproc autoscaler would sometimes try to scale down a cluster by more than one thousand secondary worker nodes at one time. Now, the autoscaler will scale down at most one thousand nodes at one time. In cases where the autoscaler previously would have scaled down more than one thousand nodes at one time, it will scale down the nodes by at most one thousand nodes, and a log will be written to the autoscaler log noting this occurrence.

Fixed

Fixed bugs that could cause Dataproc to delay marking a job cancelled.

April 01, 2022

Change

New sub-minor versions of Dataproc images:

1.5.61-debian10, 1.5.61-ubuntu18, and 1.5.61-rocky8

2.0.35-debian10, 2.0.35-ubuntu18, and 2.0.35-rocky8

March 25, 2022

March 17, 2022

March 07, 2022

February 18, 2022

Feature

Added support for Enhanced Flexibility Mode (EFM) with primary worker shuffle mode on Spark for image version 2.0.

Change

New sub-minor versions of Dataproc images:

1.5.57-debian10, 1.5.57-ubuntu18, and 1.5.57-rocky8

2.0.31-debian10, 2.0.31-ubuntu18, and 2.0.31-rocky8

Change

Upgraded Cloud Storage connector version to 2.1.7 in image version 1.5.

Deprecated

CentOS images are EOL. 1.5.56-centos8 and 2.0.30-centos8 are the final CentOS based images. CentOS images are no longer supported and will not receive new releases.

February 17, 2022

Announcement

A script that checks if a project or organization is using an unsupported Dataproc image is available for downloading (see Unsupported Dataproc versions).

February 15, 2022

Deprecated

Dataproc images prior to 1.3.95, 1.4.77, 1.5.53, and 2.0.27 are deprecated and cluster creations based on these images will fail starting 2/28/2022.

February 11, 2022

February 01, 2022

Feature

Enabled the Resource Manager UI and HA capable UIs in HA cluster mode.

Announcement

1.4.80-debian10 and 1.4.80-ubuntu18 are the last releases for the 1.4 images. Dataproc 1.4 images will no longer be supported and will not receive new releases.

January 31, 2022

Change

Dataproc Serverless for Spark now uses runtime version 1.0.2, which updates Spark to 3.2.1 version.

January 24, 2022

Change

Dataproc Serverless for Spark now uses runtime version 1.0.1, which includes improved error messaging for network connectivity issues.

January 18, 2022

Change

Dataproc extracts the warehouse directory from the Dataproc Metastore service for the cluster-local warehouse directory.

January 17, 2022

Change

New sub-minor versions of Dataproc images:

1.4.79-debian10 and 1.4.79-ubuntu18

1.5.55-debian10, 1.5.55-ubuntu18, and 1.5.55-centos8

2.0.29-debian10, 2.0.29-ubuntu18, and 2.0.29-centos8

Change

Dataproc images 1.4.79, 1.5.55, and 2.0.29, listed above, were updated with log4j version 2.17.1. It is strongly recommended that your clusters use previously released images 1.4.77, 1.5.53, or 2.0.27, or higher (see Supported Dataproc versions). While not urgent nor strongly recommended, Dataproc advises you to create or recreate Dataproc clusters with the latest sub-minor image versions when possible.

Change

The Cloud Storage connector jar is installed on the Solr server (even if dataproc:solr.gcs.path property is not set). Applies to image versions 1.4, 1.5, and 2.0.

January 09, 2022

Change

New sub-minor versions of Dataproc images:

1.4.78-debian10, and 1.4.78-ubuntu18

1.5.54-centos8, 1.5.54-debian10, and 1.5.54-ubuntu18

2.0.28-centos8, 2.0.28-debian10, and 2.0.28-ubuntu18

Change

Dataproc images 1.4.78, 1.5.54, and 2.0.28, listed above, were updated with log4j version 2.17.0. It is strongly recommended that your clusters use previously released images 1.4.77, 1.5.53, or 2.0.27, or higher (see Supported Dataproc versions). While not urgent nor strongly recommended, Dataproc advises you to create or recreate Dataproc clusters with the latest sub-minor image versions when possible.

Change

Upgraded Cloud Storage connector version to 2.2.4 in image version 2.0.

December 21, 2021

Change

Dataproc has released 1.3.95-debian10/-ubuntu18 images with a one-time patch that addresses the Apache Log4j 2 CVE-2021-44228 and CVE-2021-45046 vulnerabilities, but note that that all 1.3 images remain unsupported, and Dataproc will not provide upgrades to 1.3 images .

December 18, 2021

Change

Dataproc has released the following sub-minor image versions to address an Apache Log4j 2 vulnerability (also see Create a cluster and Recreate and update a cluster for more information). Note: These images supersede the 1.5 and 2.0 images listed in the December 16, 2021 release note:

1.5.53-centos8, 1.5.53-debian10, 1.5.53-ubuntu18,

2.0.27-centos8, 2.0.27-debian10, 2.0.27-ubuntu18

Change

Removed the Geode interpreter from Zeppelin notebook, which is affected by https://nvd.nist.gov/vuln/detail/CVE-2021-45046.

December 16, 2021

Change

Upgraded log4j version to 2.16.0, which fixes https://nvd.nist.gov/vuln/detail/CVE-2021-44228.

December 13, 2021

Fixed

Fixed executor log links on Spark History Server Web UI for running and completed applications. Applies to 1.4 and 1.5 images.

Fixed

YARN-8990: Fixed a Fairscheduler race condition. Applies to 2.0 images.

November 17, 2021

Feature

Dataproc is now available in the southamerica-west1 region (Santiago, Chile).

November 01, 2021

Feature

Added the following new Apache Spark properties to control Cloud Storage flush behavior for event logs for 1.4 and later images:

  • spark.history.fs.gs.outputstream.type (default: BASIC)
  • spark.history.fs.gs.outputstream.sync.min.interval.ms (default: 5000ms).

Note: The default configuration of these properties enables the display of running jobs in the Spark History Server UI for clusters using Cloud Storage to store spark event logs.

Feature

Added support in 1.5 and 2.0 images to filter Spark Applications on the Spark History Server Web UI based on Cloud Storage path. Filtering is accomplished using the eventLogDirFilter parameter, which accepts any Cloud Storage path substring and will return applications that match the Cloud Storage path.

Fixed

Backported SPARK-23182 in 1.4 and 1.5 images. This prevents long-running Spark shuffle servers from leaking connections when they are not cleanly terminated.

October 22, 2021

Change

The dataproc:dataproc.cluster-ttl.consider-yarn-activity cluster property is now set to true by default for image versions 1.4.64+, 1.5.39+, and 2.0.13+. With this change, with clusters created with these image versions, Dataproc Cluster Scheduled Deletion by default will consider YARN activity, in addition to Dataproc Jobs API activity, when determining cluster idle time . This change does not affect clusters with images with lower version numbers: cluster idle time for those clusters will continue to be computed based on Dataproc Jobs API activity only. When using image versions 1.4.64+, 1.5.39+, and 2.0.13+, you can opt out of this changed behavior by setting this property to false when you create the cluster.

October 08, 2021

Announcement

In a future announcement (on approximately October 22, 2021), Dataproc will announce that Cluster Scheduled Deletion by default will consider YARN activity, in addition to Dataproc Jobs API activity, when determining cluster idle time. This change will affect image versions 1.4.64+, 1.5.39+, and 2.0.13+. To test this feature now, create a cluster with a recent image, setting the dataproc:dataproc.cluster-ttl.consider-yarn-activity cluster property to true. Note: After this behavior becomes the default, you can opt out when you create a cluster by setting the property to false.

October 01, 2021

Fixed

HADOOP-15129: Fixed in 2.0 Images: Datanode cached namenode DNS lookup failure and could not startup on.

September 13, 2021

Change

New sub-minor versions of Dataproc images: 1.4.71-debian10, 1.4.71-ubuntu18, 1.5.46-centos8, 1.5.46-debian10, 1.5.46-ubuntu18, 2.0.20-centos8, 2.0.20-debian10, 2.0.20-ubuntu18

Fixed

HIVE-21018: Grouping/distinct on more than 64 columns should be possible. Applies to 2.0 images.

August 23, 2021

Change

New sub-minor versions of Dataproc images: 1.4.69-debian10, 1.4.69-ubuntu18, 1.5.44-centos8, 1.5.44-debian10, 1.5.44-ubuntu18, 2.0.18-centos8, 2.0.18-debian10, and 2.0.18-ubuntu18.

August 19, 2021

Announcement

Added support for Dataproc Metastore in three recently turned up regions: .europe-west1, northamerica-northeast1, and asia-southeast1.

Feature

Users can now help assure the successful creation of a cluster by automatically deleting any failed primary workers (the master(s) and at least two primary workers must be successfully provisioned for cluster creation to succeed). To delete any failed primary workers when you create a cluster:

  1. Using gcloud: Set the gcloud dataproc clusters create --action-on-failed-primary-workers flag to "DELETE".

  2. Using the Dataproc clusters.create API: Set the actionOnFailedPrimaryWorkers field to "DELETE".

Change

Dataproc issues a warning message if the staging or test bucket name contains an underscore.

August 13, 2021

Change

New sub-minor versions of Dataproc images: 1.4.68-debian10, 1.4.68-ubuntu18, 1.5.43-centos8, 1.5.43-debian10, 1.5.43-ubuntu18, 2.0.17-centos8, 2.0.17-debian10, and 2.0.17-ubuntu18.

August 09, 2021

Change

New sub-minor versions of Dataproc images: 1.4.67-debian10, 1.4.67-ubuntu18, 1.5.42-centos8, 1.5.42-debian10, 1.5.42-ubuntu18, 2.0.16-centos8, 2.0.16-debian10, and 2.0.16-ubuntu18.

Fixed

SPARK-28290: Fixed an issue where Spark History Server failed to serve because of a wild card certificate in the 1.4 and 1.5 images.

August 02, 2021

Change

New sub-minor versions of Dataproc images: 1.4.66-debian10, 1.4.66-ubuntu18, 1.5.41-centos8, 1.5.41-debian10, 1.5.41-ubuntu18, 2.0.15-centos8, 2.0.15-debian10, and 2.0.15-ubuntu18.

July 27, 2021

Change

The following component versions were updated in image 2.0:

Fixed

Fixed a rare bug that sometimes happened when scaling down the number of secondary workers in a cluster in which the update operation would fail with error 'Resource is not a member of' or 'Cannot delete instance that was already deleted'.

July 12, 2021

Change

For 2.0+ image clusters, the dataproc:dataproc.master.custom.init.actions.mode cluster property can be set to RUN_AFTER_SERVICES to run initialization actions on the master after HDFS and any services that depend on HDFS are initialized. Examples of HDFS-dependent services include: HBase, Hive Server2, Ranger, Solr, and the Spark and MapReduce history servers. Default: RUN_BEFORE_SERVICES.

July 09, 2021

Change

Custom image limitation: New images announced in the Dataproc release notes are not available for use as the base for custom images until one week from their announcement date.

Deprecated

The Dataproc v1beta2 APIs are deprecated. Please use the Dataproc v1 APIs.

July 05, 2021

Change

New sub-minor versions of Dataproc images: 1.3.92-debian10, 1.3.92-ubuntu18, 1.4.63-debian10, 1.4.63-ubuntu18, 1.5.38-centos8, 1.5.38-debian10, 1.5.38-ubuntu18, 2.0.12-centos8, 2.0.12-debian10, and 2.0.12-ubuntu18.

Change

Upgraded Spark version to 2.4.8 in the following images:

  • Image 1.4
  • Image 1.5

June 29, 2021

Feature

Dataproc is now available in the asia-south2 region (Delhi).

Change

The following previously released sub-minor versions of Dataproc images have been rolled back and can only be used when updating existing clusters that already use them:

  • 1.3.91-debian10, 1.3.91-ubuntu18
  • 1.4.62-debian10, 1.4.62-ubuntu18
  • 1.5.37-centos8, 1.5.37-debian10, 1.5.37-ubuntu18
  • 2.0.11-centos8, 2.0.11-debian10, and 2.0.11-ubuntu18.
Change

Added support for Dataproc Metastore in three new recently turned up regions: europe-west3, us-west1, and us-east1.

Change

Introduced a new ERROR_DUE_TO_UPDATE state, which indicates a cluster has encountered an irrecoverable error while scaling. Clusters in this state cannot be scaled, but can accept jobs.

Fixed

Fixed an issue where a spurious unrecognized property warning was generated when the dataproc:jupyter.listen.all.interfaces cluster property is set.

June 21, 2021

Feature

Dataproc is now available in the australia-southeast2 region (Melbourne).

June 08, 2021

Change

Custom image limitation: Currently, the following Dataproc image versions are the latest images that can be used as the base for custom images:

  • 1.3.89-debian10, 1.3.89-ubuntu18
  • 1.4.60-debian10, 1.4.60-ubuntu18
  • 1.5.35-debian10, 1.5.35-ubuntu18, 1.5.35-centos8
  • 2.0.9-debian10, 2.0.9-ubuntu18, 2.0.11-centos8

June 01, 2021

Change

New sub-minor versions of Dataproc images: 1.3.91-debian10, 1.3.91-ubuntu18, 1.4.62-debian10, 1.4.62-ubuntu18, 1.5.37-centos8, 1.5.37-debian10, 1.5.37-ubuntu18, 2.0.11-centos8, 2.0.11-debian10, and 2.0.11-ubuntu18.

Change

Image 1.3 - 2.0

  • All jobs now share a single JobthreadPool.

  • The number of Job threads in the Agent is configurable with the dataproc:agent.process.threads.job.min and dataproc:agent.process.threads.job.max cluster properties, defaulting to 10 and 100, respectively. The previous behavior was to always use 10 Job threads.

Change

Image 2.0

  • Added snappy-jar dependency to Hadoop.
  • Upgraded versions of Python packages: nbdime 2.1 -> 3.0, pyarrow 2.0 -> 3.0, spyder 4.2 -> 5.0, spyder-kernels 1.10 -> 2.0, regex 2020.11 -> 2021.4.
Deprecated

Image 1.5 and 2.0

Fixed

Image 2.0

  • Fixed the problem that the environment variable PATH was not set in YARN containers.

  • SPARK-34731: ConcurrentModificationException in EventLoggingListener when redacting properties.

May 20, 2021

Change

Added validation for clusters created with Dataproc Metastore services to determine compatibility between the Dataproc image's Hive version and the DPMS service's hive version

April 23, 2021

Announcement

Announcing Dataproc Confidential Compute: Dataproc clusters now support Compute Engine Confidential VMs.

Change

New sub-minor versions of Dataproc images: 1.3.89-debian10, 1.3.89-ubuntu18, 1.4.60-debian10, 1.4.60-ubuntu18, 1.5.35-centos8, 1.5.35-debian10, 1.5.35-ubuntu18, 2.0.9-centos8, 2.0.9-debian10, and 2.0.9-ubuntu18.

Change

Image 1.5

  • CentOS only: adoptopenjdk is set as the default Java environment.
Change

Image 1.5 and 2.0

  • Updated Oozie version to 5.2.1
  • The Jupyter optional component now uses the "GCS" subdirectory as the initial working directory when you open the JupyterLab UI.

April 16, 2021

Change

Added the ability to stop and start high-availability clusters.

March 26, 2021

Breaking

Image 2.0:

  • Changed default private IPv6 Google APIs access for 2.0 clusters from OUTBOUND to INHERIT_FROM_SUBNETWORK.

March 23, 2021

Announcement

The default Dataproc image is now image version 2.0.

Change

New sub-minor versions of Dataproc images: 1.3.88-debian10, 1.3.88-ubuntu18, 1.4.59-debian10, 1.4.59-ubuntu18, 1.5.34-centos8, 1.5.34-debian10, 1.5.34-ubuntu18, 2.0.7-centos8, 2.0.7-debian10, and 2.0.7-ubuntu18.

Fixed

Fixed a bug that caused Hive jobs to fail on Ranger-enabled clusters.

Change

Image 2.0:

  • Updated Iceberg to version 0.11.0.
  • Updated Flink to version 1.12.2.
Fixed

CVE-2020-13957: SOLR-14663: ConfigSets CREATE does not set trusted flag.

March 16, 2021

Announcement
Change

Image 2.0: Upgraded Spark to version 3.1.1

March 05, 2021

Change

Image 2.0:

February 26, 2021

Change

New sub-minor versions of Dataproc images: 1.3.85-debian10, 1.3.85-ubuntu18, 1.4.56-debian10, 1.4.56-ubuntu18, 1.5.31-centos8, 1.5.31-debian10, 1.5.31-ubuntu18, 2.0.4-debian10, and 2.0.4-ubuntu18

February 09, 2021

Change

Image 2.0:

  • Upgraded Spark built-in Hive to version 2.3.8.
  • Upgraded Druid to version 0.20.1
  • HIVE-24436: Fixed Avro NULL_DEFAULT_VALUE compatibility issue.
  • SQOOP-3485: Fixed Avro NULL_DEFAULT_VALUE compatibility issue.
  • SQOOP-3447: Removed usage of org.codehaus.jackson and org.json packages.
Fixed

Fixed a bug for beta clusters using a Dataproc Metastore Service where using a subnetwork for the cluster resulted in an error.

January 29, 2021

Change

New sub-minor versions of Dataproc images: 1.3.83-debian10, 1.3.83-ubuntu18, 1.4.54-debian10, 1.4.54-ubuntu18, 1.5.29-centos8, 1.5.29-debian10, 1.5.29-ubuntu18, 2.0.1-debian10, and 2.0.1-ubuntu18.

January 25, 2021

Announcement

Dataproc 2.0 image version will become a default Dataproc image version in 4 weeks on February 22, 2021.

January 22, 2021

Breaking

2.0 image clusters:

You can no longer pass the dataproc:dataproc.worker.custom.init.actions.mode property when creating a 2.0 image cluster. For 2.0+ image clusters, dataproc:dataproc.worker.custom.init.actions.mode is set to RUN_BEFORE_SERVICES. For more information, see Important considerations and guidelines—Initialization processing.

Breaking

2.0 image clusters:

In 2.0 clusters, yarn.nm.liveness-monitor.expiry-interval-ms is set to 15000 (15 seconds). If the resource manager does not receive a heartbeat from a NodeManager during this period, it marks the NodeManager as LOST. This setting is important for clusters that use preemptible VMs. Usually, NodeManagers unregister with the resource manager when their VMs shut down, but in rare cases when they are be shut down ungracefully, it is important for the resource manager to notice this quickly.

Change

New sub-minor versions of Dataproc images: 1.3.82-debian10, 1.3.82-ubuntu18, 1.4.53-debian10, 1.4.53-ubuntu18, 1.5.28-centos8, 1.5.28-debian10, 1.5.28-ubuntu18, 2.0.0-debian10, and 2.0.0-ubuntu18.

Fixed

Fixed bug affecting cluster scale-down: If Dataproc was unable to verify whether a master node exists, for example when hitting Compute Engine read quota limits, it would erroneously put the cluster into an ERROR state.

January 15, 2021

Change

New sub-minor versions of Dataproc images: 1.3.81-debian10, 1.3.81-ubuntu18, 1.4.52-debian10, 1.4.52-ubuntu18, 1.5.27-centos8, 1.5.27-debian10, 1.5.27-ubuntu18, 2.0.0-RC23-debian10, and 2.0.0-RC23-ubuntu18.

Change

Image 2.0 preview:

  • Upgraded Spark to version 3.1.0 RC1.

  • Upgraded Zeppelin to version 0.9.0.

  • Upgraded Cloud Storage Connector to version 2.2.0.

  • Upgraded JupyterLab to version 3.0.

January 12, 2021

Feature

Added support for user configuration of Compute Engine Shielded VMs in a Dataproc Cluster.

January 08, 2021

Change

New sub-minor versions of Dataproc images: 1.3.80-debian10, 1.3.80-ubuntu18, 1.4.51-debian10, 1.4.51-ubuntu18, 1.5.26-centos8, 1.5.26-debian10, 1.5.26-ubuntu18, 2.0.0-RC22-debian10, and 2.0.0-RC22-ubuntu18.

Change

Image 2.0 preview:

  • Upgraded Delta Hive connector to version 0.2.0.
  • Upgraded Flink to version 1.12.0.
  • Updated Iceberg to version 0.10.0.
Fixed

Image 2.0 preview:

HIVE-21646: Tez: Prevent TezTasks from escaping thread logging context

December 17, 2020

Change

Image 2.0 preview:

Changed the default value of Spark SQL property spark.sql.autoBroadcastJoinThreshold to 0.75% of executor memory.

Fixed SPARK-32436: Initialize numNonEmptyBlocks in HighlyCompressedMapStatus.readExternal

Fixed

Image 1.4-1.5:

Fixed a NullPointerException in a primary worker shuffle when the BypassMergeSortShuffleWriter is used when some output partitions are empty.

December 15, 2020

Announcement

Announcing the Beta release of the Dataproc cluster Stop/Start.

December 08, 2020

Feature

Restartable jobs: Added the ability for users to specify the maximum number of total failures when a job is submitted.

Change

Image 1.5:

Change

New sub-minor versions of Dataproc images: 1.3.78-debian10, 1.3.78-ubuntu18, 1.4.49-debian10, 1.4.49-ubuntu18, 1.5.24-debian10, 1.5.24-ubuntu18, 2.0.0-RC20-debian10, and 2.0.0-RC20-ubuntu18.

November 16, 2020

Change

Image 2.0 preview

  • Upgraded Hue to version 4.8.0

November 09, 2020

Breaking

Clusters that use Dataproc Metastore must be created in the same region as the Dataproc Metastore service that they will use.

Change

Image 2.0 preview

Fixed

Fixed a bug where the Jupyter optional component depended on the availability of GitHub at cluster creation time.

October 30, 2020

Feature

Added a dataproc:dataproc.cooperative.multi-tenancy.user.mapping cluster property which takes a list of comma-separated user-to-service account mappings. If a cluster is created with this property set, when a user submits a job, the cluster will attempt to impersonate the corresponding service account when accessing Cloud Storage through the Cloud Storage connector. This feature requires Cloud Storage connector version 2.1.4 or higher.

Change

New sub-minor versions of Dataproc images: 1.3.75-debian10, 1.3.75-ubuntu18, 1.4.46-debian10, 1.4.46-ubuntu18, 1.5.21-debian10, 1.5.21-ubuntu18, 2.0.0-RC17-debian10, and 2.0.0-RC17-ubuntu18.

Fixed

Fixed a bug in HBASE optional component on HA clusters in which hbase.rootdir was always configured to be hdfs://${CLUSTER_NAME}-m-0:8020/hbase, which assumes that master 0 is the active namenode. Now it is configured to be hdfs://${CLUSTER_NAME}:8020/hbase, so that the active master is always chosen.

Fixed

Image 1.3 to 2.0 preview:

Fixed HIVE-19202: CBO failed due to NullPointerException in HiveAggregate.isBucketedInput().

October 23, 2020

Fixed

Pin MySQL Java connector version to prevent breakage of the /usr/share/java/mysql-connector-java.jar symlink on long-running and old clusters caused by auto-upgrade to a new MySQL Java connector.

Fixed

Sole-tenant node cluster create or update requests to use preemptible secondary workers or attach autoscaling policies that create preemptible secondary workers are now correctly rejected.

Fixed

All image versions:

  • Fixed a bug where files uploaded to Cloud Storage through the JupyterLab UI were incorrectly base64 encoded.
Fixed

1.4 and 1.5 image versions:

  • SPARK-32708: Fixed SparkSQL query optimization failure to reuse exchange with DataSourceV2.

October 22, 2020

Feature

Announcing the Alpha release of the Dataproc Persistent History Server, which provides a UI to view job history for jobs run on active and deleted Dataproc clusters.

October 13, 2020

Change

New sub-minor versions of Dataproc images: 1.3.72-debian10, 1.3.72-ubuntu18, 1.4.43-debian10, 1.4.43-ubuntu18, 1.5.18-debian10, 1.5.18-ubuntu18, 2.0.0-RC14-debian10, and 2.0.0-RC14-ubuntu18.

October 06, 2020

Change

Image 2.0

  • Updated HBase to version 2.2.6.
  • Installed google-cloud-bigquery-storage in default conda environment.
  • Changed default value of zeppelin.notebook.storage in zeppelin-site.xml to "org.apache.zeppelin.notebook.repo.GCSNotebookRepo".

September 30, 2020

Change

Creating clusters and instantiating workflow requests that succeed even when the requester did not have ActAs permission on the service account now generate a warning field in the audit log request.

Change

All supported images

Upgraded Conscrypt to the 2.5.1 version.

Change

Image 1.5 and Image 2.0 Preview

Change

Image 2.0 preview

September 18, 2020

Change

New sub-minor versions of Dataproc images: 1.3.69-debian10, 1.3.69-ubuntu18, 1.4.40-debian10, 1.4.40-ubuntu18, 1.5.15-debian10, 1.5.15-ubuntu18, 2.0.0-RC11-debian10, and 2.0.0-RC11-ubuntu18.

Change

Image 2.0 preview

September 11, 2020

Change

Added the PrivateIpv6GoogleAccess API field to allow configuring IPv6 access to Dataproc cluster.

Change

1.5 and 2.0 preview images:

Upgraded the jupyter-core and jupyter-client packages in the 1.5 and 2.0 images to be compatible with the installed notebook package version.

Change

2.0 preview image:

Fixed

Fixed a regression that could cause clusters to fail to start if user-supplied keystore/truststore are provided when enabling Kerberos.

September 04, 2020

Change

Switched 1.3 and 1.3-debian image version aliases to point to 1.3 Debian 10 images.

Change

Support more than 8 local SSDs on VMs. Compute Engine supports 16 and 24 SSDs for larger machine types.

Change

Improved node memory utilization in clusters created with 2.0 preview images.

August 28, 2020

Feature

Launched Dataproc Workflow Timeout feature, which allows users to set a timeout on their graph of jobs and automatically cancel their workflow after a specified period.

August 21, 2020

Change

New sub-minor versions of Dataproc images: 1.3.67-debian10, 1.3.67-ubuntu18, 1.4.38-debian10, 1.4.38-ubuntu18, 1.5.13-debian10, 1.5.13-ubuntu18, 2.0.0-RC9-debian10, and 2.0.0-RC9-ubuntu18.

Change

Image 1.3 and 1.4: upgraded the Cloud Storage connector to version to 1.9.18.

August 17, 2020

Feature

Launched new Personal Cluster Authentication feature, which allows the creation of single-user clusters that can access Cloud Storage using the user's own credentials instead of a VM service account.

August 14, 2020

Change

Enabled Spark SQL parquet metadata cache (removed spark.sql.parquet.cacheMetadata=false from Spark default configuration).

Change

Image 1.4:

  • Fixed a bug in Spark EFM HCFS shuffle where failures after partial commits are not recoverable.
  • Upgraded Spark to 2.4.6 version.
Change

Image 1.5:

  • Fixed a bug in Spark EFM HCFS shuffle where failures after partial commits are not recoverable.
  • Upgraded Spark to 2.4.6 version.
  • Upgraded Zeppelin to 0.9.0-preview2 version.
  • Included all plugins in Zeppelin installation by default.
Change

Image 2.0 preview:

July 31, 2020

Feature

Enabled Kerberos automatic-configuration feature. When creating a cluster, users can enable Kerberos by setting the dataproc:kerberos.beta.automatic-config.enable cluster property to true. When using this feature, users do not need to specify the Kerberos root principal password with the --kerberos-root-principal-password and --kerberos-kms-key-uri flags.

Change

Images 1.5 - 2.0 preview:

Fixed

Fixed an issue where optional components that depend on HDFS failed on single node clusters.

July 24, 2020

Feature

Terminals started in Jupyter and JupyterLab now use login shells. The terminals behave as if you SSH'd into the cluster as root.

Fixed

Fixed a bug in which the HDFS DataNode daemon was enabled on secondary workers but not started (except on VM reboot if started automatically by systemd).

July 17, 2020

Feature

Dataproc now uses Shielded VMs for Debian 10 and Ubuntu 18.04 clusters by default.

Change

New sub-minor versions of Dataproc images: 1.3.63-debian10, 1.3.63-ubuntu18, 1.4.34-debian10, 1.4.34-ubuntu18, 1.5.9-debian10, 1.5.9-ubuntu18, 2.0.0-RC5-debian10, and 2.0.0-RC5-ubuntu18.

Change

Image 2.0 preview:

Fixed

If a project's regional Dataproc staging bucket is manually deleted, it will be recreated automatically when a cluster is subsequently created in that region.

July 10, 2020

Feature

Added --temp-bucket flag to gcloud dataproc clusters create and gcloud dataproc workflow-templates set-managed-cluster to allow users to configure a Cloud Storage bucket that stores ephemeral cluster and jobs data, such as Spark and MapReduce history files.

Feature

Extended Jupyter to support notebooks stored on VM persistent disk. This change modifies the Jupyter contents manager to create two virtual top-level directories, named GCS, and Local Disk. The GCS directory points to the Cloud Storage location used by previous versions, and the Local Disk directory points to the persistent disk of the VM running Jupyter.

Fixed
  • Images 1.3 - 1.5:

    • Fixed HIVE-11920: ADD JAR failing with URL schemes other than file/ivy/hdfs.
  • Images 1.3 - 2.0 preview:

    • Fixed TEZ-4108: NullPointerException during speculative execution race condition.

July 07, 2020

Feature

Announcing the General Availability (GA) release of Dataproc Component Gateway, which provides secure access to web endpoints for Dataproc default and optional components.

June 24, 2020

Change
  • Image 2.0 preview:

    • SPARK-22404: set spark.yarn.unmanagedAM.enabled property to true on clusters where Kerberos is not enabled to run Spark Application Master in driver (not managed in YARN) to improve job execution time.
    • Updated R version to 3.6

    • Updated Spark to 3.0.0 version

  • Image 1.5

    • Updated R version to 3.6
Fixed

Fixed a quota validation bug where accelerator counts were squared before validation -- for example, previously if you requested 8 GPUs, Dataproc validated whether your project had quota for 8^2=64 GPUs.

June 11, 2020

Change
  • New subminor image versions: 1.2.99-debian9, 1.3.59-debian9, 1.4.30-debian9, 1.3.59-debian10, 1.4.30-debian10, 1.5.5-debian10, 1.3.59-ubuntu18, 1.4.30-ubuntu18, and 1.5.5-ubuntu18.

  • New preview image 2.0.0-RC1-debian10, 2.0.0-RC1-ubuntu18, with the following components:

    • Anaconda 2019.10
    • Atlas 2.0.0
    • Druid 0.18.1
    • Flink 1.10.1
    • Hadoop 3.2.1
    • HBase 2.2.4
    • Hive 3.1.2 (with LLAP support)
    • Hue 4.7.0
    • JupyterLab 2.1.0
    • Kafka 2.3.1
    • Miniconda3 4.8.3
    • Pig 0.18.0
    • Presto SQL 333
    • Oozie 5.2.0
    • R 3.6.0
    • Ranger 2.0.0
    • Solr 8.1.1
    • Spark 3.0.0
    • Sqoop 1.5.0
    • Zeppelin 0.9.0
Change
  • Image 1.3+

    • Patched HIVE-23496 Adding a flag to disable materialized views cache warm up.
Change

Druid's Historical's and Broker's JVM and runtime properties are now calculated using server resources. Previously, only the Historical's and MiddleManager's MaxHeapSize property was calculated using server resources. This change modifies how new values for MaxHeapSize and MaxDirectMemorySize properties are calculated for Broker and Historical processes. Also, new runtime properties druid.processing.numThreads and druid.processing.numMergeBuffers are calculated using server resources.

Change

If the project-level staging bucket is manually deleted, it will be recreated when a cluster is created.

June 08, 2020

Feature

Dataproc is now available in the asia-southeast2 region (Jakarta).

May 21, 2020

Change

New sub-minor versions of Dataproc images: 1.2.98-debian9, 1.3.58-debian9, 1.4.29-debian9, 1.3.58-debian10, 1.4.29-debian10, 1.5.4-debian10, 1.3.58-ubuntu18, 1.4.29-ubuntu18, 1.5.4-ubuntu18.

Change

Image 1.3, 1.4, and 1.5

  • Restrict Jupyter, Zeppelin, and Knox to only accept connections from localhost when Component Gateway is enabled. This restriction reduces the risk of remote code execution over unsecured notebook server APIs. To override this change, when you create the cluster, set the Jupyter, Zeppelin, and Knox cluster properties, respectively, as follows: dataproc:jupyter.listen.all.interfaces=true, zeppelin:zeppelin.server.addr=0.0.0.0, and knox:gateway.host=0.0.0.0.

  • Upgrade Hive to version 2.3.7.

Change

Image 1.4 and 1.5

SPARK-29367: Add ARROW_PRE_0_15_IPC_FORMAT=1 in yarn-env.sh to fix the Pandas UDF issue with pyarrow 0.15.

Change

Image 1.5

Change

Set the hive.localize.resource.num.wait.attempts property to 25 to improve reliability of Hive queries.

Deprecated

The dataproc:alpha.state.shuffle.hcfs.enabled cluster property has been deprecated. To enable Enhanced Flexibility Mode (EFM) for Spark, set dataproc:efm.spark.shuffle=hcfs. To enable EFM for MapReduce, set dataproc:efm.mapreduce.shuffle=hcfs.

April 24, 2020

Feature

Image 1.5

Delta Lake version is upgraded to 0.5.0 release. Delta Lake Hive Connector 0.1.0 is also added to the 1.5 image.

Feature

Customers can now adjust the amount of time the Dataproc startup script will wait for Presto Coordinator service to start before deciding that their startup has succeeded. This is set via dataproc:startup.component.service-binding-timeout.presto-coordinator property and takes a value in seconds. The maximum respected value is 1800 (30 minutes).

Change

Image 1.5

Cloud Storage connector upgraded to version 2.1.2 (for more information, review the change notes in the GitHub repository)

Fixed

Image 1.5

Notebook bug fixes: fixed a bug in Zeppelin and Jupyter that resulted in a failure to render images when using Component Gateway. Also fixed a Jupyter Notebooks bug that caused notebook downloads to fail.

April 20, 2020

Change

Dataproc is now available in the us-west4 region (Las Vegas).

April 15, 2020

Feature

Image 1.5

Jupyter on Dataproc now supports exporting notebooks as PDFs.

Feature

Image 1.3, 1.4 and 1.5

Added Component Gateway support for Dataproc clusters secured with Kerberos.

Change

New sub-minor versions of Dataproc images: 1.2.95-debian9, 1.3.55-debian9, 1.4.26-debian9, 1.3.55-debian10, 1.4.26-debian10, 1.5.1-debian10, 1.3.55-ubuntu18, 1.4.26-ubuntu18, 1.5.1-ubuntu18.

Change

Image 1.3 and 1.4

Debian 10 images will become default images for 1.3 and 1.4 image tracks and Debian 9 images will not be released for these tracks anymore after June 30, 2020.

Fixed

Image 1.3, 1.4 and 1.5

HIVE-17275: Auto-merge fails on writes of UNION ALL output to ORC file with dynamic partitioning.

April 03, 2020

Fixed

Fixed an Auto Zone Placement bug that incorrectly returned INVALID_ARGUMENT errors as INTERNAL errors, and didn't propagate these error messages to the user.

April 01, 2020

Feature

Announcing the General Availability (GA) release of Dataproc Presto job type, which can be submitted to a cluster using the gcloud dataproc jobs submit presto command. Note: The Dataproc Presto Optional Component must be enabled when the cluster is created to submit a Presto job to the cluster.

March 25, 2020

Feature

Announcing the General Availability (GA) release of Dataproc 1.5 images.

Change

Image 1.5
Upgraded the Cloud Storage connector to version 2.1.1.

March 17, 2020

Feature

Added a mapreduce.jobhistory.always-scan-user-dir cluster property that enables the MapReduce job history server to read the history files (recommended when writing history files to Cloud Storage). The default is true.

Change

Image 1.5 preview

  • Preinstalled additional Python packages and Jupyter[Lab] extensions to align Jupyter notebook environment with AI Platforms Notebooks when Jupyter optional component is enabled.

  • Upgraded Druid version to 0.17.0

Change

Cluster list methods now return results in lexical order.

Fixed

Image 1.3, 1.4, 1.5 preview

Fixed YARN container log links in Component Gateway

March 10, 2020

Change

Added the following flags to gcloud dataproc clusters create and gcloud dataproc workflow-templates set-managed-cluster commands:

  • --num-secondary-workers
  • --num-secondary-worker-local-ssds
  • --secondary-worker-boot-disk-size
  • --secondary-worker-boot-disk-type
  • --secondary-worker-accelerator

March 03, 2020

Feature

Added a dataproc:yarn.log-aggregation.enabled cluster property that allows turning on YARN log aggregation to a Dataproc temporary bucket (default: true for image versions 1.5+). Users can also set the location of aggregated YARN logs by overwriting the yarn.nodemanager.remote-app-log-dir YARN property.

Change

New sub-minor versions of Dataproc images: 1.2.92-debian9, 1.3.52-debian9, 1.4.23-debian9, 1.5.0-RC8-debian10, 1.3.52-ubuntu18, 1.4.23-ubuntu18, 1.5.0-RC8-ubuntu18

Change

In addition to the staging bucket, Dataproc now creates a temporary bucket for storing feature-related data with a 90-day retention period per project per region. The bucket name will be of the following form: dataproc-temp-<REGION>-<PROJECT_NUMBER>-<RANDOM_STRING>.

Fixed

1.3, 1.4, and 1.5 preview images:

Fixed a bug where Component Gateway pages added an additional banner on each page load in some browsers.

February 25, 2020

Feature

Added support for attaching GPUs to preemptible workers.

Breaking

The Compute Engine API call from the Dataproc backend to check the API-specific quota limit is now enforced. The default Compute Engine API quota limit should be sufficient for most projects. If your project is experiencing Compute Engine API throttling, you can request a higher quota. If your project requires QPS that is higher than the upper quota limit, consider splitting your project into smaller projects.

Change

Added support for parameterizing numInstances field in Workflow Templates.

Change

Image 1.2, 1.3, & 1.4:

Upgrade Oozie to 4.3.1 and backport OOZIE-3112.

Fixed

Image 1.3 & 1.4:

ZEPPELIN-4168: fixed downloading of Maven dependencies in Zeppelin

Fixed

Backport YARN-9011 to all 1.2, 1.3, 1.4, and 1.5 images. This fixes a race condition in the resource manager that would sometimes cause containers to be killed while performing a graceful decommission.

February 24, 2020

Feature

Dataproc is now available in the us-west3 region (Salt Lake City).

January 24, 2020

Feature

Dataproc is now available in the asia-northeast3 region (Seoul).

January 20, 2020

Breaking

Breaking Change: Modified gcloud beta clusters create command, replacing --reservation-label with --reservation, which accepts the name of the reservation when --reservation-affinity is specific, matching gcloud compute instances create.

Breaking

Breaking Change: gcloud command-line tool dataproc commands now require the --region flag on each invocation or dataproc/region config property to be set. Set your default Dataproc region via the gcloud config set dataproc/region us-central1 command.

December 20, 2019

Feature

Added additional information into Autoscaler recommendations in Stackdriver logging. In particular each recommendation now includes 1) the minimum and maximum worker counts configured in the autoscaling policy, 2) the graceful decommission timeout (for SCALE_DOWN operations), and 3) additional context into why the autoscaler couldn't add or remove more nodes. For example, it will indicate if it could not scale up further due to available regional or global quotas in the project (SCALING_CAPPED_DUE_TO_LACK_OF_QUOTA).

Change

New sub-minor versions of Cloud Dataproc images:: 1.2.87-debian9, 1.3.47-debian9, 1.4.18-debian9, 1.5.0-RC3-debian10, 1.3.47-ubuntu18, 1.4.18-ubuntu18, 1.5.0-RC3-ubuntu18

Change

Disabled noisy Jersey logging in the YARN Timeline Server.

Change

1.5 preview image updates

  • Upgraded Debian OS to Debian 10 (buster)
  • Upgraded Apache Atlas to version 1.2.0
  • Upgraded Apache Solr to version 6.6.6
  • Upgraded Apache Knox to version 1.3.0
  • Backported SPARK-28921: Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)
  • Enabled HBase optional component in Dataproc beta.
  • Installed Delta Lake 0.4.0 and Apache Iceberg 0.7.0 jars
  • Updated Jupyter Python 3 kernel to add SparkMonitor 0.11 support. To create SparkContext, user can import SparkContext and create SparkContext using an auto-created conf object using sc=SparkContext.getOrCreate(conf=conf). To create SparkSession, user can import SparkSession and use SparkSession.builder.appName("Any Name").getOrCreate()

December 06, 2019

Change

Added a warning when clusters are created with component gateway and kerberos as they are not currently supported together

Change

1.3 image update - Upgraded Hive to version 2.3.6

Change

1.4 image update - Fixed a bug in the Jupyter component that prevented creating text files using the Jupyter and JupyterLab UIs

Change

1.5 preview image update:

November 15, 2019

Feature
  • Allow creation of Dataproc master and worker VMs with Intel Optane DC Persistent memory using dataproc:dataproc.alpha.master.nvdimm.size.gb and dataproc:dataproc.alpha.master.nvdimm.size.gb properties. Note: Optane VMs can only be created in zones with only n1-highmem-96-aep machine type (such as us-central1-f).

  • Support updating Cloud Storage connector version by setting the GCS_CONNECTOR_VERSION metadata key. For example, --metadata GCS_CONNECTOR_VERSION=2.0.0.

Breaking

Breaking Changes:

  • Dataproc will no longer accept requests to create new clusters using Debian 8 images. Existing clusters that are using Debian 8 can still be used and scaled.

  • The alpha version of the Kerberos optional component option has been removed from the v1beta2 API. To use Kerberos, please configure it via the SecurityConfig message instead.

  • Disallow setting or overriding goog-dataproc-* labels when creating or updating clusters. All attempts to set goog-dataproc-* labels on creation will be ignored. Any existing custom goog-dataproc-* labels on running clusters will be removed if cluster labels are updated, and attempts to add new custom goog-dataproc-* labels to existing clusters will be ignored.

  • Reset autoscaling if the same or a new autoscaling policy is attached to a cluster. E.g. through gcloud dataproc clusters update --autoscaling-policy=<policy-name>. Note that resetting autoscaling does not change the cluster's status. Autoscaling will only run while the cluster's status is RUNNING or UPDATING, and does not work if the cluster is in ERROR. In addition, if the cluster was previously resizing (UPDATING), the autoscaler will not be able to scale the cluster until the previous update operation has completed.

    • Also fixes an issue where autoscaling would not reset correctly if autoscaling was disabled then re-enabled with the same within 30 seconds. Running gcloud dataproc clusters update --disable-autoscaling and then gcloud dataproc clusters update --autoscaling-policy=<same-policy> should now correctly reset autoscaling regardless of how long autoscaling was disabled.
Fixed
  • Fixed race condition where deleting a cluster shortly after updating it would not delete all of the cluster's VMs

  • Allow compute resource names with an Alpha API version. For example: https://compute.googleapis.com/compute/alpha/projects/my-project/global/networks/my-network.

    • This allows gcloud alpha dataproc clusters create <cluster-name> --network=my-network to succeed, since gcloud sets the compute API version to the same as the gcloud command track.

October 31, 2019

Feature

October 04, 2019

Feature

September 26, 2019

Fixed
  • Fixed an issue caused by expired Bigtop APT repo GPG key.

September 03, 2019

Change
  • New sub-minor versions of Cloud Dataproc images: 1.1.120-debian9, 1.2.83-debian9, 1.3.43-debian9, 1.4.14-debian9, 1.3.43-ubuntu18, 1.4.14-ubuntu18.
  • Druid Optional Component can now be configured via druid-broker, druid-overlord, druid-coordinator, druid-historical, and druid-middlemanager property prefixes.
  • Image 1.3
  • Image 1.4

August 23, 2019

Change
  • New sub-minor versions of Cloud Dataproc images: 1.1.119-debian9, 1.2.82-debian9, 1.3.42-debian9, 1.4.13-debian9, 1.3.42-ubuntu18, 1.4.13-ubuntu18.
  • Image 1.4
    • Upgraded Zeppelin to 0.8.1 with fix for compatibility issue with Spark 2.4.

August 09, 2019

Change
  • New sub-minor versions of Cloud Dataproc images: 1.1.118-debian9, 1.2.81-debian9, 1.3.41-debian9, 1.4.12-debian9, 1.3.41-ubuntu18, 1.4.12-ubuntu18.
  • Image 1.3
  • Image 1.4
    • Applied patch for MAPREDUCE-7101 to Hadoop.
    • Upgraded Hive to 2.3.5.
    • Added Sqoop version 1.4.6.
Fixed
  • Image 1.4
    • Reverted Zeppelin to version 0.8.0 due to an incompatibility between Spark 2.4 and Zeppelin 0.8.1.

July 02, 2019

Change
  • New sub-minor versions of Cloud Dataproc images: 1.1.116-debian9, 1.2.79-debian9, 1.3.39-debian9, 1.4.10-debian9, 1.3.39-ubuntu18, 1.4.10-ubuntu18.

June 28, 2019

Feature

June 18, 2019

Feature
Fixed
  • Image 1.2, 1.3, and 1.4:
    • Fixed bug where agent would fail to mark jobs as done and complete diagnose requests when Resource Manager requests are slow.

June 11, 2019

Change
  • New major and sub-minor versions of Cloud Dataproc images: 1.1.113-debian9, 1.2.76-debian9, 1.3.36-debian9, 1.4.7-debian9, 1.3.36-ubuntu18, 1.4.7-ubuntu18
  • Failed autoscaling operations now appear with error messages in logs. Previously, log included only the failing operation ID, which user was expected to use to lookup the error message.
  • Image 1.4:
    • Upgraded Tez to 0.9.2.
  • Image 1.3:
    • The zstandard codec is now supported in Hadoop.

June 01, 2019

Issue

June 2019 Release Note Bulletin

This bulletin applies if you override the defaultFS at job-submission time for Spark jobs.

An inadvertent change in behavior has been identified for determining the location of the Spark eventlog directory when setting spark.hadoop.fs.defaultFS to any setting other than HDFS at Spark job-submission time in the following Cloud Dataproc June 2019 image versions:

1.1.113-debian9, 1.2.76-debian9, 1.3.36-debian9, 1.4.7-debian9, 1.3.36-ubuntu18, 1.4.7-ubuntu18 1.1.114-debian9, 1.2.77-debian9, 1.3.37-debian9, 1.4.8-debian9, 1.3.37-ubuntu18, 1.4.8-ubuntu18

Although the /user/spark/eventlog directory normally stays in HDFS, it began referring to the specified filesystem from spark.hadoop.fs.defaultFS in these image versions, and can result in the failure of Spark jobs with an error similar to "java.io.FileNotFoundException: File not found : /user/spark/eventlog".

Use any of the following strategies as a workaround:

  1. Run the following command manually before submitting jobs:

    hdfs dfs -mkdir -p <filesystem base/bucket>/user/spark/eventlog

  2. Add the following Spark properties to Spark jobs to override defaultFS:

    spark.eventLog.dir=hdfs:///user/spark/eventlog,spark.history.fs.logDirectory=hdfs:///user/spark/eventlog

  3. Temporarily pin to the immediately prior image version numbers until image versions newer than those listed above are available.

May 16, 2019

Change
  • New major and sub-minor versions of Cloud Dataproc images: 1.1.111-debian9, 1.2.74-debian9, 1.3.34-debian9, 1.4.5-debian9, 1.3.34-ubuntu18, 1.4.5-ubuntu18
  • Image 1.1:
    • Parallelized Hive.copyFiles method in Spark's fork of Hive.
    • Resolved scheme-less Spark event log directory relative to default filesystem.
  • Image 1.4:

May 09, 2019

Fixed
  • Fixed a rare issue where a managed instance group could be leaked if a delete cluster request is processed while a labels update is in progress for a cluster that uses preemptible VMs.

May 06, 2019

Change
  • New major and sub-minor versions of Cloud Dataproc images: 1.0.118-debian9, 1.1.109-debian9, 1.2.72-debian9, 1.3.32-debian9, 1.4.3-debian9, 1.3.32-ubuntu18, 1.4.3-ubuntu18
  • Image 1.3:
    • Upgraded Spark to 2.3.3.
Fixed
  • Image 1.2, 1.3, and 1.4:
  • Image 1.3 and 1.4:

April 29, 2019

Fixed
  • Fixed issue where the dataproc.googleapis.com/cluster/job/running_count metric was not consistent with the active running jobs count on the Cloud Dataproc Job page in Google Cloud Platform console.
  • Fixed an issue in which the HDFS Datanode blocks and MapReduce and Spark local directories on clusters with local SSDs were stored on both the cluster's boot disk and on the local SSDs. They are now stored only on the local SSDs.

April 26, 2019

Feature
  • Announcing the Beta release of Cloud Dataproc Job Logs in Stackdriver. Enable this feature to view, search, filter, and archive job driver and YARN container logs in Stackdriver Logging.
    • The dataproc:dataproc.logging.stackdriver.job.driver.enable=true service property can now be used to enable sending Job driver logs to Stackdriver. Logs will be sent to the Stackdriver "Cloud Dataproc Job" resource under the dataproc.job.driver log.
    • The dataproc:dataproc.logging.stackdriver.job.yarn.container.enable service property can now be used to enable sending YARN container logs to Stackdriver. Logs will be sent to the Stackdriver "Cloud Dataproc Job" resource under the dataproc.job.yarn.container log (instead of the "Cloud Dataproc Cluster" resource under the yarn-userlogs log).
Fixed
  • Fixed quota validation errors in which scaling down clusters with accelerators on primary workers failed with java.lang.IllegalArgumentException: occurrences cannot be negative.
  • Fixed an issue where the Tez UI failed to load.
  • Fixed a race condition on cluster startup in which a cluster accepted Hive jobs before the hive-server2 was available via beeline to run the jobs.
  • Fixed an issue where removing the yarn.nodemanager.local-dirs property from yarn-site.xml in a Kerberized cluster failed cluster creation.

April 18, 2019

Feature
  • Cloud Dataproc is now available in the asia-northeast2 region (Osaka).

April 04, 2019

Feature
  • Initial Alpha release of Cloud Dataproc Enhanced Flexibility Mode. Enhanced Flexibility Mode can provide stability and scalability benefits by preserving stateful node data, such as mapreduce shuffle data, in HDFS. Enhanced Flexibility Mode can be used with clusters created with image version 1.4.x and later.

March 12, 2019

Feature
  • Announcing the General Availability (GA) release of Cloud Dataproc Optional Components. This feature allows users to specify additional components to install when creating new Cloud Dataproc clusters (applies to clusters created with image version 1.3 and later).

March 11, 2019

Feature
  • Cloud Dataproc is now available in the europe-west6 region (Zürich, Switzerland).

March 08, 2019

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.114-deb9, 1.1.105-deb9, 1.2.68-deb9, 1.3.28-deb9, 1.4.0-RC12-deb9
  • Image 1.3 and 1.4 preview
    • Cloud Storage connector upgraded to version 1.9.16.
  • Image 1.4 preview
    • Added Spylon (Scala) kernel for Jupyter.

March 04, 2019

Fixed
  • Fixed regression in creating autoscaling clusters where dataproc:alpha.autoscaling.secondary.max_workers had to be greater than 0.

February 26, 2019

Change
  • Due to the issue announced on February 22, 2019, Cloud Dataproc image versions announced on February 14, 2019 (1.0.111-deb9, 1.1.102-deb9, 1.2.65-deb9, 1.3.25-deb9, 1.4.0-RC9-deb9) and the changes and fixes associated with those image versions are currently not available. The latest images versions available for cluster creation, and the changes and fixes associated with those image versions, are those announced on February 11, 2019 (1.0.110-deb9, 1.1.101-deb9, 1.2.64-deb9, 1.3.24-deb9, 1.4.0-RC8-deb9).

February 11, 2019

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.110-deb9, 1.1.101-deb9, 1.2.64-deb9, 1.3.24-deb9, 1.4.0-RC8-deb9.
  • YARN applications that are submitted via the Cloud Dataproc Jobs API now include the Job UUID in the application tags with the format "dataprocuuid${UUID}".
  • Removed upper limit on total number of threads in job drivers.
  • Image 1.3 only:
    • Added a new yarn-site.xml property (yarn.resourcemanager.webapp.methods-allowed) which is set with a comma- separated list of the names of HTTP methods that the YARN Resource Manager will allow to be called in its web UI and REST API (default port 8088). The current default value is "ALL", which allows all HTTP methods (GET,POST,PUT,PATCH,DELETE,HEAD,CONNECT,OPTIONS, TRACE). It is recommended that this property be set to "GET" to disable remote code submission or other modifications via the REST API. This property will default to "GET" for all Cloud Dataproc clusters in a future release (see Important cross-update notes, above).
  • Image 1.4 preview only:
    • Upgraded Kafka to 1.1.1
    • Added a new yarn-site.xml property (yarn.resourcemanager.webapp.methods-allowed) which is set with a comma- separated list of the names of HTTP methods that the YARN Resource Manager will allow to be called in its web UI and REST API (default port 8088). The current default value is "ALL", which allows all HTTP methods (GET,POST,PUT,PATCH,DELETE,HEAD,CONNECT,OPTIONS, TRACE). It is recommended that this property be set to "GET" to disable remote code submission or other modifications via the REST API. This property will default to "GET" for all Cloud Dataproc clusters in a future release (see Important cross-update notes, above).

December 14, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.105-deb9, 1.1.96-deb9, 1.2.59-deb9, 1.3.19-deb9, 1.4.0-RC3-deb9.
  • Allow properties in zoo.cfg to be configured through 'zookeeper:' property prefix.
  • Allow users to parameterize the name of a managed cluster in workflow templates.

December 04, 2018

Feature

November 12, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.101-deb9, 1.1.92-deb9, 1.2.55-deb9, 1.3.15-deb9.
  • Minor image versions now redirect to Debian 9-based images. For example, 1.2 now points to 1.2-deb9. There will be no new Debian 8-based images.
  • Job UUIDs are now exposed to allow job runs to be uniquely identified.
  • The Cloud Storage Connector now sets fadvise to SEQUENTIAL for Hadoop DistCp jobs. This mode is optimized for streaming reads, which are most efficient for these workloads.
  • Removed the ALPN boot jar from Cloud Dataproc versions 1.0 and 1.1 due to incompatibility with the latest OpenJDK 8 distributed with Debian. gRPC users must use a form of netty-tcnative, for example, io.grpc:grpc-netty-shaded. This already applies to 1.2 and 1.3.
  • Reduced the Linux process priority of user jobs.
  • dfs.namenode.datanode.registration.retry-hostname-dns-lookup is now set to true.
  • Increased the maximum number of DistCp tasks scheduled per node. This improves DistCp performance.
  • Image 1.3 only:
    • Ported HDFS-13056 to Hadoop 2.9.
    • Upgraded Cloud Storage Connector to version 1.9.9. See the GitHub release notes.
    • Presto is now supported as an optional top-level component.
Fixed
  • Fixed a bug where CMEK was not passed to PD on preemptible workers. * Fixed a bug where changes to PATH in custom images broke Cloud Dataproc initialization. For example, changing the default Python to Python 3 previously broke initialization. * Fixed a bug where POST ant PUT requests to the YARN REST API were blocked by anonymous users on Cloud Dataproc 1.3. This was fixed by adding org.apache.hadoop.http.lib.StaticUserWebFilter back to hadoop.http.filter.initializers in core-site.xml * Fixed logging warnings in Hive 2 in Cloud Dataproc 1.1, 1.2, and 1.3.

October 22, 2018

Feature
  • Cloud Dataproc is now available in the asia-east2 region (Hong Kong).

October 19, 2018

Fixed
  • Image 1.0 only: Fixed a bug where Stackdriver metrics were failing to be published, which also impacted autoscaling functionality.

October 12, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.98-deb8, 1.1.89-deb8, 1.2.52-deb8, 1.3.12-deb8, 1.0.98-deb9, 1.1.89-deb9, 1.2.52-deb9, 1.3.12-deb9.
  • Image 1.3 only: Cloud Storage connector upgrade (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to 1.9.8 version.
  • Image 1.0 only: Upgrade Hadoop to 2.7.4.

October 05, 2018

Fixed
  • Fixed an issue in which the timeout for the first initialization action was used as the timeout for all other initialization actions.
  • Fixed infrequent issue in which cluster creation failed with debconf: DbDriver "config": /var/cache/debconf/config.dat is locked by another process: Resource temporarily unavailable error.

September 28, 2018

Feature
  • Feature (1.2+) - Enabled new dataproc:am.primary_only cluster property to prevent application master from running on preemptible workers. This feature is only enabled for Dataproc 1.2+ clusters. To use the cluster property, set --properties dataproc:am.primary_only=true when creating a cluster.

September 25, 2018

Feature

September 21, 2018

Fixed
  • Fixed issue where gRPC based clients can fail when calling Get/List on Operations after using v1beta2 API to perform cluster operations.

September 14, 2018

Fixed
  • Improved granularity and accuracy of Hadoop metrics in Stackdriver
  • Fixed Hue service failed to start issue on 1.3-deb9 images.

August 31, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.93-deb8, 1.1.84-deb8, 1.2.47-deb8, 1.3.7-deb8, 1.0.93-deb9, 1.1.84-deb9, 1.2.47-deb9, 1.3.7-deb9.

August 24, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.92-deb8, 1.1.83-deb8, 1.2.46-deb8, 1.3.6-deb8, 1.0.92-deb9, 1.1.83-deb9, 1.2.46-deb9, 1.3.6-deb9.
  • Image 1.0-1.2 only: Cloud Storage and BigQuery connector upgrades (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to 1.6.8 version
    • BigQuery connector has been upgraded to 0.10.9 version
  • Image 1.3 only: Cloud Storage connector upgrade to 1.9.5 version (for more information, review the change notes in the GitHub repository)
  • Image 1.3 with Debian 9 only:
    • Upgrade Spark to 2.3.1.
    • Add HBase 1.3.2.
    • Add Flink 1.5.0.

August 16, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.91-deb8, 1.0.91-deb9, 1.1.82-deb8, 1.1.82-deb9, 1.2.45-deb8, 1.2.45-deb9, 1.3.5-deb8, 1.3.5-deb9.
  • Security fix: Install Linux Kernel 4.9 in all image versions to get security fixes for CVE-2018-3590 and CVE-2018-3591 in all new Debian 8 images.

July 31, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.3.3
  • Changes to 1.3 image only:
    • Disabled node blacklisting in Tez jobs (set tez.am.node-blacklisting.enabled=false). This affects all Hive jobs, which run on Tez by default.
Fixed
  • Fixed issue breaking native Snappy compression in spark-shell (SPARK-24018) and Zeppelin.
  • Fixed issue where gsutil and gcloud do not work on cluster VMs when the ANACONDA optional component is selected.

July 18, 2018

Feature

July 06, 2018

Feature
  • Announcing the alpha release of Cloud Dataproc Optional Components. This feature allows users to specify additional components to install when creating new Dataproc clusters.
Fixed
  • A race condition in HA cluster creation with the resolveip utility has been resolved.

June 22, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.85, 1.1.76, 1.2.40
  • Upgraded the Cloud Storage and BigQuery connectors in 1.0.85, 1.1.76, 1.2.40 (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to version 1.6.7
    • BigQuery connector has been upgraded to version 0.10.8

June 11, 2018

Feature
  • Cloud Dataproc is now available in the europe-north1 region (Finland).

June 08, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.83, 1.1.74, 1.2.38
  • Change the jdbc connect string passed to beeline when submitting Hive jobs to high availability clusters through the Cloud Dataproc Jobs API. The new connect string takes advantage of high availability of HiveServer2.
Fixed
  • WorkflowTemplates will now correctly report Job failures.

May 28, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.82, 1.1.73, 1.2.37
  • Hive Server 2 now runs all three masters in High Availablility mode.
  • Preview image changes (Dataproc 1.3):
    • Now requires a 15 GB minimum boot disk size.
    • The NameNode Service RPC port has changed from 8040 to 8051.
    • The SPARK_HOME environment variable is now globally set.

May 21, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.81, 1.1.72, 1.2.36
  • Upgraded the Cloud Storage and BigQuery connectors in 1.0.81, 1.1.72, 1.2.36 (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to version 1.6.6
    • BigQuery connector has been upgraded to version 0.10.7
  • New version of Cloud Dataproc 1.3 preview image:
    • Remove BigQuery connector from image. Users should instead include the BigQuery connector with jars for their jobs.
    • Cloud Dataproc 1.3 is not supported
    • See Cloud Dataproc version list for more information.
  • Hive Metastore is configured to run on all three masters in High Availability mode

May 14, 2018

Feature
  • New Cloud Dataproc 1.3 image track available in preview.
    • Changes include:
      • Spark 2.3, Hadoop 2.9, Hive 2.3, Pig 0.17, Tez 0.9
      • Hive on Tez by default (no need for the Tez initialization action).
    • Cloud Dataproc 1.3 is not officially supported.
    • See Cloud Dataproc version list for more information.

May 04, 2018

Change

April 20, 2018

Change
  • New sub-minor versions of Cloud Dataproc images - 1.0.77, 1.1.68, 1.2.32
  • Changed the Namenode HTTP port from 50070 to 9870 on HA clusters in preview image. WebHDFS, for example, is accessible at http://clustername-m-0:9870/webhdfs/v1/. This is consistent with standard and single node clusters in Dataproc 1.2+. Dataproc 1.0 and 1.1 clusters will continue to use port 50070 for all cluster modes.
  • Upgraded the Cloud Storage and BigQuery connectors (for more information, review the change notes in the GitHub repository):
    • Cloud Storage connector has been upgraded to version 1.6.5
    • BigQuery connector has been upgraded to version 0.10.6
Fixed
  • Fixed issue where cluster can go into ERROR state due to error when resizing a managed instance group.
  • Backported PIG-4967 and MAPREDUCE-6762 into Cloud Datproc image version 1.2 to fix an occasional NullPointerException in Pig jobs.
  • Fixed an uncommon issue in which a Cloud Dataproc agent restart during a small window of a cluster downscale operation could cause problems decommissioning data nodes.

April 13, 2018

Change
Fixed
  • Fixed rare issue where Cloud Dataproc agent fails to initialize HDFS configuration and reports too few DataNodes reporting.
  • Fixed how Cloud Dataproc determines that HDFS decommission is complete.

April 06, 2018

Change

March 30, 2018

Change

March 22, 2018

Feature

March 16, 2018

Feature
Change
Fixed

February 23, 2018

Change

February 09, 2018

Fixed
  • Fixed an issue where a workflow can get stuck due to a manually deleted cluster.

February 02, 2018

Feature
  • Added support for setting hadoop-env, mapred-env, spark-env, and yarn-env dataproc properties through new prefixes. NOTE: applies only to new subminor image versions.
  • Added a button to link to Stackdriver logs for a cluster on the Google Cloud Platform Console Cluster details page.

December 20, 2017

Change
  • The Google Cloud Dataproc Graceful Decommissioning feature is now in public release (was Beta). Graceful Decomissioning enables the removal of nodes from the cluster without interrupting jobs in progress. A user-specified timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes. This feature is available when updating clusters using the gcloud command-line tool, the Cloud Dataproc REST API, and Google Cloud Platform Console. See Graceful Decommissioning for more information.
  • The Single Node Clusters feature is now in public release (was Beta). Single node clusters are Cloud Dataproc clusters with only one node that acts as the master and worker. Single node clusters are useful for a number of activities, including development, education, and lightweight data science.

    This feature is available when creating clusters with the gcloud command-line tool, the Cloud Dataproc REST API, and Google Cloud Platform Console. See Single node clustersfor more information.

November 10, 2017

Change
  • New sub-minor versions of the Cloud Dataproc images - 1.0.57, 1.1.48, 1.2.12.
  • Apache Hadoop has been upgraded to 2.8.2 in the Cloud Dataproc 1.2 image.

November 01, 2017

Change
  • When using workflow cluster selectors, if more than one cluster matches the specified label(s), Cloud Dataproc will select the cluster with the most free YARN memory. This change replaces the old behavior of choosing a random cluster with the matching label.
  • New sub-minor versions of the Cloud Dataproc images - 1.0.56, 1.1.47, 1.2.11.
  • HTTP 404 and 409 errors will now show the full resource name in order to provide more useful error messages.

October 17, 2017

Fixed
  • Fixed a bug where HTTP keep-alive was causing java.lang.NullPointerException: ssl == null errors when accessing Cloud Storage.
  • The Apache Oozie initialization action has been fixed to work with Cloud Dataproc 1.2.

September 06, 2017

Feature
  • Cluster Scheduled Deletion (Beta) – Cloud Dataproc clusters can now be created with an scheduled deletion policy. Clusters can be scheduled for deletion either after a specified duration or at a specified time, or after a specified period of inactivity. See Cluster Scheduled Deletion for more information.

September 05, 2017

Feature

Cloud Dataproc is now available in the southamerica-east1 region (São Paulo, Brazil).

August 18, 2017

Change
  • New Subminor Image Versions – The latest subminor image versions for 1.0, 1.1, and 1.2 now map to 1.0.49, 1.1.40, 1.2.4, respectively.
  • All Cloud Dataproc clusters now have a goog-dataproc-cluster-name label that is propagated to underlying Google Compute Engine resources and can be used to determine combined Cloud Dataproc related costs in exported billing data.
Fixed
  • PySpark drivers are now launched under a changed process group ID to allow the Cloud Dataproc agent to correctly clean up misbehaving or cancelled jobs.
  • Fixed a bug where updating clusters labels and the number of secondary workers in a single update resulted in a stuck update operation and an undeletable cluster.

August 08, 2017

Change

Beginning today, Cloud Dataproc 1.2 will be the default version for new clusters. To use older versions of Cloud Dataproc, you will need to manually select the version on cluster creation.

August 04, 2017

Change

Apache Hadoop on Cloud Dataproc 1.2 has been updated to version 2.8.1

August 01, 2017

Feature

Cloud Dataproc is now available in the europe-west3 region (Frankfurt, Germany).

July 21, 2017

Change

Starting with the Cloud Dataproc 1.2 release, the ALPN boot jars are no longer provided on the Cloud Dataproc image. To avoid Spark job breakage, upgrade Cloud Bigtable client versions, and bundle boringssl-static with Cloud Dataproc jobs. Our initialization action repository contains initialization actions to revert to the previous (deprecated) behavior of including the jetty-alpn boot jar. This change should only impact you if you use Cloud Bigtable or other Java gRPC clients from Cloud Dataproc.

July 11, 2017

Feature
  • Spark 2.2.0 in Preview – The Cloud Dataproc preview image has been updated to Spark 2.2.0.
Feature
  • Spark 2.2.0 in Preview – The Cloud Dataproc preview image has been updated to Spark 2.2.0.

June 26, 2017

Deprecated

The v1alpha1 and v1beta1 Cloud Dataproc APIs are now deprecated and cannot be used. Instead, you should use the current v1 API.

June 20, 2017

Feature

Cloud Dataproc is now available in the australia-southeast1 region (Sydney).

June 06, 2017

Feature

Cloud Dataproc is now available in the europe-west2 region (London).

April 28, 2017

Change

Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.6.1 and the BigQuery connector has been upgraded to bigquery-connector-0.10.2. For more information, review the Cloud Storage or BigQuery change notes in the GitHub repository.

Deprecated

The v1alpha1 and v1beta1 Cloud Dataproc APIs are now deprecated and cannot be used. Instead, you should use the current v1 API.

April 07, 2017

Feature

Cloud Dataproc worker IAM role – A new Cloud Dataproc IAM role called Dataproc/Dataproc Worker has been added. This role is intended specifically for use with service accounts.

March 17, 2017

Deprecated

As mentioned in the February 9th release notes, Cloud Audit Logs for Cloud Dataproc are no longer emitted to the dataproc_cluster resource type. Starting with this release, Cloud Audit Logs are emitted to the new cloud_dataproc_cluster resource type.

March 07, 2017

Feature
  • User labelsUser labels with Cloud Dataproc resources are now generally available. You can add and update labels on Cloud Dataproc clusters and jobs. Labels are useful in situations such as cost accounting, work distribution, and testing.
  • Attaching GPUs to clusters (Beta) – Cloud Dataproc clusters now support Compute Engine GPUs. Clusters can have 1-8 GPUs attached to master and worker nodes. These GPUs can be used with applications on the cluster, such as Apache Spark. Attaching GPUs may benefit some types of data processing jobs.

March 01, 2017

Feature
  • Restartable jobs (Beta) – Cloud Dataproc jobs now have an optional setting to restart jobs that have failed. When you set a job to restart, you specify the maximum number of retries per hour. Restartable jobs allow you to mitigate common types of job failure and are especially useful for long-running and streaming jobs.
  • Single node clusters (Beta)Single node clusters are Cloud Dataproc clusters with only one node that acts as the master and worker. Single node clusters are useful for a number of activities, including development, education, and lightweight data science.

January 19, 2017

Feature
  • Cloud Dataproc 1.2 preview – The preview image has been updated to reflect the planned Cloud Dataproc 1.2 release. This image includes Apache Spark 2.1 and Apache Hadoop 2.8-SNAPSHOT. This preview image is provided so we can provide access to Hadoop 2.8 in Dataproc 1.2 once Hadoop 2.8 is formally released and access to release candidates.

January 05, 2017

Feature
  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.6.0 and the BigQuery connector has been upgraded to bigquery-connector-0.10.1. For more information, review the Cloud Storage or BigQuery change notes in the GitHub repository.

December 16, 2016

Feature
  • Google Stackdriver Agent Installed – The Stackdriver monitoring agent is now installed by default on Cloud Dataproc clusters. The Cloud Dataproc Stackdriver monitoring documentation has information on how to use Stackdriver monitoring with Cloud Dataproc. The agent for monitoring and logging can be enabled and disabled by adjusting the cluster properties when you create a cluster.
  • Cloud Dataproc 1.1.15 and 1.0.24 – The 1.1 and 1.0 images have been updated with non-impacting updates, bug fixes, and enhancements.

December 07, 2016

Feature
  • Cloud Dataproc 1.1.14 and 1.0.23 – The 1.1 and 1.0 images have been updated with non-impacting updates, bug fixes, and enhancements.
Change
  • Increased number of situations where Cloud-Dataproc services are automatically restarted by systemd on clusters in the event of unexpected or unhealthy behavior.

November 14, 2016

Fixed
  • Fixed an issue where the --jars argument was missing from gcloud dataproc jobs submit pyspark command.

November 08, 2016

Feature
  • Google BigQuery connector upgrade – The BigQuery connector has been upgraded to bigquery-connector-0.10.1-SNAPSHOT. This version introduces the new IndirectBigQueryOutputFormat that uses Hadoop output formats that write directly to a temporary Cloud Storage bucket, and issues a single BigQuery load job per Hadoop/Spark job at job-commit time. For more information, review the BigQuerychange notes in the GitHub repository.

November 07, 2016

Feature

November 02, 2016

Change
  • Fixed an issue where failures during a cluster update caused the cluster to fail. Now, update failures return the cluster to Running state.
  • Fixed an issue where submitting a large number of jobs rapidly or over a long period of time caused a cluster to fail.
  • Increased the maximum number of concurrent jobs per cluster.

October 18, 2016

Change
  • Fixed an issue where HiveServer2 was not healthy for up to 60 seconds after the cluster was deployed. Hive jobs should now successfully connect to the required HiveServer2 immediately after a cluster is deployed.

October 11, 2016

Feature
  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.4 and the BigQuery connector has been upgraded to bigquery-connector-0.8.0. For more information, review the Cloud Storage or BigQuerychange notes in the GitHub repository.
  • dataproc.localssd.mount.enable – Added the new property dataproc.localssd.mount.enable that can be set at cluster deployment time to make Cloud Dataproc ignore local SSDs. If set, Cloud Dataproc will use the main persistent disks for HDFS and temporary Hadoop directories so the local SSDs can be used separately for user-defined purposes. This property can be set by using the argument --properties dataproc:dataproc.localssd.mount.enable=false when creating a Cloud Dataproc cluster.
Change
  • Fixed an issue where CPU quota validation for preemptible virtual machines was being performed against the non-preemptible CPU quota even when preemptible CPU quota was set.

October 07, 2016

Change
  • Optimized the listing of resources by state and cluster UUID. This should reduce several list operations from seconds to milliseconds.

September 29, 2016

Change
  • Optimized how jobs are listed based on state or cluster uuid. This should significantly decrease the time required to list jobs.

September 22, 2016

Feature
  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.3 and the BigQuery connector has been upgraded to bigquery-connector-0.7.9. For more information, review the change notes in the GitHub repository.
Change
  • While Cloud Dataproc has been using Java 8 since its Beta launch in September 2015, there is now a hard dependency on Java 8 or higher.
  • The --preemptible-worker-boot-disk-size command no longer requires that you specify 0 preemptible workers if you do not want to add preemptible machines when you create a cluster.

September 16, 2016

Feature
  • Preemptible boot disk sizes - The disk size for preemptible workers can now be set via the gcloud command-line tool at cluster creation, even when preemptibles are not added to a cluster using the commend --preemptible-worker-boot-disk-size.
Change

September 01, 2016

Feature
  • Identity & Access Management support [BETA] - Cloud Dataproc now has beta support for Identity and Access Management (IAM). Cloud Dataproc IAM permissions allow users to perform specific actions on Cloud Dataproc clusters, jobs, and operations. See Cloud Dataproc Permissions and IAM Roles for more information.
  • LZO support - Cloud Dataproc clusters now natively support the LZO data compression format.
  • Google Stackdriver logging toggle - It is now possible to disable Google Stackdriver logging on Cloud Dataproc clusters. To disable Stackdriver logging, use the command --properties dataproc:dataproc.logging.stackdriver.enable=false when creating a cluster with the gcloud command-line tool.

August 25, 2016

Fixed
  • Fixed an issue where adding many (~200+) nodes to a cluster would cause some nodes to fail.
  • Fix an issue where output from initialization actions that timed out was not copied to Cloud Storage.
Feature
  • Cloud Storage and BigQuery connector upgrades – The Cloud Storage connector has been upgraded to gcs-connector-1.5.2 and the BigQuery connector has been upgraded to bigquery-connector-0.7.8, which bring performance improvements. See the release notes for the gcs-connector and bigquery-connector for more information.
  • Apache Zeppelin 0.6.1 – The Apache Zeppelin package built for Cloud Dataproc and installable with this initialization action has been upgraded to version 0.6.1. This new version of Zeppelin brings support for Google BigQuery.
Change

Cloud Dataproc 1.1 defaultCloud Dataproc 1.1 is now the default image version for new clusters.

August 08, 2016

Feature

Cloud Dataproc 1.1 – A new image version, Cloud Dataproc 1.1, has been released. Several components have been updated for this image version including:

To create a cluster with the 1.1 image, you can use the gcloud command-line tool with the --image-version argument, such as gcloud dataproc clusters create --image-version 1.1.

August 02, 2016

Feature

Updated preview image – Several preview image components have been updated, including:

July 19, 2016

Fixed
  • GCP Console
    • The GCP Console now uses the Cloud Dataproc v1 instead of the v1beta1 API. Clicking on the equivalent REST link will show the appropriate v1 API paths and resource names.
  • Fixed an issue where some HDFS nodes did not join a cluster because their domain name could not be resolved on first boot.

July 01, 2016

Fixed
  • Fixed a bug introduced in a June release that caused clusters to fail only after waiting for ~30 minutes. This occurred most frequently when initialization actions failed during cluster creation. Now clusters should fail within 1 minute of an initialization action failure.
  • Decreased job startup time for SparkSQL jobs with partitioned/nested directories by applying a patch for Spark (SPARK-9926)
  • Further optimized job startup time for any job with a lot of file inputs by applying a patch for Hadoop (HADOOP-12810)

June 10, 2016

Feature

March 30, 2016

Feature
  • Spark 1.6.1 - The Cloud Dataproc image version 1.0 has been updated to include the Spark 1.6.1 maintenance release instead of Spark 1.6.0.
  • OSS upgrades - This release upgrades the Cloud Storage and Google BigQuery connectors to gcs-connector-1.4.5 and bigquery-connector-0.7.5, respectively.
Fixed
  • It is now possible to specify --num-preemptible-workers 0 through the gcloud command-line tool. Previously this would fail.
  • Fixed a validation issue which produced 500 HTTP errors when the response should have been 400 bad input or 200 OK.
  • Resolved a cache validation issue and re-enabled re-enable directory inference for the Cloud Storage connector (fs.gs.implicit.dir.infer.enable).
  • Adjusted Compute Engine migration settings due to unexpected host failures - normal VMs will automatically restart after migration and preemptible machines will not. Previously all VMs were set to not automatically restart after a migration.
  • Addressed an issue where rapid job submission would result in a Too many pending operations on a resource error.

February 22, 2016

Feature
  • Cloud Dataproc is now generally available! For more information, see our announcement blog post.

  • Custom compute engine machine types - Cloud Dataproc clusters now support custom Compute Engine machine types for both master and worker nodes. This means you can create clusters with customized amounts of virtual CPUs and memory. For more information, please read the Dataproc documentation for custom machine types.

  • OSS upgrades - We have released Cloud Dataproc version 1.0. This release includes an upgrade to Apache Spark 1.6.0 and Apache Hadoop 2.7.2. This release also includes new versions of the Cloud Storage and Google BigQuery connectors.

  • v1 API - The v1 API for Cloud Dataproc is now live. This API includes support for regionality along with minor fixes and adjustments. The API is available in the APIs Explorer and also has a Maven artifact in Maven Central. For more information, review the REST API documentation.

  • Support for --jars for PySpark - Added support for using the --jars option for PySpark jobs.

  • API auto-enable - Enabling the Cloud Dataproc API now automatically enables required dependent APIs, such as Cloud Storage and Google Compute Engine

Fixed
  • Resolved several issues that occasionally caused some clusters to hang being scaled down.
  • Improved validation of certain types of malformed URLs, which previously failed during cluster deployment.

January 21, 2016

Feature
  • The dataproc command in the Google Cloud SDK now includes a --properties option for adding or updating properties in some cluster configuration files, such as core-site.xml. Properties are mapped to configuration files by specifying a prefix, such as core:io.serializations. This command makes it possible to modify multiple properties and files when creating a cluster. For more information, see the Cloud Dataproc documentation for the --properties command.
  • GCP Console
    • An option has been added to the "Create Clusters" form to enable the cloud-platform scope for a cluster. This lets you view and manage data across all Google Cloud Platform services from Cloud Dataproc clusters. You can find this option by expanding the Preemptible workers, bucket, network, version, initialization, & access options section at the bottom of the form.

December 16, 2015

Feature
  • Cloud Dataproc clusters now have vim, git, and bash-completion installed by default
  • The Cloud Dataproc API now has an official Maven artifact, Javadocs, and a downloadable .zip file
  • GCP Console
    • Properties can now be specified when submitting a job, and can be seen in the Configuration tab of a job
    • A Clone button has been added that allows you to easily copy all information about a job to a new job submission form
    • The left-side icons for Clusters and Jobs are now custom icons rather than generic ones
    • An Image version field has been added to the bottom of the create cluster form that allows you to select a specific Cloud Dataproc image version when creating a cluster
    • A VM Instances tab has been added on the cluster detail page, which you can use to display a list of all VMs in a cluster and easily SSH into the master node
    • An Initialization Actions field has been added to the bottom of the create cluster form, which allows you to specify initialization actions when creating a cluster
    • Paths to Cloud Storage buckets that are displayed in error messages are now clickable links.
Fixed
  • Forced distcp settings to match mapred-site.xml settings to provide additional fixes for the distcp command (see this related JIRA)
  • Ensured that workers created during an update do not join the cluster until after custom initialization actions are complete
  • Ensured that workers always disconnect from a cluster when the Cloud Dataproc agent is shutdown
  • Fixed a race condition in the API frontend that occurred when validating a request and marking cluster as updating
  • Enhanced validation checks for quota, Cloud Dataproc image, and initialization actions when updating clusters
  • Improved handling of jobs when the Cloud Dataproc agent is restarted
  • GCP Console
    • Allowed duplicate arguments when submitting a job
    • Replaced generic Failed to load message with details about the cause of an error when an error occurs that is not related to Cloud Dataproc
    • When a single jar file for a job is submitted, allowed it to be listed only in the Main class or jar field on the Submit a Job form, and no longer required it to also be listed in the Jar files field

November 18, 2015

Feature
  • Version selection - With the release of Cloud Dataproc version 0.2, you can now select among different versions of Cloud Dataproc (see Cloud Dataproc Versioning for information on support for previous versions and Cloud Dataproc Version List a list of the software components supported in each version). You can select a Cloud Dataproc version when creating a cluster through the Cloud Dataproc API, Cloud SDK (using the gcloud beta dataproc clusters create --image-version command) or through the https://console.cloud.google.com/. Note that within four days of the release of a new version in a region, the new version will become the default version used to create new clusters in the region.

  • OSS upgrades - We have released Cloud Dataproc version 0.2. The new Spark component includes a number of bug fixes. The new Hive component enables use of the hive command, contains performance improvements, and has a new metastore.

  • Connector updates - We released updates to our BigQuery and Google Cloud Storage connectors (0.7.3 and 1.4.3, respectively.) These connectors fix a number of bugs and the new versions are now included in Cloud Dataproc version 0.2.

  • Hive Metastore - We introduced a MySQL-based per-cluster persistent metastore, which is shared between Hive and SparkSQL. This also fixes the hive command.

  • More Native Libraries - Cloud Dataproc now includes native Snappy libraries. It also includes native BLAS, LAPACK and ARPACK libraries for Spark's MLlib.

  • Clusters --diagnose command - The Cloud SDK now includes a --diagnose command for gathering logging and diagnostic information about your cluster. More details about this command are available in the Cloud Dataproc support documentation.

Fixed
  • Fixed the ability to delete jobs that fast-failed before some cluster and staging directories were created
  • Fixed some remaining errors with vmem settings when using the distcp command
  • Fixed a rare bug in which underlying Compute Engine issues could lead to VM instances failing to be deleted after the Cloud Dataproc cluster had been successfully deleted
  • Hive command has been fixed
  • Fixed error reporting when updating the number of workers (standard and preemptible) in a cluster
  • Fixed some cases when Rate Limit Exceeded errors occurred
  • The maximum cluster name length is now correctly 55 instead of 56 characters
  • GCP Console
    • Cluster list now includes a Created column, and the cluster configuration tab now includes a Created field, telling the creation time of the cluster
    • In the cluster-create screen, cluster memory sizes greater than 999 GB are now displayed in TB
    • Fields that were missing from the PySpark and Hive job configuration tab (Additional Python Files and Jar Files) have been added
    • The option to add preemptible nodes when creating a cluster is now in the "expander" at the bottom of the form
    • Machine types with insufficient memory (less than 3.5 GB) are no longer displayed in the list of machine types (previously, selecting one of these small machine types would lead to an error from the backend
    • The placeholder text in the Arguments field of the submit-job form has been corrected
Change

Core service improvements

  • If set, a project's default zone setting is now used as the default value for the zone in the create-cluster form in the GCP Console.

Optimizations

  • Hive performance has been greatly increased, especially for partitioned tables with a large number of partitions
  • Multithreaded listStatus has now been enabled, which should speed up job startup time for FileInputFormats reading large numbers of files and directories in Cloud Storage

October 15, 2015

Fixed
  • Fixed a bug in which DataNodes failed to register with the NameNode on startup, resulting in less-than-expected HDFS capacity.
  • Prevented the submission of jobs in an Error state.
  • Fixed bug that prevented clusters from deleting cleanly in some situations.
  • Reduced HTTP 500 errors when deploying Cloud Dataproc clusters.
  • Corrected distcp out-of-memory errors with better cluster configuration.
  • Fixed a situation in which jobs failed to delete properly and were stuck in a Deleting state.
Change

Core service improvements

  • Provided more detail about HTTP 500 errors instead of showing 4xx errors.
  • Added information on existing resources for Resource already exists errors.
  • Specific information now provided instead of generic error message for errors related to Cloud Storage.
  • Listing operations now support pagination.

Optimizations

  • Significantly improved YARN utilization for MapReduce jobs running directly against Cloud Storage.
  • Made adjustments to yarn.scheduler.capacity.maximum-am-resource-percent to enable better utilization and concurrent job support.