"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Use Lightning Engine

Lightning Engine is the next generation of Apache Spark performance, introducing exclusive enhancements designed to deliver substantial improvements in performance, cost-efficiency, and operational stability.

Benefits

Lightning Engine benefits include following:

Accelerated data operations: Achieve significant performance gains and cost savings through optimizations to cloud storage interaction, including metadata handling, write workloads, and vectored I/O.
Intelligent query execution: Leverage advanced optimizer enhancements that dynamically reduce data scanned, optimize data processing, and generate more efficient execution plans for faster, more cost-effective queries.
Streamlined AI and ML workloads: Reduce cluster startup times for GPU-based workloads, and simplify deployment in secure environments using images optimized for AI.

While Lightning Engine offers substantial performance gains, the specific impact varies with the workload. It is best suited for compute-intensive tasks that leverage Spark Dataframe APIs, Spark Dataset APIs and Spark SQL queries, rather than I/O-bound operations.

Comparison to standard engine

Lightning Engine is an alternative to the standard engine used to execute Spark jobs on a Managed Service for Apache Spark cluster. The following table compares Lightning Engine to standard engine activation properties, workload applicability, and key benefits.

Feature	Standard engine	Lightning Engine
CLI flag	`--engine=default` or unset the flag	`--engine=lightning`
Best For	General purpose jobs, development, and testing	Enterprise-scale workloads requiring significant acceleration
Key Benefits	Baseline performance	Optimized cloud storage interaction, intelligent query execution

Requirements

The following requirements apply to the Lightning Engine feature:

Image version: Lightning Engine must be used with Managed Service for Apache Spark image version 2.3.3 or later.
Supported jobs: Spark, PySpark, SparkSQL, and SparkR are supported. The standard engine runs on other job types submitted to a Lightning Engine cluster.

Native Query Execution

Native Query Execution (NQE) is an optional component of Lightning Engine that provides a deeper level of acceleration for specific jobs. It is a native engine based on Apache Gluten and Velox, optimized for Google hardware, which boosts performance by running parts of a Spark query outside the JVM.

NQE is recommended for:: Compute-intensive tasks that leverage Spark Dataframe APIs and Spark Dataset APIs, and Spark SQL queries that read data from Parquet and ORC files. The output file format doesn't affect its performance.
NQE is not recommended for:: Jobs that rely heavily on Resilient Distributed Datasets (RDDs), User-Defined Functions (UDFs), most Spark Machine Learning (ML) libraries, and I/O-bound operations with delays due to storage access.

Requirements

The following requirements apply to the Native Query Execution feature:

Execution engine: NQE is available only on clusters enabled with Lightning engine at cluster creation.
Operating system: Debian-12 and Ubuntu-22 operating systems only are supported. NQE-enabled jobs using any other OS will fail.
Supported jobs: Spark, PySpark, SparkSQL, and SparkR are supported. The standard engine will run (without NQE) on other job types submitted to a Lightning Engine cluster.
Machine types: Only machine families using Intel or AMD processors are supported. NQE-enabled jobs using ARM processors will fail (but can benefit from Lightning Engine without NQE).
No GPUs and Accelerators: NQE-enabled jobs submitted on GPU accelerators will fail (but can benefit from Lightning Engine without NQE).
Data types: Inputs of the following data types are not supported:
- Byte: ORC and Parquet
- Struct, Array, Map: Parquet

Pricing

For pricing information, see Managed Service for Apache Spark pricing.

Create a Lightning Engine cluster

This section shows you how to create a Managed Service for Apache Spark cluster that enables Lightning Engine on Spark jobs submitted to the cluster.

You can also enable Native Query Execution (NQE) on the cluster when you create the cluster, or you can enable NQE later for specific Spark jobs submitted to the cluster.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the serviceusage.services.enable permission. If you created the project, then you likely already have this permission through the Owner role (roles/owner). Otherwise, you can get this permission through the Service Usage Admin role (roles/serviceusage.serviceUsageAdmin). Learn how to grant roles.

Enable the API

Install the Google Cloud CLI.

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

Enable the API

Install the Google Cloud CLI.

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.

If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

To initialize the gcloud CLI, run the following command:

gcloud init

Required roles

Certain IAM roles are required to create a Managed Service for Apache Spark cluster and submit jobs to the cluster. Depending on organization policies, a cloud project owner or service administrator might have already granted these roles to you or a service account. To check role grants, see Do you need to grant roles?.

For more information about granting roles, see Manage access to projects, folders, and organizations.

User roles

To get the permissions that you need to create a Managed Service for Apache Spark cluster, ask your administrator to grant you the following IAM roles:

All:
- Dataproc Editor (roles/dataproc.editor) on the project
- Service Account User (roles/iam.serviceAccountUser) on the Compute Engine default service account

Service account role

To ensure that the Compute Engine default service account has the necessary permissions to create a Managed Service for Apache Spark cluster, ask your administrator to grant the Dataproc Worker (roles/dataproc.worker) IAM role to the Compute Engine default service account on the project.

Create the cluster

The following examples show you how to create a Lightning Engine cluster using the Google Cloud console, Google Cloud CLI, Dataproc API, Cloud Client Libraries for Python, or Terraform. You can also create a cluster with the Lightning Engine enabled using the Go, Java, and Node.js Cloud Client Libraries.

Console

Open the Create cluster page.
Click Additional configuration to expand that section.
Edit Customization & Other.
In the panel that opens, confirm that the Enable Lightning Engine checkbox is selected.
Optional: To enable the native execution runtime by default for Spark jobs, select the Enable Native Execution checkbox.
Click Save.
Configure other cluster settings as needed.
Click Create cluster.

gcloud CLI

To create a cluster with the Lightning Engine enabled, run the gcloud dataproc clusters create command with the --engine=lightning flag. For more information, see create a cluster with gcloud CLI.
```
gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --engine=lightning \
    --image-version=2.3
```

Optional: To enable the native execution runtime by default for Spark jobs, include the spark:spark.dataproc.lightningEngine.runtime=native property.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --engine=lightning \
    --image-version=2.3 \
    --properties='spark:spark.dataproc.lightningEngine.runtime=native'

API

To create a cluster with the Lightning Engine enabled, send a clusters.create request. For more information, see create a cluster with the REST API.

In the request body, set the engine field to LIGHTNING.

{
  "projectId": "PROJECT_ID",
  "clusterName": "CLUSTER_NAME",
  "config": {
    "engine": "LIGHTNING",
    "gceClusterConfig": {},
    "softwareConfig": {
      "imageVersion": "2.3"
    }
  }
}

Optional: To enable the native execution runtime by default for all jobs, include the spark:spark.dataproc.lightningEngine.runtime property.

{
  "projectId": "PROJECT_ID",
  "clusterName": "CLUSTER_NAME",
  "config": {
    "engine": "LIGHTNING",
    "gceClusterConfig": {},
    "softwareConfig": {
      "imageVersion": "2.3",
      "properties": {
        "spark:spark.dataproc.lightningEngine.runtime": "native"
      }
    }
  }
}

Python

To create a cluster with the Lightning Engine enabled, use the create_cluster method and set the engine field in the cluster configuration to LIGHTNING. For more information, see create a cluster with Python.

from google.cloud import dataproc_v1

def create_lightning_cluster(project_id, region, cluster_name):
    client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options)

    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "engine": "LIGHTNING",
            "software_config": {
                "image_version": "2.3-debian12",
            },
        }
    }

    operation = cluster_client.create_cluster(
        project_id=project_id,
        region=region,
        cluster=cluster
    )
    result = operation.result()
    print(f"Cluster created successfully: {result.cluster_name}")

Optional: To enable the native execution runtime by default for Spark jobs, include the spark:spark.dataproc.lightningEngine.runtime property.

from google.cloud import dataproc_v1

def create_lightning_native_cluster(project_id, region, cluster_name):
    client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options)

    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "engine": "LIGHTNING",
            "software_config": {
                "image_version": "2.3-debian12",
                "properties": {
                    "spark:spark.dataproc.lightningEngine.runtime": "native"
                }
            }
        }
    }

    operation = cluster_client.create_cluster(
        project_id=project_id,
        region=region,
        cluster=cluster
    )
    result = operation.result()
    print(f"Cluster created successfully: {result.cluster_name}")

Terraform

In your google_dataproc_cluster resource configuration, set the engine argument to LIGHTNING.
For more details and advanced options, refer to the official Terraform documentation for the google_dataproc_cluster resource.

Verify the cluster engine

Console

In the Google Cloud console, go to the Cluster Details page.
Verify that Lightning Engine value is listed in the Engine field.
If you enabled Native Query Execution, verify that native is listed in the Native Execution field.

gcloud

To verify the engine and NQE (if enabled), run the gcloud dataproc clusters describe command:

gcloud dataproc clusters describe CLUSTER_NAME --project=PROJECT_ID --region=REGION

Check the output for the engine and lightningEngine.runtime properties:

clusterName: lightning-engine-cluster
engine: lightningEngine
lightningEngine.runtime: native

Submit a job with Lightning Engine

If you enabled Lightning Engine when you created a cluster, when you submit a Spark job to the cluster, Lightning Engine is enabled by default on the job.

Enable Native Query Execution for a job

If you enabled Native Query Execution (NQE) when you created a Lightning Engine cluster, all Spark jobs run with NQE enabled unless you disable NQE on a specific job.

If you didn't enable NQE when you created the Lightning Engine cluster, you can enable NQE for a job when you submit the job, as shown in the following examples.

gcloud

To enable Native Query Execution when you submit a Spark job, include the spark.dataproc.lightningEngine.runtime=native property:

gcloud dataproc jobs submit spark \
    --cluster=CLUSTER_NAME \
    --region=REGION \
    --properties=spark.dataproc.lightningEngine.runtime=native \
    -- ...

API

To enable Native Query Execution when you submit a Spark job, include the spark.dataproc.lightningEngine.runtime property in your request:

{
  "job":{
    "placement":{
      "clusterName": ...
    },
    "sparkJob":{
      "mainClass": ...,
      "properties":{
         "spark.dataproc.lightningEngine.runtime":"native"
      }
    }
  }
}

Disable Native Query Execution for a job

If you enabled Native Query Execution (NQE) when you created a Lightning Engine cluster, all Spark jobs will run with NQE enabled unless you disable NQE on a specific job.

You can disable NQE for a specific Spark job when you submit the job, as shown in the following examples.

gcloud

To disable Native Query Execution when you submit a Spark job, to a Lightning Engine cluster, include the spark.dataproc.lightningEngine.runtime=default property:

gcloud dataproc jobs submit spark \
    --cluster=CLUSTER_NAME \
    --region=REGION \
    --properties=spark.dataproc.lightningEngine.runtime=default \
    -- ...

API

To disable Native Query Execution when you submit a Spark job, to a Lightning Engine cluster, include the spark.dataproc.lightningEngine.runtime=default property:

{
  "job":{
    "placement":{
      "clusterName": ...
    },
    "sparkJob":{
      "mainClass": ...,
      "properties":{
         "spark.dataproc.lightningEngine.runtime":"default"
      }
    }
  }
}

Verify Native Query Execution for a job

After you submit a job to a Lightning Engine cluster, you can verify that Native Query Execution is enabled for the job.

Console

In the Google Cloud console, go to the Jobs page.
Click the job ID to open the Job details page.
Verify that native is listed in the Native Execution field.

gcloud

Run the gcloud dataproc jobs describe command:

gcloud dataproc jobs describe JOB_ID --project=PROJECT_ID --region=REGION

Check the output for the lightningEngine.runtime in the Properties section:
```
lightningEngine.runtime: native
```

Configuration parameters

The following table summarizes the main configuration parameters for Lightning Engine and Native Query Execution.

Parameter Name	Description	Applicable engine(s)	Default value	Default value (Lightning Engine)	User overridable (job level)	Scope
`--engine`	Cluster-level setting to select the engine during cluster creation.	Cluster-wide	`default`	`lightning`	No	Cluster
`spark:spark.dataproc.lightningEngine.runtime`	Cluster-level setting to select the Lightning engine runtime during cluster creation.	Lightning only	`default`	`default`	No	Cluster
`spark.dataproc.lightningEngine.runtime`	Enables or disables Native Query Execution (NQE) within the Lightning Engine.	Lightning only	`default`	`default`	Yes. Can be set to `native` or `default`.	Job

Limitations

Enabling Native Query Execution in the following scenarios can cause exceptions, Spark incompatibilities, or workload fallback to the default Spark engine.

Fallbacks

Native Query Execution in the following scenarios can result in a workload fallback to the Spark execution engine:

ANSI: If ANSI mode is enabled, execution falls back to Spark.
Case-sensitive mode: Native Query Execution supports only the Spark default case-insensitive mode. If case-sensitive mode is enabled, incorrect results can occur.
Partitioned table scan: Native Query Execution supports the partitioned table scan only when the path contains the partition information. Otherwise, the workload falls back to the Spark execution engine.

Incompatible behavior

Incompatible behavior or incorrect results can occur when you use Native Query Execution in the following cases:

JSON functions: Native Query Execution supports strings surrounded by double quotes, not single quotes. Incorrect results occur with single quotes. Using * in the path with the get_json_object function returns NULL.
Parquet read configuration:
- Native Query Execution treats spark.files.ignoreCorruptFiles as set to the default false value, even when set to true.
- Native Query Execution ignores spark.sql.parquet.datetimeRebaseModeInRead, and returns only the Parquet file contents. Differences between the legacy hybrid calendar and the Proleptic Gregorian calendar are not considered. Spark results can differ.
NaN: not supported. Unexpected results can occur, for example, when you use NaN in a numeric comparison.
Spark columnar reading: a fatal error can occur because the Spark columnar vector is incompatible with Native Query Execution.
Spill: when you set shuffle partitions to a large number, the spill-to-disk feature can trigger an OutOfMemoryException. If this occurs, reducing the number of partitions can eliminate this exception.

What's next

Accelerate Spark batch workloads and sessions with Lightning Engine.