Use Lightning Engine
Lightning Engine is the next generation of Apache Spark performance, introducing exclusive enhancements designed to deliver substantial improvements in performance, cost-efficiency, and operational stability.
Benefits
Lightning Engine benefits include following:
Accelerated data operations: Achieve significant performance gains and cost savings through optimizations to cloud storage interaction, including metadata handling, write workloads, and vectored I/O.
Intelligent query execution: Leverage advanced optimizer enhancements that dynamically reduce data scanned, optimize data processing, and generate more efficient execution plans for faster, more cost-effective queries.
Streamlined AI and ML workloads: Reduce cluster startup times for GPU-based workloads and simplify deployment in secure environments with native AI and ML images.
While Lightning Engine offers substantial performance gains, the specific impact varies with the workload. It is best suited for compute-intensive tasks that leverage Spark Dataframe APIs, Spark Dataset APIs and Spark SQL queries, rather than I/O-bound operations.
Comparison to standard engine
Lightning Engine is an alternative to the standard engine used to execute Spark jobs on a Managed Service for Apache Spark cluster. The following table compares Lightning Engine to standard engine activation properties, workload applicability, and key benefits.
| Feature | Standard engine | Lightning Engine |
|---|---|---|
| Activation property | --engine=default or unset the flag |
--engine=lightning |
| Best For | General purpose jobs, development, and testing | Enterprise-scale workloads requiring significant acceleration |
| Key Benefits | Baseline performance | Optimized cloud storage interaction, intelligent query execution |
Requirements
The following requirements apply to the Lightning Engine feature:
- Image version: Lightning Engine must be used with Managed Service for Apache Spark
image version
2.3.3or later. - Supported jobs: Spark, PySpark, SparkSQL, and SparkR are supported. The standard engine will run on other job types submitted to a Lightning Engine cluster.
Native Query Execution
Native Query Execution (NQE) is an optional component of Lightning Engine that provides a deeper level of acceleration for specific jobs. It is a native engine based on Apache Gluten and Velox, optimized for Google hardware, which boosts performance by running parts of a Spark query outside the JVM.
- NQE is recommended for:
- Compute-intensive tasks (rather than I/O-bound operations) leveraging Spark Dataframe APIs, Spark Dataset APIs and Spark SQL queries that read data from Parquet and ORC files. The output file format doesn't affect its performance.
- NQE is not recommended for:
- Jobs that rely heavily on Resilient Distributed Datasets (RDDs), User-Defined Functions (UDFs), or most Spark Machine Learning (ML) libraries.
Requirements
The following requirements apply to the Native Query Execution feature:
Execution engine: NQE is available only on clusters enabled with Lightning engine at cluster creation.
Operating system:
Debian-12images only are supported. NQE-enabled jobs using any other OS will fail.Supported jobs: Spark, PySpark, SparkSQL, and SparkR are supported. The standard engine will run (without NQE) on other job types submitted to a Lightning Engine cluster.
Machine types: Only machine families using Intel or AMD processors are supported. NQE-enabled jobs using ARM processors will fail (but can benefit from Lightning Engine without NQE).
No GPUs and Accelerators: NQE-enabled jobs submitted on GPU accelerators will fail (but can benefit from Lightning Engine without NQE).
Data types: Inputs of the following data types are not supported:
- Byte: ORC and Parquet
- Struct, Array, Map: Parquet
Pricing
For pricing information, see Managed Service for Apache Spark on Compute Engine pricing.
Create a Lightning Engine cluster
This section shows you how to create a Managed Service for Apache Spark cluster that enables Lightning Engine on Spark jobs submitted to the cluster.
You can also enable Native Query Execution (NQE) on the cluster when you create the cluster, or you can enable NQE later for specific Spark jobs submitted to the cluster.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that you have the permissions required to complete this guide.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Dataproc API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that you have the permissions required to complete this guide.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Dataproc API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
Install the Google Cloud CLI.
-
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
-
To initialize the gcloud CLI, run the following command:
gcloud init
Required roles
Certain IAM roles are required to create a Managed Service for Apache Spark cluster and submit jobs to the cluster. Depending on organization policies, these roles may have already been granted. To check role grants, see Do you need to grant roles?.
For more information about granting roles, see Manage access to projects, folders, and organizations.
User roles
To get the permissions that you need to create a Managed Service for Apache Spark cluster, ask your administrator to grant you the following IAM roles:
-
Managed Service for Apache Spark Editor (
roles/dataproc.editor) on the project -
Service Account User (
roles/iam.serviceAccountUser) on the Compute Engine default service account
Service account role
To ensure that the Compute Engine default service account has the necessary
permissions to create a Managed Service for Apache Spark cluster,
ask your administrator to grant the
Managed Service for Apache Spark Worker (roles/dataproc.worker)
IAM role to the Compute Engine default service account on the project.
Create the cluster
The following examples show you how to create a Lightning Engine cluster using the Google Cloud console, Google Cloud CLI, Dataproc API, Python, or Terraform. You can also create a Managed Service for Apache Spark cluster with the Lightning Engine enabled using the Go, Java, and Node.js client libraries.
Console
In the Google Cloud console, go to Create an Apache Spark cluster on Compute Engine. For more information, see create a cluster with the Google Cloud console.
Under, Define your cluster, Select the Enable Lightning Engine checkbox to create a cluster with Lightning Engine enabled.
Optional: To enable the native execution runtime by default for Spark jobs, select the Enable Native Execution checkbox.
Configure other cluster settings as needed.
Click Create.
gcloud CLI
To create a cluster with the Lightning Engine enabled, run the
gcloud dataproc clusters createcommand with the--engine=lightningflag. For more information, see create a cluster with gcloud CLI.gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --engine=lightning \ --image-version=2.3Optional: To enable the native execution runtime by default for Spark jobs, include the
spark:spark.dataproc.lightningEngine.runtime=nativeproperty.gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --engine=lightning \ --image-version=2.3 \ --properties='spark:spark.dataproc.lightningEngine.runtime=native'
API
To create a cluster with the Lightning Engine enabled, send a
clusters.create request. For more information, see
create a cluster with the REST API.
In the request body, set the
enginefield toLIGHTNING.{ "projectId": "PROJECT_ID", "clusterName": "CLUSTER_NAME", "config": { "gceClusterConfig": {}, "softwareConfig": { "imageVersion": "2.3" } }, "engine": "LIGHTNING" }Optional: To enable the native execution runtime by default for all jobs, include the
spark:spark.dataproc.lightningEngine.runtimeproperty.{ "projectId": "PROJECT_ID", "clusterName": "CLUSTER_NAME", "config": { "gceClusterConfig": {}, "softwareConfig": { "imageVersion": "2.3", "properties": { "spark:spark.dataproc.lightningEngine.runtime": "native" } } }, "engine": "LIGHTNING" }
Python
To create a cluster with the Lightning Engine enabled, use the
create_clustermethod and set theenginefield in the cluster configuration toLIGHTNING. For more information, see create a cluster with Python.from google.cloud import dataproc_v1 def create_lightning_cluster(project_id, region, cluster_name): client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"} cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options) cluster = { "project_id": project_id, "cluster_name": cluster_name, "config": { "engine": "LIGHTNING", "software_config": { "image_version": "2.3-debian12", }, } } operation = cluster_client.create_cluster( project_id=project_id, region=region, cluster=cluster ) result = operation.result() print(f"Cluster created successfully: {result.cluster_name}")Optional: To enable the native execution runtime by default for Spark jobs, include the
spark:spark.dataproc.lightningEngine.runtimeproperty.from google.cloud import dataproc_v1 def create_lightning_native_cluster(project_id, region, cluster_name): client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"} cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options) cluster = { "project_id": project_id, "cluster_name": cluster_name, "config": { "engine": "LIGHTNING", "software_config": { "image_version": "2.3-debian12", "properties": { "spark:spark.dataproc.lightningEngine.runtime": "native" } } } } operation = cluster_client.create_cluster( project_id=project_id, region=region, cluster=cluster ) result = operation.result() print(f"Cluster created successfully: {result.cluster_name}")
Terraform
- In your
google_dataproc_clusterresource configuration, set theengineargument toLIGHTNING. - For more details and advanced options, refer to the official Terraform
documentation for the
google_dataproc_clusterresource.
Verify the cluster engine
Console
- In the Google Cloud console, go to the Cluster Details page.
- Verify that
Lightning Enginevalue is listed in the Engine field. - If you enabled Native Query Execution, verify that
nativeis listed in the Native Execution field.
gcloud
To verify the engine and NQE (if enabled), run the
gcloud dataproc clusters describecommand:gcloud dataproc clusters describe CLUSTER_NAME --project=PROJECT_ID --region=REGIONCheck the output for the
engineandlightningEngine.runtimeproperties:clusterName: lightning-engine-cluster engine: lightningEngine lightningEngine.runtime: native
Submit a job with Lightning Engine
After you create a Lightning Engine cluster, when you submit a Spark job to the cluster, Lightning Engine is automatically enabled on the job.
Enable Native Query Execution for a job
If you enabled Native Query Execution (NQE) when you created a Lightning Engine cluster, all Spark jobs will run with NQE enabled unless you disable NQE on a specific job.
If you didn't enable NQE when you created the Lightning Engine cluster, you can enable NQE for a specific job when you submit the job, as shown in the following examples.
gcloud
To enable Native Query Execution when you
submit a Spark job,
include the spark.dataproc.lightningEngine.runtime=native property:
```none
gcloud dataproc jobs submit spark \
--cluster=CLUSTER_NAME \
--region=REGION \
--properties=spark.dataproc.lightningEngine.runtime=native \
-- ...
```
API
To enable Native Query Execution when you
submit a Spark job,
include the spark.dataproc.lightningEngine.runtime property in your request:
```json
{
"job":{
"placement":{
"clusterName": ...
},
"sparkJob":{
"mainClass": ...,
"properties":{
"spark.dataproc.lightningEngine.runtime":"native"
}
}
}
}
```
Disable Native Query Execution for a job
If you enabled Native Query Execution (NQE) when you created a Lightning Engine cluster, all Spark jobs will run with NQE enabled unless you disable NQE on a specific job.
You can disable NQE for a specific Spark job when you submit the job, as shown in the following examples.
gcloud
To disable Native Query Execution when you
submit a Spark job,
to a Lightning Engine cluster, include the spark.dataproc.lightningEngine.runtime=default property:
```shell
gcloud dataproc jobs submit spark \
--cluster=CLUSTER_NAME \
--region=REGION \
--properties=spark.dataproc.lightningEngine.runtime=default \
-- ...
```
API
To disable Native Query Execution when you
submit a Spark job,
to a Lightning Engline cluster, include the spark.dataproc.lightningEngine.runtime=default property:
```json
{
"job":{
"placement":{
"clusterName": ...
},
"sparkJob":{
"mainClass": ...,
"properties":{
"spark.dataproc.lightningEngine.runtime":"default"
}
}
}
}
```
Verify Native Query Execution for a job
After you submit a job to a Lightning Engine cluster, you can verify that Native Query Execution is enabled for the job.
Console
- In the Google Cloud console, go to the Job Details page.
- Verify that
nativeis listed in the Native Execution field.
gcloud
Run the
gcloud dataproc jobs describecommand:gcloud dataproc clusters describe JOB_ID --project=PROJECT_ID --region=REGIONCheck the output for the
lightningEngine.runtimein the Properties section:lightningEngine.runtime: native
Configuration parameters
The following table summarizes the main configuration parameters for Lightning Engine and Native Query Execution.
| Parameter Name | Description | Applicable engine(s) | Default value | Default value (Lightning Engine) | User overridable (job level) | Scope |
|---|---|---|---|---|---|---|
--engine |
Cluster-level setting to select the engine during cluster creation. | Cluster-wide | default |
lightning |
No | Cluster |
spark:spark.dataproc.lightningEngine.runtime |
Cluster-level setting to select the Lightning engine runtime during cluster creation. | Lightning only | default |
default |
No | Cluster |
spark.dataproc.lightningEngine.runtime |
Enables or disables Native Query Execution (NQE) within the Lightning Engine. | Lightning only | default |
default |
Yes. Can be set to native or default. |
Job |
Limitations
Enabling Native Query Execution in the following scenarios can cause exceptions, Spark incompatibilities, or workload fallback to the default Spark engine.
Fallbacks
Native Query Execution in the following scenarios can result in a workload fallback to the Spark execution engine:
- ANSI: if ANSI mode is enabled, execution falls back to Spark.
- Case-sensitive mode: Native Query Execution supports only the Spark default case-insensitive mode. If case-sensitive mode is enabled, incorrect results can occur.
- Partitioned table scan: Native Query Execution supports the partitioned table scan only when the path contains the partition information. Otherwise, the workload falls back to the Spark execution engine.
Incompatible behavior
Incompatible behavior or incorrect results can occur when you use Native Query Execution in the following cases:
- JSON functions: Native Query Execution supports strings surrounded by
double quotes, not single quotes. Incorrect results occur with single
quotes. Using
*in the path with theget_json_objectfunction returnsNULL. - Parquet read configuration:
- Native Query Execution treats
spark.files.ignoreCorruptFilesas set to the defaultfalsevalue, even when set totrue. - Native Query Execution ignores
spark.sql.parquet.datetimeRebaseModeInRead, and returns only the Parquet file contents. Differences between the legacy hybrid calendar and the Proleptic Gregorian calendar are not considered. Spark results can differ.
- Native Query Execution treats
- NaN: not supported. Unexpected results can occur, for example, when you
use
NaNin a numeric comparison. - Spark columnar reading: a fatal error can occur because the Spark columnar vector is incompatible with Native Query Execution.
- Spill: when you set shuffle partitions to a large number, the
spill-to-disk feature can trigger an
OutOfMemoryException. If this occurs, reducing the number of partitions can eliminate this exception.