Use Lightning Engine

Lightning Engine is the next generation of Apache Spark performance, introducing exclusive enhancements designed to deliver substantial improvements in performance, cost-efficiency, and operational stability.

Benefits

Lightning Engine benefits include following:

  • Accelerated data operations: Achieve significant performance gains and cost savings through optimizations to cloud storage interaction, including metadata handling, write workloads, and vectored I/O.

  • Intelligent query execution: Leverage advanced optimizer enhancements that dynamically reduce data scanned, optimize data processing, and generate more efficient execution plans for faster, more cost-effective queries.

  • Streamlined AI and ML workloads: Reduce cluster startup times for GPU-based workloads and simplify deployment in secure environments with native AI and ML images.

While Lightning Engine offers substantial performance gains, the specific impact varies with the workload. It is best suited for compute-intensive tasks that leverage Spark Dataframe APIs, Spark Dataset APIs and Spark SQL queries, rather than I/O-bound operations.

Comparison to standard engine

Lightning Engine is an alternative to the standard engine used to execute Spark jobs on a Managed Service for Apache Spark cluster. The following table compares Lightning Engine to standard engine activation properties, workload applicability, and key benefits.

Feature Standard engine Lightning Engine
Activation property --engine=default or unset the flag --engine=lightning
Best For General purpose jobs, development, and testing Enterprise-scale workloads requiring significant acceleration
Key Benefits Baseline performance Optimized cloud storage interaction, intelligent query execution

Requirements

The following requirements apply to the Lightning Engine feature:

  • Image version: Lightning Engine must be used with Managed Service for Apache Spark image version 2.3.3 or later.
  • Supported jobs: Spark, PySpark, SparkSQL, and SparkR are supported. The standard engine will run on other job types submitted to a Lightning Engine cluster.

Native Query Execution

Native Query Execution (NQE) is an optional component of Lightning Engine that provides a deeper level of acceleration for specific jobs. It is a native engine based on Apache Gluten and Velox, optimized for Google hardware, which boosts performance by running parts of a Spark query outside the JVM.

NQE is recommended for:
Compute-intensive tasks (rather than I/O-bound operations) leveraging Spark Dataframe APIs, Spark Dataset APIs and Spark SQL queries that read data from Parquet and ORC files. The output file format doesn't affect its performance.
NQE is not recommended for:
Jobs that rely heavily on Resilient Distributed Datasets (RDDs), User-Defined Functions (UDFs), or most Spark Machine Learning (ML) libraries.

Requirements

The following requirements apply to the Native Query Execution feature:

  • Execution engine: NQE is available only on clusters enabled with Lightning engine at cluster creation.

  • Operating system: Debian-12 images only are supported. NQE-enabled jobs using any other OS will fail.

  • Supported jobs: Spark, PySpark, SparkSQL, and SparkR are supported. The standard engine will run (without NQE) on other job types submitted to a Lightning Engine cluster.

  • Machine types: Only machine families using Intel or AMD processors are supported. NQE-enabled jobs using ARM processors will fail (but can benefit from Lightning Engine without NQE).

  • No GPUs and Accelerators: NQE-enabled jobs submitted on GPU accelerators will fail (but can benefit from Lightning Engine without NQE).

  • Data types: Inputs of the following data types are not supported:

    • Byte: ORC and Parquet
    • Struct, Array, Map: Parquet

Pricing

For pricing information, see Managed Service for Apache Spark on Compute Engine pricing.

Create a Lightning Engine cluster

This section shows you how to create a Managed Service for Apache Spark cluster that enables Lightning Engine on Spark jobs submitted to the cluster.

You can also enable Native Query Execution (NQE) on the cluster when you create the cluster, or you can enable NQE later for specific Spark jobs submitted to the cluster.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  3. Verify that you have the permissions required to complete this guide.

  4. Verify that billing is enabled for your Google Cloud project.

  5. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  6. Install the Google Cloud CLI.

  7. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  8. To initialize the gcloud CLI, run the following command:

    gcloud init
  9. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  10. Verify that you have the permissions required to complete this guide.

  11. Verify that billing is enabled for your Google Cloud project.

  12. Enable the Dataproc API.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the API

  13. Install the Google Cloud CLI.

  14. If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.

  15. To initialize the gcloud CLI, run the following command:

    gcloud init

Required roles

Certain IAM roles are required to create a Managed Service for Apache Spark cluster and submit jobs to the cluster. Depending on organization policies, these roles may have already been granted. To check role grants, see Do you need to grant roles?.

For more information about granting roles, see Manage access to projects, folders, and organizations.

User roles

To get the permissions that you need to create a Managed Service for Apache Spark cluster, ask your administrator to grant you the following IAM roles:

Service account role

To ensure that the Compute Engine default service account has the necessary permissions to create a Managed Service for Apache Spark cluster, ask your administrator to grant the Managed Service for Apache Spark Worker (roles/dataproc.worker) IAM role to the Compute Engine default service account on the project.

Create the cluster

The following examples show you how to create a Lightning Engine cluster using the Google Cloud console, Google Cloud CLI, Dataproc API, Python, or Terraform. You can also create a Managed Service for Apache Spark cluster with the Lightning Engine enabled using the Go, Java, and Node.js client libraries.

Console

  1. In the Google Cloud console, go to Create an Apache Spark cluster on Compute Engine. For more information, see create a cluster with the Google Cloud console.

    Go to Create an Apache Spark cluster on Compute Engine

  2. Under, Define your cluster, Select the Enable Lightning Engine checkbox to create a cluster with Lightning Engine enabled.

  3. Optional: To enable the native execution runtime by default for Spark jobs, select the Enable Native Execution checkbox.

  4. Configure other cluster settings as needed.

  5. Click Create.

gcloud CLI

  1. To create a cluster with the Lightning Engine enabled, run the gcloud dataproc clusters create command with the --engine=lightning flag. For more information, see create a cluster with gcloud CLI.

    gcloud dataproc clusters create CLUSTER_NAME \
        --region=REGION \
        --engine=lightning \
        --image-version=2.3
    
  2. Optional: To enable the native execution runtime by default for Spark jobs, include the spark:spark.dataproc.lightningEngine.runtime=native property.

    gcloud dataproc clusters create CLUSTER_NAME \
        --region=REGION \
        --engine=lightning \
        --image-version=2.3 \
        --properties='spark:spark.dataproc.lightningEngine.runtime=native'
    

API

To create a cluster with the Lightning Engine enabled, send a clusters.create request. For more information, see create a cluster with the REST API.

  1. In the request body, set the engine field to LIGHTNING.

    {
      "projectId": "PROJECT_ID",
      "clusterName": "CLUSTER_NAME",
      "config": {
        "gceClusterConfig": {},
        "softwareConfig": {
          "imageVersion": "2.3"
        }
      },
      "engine": "LIGHTNING"
    }
    
  2. Optional: To enable the native execution runtime by default for all jobs, include the spark:spark.dataproc.lightningEngine.runtime property.

    {
      "projectId": "PROJECT_ID",
      "clusterName": "CLUSTER_NAME",
      "config": {
        "gceClusterConfig": {},
        "softwareConfig": {
          "imageVersion": "2.3",
          "properties": {
            "spark:spark.dataproc.lightningEngine.runtime": "native"
          }
        }
      },
      "engine": "LIGHTNING"
    }
    

Python

  1. To create a cluster with the Lightning Engine enabled, use the create_cluster method and set the engine field in the cluster configuration to LIGHTNING. For more information, see create a cluster with Python.

    from google.cloud import dataproc_v1
    
    def create_lightning_cluster(project_id, region, cluster_name):
        client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
        cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options)
    
        cluster = {
            "project_id": project_id,
            "cluster_name": cluster_name,
            "config": {
                "engine": "LIGHTNING",
                "software_config": {
                    "image_version": "2.3-debian12",
                },
            }
        }
    
        operation = cluster_client.create_cluster(
            project_id=project_id,
            region=region,
            cluster=cluster
        )
        result = operation.result()
        print(f"Cluster created successfully: {result.cluster_name}")
    
  2. Optional: To enable the native execution runtime by default for Spark jobs, include the spark:spark.dataproc.lightningEngine.runtime property.

    from google.cloud import dataproc_v1
    
    def create_lightning_native_cluster(project_id, region, cluster_name):
        client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
        cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options)
    
        cluster = {
            "project_id": project_id,
            "cluster_name": cluster_name,
            "config": {
                "engine": "LIGHTNING",
                "software_config": {
                    "image_version": "2.3-debian12",
                    "properties": {
                        "spark:spark.dataproc.lightningEngine.runtime": "native"
                    }
                }
            }
        }
    
        operation = cluster_client.create_cluster(
            project_id=project_id,
            region=region,
            cluster=cluster
        )
        result = operation.result()
        print(f"Cluster created successfully: {result.cluster_name}")
    

Terraform

  1. In your google_dataproc_cluster resource configuration, set the engine argument to LIGHTNING.
  2. For more details and advanced options, refer to the official Terraform documentation for the google_dataproc_cluster resource.

Verify the cluster engine

Console

  1. In the Google Cloud console, go to the Cluster Details page.
  2. Verify that Lightning Engine value is listed in the Engine field.
  3. If you enabled Native Query Execution, verify that native is listed in the Native Execution field.

gcloud

  1. To verify the engine and NQE (if enabled), run the gcloud dataproc clusters describe command:

    gcloud dataproc clusters describe CLUSTER_NAME --project=PROJECT_ID --region=REGION
    
  2. Check the output for the engine and lightningEngine.runtime properties:

    clusterName: lightning-engine-cluster
    engine: lightningEngine
    lightningEngine.runtime: native
    

Submit a job with Lightning Engine

After you create a Lightning Engine cluster, when you submit a Spark job to the cluster, Lightning Engine is automatically enabled on the job.

Enable Native Query Execution for a job

If you enabled Native Query Execution (NQE) when you created a Lightning Engine cluster, all Spark jobs will run with NQE enabled unless you disable NQE on a specific job.

If you didn't enable NQE when you created the Lightning Engine cluster, you can enable NQE for a specific job when you submit the job, as shown in the following examples.

gcloud

To enable Native Query Execution when you submit a Spark job, include the spark.dataproc.lightningEngine.runtime=native property:

```none
gcloud dataproc jobs submit spark \
    --cluster=CLUSTER_NAME \
    --region=REGION \
    --properties=spark.dataproc.lightningEngine.runtime=native \
    -- ...
```

API

To enable Native Query Execution when you submit a Spark job, include the spark.dataproc.lightningEngine.runtime property in your request:

```json
{
  "job":{
    "placement":{
      "clusterName": ...
    },
    "sparkJob":{
      "mainClass": ...,
      "properties":{
         "spark.dataproc.lightningEngine.runtime":"native"
      }
    }
  }
}
```

Disable Native Query Execution for a job

If you enabled Native Query Execution (NQE) when you created a Lightning Engine cluster, all Spark jobs will run with NQE enabled unless you disable NQE on a specific job.

You can disable NQE for a specific Spark job when you submit the job, as shown in the following examples.

gcloud

To disable Native Query Execution when you submit a Spark job, to a Lightning Engine cluster, include the spark.dataproc.lightningEngine.runtime=default property:

```shell
gcloud dataproc jobs submit spark \
    --cluster=CLUSTER_NAME \
    --region=REGION \
    --properties=spark.dataproc.lightningEngine.runtime=default \
    -- ...
```

API

To disable Native Query Execution when you submit a Spark job, to a Lightning Engline cluster, include the spark.dataproc.lightningEngine.runtime=default property:

```json
{
  "job":{
    "placement":{
      "clusterName": ...
    },
    "sparkJob":{
      "mainClass": ...,
      "properties":{
         "spark.dataproc.lightningEngine.runtime":"default"
      }
    }
  }
}
```

Verify Native Query Execution for a job

After you submit a job to a Lightning Engine cluster, you can verify that Native Query Execution is enabled for the job.

Console

  1. In the Google Cloud console, go to the Job Details page.
  2. Verify that native is listed in the Native Execution field.

gcloud

  1. Run the gcloud dataproc jobs describe command:

    gcloud dataproc clusters describe JOB_ID --project=PROJECT_ID --region=REGION
    
  2. Check the output for the lightningEngine.runtime in the Properties section:

    lightningEngine.runtime: native
    

Configuration parameters

The following table summarizes the main configuration parameters for Lightning Engine and Native Query Execution.

Parameter Name Description Applicable engine(s) Default value Default value (Lightning Engine) User overridable (job level) Scope
--engine Cluster-level setting to select the engine during cluster creation. Cluster-wide default lightning No Cluster
spark:spark.dataproc.lightningEngine.runtime Cluster-level setting to select the Lightning engine runtime during cluster creation. Lightning only default default No Cluster
spark.dataproc.lightningEngine.runtime Enables or disables Native Query Execution (NQE) within the Lightning Engine. Lightning only default default Yes. Can be set to native or default. Job

Limitations

Enabling Native Query Execution in the following scenarios can cause exceptions, Spark incompatibilities, or workload fallback to the default Spark engine.

Fallbacks

Native Query Execution in the following scenarios can result in a workload fallback to the Spark execution engine:

  • ANSI: if ANSI mode is enabled, execution falls back to Spark.
  • Case-sensitive mode: Native Query Execution supports only the Spark default case-insensitive mode. If case-sensitive mode is enabled, incorrect results can occur.
  • Partitioned table scan: Native Query Execution supports the partitioned table scan only when the path contains the partition information. Otherwise, the workload falls back to the Spark execution engine.

Incompatible behavior

Incompatible behavior or incorrect results can occur when you use Native Query Execution in the following cases:

  • JSON functions: Native Query Execution supports strings surrounded by double quotes, not single quotes. Incorrect results occur with single quotes. Using * in the path with the get_json_object function returns NULL.
  • Parquet read configuration:
    • Native Query Execution treats spark.files.ignoreCorruptFiles as set to the default false value, even when set to true.
    • Native Query Execution ignores spark.sql.parquet.datetimeRebaseModeInRead, and returns only the Parquet file contents. Differences between the legacy hybrid calendar and the Proleptic Gregorian calendar are not considered. Spark results can differ.
  • NaN: not supported. Unexpected results can occur, for example, when you use NaN in a numeric comparison.
  • Spark columnar reading: a fatal error can occur because the Spark columnar vector is incompatible with Native Query Execution.
  • Spill: when you set shuffle partitions to a large number, the spill-to-disk feature can trigger an OutOfMemoryException. If this occurs, reducing the number of partitions can eliminate this exception.

What's next