使用 Lightning Engine

Lightning Engine 是新一代 Apache Spark 性能，引入了多项独家增强功能，旨在大幅提升性能、成本效益和运营稳定性。

优势

Lightning Engine 的优势包括：

加速数据操作：通过优化云存储互动（包括元数据处理、写入工作负载和矢量化 I/O），显著提升性能并节省成本。
智能查询执行：利用高级优化器增强功能，动态减少扫描的数据量、优化数据处理，并生成更高效的执行计划，从而更快地执行查询并降低查询费用。
简化 AI 和机器学习工作负载：缩短基于 GPU 的工作负载的集群启动时间，并使用原生 AI 和机器学习映像简化安全环境中的部署。

虽然 Lightning Engine 可大幅提升性能，但具体影响因工作负载而异。它最适合利用 Spark Dataframe API、Spark Dataset API 和 Spark SQL 查询的计算密集型任务，而不是 I/O 绑定操作。

与标准引擎的比较

Lightning Engine 是标准引擎的替代方案，用于在 Managed Service for Apache Spark 集群上执行 Spark 作业。下表比较了 Lightning Engine 与标准引擎的激活属性、工作负载适用性和主要优势。

功能	标准引擎	Lightning Engine
激活属性	`--engine=default` 或取消设置标志	`--engine=lightning`
最适合	一般用途作业、开发和测试	需要大幅加速的企业级工作负载
主要优势	基准性能	优化的云存储互动，智能查询执行

要求

以下要求适用于 Lightning Engine 功能：

映像版本：Lightning Engine 必须与 Managed Service for Apache Spark 映像版本 2.3.3 或更高版本搭配使用。
支持的作业：支持 Spark、PySpark、SparkSQL 和 SparkR。标准引擎将运行在提交到 Lightning 引擎集群的其他作业类型上。

原生查询执行

原生查询执行 (NQE) 是 Lightning Engine 的可选组件，可为特定作业提供更深层次的加速。它是一款基于 Apache Gluten 和 Velox 的原生引擎，针对 Google 硬件进行了优化，可通过在 JVM 外部运行部分 Spark 查询来提升性能。

建议将 NQE 用于以下方面：: 利用 Spark Dataframe API、Spark Dataset API 和从 Parquet 及 ORC 文件读取数据的 Spark SQL 查询的计算密集型任务（而非 I/O 绑定操作）。输出文件格式不会影响其性能。
不建议将 NQE 用于以下方面：: 严重依赖弹性分布式数据集 (RDD)、用户定义的函数 (UDF) 或大多数 Spark 机器学习 (ML) 库的作业。

要求

以下要求适用于原生查询执行功能：

执行引擎：NQE 仅适用于在创建集群时启用了 Lightning 引擎的集群。
操作系统：仅支持 Debian-12 映像。使用任何其他操作系统的启用 NQE 的作业都会失败。
支持的作业：支持 Spark、PySpark、SparkSQL 和 SparkR。标准引擎将在提交到 Lightning Engine 集群的其他作业类型上运行（不使用 NQE）。
机器类型：仅支持使用 Intel 或 AMD 处理器的机器家族。使用 ARM 处理器的启用 NQE 的作业将失败（但可以从 Lightning Engine 中受益，而无需 NQE）。
无 GPU 和加速器：在 GPU 加速器上提交的启用 NQE 的作业将失败（但可以在没有 NQE 的情况下受益于 Lightning Engine）。
数据类型：不支持以下数据类型的输入：
- 字节：ORC 和 Parquet
- 结构体、数组、Map：Parquet

价格

如需了解价格信息，请参阅 Managed Service for Apache Spark on Compute Engine 价格。

创建 Lightning Engine 集群

本部分将介绍如何创建 Managed Service for Apache Spark 集群，以便在提交到该集群的 Spark 作业上启用 Lightning Engine。

您还可以在创建集群时为集群启用原生查询执行 (NQE)，也可以稍后为提交到集群的特定 Spark 作业启用 NQE。

准备工作

登录您的 Google Cloud 账号。如果您是 Google Cloud新手，请创建一个账号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金，用于运行、测试和部署工作负载。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

安装 Google Cloud CLI。

注意：如果您之前安装了 gcloud CLI，请确保通过运行 gcloud components update 来获得最新版本。

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

安装 Google Cloud CLI。

注意：如果您之前安装了 gcloud CLI，请确保通过运行 gcloud components update 来获得最新版本。

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

所需的角色

您需要拥有某些 IAM 角色才能创建 Managed Service for Apache Spark 集群并向该集群提交作业。这些角色可能已获授予，具体取决于组织政策。如需检查角色授予情况，请参阅您是否需要授予角色？。

如需详细了解如何授予角色，请参阅管理对项目、文件夹和组织的访问权限。

用户角色

如需获得创建 Managed Service for Apache Spark 集群所需的权限，请让管理员向您授予以下 IAM 角色：

针对项目的 Managed Service for Apache Spark Editor (roles/dataproc.editor) 角色
Compute Engine 默认服务账号的 Service Account User (roles/iam.serviceAccountUser)

服务账号角色

为确保 Compute Engine 默认服务账号具有创建 Managed Service for Apache Spark 集群所需的权限，请让您的管理员向 Compute Engine 默认服务账号授予项目的 Managed Service for Apache Spark Worker (roles/dataproc.worker) IAM 角色。

创建集群

以下示例展示了如何使用 Google Cloud 控制台、Google Cloud CLI、Dataproc API、Python 或 Terraform 创建 Lightning Engine 集群。您还可以使用 Go、Java 和 Node.js 客户端库创建启用 Lightning Engine 的 Managed Service for Apache Spark 集群。

控制台

在 Google Cloud 控制台中，前往在 Compute Engine 上创建 Apache Spark 集群。如需了解详情，请参阅使用 Google Cloud 控制台创建集群。

前往“在 Compute Engine 上创建 Apache Spark 集群”
在定义集群下，选中启用 Lightning Engine 复选框，以创建启用了 Lightning Engine 的集群。
可选：如需默认为 Spark 作业启用原生执行运行时，请选中启用原生执行复选框。
根据需要配置其他集群设置。
点击创建。

gcloud CLI

如需创建启用了 Lightning Engine 的集群，请运行 gcloud dataproc clusters create 命令并添加 --engine=lightning 标志。如需了解详情，请参阅使用 gcloud CLI 创建集群。
```
gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --engine=lightning \
    --image-version=2.3
```

可选：如需默认针对 Spark 作业启用原生执行运行时，请添加 spark:spark.dataproc.lightningEngine.runtime=native 属性。

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --engine=lightning \
    --image-version=2.3 \
    --properties='spark:spark.dataproc.lightningEngine.runtime=native'

API

如需创建启用了 Lightning Engine 的集群，请发送 clusters.create 请求。如需了解详情，请参阅使用 REST API 创建集群。

在请求正文中，将 engine 字段设置为 LIGHTNING。

{
  "projectId": "PROJECT_ID",
  "clusterName": "CLUSTER_NAME",
  "config": {
    "gceClusterConfig": {},
    "softwareConfig": {
      "imageVersion": "2.3"
    }
  },
  "engine": "LIGHTNING"
}

可选：如需默认针对所有作业启用原生执行运行时，请添加 spark:spark.dataproc.lightningEngine.runtime 属性。

{
  "projectId": "PROJECT_ID",
  "clusterName": "CLUSTER_NAME",
  "config": {
    "gceClusterConfig": {},
    "softwareConfig": {
      "imageVersion": "2.3",
      "properties": {
        "spark:spark.dataproc.lightningEngine.runtime": "native"
      }
    }
  },
  "engine": "LIGHTNING"
}

Python

如需创建启用了 Lightning Engine 的集群，请使用 create_cluster 方法，并将集群配置中的 engine 字段设置为 LIGHTNING。如需了解详情，请参阅使用 Python 创建集群。

from google.cloud import dataproc_v1

def create_lightning_cluster(project_id, region, cluster_name):
    client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options)

    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "engine": "LIGHTNING",
            "software_config": {
                "image_version": "2.3-debian12",
            },
        }
    }

    operation = cluster_client.create_cluster(
        project_id=project_id,
        region=region,
        cluster=cluster
    )
    result = operation.result()
    print(f"Cluster created successfully: {result.cluster_name}")

可选：如需默认为 Spark 作业启用原生执行运行时，请添加 spark:spark.dataproc.lightningEngine.runtime 属性。

from google.cloud import dataproc_v1

def create_lightning_native_cluster(project_id, region, cluster_name):
    client_options = {"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    cluster_client = dataproc_v1.ClusterControllerClient(client_options=client_options)

    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "engine": "LIGHTNING",
            "software_config": {
                "image_version": "2.3-debian12",
                "properties": {
                    "spark:spark.dataproc.lightningEngine.runtime": "native"
                }
            }
        }
    }

    operation = cluster_client.create_cluster(
        project_id=project_id,
        region=region,
        cluster=cluster
    )
    result = operation.result()
    print(f"Cluster created successfully: {result.cluster_name}")

Terraform

在 google_dataproc_cluster 资源配置中，将 engine 实参设置为 LIGHTNING。
如需了解详情和高级选项，请参阅 google_dataproc_cluster 资源的官方 Terraform 文档。

验证集群引擎

控制台

在 Google Cloud 控制台中，前往集群详情页面。
验证 Lightning Engine 值是否列在引擎字段中。
如果您启用了原生查询执行，请验证 native 是否列在原生执行字段中。

gcloud

如需验证引擎和 NQE（如果已启用），请运行 gcloud dataproc clusters describe 命令：
```
gcloud dataproc clusters describe CLUSTER_NAME --project=PROJECT_ID --region=REGION
```

检查输出中的 engine 和 lightningEngine.runtime 属性：

clusterName: lightning-engine-cluster
engine: lightningEngine
lightningEngine.runtime: native