"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

使用 Python 版 Cloud 客户端库

本教程提供了一项 Cloud Shell 演示，该演示使用 Python 版 Google Cloud 客户端库以编程方式调用 Managed Service for Apache Spark gRPC API 来创建集群并将作业提交到该集群。

以下部分介绍 GitHub GoogleCloudPlatform/python-dataproc 代码库中包含的演示代码操作。

运行 Cloud Shell 演示

点击在 Cloud Shell 中打开 (Open in Cloud Shell) 以运行演示。

在 Cloud Shell 中打开

了解代码

本部分介绍了教程代码如何使用 Python 版 Cloud 客户端库通过 Google Cloud进行身份验证、创建集群、提交 Spark 作业，以及通过删除集群进行清理。

应用默认凭证

本教程中的 Cloud Shell 演示使用 Google Cloud 项目凭据提供身份验证。在本地运行代码时，建议的实践是使用服务账号凭证对代码进行身份验证。

创建 Managed Service for Apache Spark 集群

设置以下值来创建集群：

将在其中创建集群的项目
要在其中创建集群的区域。
集群的名称
集群配置，用于指定一个主节点和两个主要工作器

默认配置设置用于其余的集群设置。您可以替换默认集群配置设置。例如，您可以添加辅助虚拟机（默认值 = 0），也可以为集群指定非默认 VPC 网络。如需了解详情，请参阅 CreateCluster。

def quickstart(project_id, region, cluster_name, gcs_bucket, pyspark_file):
    # Create the cluster client.
    cluster_client = dataproc_v1.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    print(f"Cluster created successfully: {result.cluster_name}")

提交作业

设置以下值来提交作业：

将在其中创建集群的项目
要在其中创建集群的区域。
作业配置，用于指定集群名称和 PySpark 作业的 Cloud Storage 文件路径 (URI)

如需了解详情，请参阅 SubmitJob。

# Create the job client.
job_client = dataproc_v1.JobControllerClient(
    client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
)

# Create the job config.
job = {
    "placement": {"cluster_name": cluster_name},
    "pyspark_job": {"main_python_file_uri": f"gs://{gcs_bucket}/{spark_filename}"},
}

operation = job_client.submit_job_as_operation(
    request={"project_id": project_id, "region": region, "job": job}
)
response = operation.result()

# Dataproc job output is saved to the Cloud Storage bucket
# allocated to the job. Use regex to obtain the bucket and blob info.
matches = re.match("gs://(.*?)/(.*)", response.driver_output_resource_uri)

output = (
    storage.Client()
    .get_bucket(matches.group(1))
    .blob(f"{matches.group(2)}.000000000")
    .download_as_bytes()
    .decode("utf-8")
)

print(f"Job finished successfully: {output}\r\n")

删除集群

设置以下值来删除集群：

将在其中创建集群的项目
要在其中创建集群的区域。
集群的名称

如需了解详情，请参阅 DeleteCluster。

# Delete the cluster once the job has terminated.
operation = cluster_client.delete_cluster(
    request={
        "project_id": project_id,
        "region": region,
        "cluster_name": cluster_name,
    }
)
operation.result()

print(f"Cluster {cluster_name} successfully deleted.")