"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google 会使用 AI 技术将内容翻译成您偏好的语言。AI 翻译可能包含错误。

创建集群

Managed Service for Apache Spark 会阻止使用映像版本 1.3.95、1.4.77、1.5.53 和 2.0.27 之前创建集群，这些映像版本受 Apache Log4j 安全漏洞的影响。Managed Service for Apache Spark 还会阻止创建 Managed Service for Apache Spark 映像版本 0.x、1.0.x、1.1.x 和 1.2.x 的集群。 Managed Service for Apache Spark 建议您尽可能使用最新次要映像版本创建 Managed Service for Apache Spark 集群。

图片集锦	log4j 版本	客户指导
2.0.29、1.5.55 和 1.4.79 或更高版本	log4j.2.17.1	建议
2.0.28、1.5.54 和 1.4.78	log4j.2.17.0	建议
2.0.27、1.5.53 和 1.4.77	log4j.2.16.0	强烈建议
2.0.26、1.5.52 和 1.4.76 或更早版本	旧版	停止使用

如需了解特定映像和 log4j 更新信息，请参阅 Managed Service for Apache Spark 版本说明。

创建集群

要求：

名称：集群名称必须以小写字母开头，后面最多可跟 51 个小写字母、数字和连字符，但不能以连字符结尾。
集群区域 ：您必须为集群指定一个 Compute Engine 区域（例如 us-east1 或 europe-west1），以便在该区域内隔离集群资源，例如存储在 Cloud Storage 中的虚拟机实例和集群元数据。
- 如需详细了解 Compute Engine 区域，请参阅集群区域。
- 如需了解如何选择区域，请参阅可用的区域和地区。您还可以运行 gcloud compute regions list 命令以显示可用区域的列表。
**连接性**：Compute Engine 虚拟机实例（虚拟机）在 Managed Service for Apache Spark 集群中，包含主虚拟机和工作器虚拟机，该实例需要具有完整的内部 IP 网络交叉连接性。 default VPC 网络可提供此连接（请参阅 Managed Service for Apache Spark 集群网络配置）。
机器类型（推荐）：虽然指定机器类型是可选操作，但 Google 建议您为集群中的主虚拟机和工作器虚拟机明确选择机器类型。如果您未指定机器类型，Managed Service for Apache Spark 会根据资源可用性动态选择机器类型。这种动态选择可能会导致费用和性能出现差异。
- 如需详细了解如何选择机器类型，请参阅支持的机器类型。
- 为了缓解潜在的资源不可用问题，我们建议您使用灵活虚拟机，该功能可让您指定可接受的机器类型列表。

控制台

打开 Google Cloud 控制台的 “创建集群” 页面，以显示默认集群设置。您可以确认或更改显示的默认设置，然后点击其他配置 以进一步自定义集群。

点击创建集群 以创建集群。集群名称显示在集群页面中，预配集群后，其状态会更新为 Running。点击集群名称以打开集群详情页面，您可以在其中检查集群的作业、实例和配置设置，还可以连接到集群上运行的网页界面。

gcloud

如需在命令行中创建 Managed Service for Apache Spark 集群，请在终端窗口或 Cloud Shell中以本地方式运行 gcloud dataproc clusters create 命令。

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --master-machine-type=MASTER_MACHINE_TYPE \
    --worker-machine-type=WORKER_MACHINE_TYPE

该命令会创建一个集群。虽然主虚拟机和工作器虚拟机的机器类型是可选的，但建议您使用 --master-machine-type 和 --worker-machine-type 标志（例如 n4-standard-4）明确指定它们，以确保费用和性能保持一致。如果您未指定机器类型，系统会根据资源可用性动态选择默认机器类型。如需了解如何使用命令行标志自定义集群设置，请参阅 gcloud dataproc clusters create 命令。

使用 YAML 文件创建集群

运行以下 gcloud 命令，以将现有 Managed Service for Apache Spark 集群的配置导出到 cluster.yaml 文件中。

gcloud dataproc clusters export EXISTING_CLUSTER_NAME \
    --region=REGION \
    --destination=cluster.yaml

通过导入 YAML 文件配置来创建新集群。

gcloud dataproc clusters import NEW_CLUSTER_NAME \
    --region=REGION \
    --source=cluster.yaml

**注意** ：在导出操作期间，集群特有的字段（例如集群名称）、仅限输出的字段和自动应用的标签会被过滤掉。在用于创建集群的导入的 YAML 文件中，不允许使用这些字段。

REST

本部分介绍了如何创建集群。虽然指定机器类型是可选操作，但建议您在 master_config 和 worker_config 中明确包含 machine_type_uri（例如 n4-standard-4），以确保费用和性能保持一致。如果您未指定机器类型，系统会根据资源可用性动态选择默认机器类型。

在使用任何请求数据之前，请先进行以下替换：

CLUSTER_NAME：集群名称
PROJECT： Google Cloud 项目 ID
REGION：要在其中创建集群的可用 Compute Engine 区域。
ZONE：要在其中创建集群的所选区域内的可选可用区。
MASTER_MACHINE_TYPE：（推荐）主节点的机器类型（例如 n4-standard-4）。
WORKER_MACHINE_TYPE：（推荐）工作器节点的机器类型（例如 n4-standard-4）。

HTTP 方法和网址：

POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters

请求 JSON 正文：

{
  "project_id":"PROJECT",
  "cluster_name":"CLUSTER_NAME",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"MASTER_MACHINE_TYPE",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"WORKER_MACHINE_TYPE",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":"ZONE"
    }
  }
}

如需发送您的请求，请展开以下选项之一：

curl（Linux、macOS 或 Cloud Shell）

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI，或者使用了 Cloud Shell，这会使您自动登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters"

PowerShell (Windows)

注意：以下命令假定您已使用您的用户账号通过运行 gcloud init 或 gcloud auth login 登录 gcloud CLI。您可以运行 gcloud auth list 来检查当前活跃的账号。

将请求正文保存在名为 request.json 的文件中，然后执行以下命令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters" | Select-Object -Expand Content

您应该收到类似以下内容的 JSON 响应：

{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "CLUSTER_NAME",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

Go

安装客户端库。
设置应用默认凭据。

运行此代码。

注意：虽然指定机器类型是可选操作，但建议您在集群配置中明确设置主虚拟机和工作器虚拟机的机器类型（例如设置为 n4-standard-4），以确保费用和性能保持一致。如果省略，系统会根据资源可用性动态选择默认机器类型。

import (
	"context"
	"fmt"
	"io"

	dataproc "cloud.google.com/go/dataproc/apiv1"
	"cloud.google.com/go/dataproc/apiv1/dataprocpb"
	"google.golang.org/api/option"
)

func createCluster(w io.Writer, projectID, region, clusterName string) error {
	// projectID := "your-project-id"
	// region := "us-central1"
	// clusterName := "your-cluster"
	ctx := context.Background()

	// Create the cluster client.
	endpoint := region + "-dataproc.googleapis.com:443"
	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
	if err != nil {
		return fmt.Errorf("dataproc.NewClusterControllerClient: %w", err)
	}
	defer clusterClient.Close()

	// Create the cluster config.
	req := &dataprocpb.CreateClusterRequest{
		ProjectId: projectID,
		Region:    region,
		Cluster: &dataprocpb.Cluster{
			ProjectId:   projectID,
			ClusterName: clusterName,
			Config: &dataprocpb.ClusterConfig{
				MasterConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   1,
					MachineTypeUri: "n1-standard-2",
				},
				WorkerConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   2,
					MachineTypeUri: "n1-standard-2",
				},
			},
		},
	}

	// Create the cluster.
	op, err := clusterClient.CreateCluster(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateCluster: %w", err)
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("CreateCluster.Wait: %w", err)
	}

	// Output a success message.
	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
	return nil
}

Java

安装客户端库。
设置应用默认凭据。

运行此代码。

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.dataproc.v1.Cluster;
import com.google.cloud.dataproc.v1.ClusterConfig;
import com.google.cloud.dataproc.v1.ClusterControllerClient;
import com.google.cloud.dataproc.v1.ClusterControllerSettings;
import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
import com.google.cloud.dataproc.v1.InstanceGroupConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class CreateCluster {

  public static void createCluster() throws IOException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String region = "your-project-region";
    String clusterName = "your-cluster-name";
    createCluster(projectId, region, clusterName);
  }

  public static void createCluster(String projectId, String region, String clusterName)
      throws IOException, InterruptedException {
    String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);

    // Configure the settings for the cluster controller client.
    ClusterControllerSettings clusterControllerSettings =
        ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();

    // Create a cluster controller client with the configured settings. The client only needs to be
    // created once and can be reused for multiple requests. Using a try-with-resources
    // closes the client, but this can also be done manually with the .close() method.
    try (ClusterControllerClient clusterControllerClient =
        ClusterControllerClient.create(clusterControllerSettings)) {
      // Configure the settings for our cluster.
      InstanceGroupConfig masterConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(1)
              .build();
      InstanceGroupConfig workerConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(2)
              .build();
      ClusterConfig clusterConfig =
          ClusterConfig.newBuilder()
              .setMasterConfig(masterConfig)
              .setWorkerConfig(workerConfig)
              .build();
      // Create the cluster object with the desired cluster config.
      Cluster cluster =
          Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();

      // Create the Cloud Dataproc cluster.
      OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
          clusterControllerClient.createClusterAsync(projectId, region, cluster);
      Cluster response = createClusterAsyncRequest.get();

      // Print out a success message.
      System.out.printf("Cluster created successfully: %s", response.getClusterName());

    } catch (ExecutionException e) {
      System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
    }
  }
}

Node.js

安装客户端库。
设置应用默认凭据。

运行此代码。

const dataproc = require('@google-cloud/dataproc');

// TODO(developer): Uncomment and set the following variables
// projectId = 'YOUR_PROJECT_ID'
// region = 'YOUR_CLUSTER_REGION'
// clusterName = 'YOUR_CLUSTER_NAME'

// Create a client with the endpoint set to the desired cluster region
const client = new dataproc.v1.ClusterControllerClient({
  apiEndpoint: `${region}-dataproc.googleapis.com`,
  projectId: projectId,
});

async function createCluster() {
  // Create the cluster config
  const request = {
    projectId: projectId,
    region: region,
    cluster: {
      clusterName: clusterName,
      config: {
        masterConfig: {
          numInstances: 1,
          machineTypeUri: 'n1-standard-2',
        },
        workerConfig: {
          numInstances: 2,
          machineTypeUri: 'n1-standard-2',
        },
      },
    },
  };

  // Create the cluster
  const [operation] = await client.createCluster(request);
  const [response] = await operation.promise();

  // Output a success message
  console.log(`Cluster created successfully: ${response.clusterName}`);

Python

安装客户端库。
设置应用默认凭据。

运行此代码。

from google.cloud import dataproc_v1 as dataproc


def create_cluster(project_id, region, cluster_name):
    """This sample walks a user through creating a Cloud Dataproc cluster
    using the Python client library.

    Args:
        project_id (string): Project to use for creating resources.
        region (string): Region where the resources should live.
        cluster_name (string): Name to use for creating a cluster.
    """

    # Create a client with the endpoint set to the desired cluster region.
    cluster_client = dataproc.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    # Output a success message.
    print(f"Cluster created successfully: {result.cluster_name}")

创建集群 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

创建集群

控制台

gcloud

使用 YAML 文件创建集群

REST

curl（Linux、macOS 或 Cloud Shell）

PowerShell (Windows)

Go

Java

Node.js

Python

创建集群