"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

建立叢集

Managed Service for Apache Spark 禁止建立映像檔版本低於 1.3.95、1.4.77、1.5.53 和 2.0.27 的叢集，因為這些版本受到 Apache Log4j 安全漏洞影響。此外，Managed Service for Apache Spark 也禁止建立 Managed Service for Apache Spark 映像檔版本 0.x、1.0.x、1.1.x 和 1.2.x 的叢集。Managed Service for Apache Spark 建議您盡可能使用最新的次要映像檔版本來建立 Managed Service for Apache Spark 叢集。

映像檔版本	log4j 版本	客戶指南
2.0.29、1.5.55 和 1.4.79 以上版本	log4j.2.17.1	建議使用
2.0.28、1.5.54 和 1.4.78	log4j.2.17.0	建議使用
2.0.27、1.5.53 和 1.4.77	log4j.2.16.0	強烈建議使用
2.0.26、1.5.52 和 1.4.76，或更早版本	舊版本	停止使用

如要瞭解特定映像檔和 log4j 更新資訊，請參閱 Managed Service for Apache Spark 版本資訊。

建立 Managed Service for Apache Spark 叢集

需求條件：

名稱：叢集名稱開頭須為小寫英文字母，後面最多可接 51 個小寫英文字母、數字和連字號，但結尾不得為連字號。
叢集區域：您必須為叢集指定 Compute Engine 區域，例如 us-east1 或 europe-west1，才能在該區域內隔離叢集資源，例如 VM 執行個體和儲存在 Cloud Storage 的叢集中繼資料。
- 如要進一步瞭解 Compute Engine 區域，請參閱「叢集區域」。
- 要瞭解如何選取區域，請參閱「區域與可用區」一節。您也可以執行 gcloud compute regions list 指令，顯示可用區域清單。
連線能力：在由主要 VM 和工作站 VM 組成的 Managed Service for Apache Spark 叢集中，Compute Engine 虛擬機器執行個體 (VM) 需要具有完整的內部 IP 網路跨連線能力。default 虛擬私有雲網路會提供這項連線能力 (請參閱「Managed Service for Apache Spark 叢集網路設定」)。
機型 (建議)：指定機型為選用設定，但 Google 建議您為叢集中的主要和工作站 VM 明確選取機型。如果未指定機器類型，Managed Service for Apache Spark 會根據資源可用性動態選取機器類型。這項動態選取功能可能會導致費用和成效有所差異。
- 如要進一步瞭解如何選擇機器類型，請參閱「支援的機器類型」。
- 為避免資源可能無法使用的問題，建議使用彈性 VM，指定可接受的機器類型清單。

控制台

在瀏覽器中開啟 Google Cloud 控制台的 Managed Service for Apache Spark「建立叢集」頁面，然後在「在 Compute Engine 上建立 Dataproc 叢集」頁面，按一下「Compute Engine」列的「建立」。系統會選取「Set up cluster」(設定叢集) 面板，欄位中已填入預設值。您可以選取每個面板，確認或變更預設值來自訂叢集。

按一下「Create」(建立)，建立叢集。叢集名稱會出現在「Clusters」(叢集) 頁面，且在叢集佈建後，其狀態會更新為「Running」(執行中)。按一下叢集名稱，開啟叢集詳細資料頁面，即可查看叢集的工作、執行個體和設定，並連線至叢集上執行的網頁介面。

gcloud

如要在指令列建立 Managed Service for Apache Spark 叢集，請在本機的終端機視窗或在 Cloud Shell 中執行 gcloud dataproc clusters create 指令。

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --master-machine-type=MASTER_MACHINE_TYPE \
    --worker-machine-type=WORKER_MACHINE_TYPE

這個指令會建立叢集。雖然主要和工作站機器類型為選用項目，但建議使用 --master-machine-type 和 --worker-machine-type 標記 (例如 n4-standard-4) 明確指定這些類型，確保成本和效能一致。如果未指定機器類型，系統會根據資源可用性動態選取預設機器類型。要瞭解如何使用指令列標記來自訂叢集設定，請參閱 gcloud dataproc clusters create 指令說明。

使用 YAML 檔案建立叢集

執行下列 gcloud 指令，將現有 Managed Service for Apache Spark 叢集的設定匯出到 cluster.yaml 檔案。

gcloud dataproc clusters export EXISTING_CLUSTER_NAME \
    --region=REGION \
    --destination=cluster.yaml

匯入 YAML 檔案設定以建立新叢集。

gcloud dataproc clusters import NEW_CLUSTER_NAME \
    --region=REGION \
    --source=cluster.yaml

注意：在匯出作業期間，系統會篩選叢集名稱等叢集專用欄位、僅限輸出欄位，以及自動套用的標籤。用來建立叢集的匯入 YAML 檔案不允許有這些欄位。

注意：您可以在 Managed Service for Apache Spark Google Cloud 主控台的「Create a cluster」(建立叢集) 頁面左側面板底部，按一下「Equivalent REST or command line」(對等 REST 或指令列) 連結，讓主控台建構對等 API REST 要求或 gcloud 工具指令，在程式碼或指令列中使用以建立叢集。

REST

本節說明如何建立叢集。指定機器類型是選用步驟，但建議您在 master_config 和 worker_config 中明確加入 machine_type_uri (例如 n4-standard-4)，確保成本和效能一致。如果未指定機器類型，系統會根據資源可用性動態選取預設機器類型。

使用任何要求資料之前，請先修改下列項目的值：

CLUSTER_NAME：叢集名稱
PROJECT： Google Cloud 專案 ID
REGION：要建立叢集的可用 Compute Engine 區域。
ZONE：要建立叢集的所選區域中的選用可用區。
MASTER_MACHINE_TYPE：(建議) 主要節點的機器類型 (例如 n4-standard-4)。
WORKER_MACHINE_TYPE：(建議) 工作站節點的機器類型 (例如 n4-standard-4)。

HTTP 方法和網址：

POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters

JSON 要求主體：

{
  "project_id":"PROJECT",
  "cluster_name":"CLUSTER_NAME",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"MASTER_MACHINE_TYPE",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"WORKER_MACHINE_TYPE",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":"ZONE"
    }
  }
}

請展開以下其中一個選項，以傳送要求：

curl (Linux、macOS 或 Cloud Shell)

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI，或已使用 Cloud Shell 自動登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters"

PowerShell (Windows)

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters" | Select-Object -Expand Content

您應該會收到如下的 JSON 回覆：

{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "CLUSTER_NAME",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

注意：您可以在 Managed Service for Apache Spark Google Cloud 主控台的「Create a cluster」(建立叢集) 頁面左側面板底端，按一下「Equivalent REST or command line」(對等 REST 或指令列) 連結，讓主控台建構對等的 API REST 要求或 gcloud 工具指令，並在程式碼或指令列中使用此要求/指令，以建立叢集。

<0x0A

Go

安裝用戶端程式庫。
設定應用程式預設憑證。

執行程式碼。

注意：指定機型為選用設定，但建議您在叢集設定中明確設定主要和工作站機型 (例如 n4-standard-4)，確保成本和效能一致。如果省略，系統會根據資源可用性動態選取預設機器類型。

import (
	"context"
	"fmt"
	"io"

	dataproc "cloud.google.com/go/dataproc/apiv1"
	"cloud.google.com/go/dataproc/apiv1/dataprocpb"
	"google.golang.org/api/option"
)

func createCluster(w io.Writer, projectID, region, clusterName string) error {
	// projectID := "your-project-id"
	// region := "us-central1"
	// clusterName := "your-cluster"
	ctx := context.Background()

	// Create the cluster client.
	endpoint := region + "-dataproc.googleapis.com:443"
	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
	if err != nil {
		return fmt.Errorf("dataproc.NewClusterControllerClient: %w", err)
	}
	defer clusterClient.Close()

	// Create the cluster config.
	req := &dataprocpb.CreateClusterRequest{
		ProjectId: projectID,
		Region:    region,
		Cluster: &dataprocpb.Cluster{
			ProjectId:   projectID,
			ClusterName: clusterName,
			Config: &dataprocpb.ClusterConfig{
				MasterConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   1,
					MachineTypeUri: "n1-standard-2",
				},
				WorkerConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   2,
					MachineTypeUri: "n1-standard-2",
				},
			},
		},
	}

	// Create the cluster.
	op, err := clusterClient.CreateCluster(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateCluster: %w", err)
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("CreateCluster.Wait: %w", err)
	}

	// Output a success message.
	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
	return nil
}

Java

安裝用戶端程式庫。
設定應用程式預設憑證。

執行程式碼。

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.dataproc.v1.Cluster;
import com.google.cloud.dataproc.v1.ClusterConfig;
import com.google.cloud.dataproc.v1.ClusterControllerClient;
import com.google.cloud.dataproc.v1.ClusterControllerSettings;
import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
import com.google.cloud.dataproc.v1.InstanceGroupConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class CreateCluster {

  public static void createCluster() throws IOException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String region = "your-project-region";
    String clusterName = "your-cluster-name";
    createCluster(projectId, region, clusterName);
  }

  public static void createCluster(String projectId, String region, String clusterName)
      throws IOException, InterruptedException {
    String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);

    // Configure the settings for the cluster controller client.
    ClusterControllerSettings clusterControllerSettings =
        ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();

    // Create a cluster controller client with the configured settings. The client only needs to be
    // created once and can be reused for multiple requests. Using a try-with-resources
    // closes the client, but this can also be done manually with the .close() method.
    try (ClusterControllerClient clusterControllerClient =
        ClusterControllerClient.create(clusterControllerSettings)) {
      // Configure the settings for our cluster.
      InstanceGroupConfig masterConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(1)
              .build();
      InstanceGroupConfig workerConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(2)
              .build();
      ClusterConfig clusterConfig =
          ClusterConfig.newBuilder()
              .setMasterConfig(masterConfig)
              .setWorkerConfig(workerConfig)
              .build();
      // Create the cluster object with the desired cluster config.
      Cluster cluster =
          Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();

      // Create the Cloud Dataproc cluster.
      OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
          clusterControllerClient.createClusterAsync(projectId, region, cluster);
      Cluster response = createClusterAsyncRequest.get();

      // Print out a success message.
      System.out.printf("Cluster created successfully: %s", response.getClusterName());

    } catch (ExecutionException e) {
      System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
    }
  }
}

Node.js

安裝用戶端程式庫。
設定應用程式預設憑證。

執行程式碼。

const dataproc = require('@google-cloud/dataproc');

// TODO(developer): Uncomment and set the following variables
// projectId = 'YOUR_PROJECT_ID'
// region = 'YOUR_CLUSTER_REGION'
// clusterName = 'YOUR_CLUSTER_NAME'

// Create a client with the endpoint set to the desired cluster region
const client = new dataproc.v1.ClusterControllerClient({
  apiEndpoint: `${region}-dataproc.googleapis.com`,
  projectId: projectId,
});

async function createCluster() {
  // Create the cluster config
  const request = {
    projectId: projectId,
    region: region,
    cluster: {
      clusterName: clusterName,
      config: {
        masterConfig: {
          numInstances: 1,
          machineTypeUri: 'n1-standard-2',
        },
        workerConfig: {
          numInstances: 2,
          machineTypeUri: 'n1-standard-2',
        },
      },
    },
  };

  // Create the cluster
  const [operation] = await client.createCluster(request);
  const [response] = await operation.promise();

  // Output a success message
  console.log(`Cluster created successfully: ${response.clusterName}`);

Python

安裝用戶端程式庫。
設定應用程式預設憑證。

執行程式碼。

from google.cloud import dataproc_v1 as dataproc


def create_cluster(project_id, region, cluster_name):
    """This sample walks a user through creating a Cloud Dataproc cluster
    using the Python client library.

    Args:
        project_id (string): Project to use for creating resources.
        region (string): Region where the resources should live.
        cluster_name (string): Name to use for creating a cluster.
    """

    # Create a client with the endpoint set to the desired cluster region.
    cluster_client = dataproc.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    # Output a success message.
    print(f"Cluster created successfully: {result.cluster_name}")

建立叢集 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

建立 Managed Service for Apache Spark 叢集

控制台

gcloud

使用 YAML 檔案建立叢集

REST

curl (Linux、macOS 或 Cloud Shell)

PowerShell (Windows)

Go

Java

Node.js

Python

建立叢集