"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

클러스터 만들기

Managed Service for Apache Spark에서는 Apache Log4j 보안 취약점의 영향을 받는 1.3.95, 1.4.77, 1.5.53, 2.0.27 이전의 이미지 버전을 사용하여 클러스터를 만들 수 없습니다. Managed Service for Apache Spark는 또한 Managed Service for Apache Spark 이미지 버전 0.x, 1.0.x, 1.1.x, 1.2.x에 대해 클러스터 만들기를 방지합니다. 가능한 경우 최신 하위 부 이미지 버전을 사용해서 Managed Service for Apache Spark 클러스터를 만드는 것이 좋습니다.

이미지 버전	log4j 버전	고객 안내
2.0.29, 1.5.55, 1.4.79 또는 각 이후 버전	log4j.2.17.1	권장
2.0.28, 1.5.54, 1.4.78	log4j.2.17.0	권장
2.0.27, 1.5.53, 1.4.77	log4j.2.16.0	적극 권장
2.0.26, 1.5.52, 1.4.76 또는 각 이전 버전	이전 버전	사용 중단

특정 이미지 및 log4j 업데이트 정보는 Managed Service for Apache Spark 출시 노트를 참조하세요.

클러스터 만들기

요구사항:

이름: 클러스터는 소문자로 시작해야 하며 이어서 최대 51자의 소문자, 숫자, 하이픈이 와야 하고, 하이픈으로 끝나서는 안 됩니다.
클러스터 리전: us-east1 또는 europe-west1과 같이 클러스터의 Compute Engine 리전을 지정하여 해당 리전 내에서 VM 인스턴스 및 Cloud Storage에 저장된 클러스터 메타데이터와 같은 클러스터 리소스를 격리해야 합니다.
- Compute Engine 리전에 대한 자세한 내용은 클러스터 리전을 참조하세요.
- 리전 선택에 대한 자세한 내용은 사용 가능한 리전 및 영역 을 참조하세요. 또한 gcloud compute regions list 명령어를 실행하여 사용 가능한 리전 목록을 표시할 수 있습니다.
연결: Compute Engine 가상 머신 인스턴스 (VM)는 마스터 및 작업자 VM으로 구성된 Managed Service for Apache Spark 클러스터에서 전체 내부 IP 네트워킹 교차 연결이 필요합니다. default VPC 네트워크에서 이 연결을 제공합니다 (Managed Service for Apache Spark 클러스터 네트워크 구성 참조).
머신 유형 (권장): 머신 유형을 지정하는 것은 선택사항이지만 Google에서는 클러스터의 마스터 및 작업자 VM에 사용할 머신 유형을 명시적으로 선택하는 것이 좋습니다. 머신 유형을 지정하지 않으면 Managed Service for Apache Spark에서 리소스 가용성을 기반으로 머신 유형을 동적으로 선택합니다. 이러한 동적 선택으로 인해 비용과 성능이 모두 달라질 수 있습니다.
- 머신 유형 선택에 대한 자세한 내용은 지원되는 머신 유형을 참조하세요.
- 잠재적인 리소스 사용 불가 문제를 완화하려면 허용되는 머신 유형 목록을 지정할 수 있는 유연한 VM을 사용하는 것이 좋습니다.

콘솔

콘솔 클러스터 만들기 페이지를 열어 기본 클러스터 설정을 표시합니다. Google Cloud 표시된 기본 설정을 확인하거나 변경한 후 추가 구성 을 클릭하여 클러스터를 추가로 맞춤설정할 수 있습니다.

클러스터 만들기 를 클릭하여 클러스터를 만듭니다. 클러스터가 프로비저닝되면 클러스터 이름이 클러스터 페이지에 표시되고 상태가 Running으로 업데이트됩니다. 클러스터 이름을 클릭하여 클러스터의 작업, 인스턴스, 구성 설정을 검사하고 클러스터에서 실행되는 웹 인터페이스에 연결할 수 있는 클러스터 세부정보 페이지를 엽니다.

gcloud

명령줄에 Managed Service for Apache Spark 클러스터를 만들려면 터미널 창 또는 Cloud Shell에서 로컬로 gcloud dataproc clusters create 명령어를 실행합니다.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --master-machine-type=MASTER_MACHINE_TYPE \
    --worker-machine-type=WORKER_MACHINE_TYPE

이 명령어는 클러스터를 만듭니다. 마스터 및 작업자 머신 유형은 선택사항이지만 일관된 비용과 성능을 보장하려면 --master-machine-type 및 --worker-machine-type 플래그 (예: n4-standard-4)를 사용하여 명시적으로 지정하는 것이 좋습니다. 머신 유형을 지정하지 않으면 리소스 가용성을 기반으로 기본 머신 유형이 동적으로 선택됩니다. 명령줄 플래그를 사용하여 클러스터 설정을 맞춤설정하는 방법은 gcloud dataproc clusters create 명령어를 참조하세요.

YAML 파일로 클러스터 만들기

다음 gcloud 명령어를 실행하여 기존 Managed Service for Apache Spark 클러스터의 구성을 cluster.yaml 파일로 내보냅니다.
```
gcloud dataproc clusters export EXISTING_CLUSTER_NAME \
    --region=REGION \
    --destination=cluster.yaml
```

YAML 파일 구성을 가져와서 새 클러스터를 만듭니다.

gcloud dataproc clusters import NEW_CLUSTER_NAME \
    --region=REGION \
    --source=cluster.yaml

**참고:** 내보내기 작업 중에 클러스터 이름, 출력 전용 필드, 자동으로 적용되는 라벨과 같은 클러스터 관련 필드가 필터링됩니다. 클러스터를 만들 때 사용하는 가져온 YAML 파일에는 이러한 필드가 허용되지 않습니다.

REST

이 섹션에서는 클러스터를 만드는 방법을 보여줍니다. 머신 유형을 지정하는 것은 선택사항이지만 일관된 비용과 성능을 보장하려면 master_config 및 worker_config에 machine_type_uri (예: n4-standard-4)를 명시적으로 포함하는 것이 좋습니다. 머신 유형을 지정하지 않으면 리소스 가용성을 기반으로 기본 머신 유형이 동적으로 선택됩니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

CLUSTER_NAME: 클러스터 이름
PROJECT: Google Cloud 프로젝트 ID
REGION: 클러스터가 생성될 사용 가능한 Compute Engine 리전입니다.
ZONE: 클러스터가 생성될 곳으로 선택한 리전 내의 선택적인 영역
MASTER_MACHINE_TYPE: (권장) 마스터 노드의 머신 유형 (예: n4-standard-4)
WORKER_MACHINE_TYPE: (권장) 작업자 노드의 머신 유형 (예: n4-standard-4)

HTTP 메서드 및 URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters

JSON 요청 본문:

{
  "project_id":"PROJECT",
  "cluster_name":"CLUSTER_NAME",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"MASTER_MACHINE_TYPE",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"WORKER_MACHINE_TYPE",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":"ZONE"
    }
  }
}

요청을 보내려면 다음 옵션 중 하나를 펼칩니다.

cURL(Linux, macOS, Cloud Shell)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters"

PowerShell(Windows)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "CLUSTER_NAME",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

Go

클라이언트 라이브러리를 설치합니다.
애플리케이션 기본 사용자 인증 정보를 설정합니다.

코드를 실행합니다.

참고: 머신 유형을 지정하는 것은 선택사항이지만 일관된 비용과 성능을 보장하려면 클러스터 구성에서 마스터 및 작업자 머신 유형을 명시적으로 설정하는 것이 좋습니다 (예: n4-standard-4). 생략된 경우, 기본 머신 유형은 리소스 가용성을 기반으로 동적으로 선택됩니다.

import (
	"context"
	"fmt"
	"io"

	dataproc "cloud.google.com/go/dataproc/apiv1"
	"cloud.google.com/go/dataproc/apiv1/dataprocpb"
	"google.golang.org/api/option"
)

func createCluster(w io.Writer, projectID, region, clusterName string) error {
	// projectID := "your-project-id"
	// region := "us-central1"
	// clusterName := "your-cluster"
	ctx := context.Background()

	// Create the cluster client.
	endpoint := region + "-dataproc.googleapis.com:443"
	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
	if err != nil {
		return fmt.Errorf("dataproc.NewClusterControllerClient: %w", err)
	}
	defer clusterClient.Close()

	// Create the cluster config.
	req := &dataprocpb.CreateClusterRequest{
		ProjectId: projectID,
		Region:    region,
		Cluster: &dataprocpb.Cluster{
			ProjectId:   projectID,
			ClusterName: clusterName,
			Config: &dataprocpb.ClusterConfig{
				MasterConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   1,
					MachineTypeUri: "n1-standard-2",
				},
				WorkerConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   2,
					MachineTypeUri: "n1-standard-2",
				},
			},
		},
	}

	// Create the cluster.
	op, err := clusterClient.CreateCluster(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateCluster: %w", err)
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("CreateCluster.Wait: %w", err)
	}

	// Output a success message.
	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
	return nil
}

Java

클라이언트 라이브러리를 설치합니다.
애플리케이션 기본 사용자 인증 정보를 설정합니다.

코드를 실행합니다.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.dataproc.v1.Cluster;
import com.google.cloud.dataproc.v1.ClusterConfig;
import com.google.cloud.dataproc.v1.ClusterControllerClient;
import com.google.cloud.dataproc.v1.ClusterControllerSettings;
import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
import com.google.cloud.dataproc.v1.InstanceGroupConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class CreateCluster {

  public static void createCluster() throws IOException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String region = "your-project-region";
    String clusterName = "your-cluster-name";
    createCluster(projectId, region, clusterName);
  }

  public static void createCluster(String projectId, String region, String clusterName)
      throws IOException, InterruptedException {
    String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);

    // Configure the settings for the cluster controller client.
    ClusterControllerSettings clusterControllerSettings =
        ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();

    // Create a cluster controller client with the configured settings. The client only needs to be
    // created once and can be reused for multiple requests. Using a try-with-resources
    // closes the client, but this can also be done manually with the .close() method.
    try (ClusterControllerClient clusterControllerClient =
        ClusterControllerClient.create(clusterControllerSettings)) {
      // Configure the settings for our cluster.
      InstanceGroupConfig masterConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(1)
              .build();
      InstanceGroupConfig workerConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(2)
              .build();
      ClusterConfig clusterConfig =
          ClusterConfig.newBuilder()
              .setMasterConfig(masterConfig)
              .setWorkerConfig(workerConfig)
              .build();
      // Create the cluster object with the desired cluster config.
      Cluster cluster =
          Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();

      // Create the Cloud Dataproc cluster.
      OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
          clusterControllerClient.createClusterAsync(projectId, region, cluster);
      Cluster response = createClusterAsyncRequest.get();

      // Print out a success message.
      System.out.printf("Cluster created successfully: %s", response.getClusterName());

    } catch (ExecutionException e) {
      System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
    }
  }
}

Node.js

클라이언트 라이브러리를 설치합니다.
애플리케이션 기본 사용자 인증 정보를 설정합니다.

코드를 실행합니다.

const dataproc = require('@google-cloud/dataproc');

// TODO(developer): Uncomment and set the following variables
// projectId = 'YOUR_PROJECT_ID'
// region = 'YOUR_CLUSTER_REGION'
// clusterName = 'YOUR_CLUSTER_NAME'

// Create a client with the endpoint set to the desired cluster region
const client = new dataproc.v1.ClusterControllerClient({
  apiEndpoint: `${region}-dataproc.googleapis.com`,
  projectId: projectId,
});

async function createCluster() {
  // Create the cluster config
  const request = {
    projectId: projectId,
    region: region,
    cluster: {
      clusterName: clusterName,
      config: {
        masterConfig: {
          numInstances: 1,
          machineTypeUri: 'n1-standard-2',
        },
        workerConfig: {
          numInstances: 2,
          machineTypeUri: 'n1-standard-2',
        },
      },
    },
  };

  // Create the cluster
  const [operation] = await client.createCluster(request);
  const [response] = await operation.promise();

  // Output a success message
  console.log(`Cluster created successfully: ${response.clusterName}`);

Python

클라이언트 라이브러리를 설치합니다.
애플리케이션 기본 사용자 인증 정보를 설정합니다.

코드를 실행합니다.

from google.cloud import dataproc_v1 as dataproc


def create_cluster(project_id, region, cluster_name):
    """This sample walks a user through creating a Cloud Dataproc cluster
    using the Python client library.

    Args:
        project_id (string): Project to use for creating resources.
        region (string): Region where the resources should live.
        cluster_name (string): Name to use for creating a cluster.
    """

    # Create a client with the endpoint set to the desired cluster region.
    cluster_client = dataproc.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    # Output a success message.
    print(f"Cluster created successfully: {result.cluster_name}")

클러스터 만들기 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

클러스터 만들기

콘솔

gcloud

YAML 파일로 클러스터 만들기

REST

cURL(Linux, macOS, Cloud Shell)

PowerShell(Windows)

Go

Java

Node.js

Python

클러스터 만들기