"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Google Kubernetes Engine에서 Spark 작업 실행

이 문서에서는 GKE 기반 Managed Service for Apache Spark 가상 클러스터를 만들고 클러스터에서 Spark 작업을 실행하는 방법을 보여줍니다.

옵션 개요

GKE의 Managed Service for Apache Spark는 컨테이너화된 환경을 강력하게 제어할 수 있지만Google Cloud 는 운영을 간소화하고 개발을 가속화할 수 있는 완전 관리형 서버리스 옵션도 제공합니다. Spark Managed Service for Apache Spark 배포 옵션 비교는 최적의 Spark 서비스 결정을 참고하세요.

시작하기 전에

Google Cloud 계정에 로그인합니다. Google Cloud를 처음 사용하는 경우 계정을 만들고 Google 제품의 실제 성능을 평가해 보세요. 신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Google Cloud CLI를 설치합니다.

참고: 이전에 gcloud CLI를 설치했으면 gcloud components update를 실행하여 최신 버전이 설치되어 있는지 확인하세요.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면, 다음 명령어를 실행합니다.

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Google Cloud CLI를 설치합니다.

참고: 이전에 gcloud CLI를 설치했으면 gcloud components update를 실행하여 최신 버전이 설치되어 있는지 확인하세요.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면, 다음 명령어를 실행합니다.

gcloud init

클러스터에 워크로드 아이덴티티가 사용 설정된 표준 (Autopilot 아님) Google Kubernetes Engine (GKE) 영역 또는 리전 클러스터를 만들어야 합니다.

성능 팁: 빠른 워크로드 초기화를 위해 이미지 스트리밍을 사용 설정합니다.

필요한 역할

이 페이지의 예시를 실행하려면 특정 IAM 역할이 필요합니다. 조직 정책에 따라 이러한 역할이 이미 부여되었을 수 있습니다. 역할 부여를 확인하려면 역할을 부여해야 하나요?를 참고하세요.

역할 부여에 대한 상세 설명은 프로젝트, 폴더, 조직에 대한 액세스 관리를 참조하세요.

사용자 역할

Managed Service for Apache Spark 클러스터를 만드는 데 필요한 권한을 얻으려면 관리자에게 다음 IAM 역할을 부여해 달라고 요청하세요.

프로젝트에 대한 Dataproc 편집자 (roles/dataproc.editor)
Compute Engine 기본 서비스 계정의 서비스 계정 사용자 (roles/iam.serviceAccountUser)

서비스 계정 역할

Compute Engine 기본 서비스 계정에 Apache Spark용 관리 서비스를 만드는 데 필요한 권한이 있는지 확인하려면 관리자에게 프로젝트에 대한 Dataproc 작업자 (roles/dataproc.worker) IAM 역할을 Compute Engine 기본 서비스 계정에 부여해 달라고 요청하세요.

가상 클러스터 만들기

GKE의 Managed Service for Apache Spark 가상 클러스터는 Managed Service for Apache Spark 구성요소의 배포 플랫폼으로 생성됩니다. 이는 가상 리소스이며 Compute Engine의 Managed Service for Apache Spark 클러스터와 달리 별도의 Managed Service for Apache Spark 마스터 및 작업자 VM을 포함하지 않습니다.

GKE 기반 Managed Service for Apache Spark에서는 GKE 기반 Managed Service for Apache Spark 가상 클러스터를 만들 때 GKE 클러스터 내에 노드 풀을 만듭니다.
GKE 기반 Managed Service for Apache Spark 작업은 이 노드 풀에서 포드로 실행됩니다. 노드 풀 및 노드 풀의 포드 예약은 GKE에서 관리됩니다.
여러 개의 가상 클러스터를 만듭니다. GKE 클러스터에서 여러 가상 클러스터를 만들고 실행하여 가상 클러스터에서 노드 풀을 공유하면 개선된 리소스 사용률을 얻을 수 있습니다.
- 각 가상 클러스터:
  - Spark 엔진 버전 및 워크로드 아이덴티티를 포함한 별도의 속성으로 생성됩니다.
  - GKE 클러스터의 별도 GKE 네임스페이스 내에 격리됩니다.

콘솔

Google Cloud 콘솔에서 Managed Service for Apache Spark 클러스터 페이지로 이동합니다.

클러스터로 이동
클러스터 만들기를 클릭합니다.
Managed Service for Apache Spark 클러스터 만들기 대화상자의 GKE의 클러스터 행에서 만들기를 클릭합니다.
클러스터 설정 패널에서 다음을 수행합니다.
1. 클러스터 이름 필드에 클러스터 이름을 입력합니다.
2. 리전 목록에서 GKE 기반 Managed Service for Apache Spark 가상 클러스터의 리전을 선택합니다. 이 리전은 기존 GKE 클러스터가 위치한 리전(다음 항목에서 선택)이어야 합니다.
3. Kubernetes 클러스터 필드에서 찾아보기를 클릭하여 기존 GKE 클러스터가 있는 리전을 선택합니다.
4. (선택사항) Cloud Storage 스테이징 버킷 필드에서 찾아보기를 클릭하여 기존 Cloud Storage 버킷을 선택할 수 있습니다. GKE 기반 Managed Service for Apache Spark은 버킷의 아티팩트를 스테이징합니다. GKE 기반 Managed Service for Apache Spark에서 스테이징 버킷을 만들도록 하려면 이 필드를 무시하세요.
왼쪽 패널에서 노드 풀 구성을 클릭한 후 노드 풀 패널에서 풀 추가를 클릭합니다.
1. 기존 GKE 기반 Managed Service for Apache Spark 노드 풀을 재사용하려면 다음 안내를 따르세요.
  1. 기존 노드 풀 재사용을 클릭합니다.
  2. 기존 노드 풀 이름을 입력하고 역할을 선택합니다. 노드 풀 최소 하나 이상에 기본 역할이 있어야 합니다.
  3. 완료를 클릭합니다.
2. GKE 기반 Managed Service for Apache Spark 노드 풀을 새로 만들려면 다음 안내를 따르세요.
  1. 새 노드 풀 만들기를 클릭합니다.
  2. 다음 노드 풀 값을 입력합니다.
    - 노드 풀 이름
    - 역할: 노드 풀 최소 하나 이상에 기본 역할이 있어야 합니다.
    - 위치: GKE 기반 Managed Service for Apache Spark 클러스터 리전 내에서 영역을 지정합니다.
    - 노드 풀 머신 유형
    - CPU 플랫폼
    - 선점 가능성
    - 최소: 최소 노드 수
    - 최대: 최대 노드 수. 최대 노드 수는 0보다 커야 합니다.
3. 풀 추가를 클릭하여 노드 풀을 더 추가합니다. 모든 노드 풀에 위치가 있어야 합니다. 노드 풀을 총 4개까지 추가할 수 있습니다.
(선택사항) Managed Service for Apache Spark 영구 기록 서버 (PHS)를 사용해 활성 및 삭제된 GKE 기반 Managed Service for Apache Spark 클러스터에 대한 Spark 작업 기록을 확인하도록 설정했다면 클러스터 맞춤설정을 클릭합니다. 그런 후 기록 서버 클러스터 필드에서 PHS 클러스터를 찾아 선택합니다. PHS 클러스터가 GKE 기반 Managed Service for Apache Spark 가상 클러스터와 동일한 리전에 있어야 합니다.
만들기를 클릭하여 Managed Service for Apache Spark 클러스터를 만듭니다. GKE 기반 Managed Service for Apache Spark 클러스터가 클러스터 페이지의 목록에 나타납니다. 클러스터를 사용할 수 있을 때까지 상태는 프로비저닝이고 실행 중으로 변경됩니다.

gcloud

로컬 또는 Cloud Shell에서 환경 변수를 설정한 후 gcloud dataproc clusters gke create 명령어를 실행하여 GKE 기반 Managed Service for Apache Spark 클러스터를 만듭니다.

환경 변수를 설정합니다.
```
DP_CLUSTER=Managed Service for Apache Spark on GKE  cluster-name \
  REGION=region \
  GKE_CLUSTER=GKE cluster-name \
  BUCKET=Cloud Storage bucket-name \
  DP_POOLNAME=node pool-name
  PHS_CLUSTER=Managed Service for Apache Spark PHS server name
```
참고:
- DP_CLUSTER: Managed Service for Apache Spark 가상 클러스터 이름을 설정합니다. 이름은 소문자로 시작하고 이어서 최대 54자의 소문자, 숫자 또는 하이픈이 와야 합니다. 하이픈으로 끝날 수 없습니다.
- REGION: region은 GKE 클러스터가 위치한 리전과 같아야 합니다.
- GKE_CLUSTER: 기존 GKE 클러스터의 이름입니다.
- BUCKET: (선택사항) Managed Service for Apache Spark이 아티팩트를 스테이징하는 데 사용할 Cloud Storage 버킷의 이름을 지정할 수 있습니다. 버킷을 지정하지 않으면 GKE의 Managed Service for Apache Spark에서 스테이징 버킷을 만듭니다.
- DP_POOLNAME: GKE 클러스터에 만들려는 노드 풀의 이름입니다.
- PHS_CLUSTER: (선택사항) 활성 및 삭제된 GKE 기반 Managed Service for Apache Spark 클러스터에 대한 Spark 작업 기록을 확인하는 데 사용할 Managed Service for Apache Spark PHS 서버입니다. PHS 클러스터가 GKE 기반 Managed Service for Apache Spark 가상 클러스터와 동일한 리전에 있어야 합니다.
다음 명령어를 실행합니다.
```
gcloud dataproc clusters gke create ${DP_CLUSTER} \
    --region=${REGION} \
    --gke-cluster=${GKE_CLUSTER} \
    --spark-engine-version=latest \
    --staging-bucket=${BUCKET} \
    --pools="name=${DP_POOLNAME},roles=default" \
    --setup-workload-identity \
    --history-server-cluster=${PHS_CLUSTER}
```
참고:
- --spark-engine-version: Managed Service for Apache Spark 클러스터에 사용된 Spark 이미지 버전입니다. 3, 3.1 또는 latest와 같은 식별자를 사용하거나 3.1-dataproc-5와 같은 전체 하위 부 버전을 지정할 수 있습니다.
- --staging-bucket: GKE 기반 Managed Service for Apache Spark에서 스테이징 버킷을 만들도록 하려면 이 플래그를 삭제합니다.
- --pools: 이 플래그는 Managed Service for Apache Spark이 워크로드를 수행하기 위해 만들거나 사용할 신규 또는 기존 노드 풀을 지정하는 데 사용됩니다. GKE 기반 Managed Service for Apache Spark 노드 풀 설정을 쉼표로 구분하여 나열합니다. 예:
```
--pools=name=dp-default,roles=default,machineType=e2-standard-4,min=0,max=10
```
  노드 풀 name과 role을 지정해야 합니다. 다른 노드 풀 설정은 선택사항입니다. 여러 --pools 플래그를 사용하여 여러 노드 풀을 지정할 수 있습니다. 노드 풀 최소 하나 이상에 default 역할이 있어야 합니다. 모든 노드 풀에 동일한 위치가 있어야 합니다.
- --setup-workload-identity: 이 플래그는 워크로드 아이덴티티 바인딩을 사용 설정합니다. 이러한 바인딩을 통해 Kubernetes 서비스 계정 (KSA)은 가상 클러스터의 기본 Managed Service for Apache Spark VM 서비스 계정 (데이터 영역 ID) 역할을 할 수 있습니다.
  Google 서비스 계정 (GSA)에서 워크로드 아이덴티티를 설정하려면 승격된 권한이 필요합니다(Google Kubernetes Engine의 Managed Service for Apache Spark IAM 권한 참고). GKE 기반 Managed Service for Apache Spark 가상 클러스터에서 자체 GSA를 사용하려면 커스텀 IAM 구성을 참고하세요.

REST

cluster.create API 요청의 일부로 virtualClusterConfig를 작성합니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

PROJECT: Google Cloud 프로젝트 ID
REGION: Dataproc 가상 클러스터 리전(기존 GKE 클러스터 리전과 동일한 리전)
DP_CLUSTER: Dataproc 클러스터 이름
GKE_CLUSTER: GKE 클러스터 이름
NODE_POOL: 노드 풀 이름
PHS_CLUSTER: 영구 기록 서버(PHS) 클러스터 이름
BUCKET: (선택사항) 스테이징 버킷 이름. GKE 기반 Managed Service for Apache Spark에서 스테이징 버킷을 만들도록 하려면 비워둡니다.

HTTP 메서드 및 URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

JSON 요청 본문:

{
  "clusterName":"DP_CLUSTER",
  "projectId":"PROJECT",
  "virtualClusterConfig":{
    "auxiliaryServicesConfig":{
      "sparkHistoryServerConfig":{
        "dataprocCluster":"projects/PROJECT/regions/REGION/clusters/PHS_CLUSTER"
      }
    },
    "kubernetesClusterConfig":{
      "gkeClusterConfig":{
        "gkeClusterTarget":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER",
        "nodePoolTarget":[
          {
"nodePool":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER/nodePools/NODE_POOL",
            "roles":[
              "DEFAULT"
            ]
          }
        ]
      },
      "kubernetesSoftwareConfig":{
        "componentVersion":{
          "SPARK":"latest"
        }
      }
    },
    "stagingBucket":"BUCKET"
  }
}

요청을 보내려면 다음 옵션 중 하나를 펼칩니다.

cURL(Linux, macOS, Cloud Shell)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters"

PowerShell(Windows)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
  "projectId":"PROJECT",
  "clusterName":"DP_CLUSTER",
  "status":{
    "state":"RUNNING",
    "stateStartTime":"2022-04-01T19:16:39.865716Z"
  },
  "clusterUuid":"98060b77-...",
  "statusHistory":[
    {
      "state":"CREATING",
      "stateStartTime":"2022-04-01T19:14:27.340544Z"
    }
  ],
  "labels":{
    "goog-dataproc-cluster-name":"DP_CLUSTER",
    "goog-dataproc-cluster-uuid":"98060b77-...",
    "goog-dataproc-location":"REGION",
    "goog-dataproc-environment":"prod"
  },
  "virtualClusterConfig":{
    "stagingBucket":"BUCKET",
    "kubernetesClusterConfig":{
      "kubernetesNamespace":"dp-cluster",
      "gkeClusterConfig":{
"gkeClusterTarget":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER",
        "nodePoolTarget":[
          {
"nodePool":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER/nodePools/NODE_POOL",
            "roles":[
              "DEFAULT"
            ]
          }
        ]
      },
      "kubernetesSoftwareConfig":{
        "componentVersion":{
          "SPARK":"3.1-..."
        },
        "properties":{
          "dpgke:dpgke.unstable.outputOnly.endpoints.sparkHistoryServer":"https://...",
          "spark:spark.eventLog.dir":"gs://BUCKET/.../spark-job-history",
          "spark:spark.eventLog.enabled":"true"
        }
      }
    },
    "auxiliaryServicesConfig":{
      "sparkHistoryServerConfig":{
        "dataprocCluster":"projects/PROJECT/regions/REGION/clusters/PHS_CLUSTER"
      }
    }
  }

Spark 작업 제출

GKE 기반 Managed Service for Apache Spark 가상 클러스터가 실행된 후 Google Cloud 콘솔, gcloud CLI 또는 Managed Service for Apache Spark jobs.submit API (직접 HTTP 요청 또는 Cloud 클라이언트 라이브러리 사용)를 사용하여 Spark 작업을 제출합니다.

gcloud CLI Spark 작업 예시:

gcloud dataproc jobs submit spark \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    --class=org.apache.spark.examples.SparkPi \
    --jars=local:///usr/lib/spark/examples/jars/spark-examples.jar \
    -- 1000

gcloud CLI PySpark 작업 예시:

gcloud dataproc jobs submit pyspark \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    local:///usr/lib/spark/examples/src/main/python/pi.py \
    -- 10

gcloud CLI SparkR 작업 예시:

gcloud dataproc jobs submit spark-r \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    local:///usr/lib/spark/examples/src/main/r/dataframe.R

삭제

이 빠른 시작에서 사용된 다음 리소스를 계속 사용하지 않으려면 삭제합니다.
GKE의 Managed Service for Apache Spark 클러스터를 삭제합니다.
GKE 기반 Managed Service for Apache Spark 클러스터에서 사용한 노드 풀을 삭제합니다.
GKE 클러스터를 삭제합니다.

Google Kubernetes Engine에서 Spark 작업 실행 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

옵션 개요

시작하기 전에

필요한 역할

사용자 역할

서비스 계정 역할

가상 클러스터 만들기

콘솔

gcloud

REST

cURL(Linux, macOS, Cloud Shell)

PowerShell(Windows)

Spark 작업 제출

삭제

Google Kubernetes Engine에서 Spark 작업 실행