"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Apache Spark 배치 워크로드 제출

필요에 따라 리소스를 확장하는 Managed Service for Apache Spark 컴퓨팅 인프라에 일괄 워크로드를 제출하는 방법을 알아봅니다.

시작하기 전에

프로젝트를 설정하고 필요한 경우 Identity and Access Management 역할을 부여합니다.

프로젝트 설정

필요에 따라 다음 단계 중 하나 이상을 수행합니다.

Google Cloud 계정에 로그인합니다. Google Cloud를 처음 사용하는 경우 계정을 만들고 Google 제품의 실제 성능을 평가해 보세요. 신규 고객에게는 워크로드를 실행, 테스트, 배포하는 데 사용할 수 있는 $300의 무료 크레딧이 제공됩니다.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

필요한 경우 IAM 역할 부여

이 페이지의 예시를 실행하려면 특정 IAM 역할이 필요합니다. 조직 정책에 따라 이러한 역할이 이미 부여되었을 수 있습니다. 역할 부여를 확인하려면 역할을 부여해야 하나요?를 참고하세요.

역할 부여에 대한 자세한 내용은 프로젝트, 폴더, 조직에 대한 액세스 관리를 참고하세요.

사용자 역할

서버리스 일괄 워크로드를 제출하는 데 필요한 권한을 얻으려면 관리자에게 다음 IAM 역할을 부여해 달라고 요청하세요.

프로젝트에 대한 Dataproc 편집자 (roles/dataproc.editor)
Compute Engine 기본 서비스 계정의 서비스 계정 사용자 (roles/iam.serviceAccountUser)

서비스 계정 역할

Compute Engine 기본 서비스 계정에 서버리스 배치 워크로드를 제출하는 데 필요한 권한이 있는지 확인하려면 관리자에게 프로젝트에 대한 Dataproc 작업자 (roles/dataproc.worker) IAM 역할을 Compute Engine 기본 서비스 계정에 부여해 달라고 요청하세요.

Spark 배치 워크로드 제출

Google Cloud 콘솔, Google Cloud CLI 또는 REST API를 사용하여 Managed Service for Apache Spark 일괄 워크로드를 만들고 제출할 수 있습니다.

콘솔

Google Cloud 콘솔에서 Managed Service for Apache Spark Batches로 이동합니다.
만들기를 클릭합니다.
다음 필드를 선택하고 작성하여 pi의 대략적인 값을 계산하는 Spark 배치 워크로드를 제출합니다.
- 배치 정보
  - 배치 ID: 배치 워크로드에 ID를 지정합니다. 이 값은 4~63자(영문 기준)의 소문자여야 합니다. 유효한 문자는 /[a-z][0-9]-/입니다.
  - 리전: 워크로드를 실행할 리전을 선택합니다.
- 컨테이너:
  - 배치 유형: Spark
  - 런타임 버전: 3.0 런타임 버전을 확인하거나 선택합니다.
  - 기본 클래스:
```
org.apache.spark.examples.SparkPi
```
  - Jar 파일 (이 파일은 Managed Service for Apache Spark 실행 환경에 사전 설치됨).
```
file:///usr/lib/spark/examples/jars/spark-examples.jar
```
  - 인수: 1000.
- 실행 구성: 서비스 계정을 선택합니다. 기본적으로 일괄 처리는 Compute Engine 기본 서비스 계정을 사용하여 실행됩니다. 커스텀 서비스 계정을 지정할 수 있습니다. 기본 또는 커스텀 서비스 계정에는 Dataproc 작업자 역할이 있어야 합니다.
- 네트워크 구성: 세션 리전에서 서브네트워크를 선택합니다. Managed Service for Apache Spark는 지정된 서브넷에서 비공개 Google 액세스 (PGA)를 사용 설정합니다. 네트워크 연결 요구사항은 Managed Service for Apache Spark 네트워크 구성을 참고하세요.
- 속성: Spark 일괄 워크로드에서 설정하려는 Key(속성 이름) 및 지원되는 Spark 속성의 Value를 입력합니다. 참고: Managed Service for Apache Spark 클러스터 속성과 달리 Managed Service for Apache Spark 워크로드 속성에는 spark: 접두사가 포함되지 않습니다.
- 기타 옵션:
  - 외부 자체 관리형 Hive 메타스토어를 사용하도록 일괄 워크로드를 구성할 수 있습니다.
  - 영구 기록 서버(PHS)를 사용할 수 있습니다. PHS는 일괄 워크로드를 실행하는 리전에 있어야 합니다.
제출을 클릭하여 Spark 배치 워크로드를 실행합니다.

gcloud

Spark 일괄 워크로드를 제출하여 pi의 근사치를 계산하려면 터미널 또는 Cloud Shell에서 다음 gcloud CLI gcloud dataproc batches submit spark 명령어를 로컬로 실행합니다.

gcloud dataproc batches submit spark \
    --region=REGION \
    --version=3.0 \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    -- 1000

다음을 바꿉니다.

REGION: 워크로드가 실행되는 리전을 지정합니다.
기타 옵션: gcloud dataproc batches submit spark 플래그를 추가하여 다른 워크로드 옵션 및 Spark 속성을 지정할 수 있습니다.
- --jars: 예시 JAR 파일은 Spark 실행 환경에 사전 설치됩니다. SparkPi 워크로드에 전달된 1000 명령어 인수는 pi 추정 로직의 1000회 반복을 지정합니다 (워크로드 입력 인수는 '--' 뒤에 포함).
- --subnet: 이 플래그를 추가하여 세션 리전의 서브넷 이름을 지정할 수 있습니다. 서브넷을 지정하지 않으면 Managed Service for Apache Spark가 세션 리전의 default 서브넷을 선택합니다. Managed Service for Apache Spark는 서브넷에서 비공개 Google 액세스 (PGA)를 사용 설정합니다. 네트워크 연결 요구사항은 Managed Service for Apache Spark 네트워크 구성을 참고하세요.
- --tags: 이 플래그를 추가하여 트래픽 제어를 위한 네트워크 태그를 지정할 수 있습니다. 네트워크 태그를 사용하여 연결을 제한합니다. 프로덕션에서는 방화벽 규칙을 Spark 워크로드에서 사용하는 IP 주소로 제한하는 것이 좋습니다.
- --properties: 이 플래그를 추가하여 Spark 배치 워크로드가 사용할 지원되는 Spark 속성을 입력할 수 있습니다.
- --deps-bucket: 이 플래그를 추가하여 Managed Service for Apache Spark에서 워크로드 종속 항목을 업로드할 Cloud Storage 버킷을 지정할 수 있습니다. 버킷의 gs:// URI 접두사는 필수가 아닙니다. 버킷 경로 또는 버킷 이름을 지정할 수 있습니다. Managed Service for Apache Spark는 일괄 워크로드를 실행하기 전에 버킷의 /dependencies 폴더에 로컬 파일을 업로드합니다. 참고: 이 플래그는 일괄 워크로드가 로컬 머신의 파일을 참조할 때 필수입니다.
- --ttl: --ttl 플래그를 추가하여 일괄 수명 기간을 지정할 수 있습니다. 워크로드가 이 기간을 초과하면 진행 중인 작업이 완료될 때까지 기다리지 않고 무조건 종료됩니다. s, m, h 또는 d(초, 분, 시, 일) 서픽스를 사용하여 기간을 지정합니다. 최솟값은 10분 (10m)이며 최댓값은 14일 (14d)입니다.
  - 1.1 또는 2.0 런타임 배치: 1.1 또는 2.0 런타임 일괄 워크로드에 --ttl을 지정하지 않으면 워크로드가 자연스럽게 종료될 때까지 실행됩니다 (또는 종료되지 않는 경우 영구 실행).
  - 2.1 이상 런타임 배치: 2.1 이상 런타임 일괄 워크로드에 --ttl을 지정하지 않으면 기본값은 4h입니다.
- --service-account: 워크로드를 실행하는 데 사용할 서비스 계정을 지정할 수 있습니다. 서비스 계정을 지정하지 않으면 워크로드가 Compute Engine 기본 서비스 계정으로 실행됩니다. 서비스 계정에는 Dataproc 작업자 역할이 있어야 합니다.
- Hive 메타스토어: 다음 명령어는 표준 Spark 구성을 사용해서 외부 자체 관리형 Hive 메타스토어를 사용하도록 일괄 워크로드를 구성합니다.
```
gcloud dataproc batches submit spark\
    --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \
    other args ...
        
```
- 영구 기록 서버:
  1. 다음 명령어는 단일 노드 Managed Service for Apache Spark 클러스터에 PHS를 만듭니다. PHS가 일괄 워크로드를 실행하는 리전에 있으며 Cloud Storage bucket-name이 있어야 합니다.
```
gcloud dataproc clusters create PHS_CLUSTER_NAME \
    --region=REGION \
    --single-node \
    --enable-component-gateway \
    --properties=spark:spark.history.fs.logDirectory=gs://bucket-name/phs/*/spark-job-history
             
```
  2. 실행 중인 영구 기록 서버를 지정하여 배치 워크로드를 제출합니다.
```
gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --history-server-cluster=projects/project-id/regions/region/clusters/PHS-cluster-name \
    -- 1000
              
```
- 런타임 버전: --version 플래그를 사용하여 워크로드의 Managed Service for Apache Spark 런타임 버전을 지정합니다.
```
gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --version=VERSION
    -- 1000
            
```

API

이 섹션에서는 Managed Service for Apache Spark batches.create를 사용해서 pi의 근사치를 계산하도록 일괄 워크로드를 만드는 방법을 보여줍니다.

요청 데이터를 사용하기 전에 다음을 바꿉니다.

project-id: Google Cloud 프로젝트 ID입니다.
region: Managed Service for Apache Spark가 워크로드를 실행하는 Compute Engine 리전입니다.

참고:

PROJECT_ID: Google Cloud 프로젝트 ID입니다. 프로젝트 ID는 Google Cloud 콘솔 대시보드의 프로젝트 정보 섹션에 나열됩니다.
REGION: 세션 리전입니다.

HTTP 메서드 및 URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches

JSON 요청 본문:

{
  "sparkBatch":{
    "args":[
      "1000"
    ],
    "runtimeConfig": {
      "version": "2.3",
    },
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ],
    "mainClass":"org.apache.spark.examples.SparkPi"
  }
}

요청을 보내려면 다음 옵션 중 하나를 펼칩니다.

cURL(Linux, macOS, Cloud Shell)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하거나 gcloud CLI에 자동으로 로그인하는 Cloud Shell을 사용하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches"

PowerShell(Windows)

참고: 다음 명령어는 gcloud init 또는 gcloud auth login을 실행하여 사용자 계정으로 gcloud CLI에 로그인했다고 가정합니다. gcloud auth list를 실행하면 현재 활성 계정을 확인할 수 있습니다.

요청 본문을 request.json 파일에 저장하고 다음 명령어를 실행합니다.

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches" | Select-Object -Expand Content

다음과 비슷한 JSON 응답이 표시됩니다.

{
"name":"projects/project-id/locations/region/batches/batch-id",
  "uuid":",uuid",
  "createTime":"2021-07-22T17:03:46.393957Z",
  "sparkBatch":{
    "mainClass":"org.apache.spark.examples.SparkPi",
    "args":[
      "1000"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ]
  },
  "runtimeInfo":{
    "outputUri":"gs://dataproc-.../driveroutput"
  },
  "state":"SUCCEEDED",
  "stateTime":"2021-07-22T17:06:30.301789Z",
  "creator":"account-email-address",
  "runtimeConfig":{
    "version":"2.3",
    "properties":{
      "spark:spark.executor.instances":"2",
      "spark:spark.driver.cores":"2",
      "spark:spark.executor.cores":"2",
      "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id"
    }
  },
  "environmentConfig":{
    "peripheralsConfig":{
      "sparkHistoryServerConfig":{
      }
    }
  },
  "operation":"projects/project-id/regions/region/operation-id"
}

워크로드 비용 추정

Managed Service for Apache Spark 워크로드는 데이터 컴퓨팅 단위 (DCU) 및 셔플 스토리지 리소스를 사용합니다. 워크로드 리소스 소비 및 비용을 추정하기 위해 Managed Service for Apache Spark UsageMetrics를 출력하는 예시는 Managed Service for Apache Spark 가격 책정을 참고하세요.

다음 단계

다음 사항에 대해 알아보세요.

Apache Spark 배치 워크로드 제출 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

시작하기 전에

프로젝트 설정

필요한 경우 IAM 역할 부여

사용자 역할

서비스 계정 역할

Spark 배치 워크로드 제출

콘솔

gcloud

API

cURL(Linux, macOS, Cloud Shell)

PowerShell(Windows)

워크로드 비용 추정

다음 단계

Apache Spark 배치 워크로드 제출