"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Google Kubernetes Engine で Spark ジョブを実行する

このドキュメントでは、GKE 上の Managed Service for Apache Spark 仮想クラスタを作成し、クラスタで Spark ジョブを実行する方法について説明します。

オプションの概要

GKE 上の Managed Service for Apache Spark では、コンテナ化された環境の高度な制御が可能ですが、 Google Cloud には、運用の簡素化と開発の加速に役立つ、フルマネージドでサーバーレスのオプションも用意されています。Spark Managed Service for Apache Spark のデプロイオプションの比較については、最適な Spark サービスを決定するをご覧ください。

始める前に

アカウントにログインします。 Google Cloud を初めて使用する場合は、アカウントを作成して、実際のシナリオでプロダクトがどのように機能するかを評価してください。 Google Cloud新規のお客様には、ワークロードの実行、テスト、デプロイができる無料クレジット $300 分を差し上げます。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Google Cloud CLI をインストールします。

注: すでに gcloud CLI をインストールしている場合は、 gcloud components update を実行して、最新バージョンがインストールされていることを確認してください。

外部 ID プロバイダ（IdP）を使用している場合は、まず連携 ID を使用して gcloud CLI にログインする必要があります。

gcloud CLI を初期化するには、次のコマンドを実行します:

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that you have the permissions required to complete this guide.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Google Cloud CLI をインストールします。

外部 ID プロバイダ（IdP）を使用している場合は、まず連携 ID を使用して gcloud CLI にログインする必要があります。

gcloud CLI を初期化するには、次のコマンドを実行します:

gcloud init

クラスタでWorkload Identity が有効になっている自動パイロットではない標準的な Google Kubernetes Engine（GKE） ゾーン クラスタまたはリージョン クラスタを作成済みである必要があります。

パフォーマンスに関するヒント: ワークロードの初期化を高速化するために、イメージストリーミングを有効にします。

必要なロール

このページの例を実行するには、特定の IAM ロールが必要です。組織のポリシーによっては、これらのロールがすでに付与されている場合があります。ロールの付与を確認するには、ロールを付与する必要がありますか？をご覧ください。

ロールの付与については、プロジェクト、フォルダ、組織へのアクセス権の管理をご覧ください。

ユーザーロール

Managed Service for Apache Spark クラスタの作成に必要な権限を取得するには、次の IAM ロールを付与するよう管理者に依頼してください。

Dataproc 編集者（roles/dataproc.editor）プロジェクトに対する
Compute Engine のデフォルトのサービスアカウントに対するサービスアカウントユーザー（roles/iam.serviceAccountUser）

サービスアカウントロール

Compute Engine のデフォルトのサービスアカウントに Managed Service for Apache Spark クラスタの作成に必要な権限を付与するには、プロジェクトに対するDataproc ワーカー（roles/dataproc.worker）IAM ロールを Compute Engine のデフォルトのサービスアカウントに付与するよう管理者に依頼します。

仮想クラスタを作成する

GKE 上の Managed Service for Apache Spark 仮想クラスタは、 Managed Service for Apache Spark コンポーネントのデプロイプラットフォームとして作成されます。これは仮想リソースであり、 Managed Service for Apache Spark on Compute Engine クラスタとは異なり、個別の Managed Service for Apache Spark マスター VM とワーカー VM が含まれません。

GKE 上の Managed Service for Apache Spark では、GKE クラスタ内に、GKE 上の Managed Service for Apache Spark 仮想クラスタを作成するときにノードプールが作成されます。
GKE 上の Managed Service for Apache Spark ジョブは、こうしたノードプールで Pod として実行されます。ノードプールとノードプール上の Pod のスケジューリングは、GKE が管理します。
複数の仮想クラスタを作成します。GKE クラスタで複数の仮想クラスタを作成して実行し、仮想クラスタ間でノードプールを共有することで、リソース使用率を改善できます。
- 各仮想クラスタ:
  - Spark エンジンのバージョンやワークロード ID などの個別のプロパティで作成されます。
  - GKE クラスタの個別の GKE Namespace 内で分離されています。

コンソール

コンソールで、Managed Service for Apache Spark の [Clusters] ページに移動します。 Google Cloud

[クラスタ] に移動
[クラスタを作成] をクリックします。
[Managed Service for Apache Spark クラスタを作成] ダイアログで、[GKE 上のクラスタ] 行の [作成] をクリックします。
[クラスタの設定] パネルで次の操作を行います。
1. [クラスタ名] フィールドに、クラスタの名前を入力します。
2. [**リージョン**] リストで、GKE 上の Managed Service for Apache Spark 仮想クラスタのリージョンを選択します。このリージョンは、既存の GKE クラスタが配置されている同じリージョン（次の項目で選択）に存在する必要があります。
3. [Kubernetes クラスタ] フィールドで、[参照] をクリックして、既存の GKE クラスタが配置されているリージョンを選択します。
4. 省略可: [Cloud Storage ステージングバケット] フィールドで [参照] をクリックして、既存の Cloud Storage バケットを選択できます。 GKE 上の Managed Service for Apache Spark は、バケットにアーティファクトをステージングします。GKE 上の Managed Service for Apache Spark でステージングバケットを作成するには、このフィールドを省略します。
左側のパネルで [ノードプールの構成] をクリックし、[ノードプール] パネルで [プールを追加] をクリックします。
1. GKE のノードプールで既存の Managed Service for Apache Spark を再利用するには:
  1. [既存のノードプールを再利用] をクリックします。
  2. 既存のノードプールの名前を入力して、[ロール] を選択します。少なくとも 1 つのノードプールで DEFAULT のロールを持つ必要があります。
  3. [完了] をクリックします。
2. 新しい GKE 上の Managed Service for Apache Spark ノードプールを作成するには:
  1. [新しいノードプールを作成] をクリックします。
  2. 次のノードプール値を入力します。
    - ノードプール名
    - ロール: 少なくとも 1 つのノードプールで DEFAULT のロールを持つ必要があります。
    - ロケーション: GKE 上の Managed Service for Apache Spark クラスタリージョン内のゾーンを指定します。
    - ノードプールマシンタイプ
    - CPU プラットフォーム
    - プリエンプティブ
    - Min: 最小ノード数。
    - Max: 最大ノード数。最大ノード数は 0 より大きい値にする必要があります。
3. [プールを追加] をクリックしてノードプールをさらに追加します。すべてのノードプールにはロケーションが必要です。合計 4 つのノードプールを追加できます。
（オプション）アクティブおよび削除された GKE 上の Managed Service for Apache Spark クラスタで Spark ジョブ履歴を表示するために使用する Managed Service for Apache Spark Persistent History Server（PHS）を設定した場合は、[クラスタのカスタマイズ]をクリックします。次に、[History サーバークラスタ] フィールドで、PHS クラスタを参照して選択します。PHS クラスタは、GKE 上の Managed Service for Apache Spark 仮想クラスタと同じリージョンに配置する必要があります。
[作成] をクリックして、Managed Service for Apache Spark クラスタを作成します。GKE 上の Managed Service for Apache Spark クラスタが、[**クラスタ**] ページのリストに表示されます。クラスタが使用できるようになるまでのステータスは [プロビジョニング] で、その後 [実行中] に変わります。

gcloud

環境変数を設定し、ローカルまたは Cloud Shell で gcloud dataproc clusters gke create コマンドを実行して、GKE 上の Managed Service for Apache Spark クラスタを作成します。

環境変数を設定します。
```
DP_CLUSTER=Managed Service for Apache Spark on GKE  cluster-name \
  REGION=region \
  GKE_CLUSTER=GKE cluster-name \
  BUCKET=Cloud Storage bucket-name \
  DP_POOLNAME=node pool-name
  PHS_CLUSTER=Managed Service for Apache Spark PHS server name
```
注:
- DP_CLUSTER: Managed Service for Apache Spark の仮想クラスタ名は、小文字で始まり、54 文字以下の小文字、数字、ハイフンで構成される必要があります。末尾をハイフンにはできません。
- REGION: region は、GKE クラスタが配置されているリージョンと同じである必要があります。
- GKE_CLUSTER: 既存の GKE クラスタの名前。
- BUCKET: （省略可） Cloud Storage バケットの名前を指定できます。 Managed Service for Apache Spark は、この名前を使用してアーティファクトをステージングします。バケットを指定しない場合は、 GKE 上の Managed Service for Apache Spark がステージングバケットを作成します。
- DP_POOLNAME: GKE クラスタに作成するノードプールの名前。
- PHS_CLUSTER：（省略可）アクティブな GKE 上の Managed Service for Apache Spark クラスタと削除された GKE 上の Managed Service for Apache Spark クラスタの Spark ジョブ履歴の表示に使用するManaged Service for Apache Spark PHS サーバー。PHS クラスタは、 GKE 上の Managed Service for Apache Spark 仮想クラスタと同じリージョンに配置する必要があります。
次のコマンドを実行します。
```
gcloud dataproc clusters gke create ${DP_CLUSTER} \
    --region=${REGION} \
    --gke-cluster=${GKE_CLUSTER} \
    --spark-engine-version=latest \
    --staging-bucket=${BUCKET} \
    --pools="name=${DP_POOLNAME},roles=default" \
    --setup-workload-identity \
    --history-server-cluster=${PHS_CLUSTER}
```
注:
- --spark-engine-version: Managed Service for Apache Spark クラスタで使用される Spark イメージバージョン。`3`、`3.1`、`latest` などの識別子を使用することも、`3.1-dataproc-5` などの完全なサブマイナーバージョンを指定することもできます。
- --staging-bucket: GKE 上の Managed Service for Apache Spark にステージングバケットを作成させるには、このフラグを削除します。
- --pools: Managed Service for Apache Spark が、ワークロードの作成やワークロードを実行するために使用する、新規または既存のノードプールの指定に使用します。GKE 上の Managed Service for Apache Spark ノードプールの設定をカンマで区切って列挙します（例:
```
--pools=name=dp-default,roles=default,machineType=e2-standard-4,min=0,max=10
```
  ノードプールの name と role を指定する必要があります。その他のノードプールの設定は省略可能です。複数の --pools フラグを使用して、複数のノードプールを指定できます。少なくとも 1 つのノードプールで default のロールを持つ必要があります。すべてのノードプールには同じロケーションが必要です。
- --setup-workload-identity: このフラグは、 Workload Identity バインディングを有効にします。これらのバインディングにより、Kubernetes サービスアカウント（KSA）が仮想クラスタのデフォルトの Managed Service for Apache Spark VM サービスアカウント（データプレーン ID）として機能できるようになります。
  Google サービスアカウント（GSA）で Workload Identity を設定するには、昇格した権限が必要です（ Google Kubernetes Engine の Managed Service for Apache Spark 権限をご覧ください）。GKE 上の Managed Service for Apache Spark 仮想クラスタで独自の GSA を使用するには、カスタム IAM 構成をご覧ください。

REST

virtualClusterConfig を cluster.create API リクエストの一部として作成します。

リクエストのデータを使用する前に、次のように置き換えます。

PROJECT: Google Cloud プロジェクト ID
REGION: Dataproc 仮想クラスタのリージョン（既存の GKE クラスタリージョンと同じリージョン）
DP_CLUSTER: Dataproc クラスタ名
GKE_CLUSTER: GKE クラスタ名
NODE_POOL: ノードプール名
PHS_CLUSTER: 永続履歴サーバー（PHS）のクラスタ名
BUCKET：（省略可）ステージングバケット名。GKE 上の Managed Service for Apache Spark でステージングバケットを作成する場合は、これを空のままにします。

HTTP メソッドと URL:

POST https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters

リクエストの本文（JSON）:

{
  "clusterName":"DP_CLUSTER",
  "projectId":"PROJECT",
  "virtualClusterConfig":{
    "auxiliaryServicesConfig":{
      "sparkHistoryServerConfig":{
        "dataprocCluster":"projects/PROJECT/regions/REGION/clusters/PHS_CLUSTER"
      }
    },
    "kubernetesClusterConfig":{
      "gkeClusterConfig":{
        "gkeClusterTarget":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER",
        "nodePoolTarget":[
          {
"nodePool":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER/nodePools/NODE_POOL",
            "roles":[
              "DEFAULT"
            ]
          }
        ]
      },
      "kubernetesSoftwareConfig":{
        "componentVersion":{
          "SPARK":"latest"
        }
      }
    },
    "stagingBucket":"BUCKET"
  }
}

リクエストを送信するには、次のいずれかのオプションを展開します。

curl（Linux、macOS、Cloud Shell）

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ユーザーアカウントで gcloud CLI にログインしているか、Cloud Shell を使用して自動的に gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters"

PowerShell（Windows）

注: 次のコマンドは、gcloud init または gcloud auth login を実行して、ご自分のユーザーアカウントで gcloud CLI にログインしていることを前提としています。gcloud auth list を実行すると、現在アクティブなアカウントを確認できます。

リクエスト本文を request.json という名前のファイルに保存して、次のコマンドを実行します。

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/project-id/regions/region/clusters" | Select-Object -Expand Content

次のような JSON レスポンスが返されます。

{
  "projectId":"PROJECT",
  "clusterName":"DP_CLUSTER",
  "status":{
    "state":"RUNNING",
    "stateStartTime":"2022-04-01T19:16:39.865716Z"
  },
  "clusterUuid":"98060b77-...",
  "statusHistory":[
    {
      "state":"CREATING",
      "stateStartTime":"2022-04-01T19:14:27.340544Z"
    }
  ],
  "labels":{
    "goog-dataproc-cluster-name":"DP_CLUSTER",
    "goog-dataproc-cluster-uuid":"98060b77-...",
    "goog-dataproc-location":"REGION",
    "goog-dataproc-environment":"prod"
  },
  "virtualClusterConfig":{
    "stagingBucket":"BUCKET",
    "kubernetesClusterConfig":{
      "kubernetesNamespace":"dp-cluster",
      "gkeClusterConfig":{
"gkeClusterTarget":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER",
        "nodePoolTarget":[
          {
"nodePool":"projects/PROJECT/locations/REGION/clusters/GKE_CLUSTER/nodePools/NODE_POOL",
            "roles":[
              "DEFAULT"
            ]
          }
        ]
      },
      "kubernetesSoftwareConfig":{
        "componentVersion":{
          "SPARK":"3.1-..."
        },
        "properties":{
          "dpgke:dpgke.unstable.outputOnly.endpoints.sparkHistoryServer":"https://...",
          "spark:spark.eventLog.dir":"gs://BUCKET/.../spark-job-history",
          "spark:spark.eventLog.enabled":"true"
        }
      }
    },
    "auxiliaryServicesConfig":{
      "sparkHistoryServerConfig":{
        "dataprocCluster":"projects/PROJECT/regions/REGION/clusters/PHS_CLUSTER"
      }
    }
  }

Spark ジョブの送信

GKE 上の Managed Service for Apache Spark 仮想クラスタが実行されたら、コンソール、 gcloud CLI、または Managed Service for Apache Spark jobs.submit API を使用して（直接 HTTP リクエストまたは Cloud クライアントライブラリを使用）Spark ジョブを送信します。 Google Cloud

gcloud CLI Spark ジョブの例:

gcloud dataproc jobs submit spark \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    --class=org.apache.spark.examples.SparkPi \
    --jars=local:///usr/lib/spark/examples/jars/spark-examples.jar \
    -- 1000

gcloud CLI PySpark ジョブの例:

gcloud dataproc jobs submit pyspark \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    local:///usr/lib/spark/examples/src/main/python/pi.py \
    -- 10

gcloud CLI SparkR ジョブの例:

gcloud dataproc jobs submit spark-r \
    --region=${REGION} \
    --cluster=${DP_CLUSTER} \
    local:///usr/lib/spark/examples/src/main/r/dataframe.R

クリーンアップ

このクイックスタートで使用した以下のリソースのうち、今後使用しないリソースは削除します。
GKE 上の Managed Service for Apache Spark クラスタを削除します。
ノードプールを削除します GKE 上の Managed Service for Apache Spark クラスタによって使用されます。
GKE クラスタを削除する。

Google Kubernetes Engine で Spark ジョブを実行する コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

オプションの概要

始める前に

必要なロール

ユーザーロール

サービス アカウント ロール

仮想クラスタを作成する

コンソール

gcloud

REST

curl（Linux、macOS、Cloud Shell）

PowerShell（Windows）

Spark ジョブの送信

クリーンアップ

Google Kubernetes Engine で Spark ジョブを実行する

サービスアカウントロール