提交 Apache Spark 批次工作負載

瞭解如何在 Google Cloud Serverless for Apache Spark 管理的運算基礎架構上提交批次工作負載，並視需要調度資源。

事前準備

設定專案，並視需要授予 Identity and Access Management 角色。

設定專案

視需要執行下列一或多個步驟：

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

視需要授予 IAM 角色

如要執行本頁的範例，您必須具備特定 IAM 角色。根據機構政策，系統可能已授予這些角色。如要檢查角色授予情況，請參閱「是否需要授予角色？」一文。

如要進一步瞭解如何授予角色，請參閱「管理專案、資料夾和機構的存取權」。

使用者角色

如要取得提交無伺服器批次工作負載所需的權限，請要求管理員授予您下列 IAM 角色：

專案的 Dataproc 編輯者 (roles/dataproc.editor)
Compute Engine 預設服務帳戶的「服務帳戶使用者」 (roles/iam.serviceAccountUser)

服務帳戶角色

為確保 Compute Engine 預設服務帳戶具備提交無伺服器批次工作負載的必要權限，請要求管理員在專案中，授予 Compute Engine 預設服務帳戶 Dataproc Worker (roles/dataproc.worker) IAM 角色。

提交 Spark 批次工作負載

您可以使用 Google Cloud 控制台、Google Cloud CLI 或 Dataproc API，建立及提交 Serverless for Apache Spark 批次工作負載。

控制台

在 Google Cloud 控制台中，前往 Dataproc Batches。
點選「建立」。
選取並填寫下列欄位，提交 Spark 批次工作來計算 pi 的概略值：
- 批次資訊：
  - 批次 ID：指定批次工作負載的 ID。這個值必須是 4 到 63 個小寫字元。有效字元為 /[a-z][0-9]-/。
  - 區域：選取要執行工作負載的區域。
- 容器：
  - 批次類型：Spark。
  - 執行階段版本：確認或選取 2.3 執行階段版本。
  - 主要類別：
```
org.apache.spark.examples.SparkPi
```
  - JAR 檔案 (這個檔案已預先安裝在 Serverless for Apache Spark Spark 執行環境中)。
```
file:///usr/lib/spark/examples/jars/spark-examples.jar
```
  - 引數：1000。
- 執行設定：選取「服務帳戶」。根據預設，批次會使用 Compute Engine 預設服務帳戶執行。您可以指定自訂服務帳戶。預設或自訂服務帳戶必須具有 Dataproc 工作者角色。
- 網路設定：在工作階段區域中選取子網路。Serverless for Apache Spark 會在指定的子網路上啟用私人 Google 存取權 (PGA)。如需網路連線規定，請參閱「Google Cloud Serverless for Apache Spark 網路設定」。
- 屬性：輸入 Key (屬性名稱) 和支援的 Spark 屬性 Value，在 Spark 批次工作負載中設定。注意：與 Dataproc on Compute Engine 叢集屬性不同，Serverless for Apache Spark 工作負載屬性不包含 spark: 前置字元。
- 其他選項：
  - 您可以將批次工作負載設定為使用外部自行管理的 Hive Metastore。
  - 您可以使用永久記錄伺服器 (PHS)。 PHS 必須位於執行批次工作負載的區域。
按一下「提交」，執行 Spark 批次工作負載。

gcloud

如要提交 Spark 批次工作負載來計算 pi 的近似值，請在本機的終端機視窗或 Cloud Shell 中執行下列 gcloud CLI gcloud dataproc batches submit spark 指令。

gcloud dataproc batches submit spark \
    --region=REGION \
    --version=2.3 \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    -- 1000

請替換下列項目：

REGION: 指定工作負載的執行區域。
其他選項：您可以新增 gcloud dataproc batches submit spark 旗標，指定其他工作負載選項和 Spark 屬性。
- --jars：範例 JAR 檔案已預先安裝在 Spark 執行環境中。傳遞至 SparkPi 工作負載的 1000 指令引數會指定 pi 估算邏輯的 1000 次疊代 (工作負載輸入引數會包含在「-- 」之後)。
- --subnet：您可以新增這個標記，指定工作階段區域中的子網路名稱。如果未指定子網路，Serverless for Apache Spark 會在工作階段區域中選取 default 子網路。Serverless for Apache Spark 會在子網路上啟用私人 Google 存取權 (PGA)。如需網路連線需求，請參閱「Google Cloud Serverless for Apache Spark 網路設定」。
- --properties：您可以新增這個標記，為 Spark 批次工作負載輸入要使用的支援的 Spark 屬性。
- --deps-bucket：您可以新增這個旗標，指定 Serverless for Apache Spark 上傳工作負載依附元件的 Cloud Storage bucket。不必使用 bucket 的 gs:// URI 前置字串，您可以指定 bucket 路徑或 bucket 名稱。Serverless for Apache Spark 會先將本機檔案上傳至 bucket 中的 /dependencies 資料夾，再執行批次工作負載。注意：如果批次工作負載參照本機電腦上的檔案，則此旗標為必填屬性。
- --ttl：您可以新增 --ttl 標記，指定批次生命週期的時間長度。如果工作負載超過這個時間長度，系統會無條件終止工作，不會等待進行中的工作完成。使用 s、m、h 或 d (秒、分、小時或天) 尾碼指定時間長度。最小值為 10 分鐘 (10m)，最大值為 14 天 (14d)。
  - 1.1 或 2.0 執行階段批次：如果未指定 --ttl，1.1 或 2.0 執行階段批次工作負載會持續執行，直到自然結束 (或永遠執行，如果沒有結束的話)。
  - 2.1 以上的執行階段批次：如果未指定 2.1 以上執行階段批次工作負載的 --ttl，系統會預設為 4h。
- --service-account：您可以指定要用來執行工作負載的服務帳戶。如未指定服務帳戶，工作負載會透過 Compute Engine 預設服務帳戶執行。您的服務帳戶必須具備 Dataproc 工作者角色。
- Hive Metastore：下列指令會設定批次工作負載，以使用外部自行管理的 Hive Metastore，並採用標準 Spark 設定。
```
gcloud dataproc batches submit spark\
    --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \
    other args ...
        
```
- 永久記錄伺服器：
  1. 下列指令會在單一節點 Dataproc 叢集上建立 PHS。PHS 必須位於執行批次工作負載的地區，且 Cloud Storage bucket-name 必須存在。
```
gcloud dataproc clusters create PHS_CLUSTER_NAME \
    --region=REGION \
    --single-node \
    --enable-component-gateway \
    --properties=spark:spark.history.fs.logDirectory=gs://bucket-name/phs/*/spark-job-history
             
```
  2. 提交批次工作負載，並指定正在執行的永久記錄伺服器。
```
gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --history-server-cluster=projects/project-id/regions/region/clusters/PHS-cluster-name \
    -- 1000
              
```
- 執行階段版本：使用 --version 旗標，為工作負載指定 Serverless for Apache Spark 執行階段版本。
```
gcloud dataproc batches submit spark \
    --region=REGION \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    --class=org.apache.spark.examples.SparkPi \
    --version=VERSION
    -- 1000
            
```

API

本節說明如何建立批次工作負載，使用 Serverless for Apache Spark batches.create` 計算 pi 的近似值。

使用任何要求資料之前，請先修改下列項目的值：

project-id： Google Cloud 專案 ID。
region：Compute Engine 區域， Google Cloud Serverless for Apache Spark 會在該區域執行工作負載。

注意：

PROJECT_ID：您的 Google Cloud 專案 ID。專案 ID 會列在 Google Cloud 控制台資訊主頁的「Project info」(專案資訊) 部分。
REGION：工作階段區域。

HTTP 方法和網址：

POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches

JSON 要求主體：

{
  "sparkBatch":{
    "args":[
      "1000"
    ],
    "runtimeConfig": {
      "version": "2.3",
    },
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ],
    "mainClass":"org.apache.spark.examples.SparkPi"
  }
}

請展開以下其中一個選項，以傳送要求：

curl (Linux、macOS 或 Cloud Shell)

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI，或已使用 Cloud Shell 自動登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches"

PowerShell (Windows)

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches" | Select-Object -Expand Content

您應該會收到如下的 JSON 回覆：

{
"name":"projects/project-id/locations/region/batches/batch-id",
  "uuid":",uuid",
  "createTime":"2021-07-22T17:03:46.393957Z",
  "sparkBatch":{
    "mainClass":"org.apache.spark.examples.SparkPi",
    "args":[
      "1000"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/examples/jars/spark-examples.jar"
    ]
  },
  "runtimeInfo":{
    "outputUri":"gs://dataproc-.../driveroutput"
  },
  "state":"SUCCEEDED",
  "stateTime":"2021-07-22T17:06:30.301789Z",
  "creator":"account-email-address",
  "runtimeConfig":{
    "version":"2.3",
    "properties":{
      "spark:spark.executor.instances":"2",
      "spark:spark.driver.cores":"2",
      "spark:spark.executor.cores":"2",
      "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id"
    }
  },
  "environmentConfig":{
    "peripheralsConfig":{
      "sparkHistoryServerConfig":{
      }
    }
  },
  "operation":"projects/project-id/regions/region/operation-id"
}

估算工作負載費用

Serverless for Apache Spark 工作負載會耗用資料運算單元 (DCU) 和重組儲存空間資源。如需輸出 Dataproc UsageMetrics 的範例，以便估算工作負載資源用量和費用，請參閱「Serverless for Apache Spark 定價」。

後續步驟

瞭解下列內容：

提交 Apache Spark 批次工作負載 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

事前準備

設定專案

視需要授予 IAM 角色

使用者角色

服務帳戶角色

提交 Spark 批次工作負載

控制台

gcloud

API

curl (Linux、macOS 或 Cloud Shell)

PowerShell (Windows)

估算工作負載費用

後續步驟

提交 Apache Spark 批次工作負載