"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Cloud Spanner 到 Cloud Storage 範本

使用 Managed Service for Apache Spark Cloud Spanner to Cloud Storage 範本，從 Spanner 資料庫將資料擷取至 Cloud Storage。

使用範本

使用 gcloud CLI 或 Managed Service for Apache Spark API 執行範本。

gcloud

使用下方的任何指令資料之前，請先替換以下項目：

PROJECT_ID：必填，IAM 設定中列出的 Google Cloud 專案 ID。
REGION：必填，Compute Engine 區域。
SUBNET：選用。如未指定子網路，系統會選取 default 網路中指定 REGION 的子網路。
示例：projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME
TEMPLATE_VERSION：必填，指定 latest 代表最新版範本，或指定特定版本的日期，例如 2023-03-17_v0.1.0-beta (請前往 gs://templates-binaries 或執行 gcloud storage ls gs://templates-binaries，列出可用的範本版本)。
INSTANCE：必填，Spanner 執行個體 ID。
DATABASE：必填，Spanner 資料庫 ID。
TABLE：必填，Spanner 輸入資料表名稱，或對 Spanner 輸入資料表執行的 SQL 查詢。
範例 (SQL 查詢應位於括號內)： (select * from TABLE)
SPANNER_JDBC_DIALECT：必填，Spanner JDBC 方言。選項：googlesql 或 postgresql。預設為 googlesql。
CLOUD_STORAGE_OUTPUT_PATH：必填，輸出內容的儲存位置 Cloud Storage 路徑。
示例：gs://example-bucket/example-folder/
FORMAT：必填，輸出資料格式。選項：avro、 parquet、csv 或 json。 注意：如果 avro，您必須將「file:///usr/lib/spark/connector/spark-avro.jar」新增至 jars gcloud CLI 旗標或 API 欄位。
範例 (file:// 前置字元會參照 Managed Service for Apache Spark JAR 檔案)：
--jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
MODE：必填，Cloud Storage 輸出的寫入模式。選項：append、overwrite、ignore 或 errorifexists。
NUM_PARTITIONS：選用。可用於平行處理資料表讀寫作業的分區數量上限。
INPUT_PARTITION_COLUMN、 LOWERBOUND、 UPPERBOUND：選用。如要使用，必須指定下列所有參數：
- INPUT_PARTITION_COLUMN：Spanner 輸入資料表分區資料欄名稱。
- LOWERBOUND：Spanner 輸入資料表分區資料欄範圍下限，用於判斷分區步幅。
- UPPERBOUND：Spanner 輸入資料表分區資料欄範圍上限，用於決定分區步幅。
TEMP_VIEW 和 TEMP_QUERY：選用。您可以使用這兩個選用參數，在將資料載入 Cloud Storage 時套用 Spark SQL 轉換。TEMP_VIEW 必須與查詢中使用的資料表名稱相同，而 TEMP_QUERY 則是查詢陳述式。
SERVICE_ACCOUNT：選用。如未提供，系統會使用預設的 Compute Engine 服務帳戶。
PROPERTY 和 PROPERTY_VALUE：選用。以半形逗號分隔的 Spark 屬性=value 配對清單。
LABEL 和 LABEL_VALUE：選用。以半形逗號分隔的 label=value 組合清單。
LOG_LEVEL：選用。記錄層級。可以是 ALL、DEBUG、ERROR、FATAL、INFO、OFF、TRACE 或 WARN。預設值：INFO。
KMS_KEY：選用。用於加密的 Cloud Key Management Service 金鑰。如果未指定金鑰，系統會使用 Google-owned and Google-managed encryption key加密靜態資料。
示例：projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

執行下列指令：

Linux、macOS 或 Cloud Shell

gcloud dataproc batches submit spark \
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \
    --version="1.2" \
    --project="PROJECT_ID" \
    --region="REGION" \
    --jars="gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar" \
    --subnet="SUBNET" \
    --kms-key="KMS_KEY" \
    --service-account="SERVICE_ACCOUNT" \
    --properties="PROPERTY=PROPERTY_VALUE" \
    --labels="LABEL=LABEL_VALUE" \
    -- --template=SPANNERTOGCS \
    --templateProperty log.level="LOG_LEVEL" \
    --templateProperty project.id="PROJECT_ID" \
    --templateProperty spanner.gcs.input.spanner.id="INSTANCE" \
    --templateProperty spanner.gcs.input.database.id="DATABASE" \
    --templateProperty spanner.gcs.input.table.id="TABLE" \
    --templateProperty spanner.gcs.output.gcs.path="CLOUD_STORAGE_OUTPUT_PATH" \
    --templateProperty spanner.gcs.output.gcs.saveMode="MODE" \
    --templateProperty spanner.gcs.output.gcs.format="FORMAT" \
    --templateProperty spanner.gcs.input.sql.partitionColumn="INPUT_PARTITION_COLUMN" \
    --templateProperty spanner.gcs.input.sql.lowerBound="LOWERBOUND" \
    --templateProperty spanner.gcs.input.sql.upperBound="UPPERBOUND" \
    --templateProperty spanner.spanner.gcs.input.sql.numPartitions="NUM_PARTITIONS" \
    --templateProperty spanner.gcs.temp.table="TEMP_VIEW" \
    --templateProperty spanner.gcs.temp.query="TEMP_QUERY" \
    --templateProperty spanner.jdbc.dialect="SPANNER_JDBC_DIALECT"

Windows (PowerShell)

gcloud dataproc batches submit spark `
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate `
    --version="1.2" `
    --project="PROJECT_ID" `
    --region="REGION" `
    --jars="gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar" `
    --subnet="SUBNET" `
    --kms-key="KMS_KEY" `
    --service-account="SERVICE_ACCOUNT" `
    --properties="PROPERTY=PROPERTY_VALUE" `
    --labels="LABEL=LABEL_VALUE" `
    -- --template=SPANNERTOGCS `
    --templateProperty log.level="LOG_LEVEL" `
    --templateProperty project.id="PROJECT_ID" `
    --templateProperty spanner.gcs.input.spanner.id="INSTANCE" `
    --templateProperty spanner.gcs.input.database.id="DATABASE" `
    --templateProperty spanner.gcs.input.table.id="TABLE" `
    --templateProperty spanner.gcs.output.gcs.path="CLOUD_STORAGE_OUTPUT_PATH" `
    --templateProperty spanner.gcs.output.gcs.saveMode="MODE" `
    --templateProperty spanner.gcs.output.gcs.format="FORMAT" `
    --templateProperty spanner.gcs.input.sql.partitionColumn="INPUT_PARTITION_COLUMN" `
    --templateProperty spanner.gcs.input.sql.lowerBound="LOWERBOUND" `
    --templateProperty spanner.gcs.input.sql.upperBound="UPPERBOUND" `
    --templateProperty spanner.spanner.gcs.input.sql.numPartitions="NUM_PARTITIONS" `
    --templateProperty spanner.gcs.temp.table="TEMP_VIEW" `
    --templateProperty spanner.gcs.temp.query="TEMP_QUERY" `
    --templateProperty spanner.jdbc.dialect="SPANNER_JDBC_DIALECT"

Windows (cmd.exe)

gcloud dataproc batches submit spark ^
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^
    --version="1.2" ^
    --project="PROJECT_ID" ^
    --region="REGION" ^
    --jars="gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar" ^
    --subnet="SUBNET" ^
    --kms-key="KMS_KEY" ^
    --service-account="SERVICE_ACCOUNT" ^
    --properties="PROPERTY=PROPERTY_VALUE" ^
    --labels="LABEL=LABEL_VALUE" ^
    -- --template=SPANNERTOGCS ^
    --templateProperty log.level="LOG_LEVEL" ^
    --templateProperty project.id="PROJECT_ID" ^
    --templateProperty spanner.gcs.input.spanner.id="INSTANCE" ^
    --templateProperty spanner.gcs.input.database.id="DATABASE" ^
    --templateProperty spanner.gcs.input.table.id="TABLE" ^
    --templateProperty spanner.gcs.output.gcs.path="CLOUD_STORAGE_OUTPUT_PATH" ^
    --templateProperty spanner.gcs.output.gcs.saveMode="MODE" ^
    --templateProperty spanner.gcs.output.gcs.format="FORMAT" ^
    --templateProperty spanner.gcs.input.sql.partitionColumn="INPUT_PARTITION_COLUMN" ^
    --templateProperty spanner.gcs.input.sql.lowerBound="LOWERBOUND" ^
    --templateProperty spanner.gcs.input.sql.upperBound="UPPERBOUND" ^
    --templateProperty spanner.spanner.gcs.input.sql.numPartitions="NUM_PARTITIONS" ^
    --templateProperty spanner.gcs.temp.table="TEMP_VIEW" ^
    --templateProperty spanner.gcs.temp.query="TEMP_QUERY" ^
    --templateProperty spanner.jdbc.dialect="SPANNER_JDBC_DIALECT"

REST

使用任何要求資料前，請先更改下列內容：

PROJECT_ID：必填，IAM 設定中列出的 Google Cloud 專案 ID。
REGION：必填，Compute Engine 區域。
SUBNET：選用。如未指定子網路，系統會選取 default 網路中指定 REGION 的子網路。
示例：projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME
TEMPLATE_VERSION：必填，指定 latest 代表最新版範本，或指定特定版本的日期，例如 2023-03-17_v0.1.0-beta (請前往 gs://templates-binaries 或執行 gcloud storage ls gs://templates-binaries，列出可用的範本版本)。
INSTANCE：必填，Spanner 執行個體 ID。
DATABASE：必填，Spanner 資料庫 ID。
TABLE：必填，Spanner 輸入資料表名稱，或對 Spanner 輸入資料表執行的 SQL 查詢。
範例 (SQL 查詢應位於括號內)： (select * from TABLE)
SPANNER_JDBC_DIALECT：必填，Spanner JDBC 方言。選項：googlesql 或 postgresql。預設為 googlesql。
CLOUD_STORAGE_OUTPUT_PATH：必填，輸出內容的儲存位置 Cloud Storage 路徑。
示例：gs://example-bucket/example-folder/
FORMAT：必填，輸出資料格式。選項：avro、 parquet、csv 或 json。 注意：如果 avro，您必須將「file:///usr/lib/spark/connector/spark-avro.jar」新增至 jars gcloud CLI 旗標或 API 欄位。
範例 (file:// 前置字元會參照 Managed Service for Apache Spark JAR 檔案)：
--jars=file:///usr/lib/spark/connector/spark-avro.jar, [ ... other jars]
MODE：必填，Cloud Storage 輸出的寫入模式。選項：append、overwrite、ignore 或 errorifexists。
NUM_PARTITIONS：選用。可用於平行處理資料表讀寫作業的分區數量上限。
INPUT_PARTITION_COLUMN、 LOWERBOUND、 UPPERBOUND：選用。如要使用，必須指定下列所有參數：
- INPUT_PARTITION_COLUMN：Spanner 輸入資料表分區資料欄名稱。
- LOWERBOUND：Spanner 輸入資料表分區資料欄範圍下限，用於判斷分區步幅。
- UPPERBOUND：Spanner 輸入資料表分區資料欄範圍上限，用於決定分區步幅。
TEMP_VIEW 和 TEMP_QUERY：選用。您可以使用這兩個選用參數，在將資料載入 Cloud Storage 時套用 Spark SQL 轉換。TEMP_VIEW 必須與查詢中使用的資料表名稱相同，而 TEMP_QUERY 則是查詢陳述式。
SERVICE_ACCOUNT：選用。如未提供，系統會使用預設的 Compute Engine 服務帳戶。
PROPERTY 和 PROPERTY_VALUE：選用。以半形逗號分隔的 Spark 屬性=value 配對清單。
LABEL 和 LABEL_VALUE：選用。以半形逗號分隔的 label=value 組合清單。
LOG_LEVEL：選用。記錄層級。可以是 ALL、DEBUG、ERROR、FATAL、INFO、OFF、TRACE 或 WARN。預設值：INFO。
KMS_KEY：選用。用於加密的 Cloud Key Management Service 金鑰。如果未指定金鑰，系統會使用 Google-owned and Google-managed encryption key加密靜態資料。
示例：projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

HTTP 方法和網址：

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches

JSON 要求主體：


{
  "environmentConfig":{
    "executionConfig":{
      "subnetworkUri":"SUBNET",
      "kmsKey": "KMS_KEY",
      "serviceAccount": "SERVICE_ACCOUNT"
    }
  },
  "labels": {
    "LABEL": "LABEL_VALUE"
  },
  "runtimeConfig": {
    "version": "1.2",
    "properties": {
      "PROPERTY": "PROPERTY_VALUE"
    }
  },
  "sparkBatch":{
    "mainClass":"com.google.cloud.dataproc.templates.main.DataProcTemplate",
    "args":[
      "--template","SPANNERTOGCS",
      "--templateProperty","log.level=LOG_LEVEL",
      "--templateProperty","project.id=PROJECT_ID",
      "--templateProperty","spanner.gcs.input.spanner.id=INSTANCE",
      "--templateProperty","spanner.gcs.input.database.id=DATABASE",
      "--templateProperty","spanner.gcs.input.table.id=TABLE",
      "--templateProperty","spanner.gcs.output.gcs.path=CLOUD_STORAGE_OUTPUT_PATH",
      "--templateProperty","spanner.gcs.output.gcs.saveMode=MODE",
      "--templateProperty","spanner.gcs.output.gcs.format=FORMAT",
      "--templateProperty","spanner.gcs.input.sql.partitionColumn=INPUT_PARTITION_COLUMN",
      "--templateProperty","spanner.gcs.input.sql.lowerBound=LOWERBOUND",
      "--templateProperty","spanner.gcs.input.sql.upperBound=UPPERBOUND",
      "--templateProperty","spanner.gcs.input.sql.numPartitions=NUM_PARTITIONS",
      "--templateProperty","spanner.gcs.temp.table=TEMP_VIEW",
      "--templateProperty","spanner.gcs.temp.query=TEMP_QUERY",
      "--templateProperty spanner.jdbc.dialect=SPANNER_JDBC_DIALECT"
    ],
    "jarFileUris":[
      "file:///usr/lib/spark/connector/spark-avro.jar", "gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar"
    ]
  }
}

請展開以下其中一個選項，以傳送要求：

curl (Linux、macOS 或 Cloud Shell)

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI，或已使用 Cloud Shell 自動登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches"

PowerShell (Windows)

注意：下列指令假設您已執行 gcloud init 或 gcloud auth login，透過使用者帳戶登入 gcloud CLI。您可以執行 gcloud auth list，查看目前使用的帳戶。

將要求主體儲存在名為 request.json 的檔案中，然後執行下列指令：

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches" | Select-Object -Expand Content

您應該會收到如下的 JSON 回覆：


{
  "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
    "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID",
    "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
    "createTime": "2023-02-24T03:31:03.440329Z",
    "operationType": "BATCH",
    "description": "Batch"
  }
}