"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Template JDBC ke Cloud Storage

Gunakan template Managed Service untuk Apache Spark JDBC ke Cloud Storage untuk mengekstrak data dari database JDBC ke Cloud Storage.

Template ini mendukung database berikut sebagai input:

MySQL
PostgreSQL
Microsoft SQL Server
Oracle

Menggunakan template

Jalankan template menggunakan gcloud CLI atau Managed Service untuk Apache Spark API.

gcloud

Sebelum menggunakan salah satu data perintah di bawah, lakukan penggantian berikut:

PROJECT_ID: Wajib diisi. Project ID Anda yang tercantum di Setelan IAM. Google Cloud
REGION: Wajib diisi. Region Compute Engine.
SUBNET: Opsional. Jika subnet tidak ditentukan, subnet di REGION yang ditentukan dalam jaringan default akan dipilih.
Contoh: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

JDBC_CONNECTOR_CLOUD_STORAGE_PATH: Wajib diisi. Jalur Cloud Storage jalur lengkap, termasuk nama file, tempat penyimpanan jar konektor JDBC. Anda dapat menggunakan perintah berikut untuk mendownload konektor JDBC untuk diupload ke Cloud Storage:

MySQL:

wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz

Postgres SQL:

wget https://jdbc.postgresql.org/download/postgresql-42.2.6.jar

Microsoft SQL Server:

  
wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar

Oracle:

wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar

Variabel berikut digunakan untuk membuat yang diperlukan JDBC_CONNECTION_URL:

JDBC_HOST
JDBC_PORT
JDBC_DATABASE, atau, untuk Oracle, JDBC_SERVICE
JDBC_USERNAME
JDBC_PASSWORD

Buat JDBC_CONNECTION_URL menggunakan salah satu format khusus konektor berikut:

MySQL:

jdbc:mysql://JDBC_HOST:JDBC_PORT/JDBC_DATABASE?user=JDBC_USERNAME&password=JDBC_PASSWORD

Postgres SQL:

jdbc:postgresql://JDBC_HOST:JDBC_PORT/JDBC_DATABASE?user=JDBC_USERNAME&password=JDBC_PASSWORD

Microsoft SQL Server:

 
jdbc:sqlserver://JDBC_HOST:JDBC_PORT;databaseName=JDBC_DATABASE;user=JDBC_USERNAME;password=JDBC_PASSWORD

Oracle:

jdbc:oracle:thin:@//JDBC_HOST:JDBC_PORT/JDBC_SERVICE?user=JDBC_USERNAME&password=

DRIVER: Wajib diisi. Driver JDBC yang digunakan untuk koneksi:

MySQL:
```
com.mysql.cj.jdbc.Driver
        
```
Postgres SQL:
```
org.postgresql.Driver
        
```

Microsoft SQL Server:

  
com.microsoft.sqlserver.jdbc.SQLServerDriver

Oracle:

oracle.jdbc.driver.OracleDriver

FORMAT: Wajib diisi. Format data output. Opsi: avro, parquet, csv, atau json. Default: avro. Catatan: Jika avro, Anda harus menambahkan "file:///usr/lib/spark/connector/spark-avro.jar" ke flag gcloud CLI jars atau kolom API.
Contoh (awalan file:// mereferensikan file jar Managed Service untuk Apache Spark):
--jars=file:///usr/lib/spark/connector/spark-avro.jar, [, ... jar lainnya]
MODE: Wajib diisi. Mode tulis untuk output Cloud Storage. Opsi: append, overwrite, ignore, atau errorifexists.
TEMPLATE_VERSION: Wajib diisi. Tentukan latest untuk versi template terbaru, atau tanggal versi tertentu, misalnya, 2023-03-17_v0.1.0-beta (buka gs://templates-binaries atau jalankan gcloud storage ls gs://templates-binaries untuk mencantumkan versi template yang tersedia).
CLOUD_STORAGE_OUTPUT_PATH: Wajib diisi. Jalur Cloud Storage tempat output akan disimpan.
Contoh: gs://templates/jdbc_to_cloud_storage_output
LOG_LEVEL: Opsional. Tingkat logging. Dapat berupa salah satu dari ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, atau WARN. Default: INFO.
INPUT_PARTITION_COLUMN, LOWERBOUND, UPPERBOUND, NUM_PARTITIONS: Opsional. Jika digunakan, semua parameter berikut harus ditentukan:
- INPUT_PARTITION_COLUMN: Nama kolom partisi tabel input JDBC.
- LOWERBOUND: Batas bawah kolom partisi tabel input JDBC yang digunakan untuk menentukan langkah partisi.
- UPPERBOUND: Batas atas kolom partisi tabel input JDBC yang digunakan untuk menentukan langkah partisi.
- NUM_PARTITIONS: Jumlah maksimum partisi yang dapat digunakan untuk paralelisme pembacaan dan penulisan tabel. Jika ditentukan, nilai ini digunakan untuk koneksi input dan output JDBC. Default: 10.
OUTPUT_PARTITION_COLUMN: Opsional. Nama kolom partisi output.
FETCHSIZE: Opsional. Jumlah baris yang akan diambil per perjalanan pulang pergi. Default: 10.
QUERY atau QUERY_FILE: Wajib diisi. Tetapkan salah satu QUERY atau QUERY_FILE untuk menentukan kueri yang akan digunakan untuk mengekstrak data dari JDBC
TEMP_VIEW dan TEMP_QUERY: Opsional. Anda dapat menggunakan dua parameter opsional ini untuk menerapkan transformasi Spark SQL saat memuat data ke Cloud Storage. TEMPVIEW harus sama dengan nama tabel yang digunakan dalam kueri, dan TEMP_QUERY adalah pernyataan kueri.
SERVICE_ACCOUNT: Opsional. Jika tidak diberikan, akun layanan Compute Engine default akan digunakan.
PROPERTY dan PROPERTY_VALUE: Opsional. Daftar pasangan properti Spark=value yang dipisahkan koma.

LABEL dan LABEL_VALUE: Opsional. Daftar pasangan label=value yang dipisahkan koma.

JDBC_SESSION_INIT: Opsional. Pernyataan inisialisasi sesi untuk membaca template Java.

KMS_KEY: Opsional. Kunci Cloud Key Management Service yang akan digunakan untuk enkripsi. Jika kunci tidak ditentukan, data akan dienkripsi saat tidak aktif menggunakan Google-owned and Google-managed encryption key.
Contoh: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

Jalankan perintah berikut:

Linux, macOS, atau Cloud Shell

gcloud dataproc batches submit spark \
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate \
    --project="PROJECT_ID" \
    --region="REGION" \
    --version="1.2" \
    --jars="gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar,JDBC_CONNECTOR_CLOUD_STORAGE_PATH" \
    --subnet="SUBNET" \
    --kms-key="KMS_KEY" \
    --service-account="SERVICE_ACCOUNT" \
    --properties="PROPERTY=PROPERTY_VALUE" \
    --labels="LABEL=LABEL_VALUE" \
    -- --template=JDBCTOGCS \
    --templateProperty project.id="PROJECT_ID" \
    --templateProperty log.level="LOG_LEVEL" \
    --templateProperty jdbctogcs.jdbc.url="JDBC_CONNECTION_URL" \
    --templateProperty jdbctogcs.jdbc.driver.class.name="DRIVER" \
    --templateProperty jdbctogcs.output.format="FORMAT" \
    --templateProperty jdbctogcs.output.location="CLOUD_STORAGE_OUTPUT_PATH" \
    --templateProperty jdbctogcs.sql="QUERY" \
    --templateProperty jdbctogcs.sql.file="QUERY_FILE" \
    --templateProperty jdbctogcs.sql.partitionColumn="INPUT_PARTITION_COLUMN" \
    --templateProperty jdbctogcs.sql.lowerBound="LOWERBOUND" \
    --templateProperty jdbctogcs.sql.upperBound="UPPERBOUND" \
    --templateProperty jdbctogcs.jdbc.fetchsize="FETCHSIZE" \
    --templateProperty jdbctogcs.sql.numPartitions="NUM_PARTITIONS" \
    --templateProperty jdbctogcs.write.mode="MODE" \
    --templateProperty dbctogcs.output.partition.col="OUTPUT_PARTITION_COLUMN" \
    --templateProperty jdbctogcs.temp.table="TEMP_VIEW" \
    --templateProperty jdbctogcs.temp.query="TEMP_QUERY"

Windows (PowerShell)

gcloud dataproc batches submit spark `
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate `
    --project="PROJECT_ID" `
    --region="REGION" `
    --version="1.2" `
    --jars="gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar,JDBC_CONNECTOR_CLOUD_STORAGE_PATH" `
    --subnet="SUBNET" `
    --kms-key="KMS_KEY" `
    --service-account="SERVICE_ACCOUNT" `
    --properties="PROPERTY=PROPERTY_VALUE" `
    --labels="LABEL=LABEL_VALUE" `
    -- --template=JDBCTOGCS `
    --templateProperty project.id="PROJECT_ID" `
    --templateProperty log.level="LOG_LEVEL" `
    --templateProperty jdbctogcs.jdbc.url="JDBC_CONNECTION_URL" `
    --templateProperty jdbctogcs.jdbc.driver.class.name="DRIVER" `
    --templateProperty jdbctogcs.output.format="FORMAT" `
    --templateProperty jdbctogcs.output.location="CLOUD_STORAGE_OUTPUT_PATH" `
    --templateProperty jdbctogcs.sql="QUERY" `
    --templateProperty jdbctogcs.sql.file="QUERY_FILE" `
    --templateProperty jdbctogcs.sql.partitionColumn="INPUT_PARTITION_COLUMN" `
    --templateProperty jdbctogcs.sql.lowerBound="LOWERBOUND" `
    --templateProperty jdbctogcs.sql.upperBound="UPPERBOUND" `
    --templateProperty jdbctogcs.jdbc.fetchsize="FETCHSIZE" `
    --templateProperty jdbctogcs.sql.numPartitions="NUM_PARTITIONS" `
    --templateProperty jdbctogcs.write.mode="MODE" `
    --templateProperty dbctogcs.output.partition.col="OUTPUT_PARTITION_COLUMN" `
    --templateProperty jdbctogcs.temp.table="TEMP_VIEW" `
    --templateProperty jdbctogcs.temp.query="TEMP_QUERY"

Windows (cmd.exe)

gcloud dataproc batches submit spark ^
    --class=com.google.cloud.dataproc.templates.main.DataProcTemplate ^
    --project="PROJECT_ID" ^
    --region="REGION" ^
    --version="1.2" ^
    --jars="gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar,JDBC_CONNECTOR_CLOUD_STORAGE_PATH" ^
    --subnet="SUBNET" ^
    --kms-key="KMS_KEY" ^
    --service-account="SERVICE_ACCOUNT" ^
    --properties="PROPERTY=PROPERTY_VALUE" ^
    --labels="LABEL=LABEL_VALUE" ^
    -- --template=JDBCTOGCS ^
    --templateProperty project.id="PROJECT_ID" ^
    --templateProperty log.level="LOG_LEVEL" ^
    --templateProperty jdbctogcs.jdbc.url="JDBC_CONNECTION_URL" ^
    --templateProperty jdbctogcs.jdbc.driver.class.name="DRIVER" ^
    --templateProperty jdbctogcs.output.format="FORMAT" ^
    --templateProperty jdbctogcs.output.location="CLOUD_STORAGE_OUTPUT_PATH" ^
    --templateProperty jdbctogcs.sql="QUERY" ^
    --templateProperty jdbctogcs.sql.file="QUERY_FILE" ^
    --templateProperty jdbctogcs.sql.partitionColumn="INPUT_PARTITION_COLUMN" ^
    --templateProperty jdbctogcs.sql.lowerBound="LOWERBOUND" ^
    --templateProperty jdbctogcs.sql.upperBound="UPPERBOUND" ^
    --templateProperty jdbctogcs.jdbc.fetchsize="FETCHSIZE" ^
    --templateProperty jdbctogcs.sql.numPartitions="NUM_PARTITIONS" ^
    --templateProperty jdbctogcs.write.mode="MODE" ^
    --templateProperty dbctogcs.output.partition.col="OUTPUT_PARTITION_COLUMN" ^
    --templateProperty jdbctogcs.temp.table="TEMP_VIEW" ^
    --templateProperty jdbctogcs.temp.query="TEMP_QUERY"

REST

Sebelum menggunakan salah satu data permintaan, lakukan penggantian berikut:

PROJECT_ID: Wajib diisi. Project ID Anda yang tercantum di Setelan IAM. Google Cloud
REGION: Wajib diisi. Region Compute Engine.
SUBNET: Opsional. Jika subnet tidak ditentukan, subnet di REGION yang ditentukan dalam jaringan default akan dipilih.
Contoh: projects/PROJECT_ID/regions/REGION/subnetworks/SUBNET_NAME

MySQL:

wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.30.tar.gz

Postgres SQL:

wget https://jdbc.postgresql.org/download/postgresql-42.2.6.jar

Microsoft SQL Server:

  
wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar

Oracle:

wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar

Variabel berikut digunakan untuk membuat yang diperlukan JDBC_CONNECTION_URL:

JDBC_HOST
JDBC_PORT
JDBC_DATABASE, atau, untuk Oracle, JDBC_SERVICE
JDBC_USERNAME
JDBC_PASSWORD

Buat JDBC_CONNECTION_URL menggunakan salah satu format khusus konektor berikut:

MySQL:

jdbc:mysql://JDBC_HOST:JDBC_PORT/JDBC_DATABASE?user=JDBC_USERNAME&password=JDBC_PASSWORD

Postgres SQL:

jdbc:postgresql://JDBC_HOST:JDBC_PORT/JDBC_DATABASE?user=JDBC_USERNAME&password=JDBC_PASSWORD

Microsoft SQL Server:

 
jdbc:sqlserver://JDBC_HOST:JDBC_PORT;databaseName=JDBC_DATABASE;user=JDBC_USERNAME;password=JDBC_PASSWORD

Oracle:

jdbc:oracle:thin:@//JDBC_HOST:JDBC_PORT/JDBC_SERVICE?user=JDBC_USERNAME&password=

DRIVER: Wajib diisi. Driver JDBC yang digunakan untuk koneksi:

MySQL:
```
com.mysql.cj.jdbc.Driver
        
```
Postgres SQL:
```
org.postgresql.Driver
        
```

Microsoft SQL Server:

  
com.microsoft.sqlserver.jdbc.SQLServerDriver

Oracle:

oracle.jdbc.driver.OracleDriver

FORMAT: Wajib diisi. Format data output. Opsi: avro, parquet, csv, atau json. Default: avro. Catatan: Jika avro, Anda harus menambahkan "file:///usr/lib/spark/connector/spark-avro.jar" ke flag gcloud CLI jars atau kolom API.
Contoh (awalan file:// mereferensikan file jar Managed Service untuk Apache Spark):
--jars=file:///usr/lib/spark/connector/spark-avro.jar, [, ... jar lainnya]
MODE: Wajib diisi. Mode tulis untuk output Cloud Storage. Opsi: append, overwrite, ignore, atau errorifexists.
TEMPLATE_VERSION: Wajib diisi. Tentukan latest untuk versi template terbaru, atau tanggal versi tertentu, misalnya, 2023-03-17_v0.1.0-beta (buka gs://templates-binaries atau jalankan gcloud storage ls gs://templates-binaries untuk mencantumkan versi template yang tersedia).
CLOUD_STORAGE_OUTPUT_PATH: Wajib diisi. Jalur Cloud Storage tempat output akan disimpan.
Contoh: gs://templates/jdbc_to_cloud_storage_output
LOG_LEVEL: Opsional. Tingkat logging. Dapat berupa salah satu dari ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, atau WARN. Default: INFO.
INPUT_PARTITION_COLUMN, LOWERBOUND, UPPERBOUND, NUM_PARTITIONS: Opsional. Jika digunakan, semua parameter berikut harus ditentukan:
- INPUT_PARTITION_COLUMN: Nama kolom partisi tabel input JDBC.
- LOWERBOUND: Batas bawah kolom partisi tabel input JDBC yang digunakan untuk menentukan langkah partisi.
- UPPERBOUND: Batas atas kolom partisi tabel input JDBC yang digunakan untuk menentukan langkah partisi.
- NUM_PARTITIONS: Jumlah maksimum partisi yang dapat digunakan untuk paralelisme pembacaan dan penulisan tabel. Jika ditentukan, nilai ini digunakan untuk koneksi input dan output JDBC. Default: 10.
OUTPUT_PARTITION_COLUMN: Opsional. Nama kolom partisi output.
FETCHSIZE: Opsional. Jumlah baris yang akan diambil per perjalanan pulang pergi. Default: 10.
QUERY atau QUERY_FILE: Wajib diisi. Tetapkan salah satu QUERY atau QUERY_FILE untuk menentukan kueri yang akan digunakan untuk mengekstrak data dari JDBC
TEMP_VIEW dan TEMP_QUERY: Opsional. Anda dapat menggunakan dua parameter opsional ini untuk menerapkan transformasi Spark SQL saat memuat data ke Cloud Storage. TEMPVIEW harus sama dengan nama tabel yang digunakan dalam kueri, dan TEMP_QUERY adalah pernyataan kueri.
SERVICE_ACCOUNT: Opsional. Jika tidak diberikan, akun layanan Compute Engine default akan digunakan.
PROPERTY dan PROPERTY_VALUE: Opsional. Daftar pasangan properti Spark=value yang dipisahkan koma.

LABEL dan LABEL_VALUE: Opsional. Daftar pasangan label=value yang dipisahkan koma.

JDBC_SESSION_INIT: Opsional. Pernyataan inisialisasi sesi untuk membaca template Java.

KMS_KEY: Opsional. Kunci Cloud Key Management Service yang akan digunakan untuk enkripsi. Jika kunci tidak ditentukan, data akan dienkripsi saat tidak aktif menggunakan Google-owned and Google-managed encryption key.
Contoh: projects/PROJECT_ID/regions/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME

Metode HTTP dan URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches

Meminta isi JSON:


{
  "environmentConfig": {
    "executionConfig": {
      "subnetworkUri": "SUBNET",
      "kmsKey": "KMS_KEY",
      "serviceAccount": "SERVICE_ACCOUNT"
    }
  },
  "labels": {
    "LABEL": "LABEL_VALUE"
  },
  "runtimeConfig": {
    "version": "1.2",
    "properties": {
      "PROPERTY": "PROPERTY_VALUE"
    }
  },
  "sparkBatch": {
    "mainClass": "com.google.cloud.dataproc.templates.main.DataProcTemplate",
    "args": [
      "--template=JDBCTOGCS",
      "--templateProperty","log.level=LOG_LEVEL",
      "--templateProperty","project.id=PROJECT_ID",
      "--templateProperty","jdbctogcs.jdbc.url=JDBC_CONNECTION_URL",
      "--templateProperty","jdbctogcs.jdbc.driver.class.name=DRIVER",
      "--templateProperty","jdbctogcs.output.location=CLOUD_STORAGE_OUTPUT_PATH",
      "--templateProperty","jdbctogcs.write.mode=MODE",
      "--templateProperty","jdbctogcs.output.format=FORMAT",
      "--templateProperty","jdbctogcs.sql.numPartitions=NUM_PARTITIONS",
      "--templateProperty","jdbctogcs.jdbc.fetchsize=FETCHSIZE",
      "--templateProperty","jdbctogcs.sql=QUERY",
      "--templateProperty","jdbctogcs.sql.file=QUERY_FILE",
      "--templateProperty","jdbctogcs.sql.partitionColumn=INPUT_PARTITION_COLUMN",
      "--templateProperty","jdbctogcs.sql.lowerBound=LOWERBOUND",
      "--templateProperty","jdbctogcs.sql.upperBound=UPPERBOUND",
      "--templateProperty","jdbctogcs.output.partition.col=OUTPUT_PARTITION_COLUMN",
      "--templateProperty","jdbctogcs.temp.table=TEMP_VIEW",
      "--templateProperty","jdbctogcs.temp.query=TEMP_QUERY",
      "--templateProperty","jdbctogcs.jdbc.sessioninitstatement=JDBC_SESSION_INIT"
    ],
    "jarFileUris": [
      "gs://templates-binaries/TEMPLATE_VERSION/java/templates.jar", "JDBC_CONNECTOR_CLOUD_STORAGE_PATH"
    ]
  }
}

Untuk mengirim permintaan Anda, perluas salah satu opsi berikut:

curl (Linux, macOS, atau Cloud Shell)

Catatan: Perintah berikut mengasumsikan bahwa Anda telah login ke gcloud CLI menggunakan akun pengguna Anda dengan menjalankan gcloud init atau gcloud auth login, atau dengan menggunakan Cloud Shell, yang secara otomatis membuat Anda login ke gcloud CLI. Anda dapat memeriksa akun yang saat ini aktif dengan menjalankan gcloud auth list.

Simpan isi permintaan dalam file bernama request.json, dan jalankan perintah berikut:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches"

PowerShell (Windows)

Catatan: Perintah berikut mengasumsikan bahwa Anda telah login ke gcloud CLI menggunakan akun pengguna Anda dengan menjalankan gcloud init atau gcloud auth login. Anda dapat memeriksa akun yang saat ini aktif dengan menjalankan gcloud auth list.

Simpan isi permintaan dalam file bernama request.json, dan jalankan perintah berikut:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT_ID/locations/REGION/batches" | Select-Object -Expand Content

Anda akan menerima respons JSON yang mirip dengan yang berikut ini:


{
  "name": "projects/PROJECT_ID/regions/REGION/operations/OPERATION_ID",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.BatchOperationMetadata",
    "batch": "projects/PROJECT_ID/locations/REGION/batches/BATCH_ID",
    "batchUuid": "de8af8d4-3599-4a7c-915c-798201ed1583",
    "createTime": "2023-02-24T03:31:03.440329Z",
    "operationType": "BATCH",
    "description": "Batch"
  }
}