"Managed Service for Apache Spark" adalah nama baru untuk produk yang sebelumnya dikenal sebagai "Dataproc on Compute Engine" (deployment cluster) dan "Google Cloud Serverless for Apache Spark" (deployment serverless).

Memuat data ke BigQuery

Dokumen ini menunjukkan cara menggunakan Managed Service for Apache Spark untuk menjalankan tugas Spark yang memuat data yang telah diproses dari Cloud Storage ke dalam tabel BigQuery. Managed Service for Apache Spark menyederhanakan proses ini dengan mengelola lingkungan Spark dan konektor yang diperlukan.

Sebelum memulai

Login ke akun Google Cloud Anda. Jika Anda baru menggunakan Google Cloud, buat akun untuk mengevaluasi performa produk kami dalam skenario dunia nyata. Pelanggan baru juga mendapatkan kredit gratis senilai $300 untuk menjalankan, menguji, dan men-deploy workload.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc, BigQuery, and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc, BigQuery, and Cloud Storage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Buat bucket Cloud Storage.
Buat cluster Managed Service for Apache Spark yang menggunakan versi image 2.1 atau yang lebih baru.
Membuat set data BigQuery.

Siapkan skrip PySpark

Buat file Python lokal bernama load_analytics_data.py.

Tambahkan kode berikut ke file tersebut. Skrip ini membaca data dari jalur Cloud Storage, melakukan agregasi, dan menulis hasilnya ke BigQuery.

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as _sum
import sys

# --- Configuration ---
gcs_bucket = "BUCKET_NAME"
bq_project = "PROJECT_ID"
bq_dataset = "DATASET"
bq_table = "corpus_word_counts"

# --- Paths ---
processed_path = f"gs://{gcs_bucket}/processed/processed_data"
temp_gcs_path = f"{gcs_bucket}"

# --- Spark Session Initialization ---
spark = SparkSession.builder \
   .appName("Dataproc-BigQuery-Load") \
   .config("spark.hadoop.google.cloud.bigdata.connector.temporary.gcs.bucket", temp_gcs_path) \
   .getOrCreate()

# --- Read Processed Data from Cloud Storage ---
processed_df = spark.read.parquet(processed_path)

# --- Final Aggregation for Analytics-Ready Stage ---
analytics_df = processed_df.groupBy("corpus") \
   .agg(_sum("word_count_int").alias("total_word_count")) \
   .orderBy("corpus")

print("Aggregated Analytics-Ready data:")
analytics_df.show()

# --- Write DataFrame to BigQuery ---
print(f"Writing data to BigQuery table: {bq_dataset}.{bq_table}")

analytics_df.write \
   .format("bigquery") \
   .option("table", f"{bq_project}.{bq_dataset}.{bq_table}") \
   .mode("append") \
   .save()

print("Successfully wrote data to BigQuery.")

# --- Stop Spark Session ---
spark.stop()

Ganti placeholder berikut:
- BUCKET_NAME: nama bucket Cloud Storage Anda.
- PROJECT_ID: Google Cloud Project ID Anda.
Upload skrip load_analytics_data.py ke bucket Cloud Storage Anda.

Mengirimkan tugas Managed Service for Apache Spark

Kirim skrip PySpark sebagai tugas ke cluster Managed Service for Apache Spark Anda.

Di terminal, jalankan perintah gcloud dataproc jobs submit pyspark:

gcloud dataproc jobs submit pyspark gs://YOUR_BUCKET_NAME/scripts/load_analytics_data.py \
    --cluster=CLUSTER_NAME \
    --region=REGION

Ganti placeholder berikut:
- BUCKET_NAME: nama bucket Cloud Storage Anda.
- CLUSTER_NAME: nama cluster Managed Service for Apache Spark Anda.
- REGION: region tempat cluster Anda berada.
Perintah ini mengirimkan tugas PySpark ke layanan Managed Service for Apache Spark. Worker Managed Service for Apache Spark mengambil skrip dari jalur Cloud Storage yang ditentukan dan mengeksekusinya di cluster.

Memverifikasi pemuatan data

Di konsol Google Cloud , buka halaman Jobs Managed Service for Apache Spark untuk memantau eksekusi tugas dan melihat log output driver.
Setelah tugas selesai, buka halaman BigQuery.
Di panel Explorer, temukan project dan set data Anda, lalu pilih tabel corpus_word_counts.
Klik tab Preview untuk memeriksa data yang dimuat.

Cara kerja konektor Spark-BigQuery

Konektor Spark-BigQuery memungkinkan aplikasi Spark membaca dan menulis ke BigQuery. Di cluster Managed Service for Apache Spark dengan versi image 2.1 atau yang lebih baru, konektor sudah diinstal sebelumnya.

Konektor menggunakan metode penulisan tidak langsung untuk memuat data. Metode ini memanfaatkan Managed Service for Apache Spark dan BigQuery untuk performa tinggi.

Tugas Spark di cluster Managed Service for Apache Spark menulis DataFrame akhir ke lokasi sementara di bucket Cloud Storage.
Setelah penulisan ke Cloud Storage selesai, konektor akan memicu Tugas Pemuatan BigQuery.
BigQuery menyerap data dari lokasi Cloud Storage sementara ke tabel target.

Pendekatan tidak langsung ini memisahkan komputasi Spark dari penyerapan BigQuery. Pemisahan ini memungkinkan setiap layanan beroperasi secara efisien dan memastikan throughput tinggi untuk beban data besar.

Langkah berikutnya

Pelajari lebih lanjut konektor Spark-BigQuery.
Pelajari cara membuat kueri dan memvisualisasikan data di BigQuery.
Pelajari cara memvisualisasikan data BigQuery menggunakan Looker.