探索及擷取資料
多級式資料架構會將資料整理成品質逐漸提升的層級:銅級、銀級和金級。這個架構提供結構化方法,可將原始資料轉換為乾淨可靠的資產。
本文說明如何建立 Bronze 層,以便從外部來源系統匯入資料。這個層級提供單一資料來源、確保完整的資料沿襲,並支援重新處理管道。
事前準備
- 登入 Google Cloud 帳戶。如果您是 Google Cloud新手,歡迎 建立帳戶,親自評估產品在實際工作環境中的成效。新客戶還能獲得價值 $300 美元的免費抵免額,可用於執行、測試及部署工作負載。
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Compute Engine, BigQuery, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
Create a service account:
-
Ensure that you have the Create Service Accounts IAM role
(
roles/iam.serviceAccountCreator) and the Project IAM Admin role (roles/resourcemanager.projectIamAdmin). Learn how to grant roles. -
In the Google Cloud console, go to the Create service account page.
Go to Create service account - Select your project.
-
In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example,
Service account for quickstart. - Click Create and continue.
-
Grant the following roles to the service account: Storage Object Admin, Dataproc Worker.
To grant a role, find the Select a role list, then select the role.
To grant additional roles, click Add another role and add each additional role.
- Click Continue.
-
Click Done to finish creating the service account.
-
Ensure that you have the Create Service Accounts IAM role
(
-
安裝 Google Cloud CLI。
-
若您採用的是外部識別資訊提供者 (IdP),請先使用聯合身分登入 gcloud CLI。
-
執行下列指令,初始化 gcloud CLI:
gcloud init -
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Compute Engine, BigQuery, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
Create a service account:
-
Ensure that you have the Create Service Accounts IAM role
(
roles/iam.serviceAccountCreator) and the Project IAM Admin role (roles/resourcemanager.projectIamAdmin). Learn how to grant roles. -
In the Google Cloud console, go to the Create service account page.
Go to Create service account - Select your project.
-
In the Service account name field, enter a name. The Google Cloud console fills in the Service account ID field based on this name.
In the Service account description field, enter a description. For example,
Service account for quickstart. - Click Create and continue.
-
Grant the following roles to the service account: Storage Object Admin, Dataproc Worker.
To grant a role, find the Select a role list, then select the role.
To grant additional roles, click Add another role and add each additional role.
- Click Continue.
-
Click Done to finish creating the service account.
-
Ensure that you have the Create Service Accounts IAM role
(
-
安裝 Google Cloud CLI。
-
若您採用的是外部識別資訊提供者 (IdP),請先使用聯合身分登入 gcloud CLI。
-
執行下列指令,初始化 gcloud CLI:
gcloud init
將原始資料存放在 Cloud Storage 中
將原始資料擷取至 Cloud Storage 的初始儲存層。如要模擬資料擷取作業,請將公開資料集從 BigQuery 匯出至 Cloud Storage 值區。這個程序會模擬外部系統將原始資料檔案傳送至登陸區。
建立 Cloud Storage bucket 做為資料湖泊。請在與 Managed Service for Apache Spark 叢集相同的區域中建立,以最佳化效能。
gsutil mb -l REGION gs://BUCKET_NAME/將
bigquery-public-data:samples.shakespeare資料表匯出至 Cloud Storage bucket,並採用 CSV 格式。bq extract \ --destination_format CSV \ "bigquery-public-data:samples.shakespeare" \ "gs://BUCKET_NAME/raw/shakespeare/shakespeare.csv"這個指令會啟動匯出工作,將資料表的內容寫入指定的 Cloud Storage 路徑。
使用 PySpark 讀取原始資料
將資料存放在 Cloud Storage 後,即可在 Managed Service for Apache Spark 叢集上使用 PySpark 工作讀取及探索資料。Apache Spark 會透過 Cloud Storage 連接器與 Cloud Storage 互動,讓您使用 gs:// URI 配置讀取及寫入資料。
使用下列 PySpark 指令碼建立 SparkSession,設定 Cloud Storage 存取權,並將原始 CSV 檔案讀取至 DataFrame。
from pyspark.sql import SparkSession # --- Configuration --- gcs_bucket = "BUCKET_NAME" raw_path = f"gs://{gcs_bucket}/raw/shakespeare/shakespeare.csv" # For local development only. service_account_key_path = "/path/to/your/service-account-key.json" # --- Spark Session Initialization --- spark = SparkSession.builder \ .appName("DataprocETL-RawIngestion") \ .config("spark.jars", "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar") \ .getOrCreate() # --- Authentication for local development --- # This step is not necessary when running on a Dataproc cluster # with the service account attached to the cluster VMs. spark.conf.set("google.cloud.auth.service.account.json.keyfile", service_account_key_path) # --- Read Raw Data from Cloud Storage --- # Read the raw CSV data into a DataFrame. # inferSchema=True scans the data to determine column types. raw_df = spark.read.csv(raw_path, header=True, inferSchema=True) # --- Initial Exploration --- print("Raw data count:", raw_df.count()) print("Schema:") raw_df.printSchema() print("Sample of raw data:") raw_df.show(10, truncate=False) # --- Stop Spark Session --- spark.stop()以 Managed Service for Apache Spark 工作形式執行指令碼,擷取及探索原始資料。
資料擷取模式
銅級層級支援批次檔案上傳以外的各種擷取模式。Managed Service for Apache Spark 是多功能引擎,可處理各種擷取情境。
串流擷取
如果是 IoT 感應器資料或應用程式記錄等持續性資料來源,請使用串流管道。您可以使用 Managed Service for Apache Spark 處理來自 Pub/Sub 或 Apache Kafka 等服務的大量串流,並將資料存放在 Bronze 層。
資料庫擷取
如要讓資料湖泊與作業資料庫保持同步,請使用變更資料擷取 (CDC)。Apache Spark 代管服務的 Spark 工作可以訂閱接收變更事件的 Pub/Sub 主題、處理串流,並將變更套用至 Cloud Storage 中的原始資料存放區。
後續步驟
- 進一步瞭解 Managed Service for Apache Spark。
- 進一步瞭解 Cloud Storage 連接器。
- 探索 BigQuery 公開資料集。