在 Dataflow 中使用資料歷程

資料歷程是 Dataflow 的功能，可讓您追蹤資料在系統之間的移動情形，包括來源、傳遞目的地和採用的轉換機制。

您使用 Dataflow 執行的每個管道都有幾個相關聯的資料資產。資料資產的歷程包括來源、發生了什麼事，以及隨時間的移動情形。透過資料歷程，您可以追蹤資料資產的端對端移動情形，從來源到最終目的地。

為 Dataflow 工作啟用資料歷程後，Dataflow 會擷取歷程事件，並發布至 Dataplex Universal Catalog Data Lineage API。

如要透過 Dataplex Universal Catalog 存取歷程資訊，請參閱「在 Google Cloud Platform 系統使用資料歷程功能」。

事前準備

設定專案：

Sign in to your Google Cloud Platform account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataplex, BigQuery, and Data lineage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Verify that billing is enabled for your Google Cloud project.

Enable the Dataplex, BigQuery, and Data lineage APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

在 Dataflow 中，您也需要在工作層級啟用歷程。請參閱本文的「在 Dataflow 中啟用資料歷程」一節。

必要的角色

如要取得查看歷程資料視覺化圖表所需的權限，請要求管理員授予您下列 IAM 角色：

Dataplex Catalog 檢視者 (roles/dataplex.catalogViewer) 在 Dataplex Universal Catalog 資源專案中
資料歷程檢視者 (roles/datalineage.viewer) 在您使用 Dataflow 的專案中
Dataflow 檢視者 (roles/dataflow.viewer) 在您使用 Dataflow 的專案中

如要進一步瞭解如何授予角色，請參閱「管理專案、資料夾和組織的存取權」。

您或許也能透過自訂角色或其他預先定義的角色，取得必要權限。

如要進一步瞭解資料歷程角色，請參閱資料歷程的預先定義角色。

支援與限制

Dataflow 中的資料沿襲具有下列限制：

Apache Beam SDK 2.63.0 以上版本支援資料沿襲。
您必須為每個工作啟用資料歷程功能。
系統不會立即擷取資料，Dataflow 工作歷程資料需要幾分鐘才會顯示在 Dataplex Universal Catalog 中。
系統支援下列來源和接收器：
- Apache Kafka
- BigQuery (Python 中的串流工作會使用舊版 STREAMING_INSERT 方法，該方法不支援資料沿襲。如要使用資料沿襲，請改用建議的 STORAGE_WRITE_API 方法。詳情請參閱「從 Dataflow 寫入 BigQuery」一文。
- Bigtable
- Cloud Storage
- JDBC (Java Database Connectivity)
- Pub/Sub
- Spanner (不支援變更串流)
使用這些來源和接收器的 Dataflow 範本也會自動擷取及發布沿襲事件。

在 Dataflow 中啟用資料歷程

您需要在工作層級啟用沿襲。如要啟用資料沿襲，請使用 enable_lineage Dataflow 服務選項，如下所示：

Java

--dataflowServiceOptions=enable_lineage=true

Python

--dataflow_service_options=enable_lineage=true

Go

--dataflow_service_options=enable_lineage=true

gcloud

使用 gcloud dataflow jobs run 指令，並加上 additional-experiments 選項。如果您使用 Flex 範本，請使用 gcloud dataflow flex-template run 指令。

--additional-experiments=enable_lineage=true

您也可以使用服務選項指定下列一或多個參數：

process_id：Dataplex Universal Catalog 用於將工作執行作業分組的專屬 ID。如未指定，則會使用工作名稱。
process_name：資料沿革程序的人類可讀名稱。如未指定，系統會使用以 "Dataflow " 為前置字元的工作名稱。

請按照下列方式指定這些選項：

Java

--dataflowServiceOptions=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

Python

--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

Go

--dataflow_service_options=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

gcloud

--additional-experiments=enable_lineage=process_id=PROCESS_ID;process_name=DISPLAY_NAME

在 Dataplex Universal Catalog 中查看歷程

資料歷程會提供專案資源與建立這些資源的程序之間的關係資訊。您可以在 Google Cloud 控制台中以圖表或單一表格的形式，查看資料歷程資訊。您也可以從 Data Lineage API 擷取 JSON 格式的資料歷程資訊。

詳情請參閱在 Google Cloud Platform 系統使用資料歷程功能。

在 Dataflow 中停用資料歷程

如果特定工作已啟用資料沿襲功能，但您想停用這項功能，請取消現有工作，然後執行新版工作，但不要使用 enable_lineage 服務選項。

帳單

在 Dataflow 中使用資料歷程不會影響 Dataflow 帳單，但可能會導致 Dataplex Universal Catalog 帳單產生額外費用。詳情請參閱「資料歷程注意事項」和「Dataplex Universal Catalog 定價」。

後續步驟

進一步瞭解資料歷程。
瞭解如何使用資料沿襲。

在 Dataflow 中使用資料歷程 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

事前準備

必要的角色

支援與限制

在 Dataflow 中啟用資料歷程

Java

Python

Go

gcloud

Java

Python

Go

gcloud

在 Dataplex Universal Catalog 中查看歷程

在 Dataflow 中停用資料歷程

帳單

後續步驟

在 Dataflow 中使用資料歷程