Data Lineage API 可從與 OpenLineage 整合的系統擷取沿襲資訊。OpenLineage 是沿襲資訊收集的開放標準。使用 ProcessOpenLineageRunEvent 方法將 OpenLineage 格式的事件傳送至 Data Lineage API 時,Data Lineage API 會將 OpenLineage 訊息中的屬性對應至 Data Lineage API 中的相應屬性。
本文提供這些對應項目的參考表格。
屬性對應
ProcessOpenLineageRunEvent
REST API 方法會將 OpenLineage 屬性對應至 Data Lineage API 屬性,如下所示:
| Data Lineage API 屬性 | OpenLineage 屬性 |
|---|---|
| Process.name | projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME |
| Process.displayName | Job.namespace + ":" + Job.name |
| Process.attributes | Job.facets (請參閱「儲存的資料」) |
| Run.name | projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME/runs/HASH_OF_RUNID |
| Run.displayName | Run.runId |
| Run.attributes | Run.facets (請參閱「儲存的資料」) |
| Run.startTime | eventTime |
| Run.endTime | eventTime |
| Run.state | eventType |
| LineageEvent.name | projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME/runs/HASH_OF_RUNID/lineageEvents/HASH_OF_JOB_RUN_INPUT_OUTPUTS_OF_EVENT (例如 projects/11111111/locations/us/processes/1234/runs/4321/lineageEvents/111-222-333) |
| LineageEvent.EventLinks.source | 輸入 (fqn 是命名空間和名稱的串連) |
| LineageEvent.EventLinks.target | 輸出 (fqn 是命名空間和名稱的串連) |
| LineageEvent.startTime | eventTime |
| LineageEvent.endTime | eventTime |
| requestId | 由方法使用者定義 |
只要使用 outputs 物件中的 columnLineage 層面,系統也支援 Managed Service for Apache Spark 產生的資料欄層級沿革。範例如下:
"outputs": [ {
"namespace": "bigquery",
"name": "project.dataset.outputtable",
"columnLineage": {
"_producer": "https://github.com/OpenLineage/OpenLineage/tree/1.39.0/integration/spark",
"_schemaURL": "https://openlineage.io/spec/facets/1-2-0/ColumnLineageDatasetFacet.json#/$defs/ColumnLineageDatasetFacet",
"fields": {
"output_field": { // This is the name of the output field
"inputFields": [
{
"namespace": "bigquery",
"name": "project.dataset.inputtable",
"field": "input_field", // This is the name of the input field
"transformations": [
{
"type": "DIRECT",
"subtype": "IDENTITY",
"description": "",
"masking": false
}
]
}
]
}
},
}
}]
在本範例中,您會在 input_field 和 output_field 之間建立資料欄層級的沿襲連結。您必須加入轉換欄位,否則系統不會擷取資料欄層級的沿襲資訊。如要進一步瞭解這個構面的 Open Lineage 定義,請參閱資料欄層級資料集構面。
FQN 對應
下表列出各種系統的 OpenLineage 命名空間和名稱配對範例,以及對應的 Knowledge Catalog (舊稱 Dataplex Universal Catalog) 完全合格名稱 (FQN):
| 系統 | OpenLineage 命名空間 | OpenLineage 名稱 | Knowledge Catalog FQN |
|---|---|---|---|
| 雅典娜 | awsathena://athena.{region_name}.amazonaws.com |
|
|
| AWS Glue | arn:aws:glue:{region}:{account id} |
table/{database name}/{table name} |
aws_glue:table:{region}.{account id}.{database name}.{table name} |
| Azure Cosmos DB | azurecosmos://{host}/dbs/{database} |
colls/{table} |
|
| Azure 資料探索工具 | azurekusto://{host}.kusto.windows.net |
{database}/{table} |
|
| Azure Synapse | sqlserver://{host}:{port} |
|
不支援 |
| BigQuery | bigquery |
|
|
| Cassandra | cassandra://{host}:{port} |
|
|
| MySQL | mysql://{host}:{port} |
|
|
| CrateDB | crate://{host}:{port} |
{database}.{schema}.{table} |
不支援 |
| DB2 | db2://{host}:{port} |
|
|
| Hive | hive://{host}:{port} |
{database}.{table} |
不支援 |
| MSSQL | mssql://{host}:{port} |
{database}.{schema}.{table} |
不支援 |
| OceanBase | oceanbase://{host}:{port} |
{database}.{table} |
不支援 |
| Oracle | oracle://{host}:{port} |
{serviceName}.{schema}.{table} or {sid}.{schema}.{table} |
|
| Postgres | postgres://{host}:{port} |
|
|
| Teradata | teradata://{host}:{port} |
{database}.{table} |
不支援 |
| Redshift | redshift://{cluster_identifier}.{region_name}:{port} |
|
|
| 雪花 | snowflake://{organization name}-{account name} or snowflake://{account-locator}(.{compliance})(.{cloud_region_id})(.{cloud}) |
|
|
| Spanner | spanner://{projectId}:{instanceId} |
{database}.{schema}.{table} |
Knowledge Catalog 支援,但資料歷程不支援 |
| Trino | trino://{host}:{port} |
|
|
| ABFSS (Azure Data Lake Gen2) | abfss://{container name}@{service name}.dfs.core.windows.net |
{path} |
|
| DBFS (Databricks 檔案系統) | dbfs://{workspace name} |
{path} |
|
| Cloud Storage | gs://{bucket name} |
{object key} |
|
| HDFS | hdfs://{namenode host}:{namenode port} |
{path} |
|
| Kafka | kafka://{bootstrap server host}:{port} |
{topic} |
kafka:{serverHostWithPort}.{topicId} |
| 本機檔案系統 | file |
{path} |
filesystem:localhost.{path} |
| 遠端檔案系統 | file://{host} |
{path} |
filesystem:{hostWithPort}.{path} |
| S3 | s3://{bucket name} |
{object key} |
命名空間前置字元 s3a 和 s3n 也可接受,並會轉換為 s3 |
| WASBS (Azure Blob 儲存空間) | wasbs://{container name}@{service name}.dfs.core.windows.net |
{object key} |
|
| Pub/Sub 主題 | pubsub |
topic:{projectId}:{topicId} |
pubsub:topic:{projectId}.{topicId} |
| Pub/Sub 訂閱項目 | pubsub |
subscription:{projectId}:{subscriptionId} |
pubsub:subscription:{projectId}.{subscriptionId} |
其他可接受的格式
雖然 OpenLineage 並未為下列系統定義標準 namespace/name 配對,但只要按照下表格式設定,Data Lineage API 就能接受這些系統的沿襲事件。在 OpenLineage 訊息中,以命名空間 custom 參照的資源會解讀為自訂完整名稱。
| 系統 | OpenLineage 命名空間 | OpenLineage 名稱 | Knowledge Catalog FQN |
|---|---|---|---|
| 自訂 FQN | custom |
{some reference} |
custom:{someReference} |
| Dataproc Metastore | dataproc_metastore |
|
|
後續步驟
- 瞭解如何整合 OpenLineage。
- 請參閱完整名稱的參考資料。
- 探索 Data Lineage API。
- 瞭解如何查看歷程資訊。