OpenLineage 對應

Data Lineage API 可從與 OpenLineage 整合的系統擷取沿襲資訊。OpenLineage 是沿襲資訊收集的開放標準。使用 ProcessOpenLineageRunEvent 方法將 OpenLineage 格式的事件傳送至 Data Lineage API 時,Data Lineage API 會將 OpenLineage 訊息中的屬性對應至 Data Lineage API 中的相應屬性。

本文提供這些對應項目的參考表格。

屬性對應

ProcessOpenLineageRunEvent REST API 方法會將 OpenLineage 屬性對應至 Data Lineage API 屬性,如下所示:

Data Lineage API 屬性 OpenLineage 屬性
Process.name projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME
Process.displayName Job.namespace + ":" + Job.name
Process.attributes Job.facets (請參閱「儲存的資料」)
Run.name projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME/runs/HASH_OF_RUNID
Run.displayName Run.runId
Run.attributes Run.facets (請參閱「儲存的資料」)
Run.startTime eventTime
Run.endTime eventTime
Run.state eventType
LineageEvent.name projects/PROJECT_NUMBER/locations/LOCATION/processes/HASH_OF_NAMESPACE_AND_NAME/runs/HASH_OF_RUNID/lineageEvents/HASH_OF_JOB_RUN_INPUT_OUTPUTS_OF_EVENT (例如 projects/11111111/locations/us/processes/1234/runs/4321/lineageEvents/111-222-333)
LineageEvent.EventLinks.source 輸入 (fqn 是命名空間和名稱的串連)
LineageEvent.EventLinks.target 輸出 (fqn 是命名空間和名稱的串連)
LineageEvent.startTime eventTime
LineageEvent.endTime eventTime
requestId 由方法使用者定義

FQN 對應

下表列出各種系統的 OpenLineage 命名空間和名稱配對範例,以及對應的 Dataplex Universal Catalog 完整名稱 (FQN):

系統 OpenLineage 命名空間 OpenLineage 名稱 Dataplex Universal Catalog FQN
雅典娜 awsathena://athena.{region_name}.amazonaws.com
  • {catalog}
  • {catalog}.{database}
  • {catalog}.{database}.{table}
  • athena:{catalogId}.{region}
  • athena:{catalogId}.{region}.{databaseId}
  • athena:{catalogId}.{region}.{databaseId}.{tableId}
AWS Glue arn:aws:glue:{region}:{account id} table/{database name}/{table name} aws_glue:table:{region}.{account id}.{database name}.{table name}
Azure Cosmos DB azurecosmos://{host}/dbs/{database} colls/{table}
  • cosmos-db:{host}.{database}
  • cosmos-db:{host}.{database}.{table}
Azure 資料探索工具 azurekusto://{host}.kusto.windows.net {database}/{table}
  • kusto:{host}.{region}.{database}
  • kusto:{host}.{region}.{database}.{table}
Azure Synapse sqlserver://{host}:{port}
  • {database}
  • {database}.{schema}
  • {database}.{schema}.{table}
  • sqlserver:{hostWithPort}.{databaseId}
  • sqlserver:{hostWithPort}.{databaseId}.{schemaId}
  • sqlserver:{hostWithPort}.{databaseId}.{schemaId}.{tableId}
BigQuery bigquery
  • {project id}.{dataset name}
  • {project id}.{dataset name}.{table name}
  • bigquery:{projectId}.{datasetId}
  • bigquery:{projectId}.{datasetId}.{assetId}
Cassandra cassandra://{host}:{port}
  • {keyspace}
  • {keyspace}.{table}
  • cassandra:{hostWithPort}.{keyspaceId}
  • cassandra:{hostWithPort}.{keyspaceId}.{tableId}
MySQL mysql://{host}:{port}
  • {database}
  • {database}.{table}
  • mysql:{hostWithPort}.{databaseId}
  • mysql:{hostWithPort}.{databaseId}.{tableId}
CrateDB crate://{host}:{port} {database}.{schema}.{table} 不支援
DB2 db2://{host}:{port}
  • {database}
  • {database}.{schema}
  • {database}.{schema}.{table}
  • db2:{dns}.{databaseId}
  • db2:{dns}.{databaseId}.{schemaId}
  • db2:{dns}.{databaseId}.{schemaId}.{tableId}
Hive hive://{host}:{port} {database}.{table} 不支援
MSSQL mssql://{host}:{port} {database}.{schema}.{table} 不支援
OceanBase oceanbase://{host}:{port} {database}.{table} 不支援
Oracle oracle://{host}:{port} {serviceName}.{schema}.{table} or {sid}.{schema}.{table}
  • oracle:{hostWithPort}.{databaseId}
  • oracle:{hostWithPort}.{databaseId}.{schemaId}
  • oracle:{hostWithPort}.{databaseId}.{schemaId}.{tableId}
Postgres postgres://{host}:{port}
  • {database}
  • {database}.{schema}
  • {database}.{schema}.{table}
  • postgresql:{hostWithPort}.{databaseId}
  • postgresql:{hostWithPort}.{databaseId}.{schemaId}
  • postgresql:{hostWithPort}.{databaseId}.{schemaId}.{tableId}
Teradata teradata://{host}:{port} {database}.{table} 不支援
Redshift redshift://{cluster_identifier}.{region_name}:{port}
  • {database}
  • {database}.{schema}
  • {database}.{schema}.{table}
  • redshift:{clusterId}.{region}.{port}.{databaseId}
  • redshift:{clusterId}.{region}.{port}.{databaseId}.{schemaId}
  • redshift:{clusterId}.{region}.{port}.{databaseId}.{schemaId}.{tableId}
雪花 snowflake://{organization name}-{account name} or snowflake://{account-locator}(.{compliance})(.{cloud_region_id})(.{cloud})
  • {database}
  • {database}.{schema}
  • {database}.{schema}.{table}
  • snowflake:{accountName}.{databaseId}
  • snowflake:{accountName}.{databaseId}.{schemaId}
  • snowflake:{accountName}.{databaseId}.{schemaId}.{tableId}
Spanner spanner://{projectId}:{instanceId} {database}.{schema}.{table} Dataplex Universal Catalog 支援,但資料歷程不支援
Trino trino://{host}:{port}
  • {catalog}
  • {catalog}.{schema}
  • {catalog}.{schema}.{table}
  • trino:{hostWithPort}.{catalogId}
  • trino:{hostWithPort}.{catalogId}.{schemaId}
  • trino:{hostWithPort}.{catalogId}.{schemaId}.{tableId}
ABFSS (Azure Data Lake Gen2) abfss://{container name}@{service name}.dfs.core.windows.net {path}
  • abs:{serviceName}.{containerName}
  • abs:{serviceName}.{containerName}.{path}
DBFS (Databricks 檔案系統) dbfs://{workspace name} {path}
  • dbfs:{workspace}
  • dbfs:{workspace}.{path}
Cloud Storage gs://{bucket name} {object key}
  • gcs:{bucketName}
  • gcs:{bucketName}.{virtualPath}
HDFS hdfs://{namenode host}:{namenode port} {path}
  • hdfs:{namenodeHostWithPort}
  • hdfs:{namenodeHostWithPort}.{path}
Kafka kafka://{bootstrap server host}:{port} {topic} kafka:{serverHostWithPort}.{topicId}
本機檔案系統 file {path} filesystem:localhost.{path}
遠端檔案系統 file://{host} {path} filesystem:{hostWithPort}.{path}
S3 s3://{bucket name} {object key}
  • s3:{bucketName}
  • s3:{bucketName}.{objectKey}

命名空間前置字元 s3as3n 也可接受,並會轉換為 s3
WASBS (Azure Blob 儲存空間) wasbs://{container name}@{service name}.dfs.core.windows.net {object key}
  • abs:{serviceName}.{containerName}
  • abs:{serviceName}.{containerName}.{objectKey}
Pub/Sub 主題 pubsub topic:{projectId}:{topicId} pubsub:topic:{projectId}.{topicId}
Pub/Sub 訂閱項目 pubsub subscription:{projectId}:{subscriptionId} pubsub:subscription:{projectId}.{subscriptionId}

其他可接受的格式

雖然 OpenLineage 並未為下列系統定義標準 namespace/name 配對,但只要按照下表格式設定,Data Lineage API 就能接受這些系統的沿襲事件。在 OpenLineage 訊息中,以命名空間 custom 參照的資源會解讀為自訂完整名稱。

系統 OpenLineage 命名空間 OpenLineage 名稱 Dataplex Universal Catalog FQN
自訂 FQN custom {some reference} custom:{someReference}
Dataproc Metastore dataproc_metastore
  • dataproc_metastore:{projectId}.{location}.{instanceId}
  • dataproc_metastore:{projectId}.{location}.{instanceId}.{databaseId}
  • dataproc_metastore:{projectId}.{location}.{instanceId}.{databaseId}.{tableId}
  • dataproc_metastore:{projectId}.{location}.{instanceId}
  • dataproc_metastore:{projectId}.{location}.{instanceId}.{databaseId}
  • dataproc_metastore:{projectId}.{location}.{instanceId}.{databaseId}.{tableId}

後續步驟