本文說明如何建立 Knowledge Catalog (舊稱 Dataplex Universal Catalog) 湖泊。您可以在支援 Knowledge Catalog 的任何區域中建立資料湖泊。
事前準備
- 登入 Google Cloud 帳戶。如果您是 Google Cloud新手,歡迎 建立帳戶,親自評估產品在實際工作環境中的成效。新客戶還能獲得價值 $300 美元的免費抵免額,可用於執行、測試及部署工作負載。
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Dataplex, Managed Service for Apache Spark, Dataproc Metastore, BigQuery, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
-
Create a project: To create a project, you need the Project Creator role
(
roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
-
Verify that billing is enabled for your Google Cloud project.
Enable the Dataplex, Managed Service for Apache Spark, Dataproc Metastore, BigQuery, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (
roles/serviceusage.serviceUsageAdmin), which contains theserviceusage.services.enablepermission. Learn how to grant roles.
存取權控管
如要建立及管理資料湖,請確認您已獲授預先定義的角色
roles/dataplex.admin或roles/dataplex.editor。詳情請參閱授予單一角色。如要將其他專案的 Cloud Storage bucket 連接至資料湖,請執行下列指令,將 bucket 的管理員角色授予下列 Knowledge Catalog 服務帳戶:
gcloud dataplex lakes authorize \ --project PROJECT_ID_OF_LAKE \ --storage-bucket-resource BUCKET_NAME
建立 Metastore
您可以將 Dataproc Metastore 服務執行個體與 Knowledge Catalog 湖泊建立關聯,在 Spark 查詢中使用 Hive Metastore 存取 Knowledge Catalog 中繼資料。您必須擁有與 Knowledge Catalog 湖泊相關聯的 gRPC 啟用型 Dataproc Metastore (3.1.2 以上版本)。
設定 Dataproc Metastore 服務執行個體,公開 gRPC 端點 (而非預設的 Thrift Metastore 端點):
curl -X PATCH \ -H "Authorization: Bearer $(gcloud auth print-access-token)" \ -H "Content-Type: application/json" \ "https://metastore.googleapis.com/v1beta/projects/PROJECT_ID/locations/LOCATION/services/SERVICE_ID?updateMask=hiveMetastoreConfig.endpointProtocol" \ -d '{"hiveMetastoreConfig": {"endpointProtocol": "GRPC"}}'查看 gRPC 端點:
gcloud metastore services describe SERVICE_ID \ --project PROJECT_ID \ --location LOCATION \ --format "value(endpointUri)"
建立湖泊
控制台
前往 Google Cloud 控制台的 Knowledge Catalog「Lakes」(湖泊) 頁面。
按一下「建立」。
輸入顯示名稱。
系統會自動產生湖泊 ID,如有需要,您也可以提供自己的 ID。請參閱「資源命名慣例」。
選用:輸入說明。
指定要建立資料湖泊的「Region」(區域)。
如果是在特定區域 (例如
us-central1) 建立的湖泊,可以根據可用區設定來附加單一區域 (us-central1) 資料和多區域 (us multi-region) 資料。選用:為資料湖泊加上標籤。
選用:在「Metastore」(中繼存放區) 區段中,按一下「Metastore service」(中繼存放區服務) 選單,然後選取您在「事前準備」部分建立的服務。
點選「建立」。
gcloud
如要建立湖泊,請使用 gcloud dataplex lakes create 指令:
gcloud dataplex lakes create LAKE \ --location=LOCATION \ --labels=k1=v1,k2=v2,k3=v3 \ --metastore-service=METASTORE_SERVICE
更改下列內容:
LAKE:新湖泊的名稱LOCATION:是指 Google Cloud 區域k1=v1,k2=v2,k3=v3:使用的標籤 (如有)METASTORE_SERVICE:Dataproc Metastore 服務 (如已建立)
REST
如要建立 Lake,請使用 lakes.create 方法。