BigQuery 到 Elasticsearch 範本

「BigQuery 到 Elasticsearch」範本是一個批次管道，可將 BigQuery 資料表中的資料擷取至 Elasticsearch 做為文件。範本可以讀取整個資料表，也可以使用提供的查詢讀取特定記錄。

管道相關規定

來源 BigQuery 資料表必須存在。
Google Cloud Platform 執行個體或 Elastic Cloud 上的 Elasticsearch 主機，且 Elasticsearch 版本為 7.0 以上。必須可從 Dataflow 工作站機器存取。

範本參數

必要參數

connectionUrl：Elasticsearch 網址，格式為 https://hostname:[port]。如果使用 Elastic Cloud，請指定 CloudID。例如：https://elasticsearch-host:9200。
apiKey：用於驗證的 Base64 編碼 API 金鑰。
index：要求發送至的 Elasticsearch 索引。例如：my-index。

選用參數

inputTableSpec：要從中讀取資料的 BigQuery 資料表。如果您指定 inputTableSpec，範本會使用 BigQuery Storage Read API (https://cloud.google.com/bigquery/docs/reference/storage)，直接從 BigQuery 儲存空間讀取資料。如要瞭解 Storage Read API 的限制，請參閱 https://cloud.google.com/bigquery/docs/reference/storage#limitations。你必須指定 inputTableSpec 或 query。如果您同時設定這兩個參數，範本會使用 query 參數。例如：<BIGQUERY_PROJECT>:<DATASET_NAME>.<INPUT_TABLE>。
outputDeadletterTable：無法到達輸出資料表的訊息所屬 BigQuery 資料表。如果資料表不存在，系統會在管道執行期間建立。如未指定，則會使用 <outputTableSpec>_error_records。例如：<PROJECT_ID>:<DATASET_NAME>.<DEADLETTER_TABLE>。
查詢：用於從 BigQuery 讀取資料的 SQL 查詢。如果 BigQuery 資料集與 Dataflow 工作位於不同專案，請在 SQL 查詢中指定完整資料集名稱，例如：<PROJECT_ID>.<DATASET_NAME>.<TABLE_NAME>。根據預設，query 參數會使用 GoogleSQL (https://cloud.google.com/bigquery/docs/introduction-sql)，除非 useLegacySql 為 true。你必須指定 inputTableSpec 或 query。如果您同時設定這兩個參數，範本會使用 query 參數。例如：select * from sampledb.sample_table。
useLegacySql：設為 true 即可使用舊版 SQL。只有在使用 query 參數時，這個參數才適用。預設值為 false。
queryLocation：從授權檢視表讀取資料時需要此欄位，但不需要基礎資料表的權限。例如：US。
queryTempDataset：使用這個選項，您可以設定現有資料集來建立臨時資料表，以儲存查詢結果。例如：temp_dataset。
KMSEncryptionKey：如果使用查詢來源從 BigQuery 讀取資料，請使用這個 Cloud KMS 金鑰加密建立的任何暫時性資料表。例如：projects/your-project/locations/global/keyRings/your-keyring/cryptoKeys/your-key。
elasticsearchUsername：用於驗證的 Elasticsearch 使用者名稱。如果指定，系統會忽略 apiKey 的值。
elasticsearchPassword：用於驗證的 Elasticsearch 密碼。如果指定，系統會忽略 apiKey 的值。
batchSize：批次大小，以文件數量為單位。預設值為 1000。
batchSizeBytes：批次大小 (以位元組為單位)。預設值為 5242880 (5 MB)。
maxRetryAttempts：重試次數上限。必須大於零。預設值為 no retries。
maxRetryDuration：重試時間上限 (以毫秒為單位)。必須大於零。預設值為 no retries。
propertyAsIndex：要建立索引的文件中的屬性，其值會指定要與文件一起納入大量要求的 _index 中繼資料。優先順序高於 _index UDF。預設值為 none。
javaScriptIndexFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，適用於指定 _index 中繼資料的函式，可與大量要求中的文件一併納入。預設值為 none。
javaScriptIndexFnName：UDF JavaScript 函式的名稱，用於指定要隨文件一併納入大量要求的 _index 中繼資料。預設值為 none。
propertyAsId：要建立索引的文件中的屬性，其值會指定要隨文件一併納入大量要求的 _id 中繼資料。優先順序高於 _id UDF。預設值為 none。
javaScriptIdFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，適用於指定 _id 中繼資料的函式，可與大量要求中的文件一併納入。預設值為 none。
javaScriptIdFnName：UDF JavaScript 函式的名稱，用於指定要隨文件一併納入大量要求的 _id 中繼資料。預設值為 none。
javaScriptTypeFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，適用於指定 _type 中繼資料的函式，可與大量要求中的文件一併納入。預設值為 none。
javaScriptTypeFnName：UDF JavaScript 函式的名稱，用於指定要隨文件一併納入大量要求的 _type 中繼資料。預設值為 none。
javaScriptIsDeleteFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，用於判斷是否要刪除文件，而非插入或更新文件。函式會傳回 true 或 false 的字串值。預設值為 none。
javaScriptIsDeleteFnName：UDF JavaScript 函式的名稱，用於判斷是否要刪除文件，而不是插入或更新文件。函式會傳回 true 或 false 的字串值。預設值為 none。
usePartialUpdate：是否要對 Elasticsearch 要求使用部分更新 (更新而非建立或建立索引，允許部分文件)。預設值為 false。
bulkInsertMethod：是否要搭配 Elasticsearch 大量要求使用 INDEX (索引，允許 upsert) 或 CREATE (建立，重複 _id 時會發生錯誤)。預設值為 CREATE。
trustSelfSignedCerts：是否信任自行簽署的憑證。安裝的 Elasticsearch 執行個體可能具有自行簽署的憑證，請將此值設為 true，略過 SSL 憑證的驗證。(預設值為 false)。
disableCertificateValidation：如果為 true，請信任自行簽署的 SSL 憑證。Elasticsearch 執行個體可能具有自行簽署的憑證。如要略過憑證驗證，請將這個參數設為 true。預設值為 false。
apiKeyKMSEncryptionKey：用於解密 API 金鑰的 Cloud KMS 金鑰。如果 apiKeySource 設為 KMS，就必須提供這項參數。如果提供這項參數，請傳遞加密的 apiKey 字串。使用 KMS API 加密端點加密參數。請使用 projects/<PROJECT_ID>/locations/<KEY_REGION>/keyRings/<KEY_RING>/cryptoKeys/<KMS_KEY_NAME> 格式的鍵。請參閱：https://cloud.google.com/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt。例如 projects/your-project-id/locations/global/keyRings/your-keyring/cryptoKeys/your-key-name。
apiKeySecretId：apiKey 的 Secret Manager 密鑰 ID。如果 apiKeySource 設為 SECRET_MANAGER，請提供這項參數。請使用 projects/<PROJECT_ID>/secrets/<SECRET_ID>/versions/<SECRET_VERSION>. For example, projects/your-project-id/secrets/your-secret/versions/your-secret-version` 格式。
apiKeySource：API 金鑰的來源。允許的值為 PLAINTEXT、KMS 或 SECRET_MANAGER。使用 Secret Manager 或 KMS 時，必須設定這個參數。如果 apiKeySource 設為 KMS，則必須提供 apiKeyKMSEncryptionKey 和加密的 apiKey。如果 apiKeySource 設為 SECRET_MANAGER，則必須提供 apiKeySecretId。如果 apiKeySource 設為 PLAINTEXT，則必須提供 apiKey。預設值為 PLAINTEXT。
socketTimeout：如果設定此參數，系統會覆寫 Elastic RestClient 中的預設重試逾時時間上限和預設 Socket 逾時時間 (30000 毫秒)。
javascriptTextTransformGcsPath：定義要使用的 JavaScript 使用者定義函式 (UDF) 的 .js 檔案 Cloud Storage URI。例如：gs://my-bucket/my-udfs/my_file.js。
javascriptTextTransformFunctionName：要使用的 JavaScript 使用者定義函式 (UDF) 名稱。舉例來說，如果您的 JavaScript 函式程式碼是 myTransform(inJson) { /*...do stuff...*/ }，則函式名稱就是 myTransform。如需 JavaScript UDF 範例，請參閱 UDF 範例 (https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples)。

使用者定義函式

此範本在管道的幾個點支援使用者定義函式 (UDF)，詳情如下。詳情請參閱「為 Dataflow 範本建立使用者定義函式」。

索引函式

傳回文件所屬的索引。

範本參數：

javaScriptIndexFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptIndexFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，以 JSON 字串形式序列化。
輸出：文件 _index 中繼資料欄位的值。

文件 ID 函式

傳回文件 ID。

範本參數：

javaScriptIdFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptIdFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，序列化為 JSON 字串。
輸出：文件 _id 中繼資料欄位的值。

文件刪除功能

指定是否要刪除文件。如要使用這項函式，請將大量插入模式設為 INDEX，並提供文件 ID 函式。

範本參數：

javaScriptIsDeleteFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptIsDeleteFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，序列化為 JSON 字串。
輸出：傳回字串 "true" 刪除文件，或 "false" 插入/更新文件。

對應類型函式

傳回文件的對應類型。

範本參數：

javaScriptTypeFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptTypeFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，序列化為 JSON 字串。
輸出：文件 _type 中繼資料欄位的值。

執行範本

控制台

前往 Dataflow 的「Create job from template」(透過範本建立工作) 頁面。

前往「依範本建立工作」

在「Job name」(工作名稱) 欄位中，輸入不重複的工作名稱。
選用：如要使用區域端點，請從下拉式選單中選取值。預設區域為 us-central1。
如需可執行 Dataflow 工作的地區清單，請參閱「Dataflow 位置」。
從「Dataflow template」(Dataflow 範本) 下拉式選單中，選取 the BigQuery to Elasticsearch template。
在提供的參數欄位中輸入參數值。
按一下「Run Job」(執行工作)。

gcloud

在殼層或終端機中執行範本：

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID \
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/BigQuery_to_Elasticsearch \
    --parameters \
inputTableSpec=INPUT_TABLE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX

更改下列內容：

PROJECT_ID：您要執行 Dataflow 工作的 Google Cloud 專案 ID
JOB_NAME：您選擇的不重複工作名稱
REGION_NAME：您要部署 Dataflow 工作的區域，例如 us-central1
VERSION：您要使用的範本版本
您可以使用下列值：
- latest 使用最新版範本，該範本位於值區中非依日期命名的上層資料夾： gs://dataflow-templates-REGION_NAME/latest/
- 版本名稱，例如 2023-09-12-00_RC00，可使用特定版本的範本，該範本會以巢狀結構存放在值區中相應的依日期命名上層資料夾中：gs://dataflow-templates-REGION_NAME/
注意：範本的最新版本可能會隨著破壞性變更而有所更新。您的實際工作環境使用的範本應該來自最近依日期命名的上層資料夾，以免這些破壞性變更影響實際運作流程。
INPUT_TABLE_SPEC：您的 BigQuery 資料表名稱。
CONNECTION_URL：您的 Elasticsearch 網址。
APIKEY：用於驗證的 base64 編碼 API 金鑰。
INDEX：您的 Elasticsearch 索引。

API

如要使用 REST API 執行範本，請傳送 HTTP POST 要求。如要進一步瞭解 API 和授權範圍，請參閱 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputTableSpec": "INPUT_TABLE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/BigQuery_to_Elasticsearch",
   }
}

更改下列內容：

PROJECT_ID：您要執行 Dataflow 工作的 Google Cloud 專案 ID
JOB_NAME：您選擇的不重複工作名稱
LOCATION：您要部署 Dataflow 工作的區域，例如 us-central1
VERSION：您要使用的範本版本
您可以使用下列值：
- latest 使用最新版範本，該範本位於值區中非依日期命名的上層資料夾： gs://dataflow-templates-REGION_NAME/latest/
- 版本名稱，例如 2023-09-12-00_RC00，可使用特定版本的範本，該範本會以巢狀結構存放在值區中相應的依日期命名上層資料夾中：gs://dataflow-templates-REGION_NAME/
注意：範本的最新版本可能會隨著破壞性變更而有所更新。您的實際工作環境使用的範本應該來自最近依日期命名的上層資料夾，以免這些破壞性變更影響實際運作流程。
INPUT_TABLE_SPEC：您的 BigQuery 資料表名稱。
CONNECTION_URL：您的 Elasticsearch 網址。
APIKEY：用於驗證的 base64 編碼 API 金鑰。
INDEX：您的 Elasticsearch 索引。

範本原始碼

Java

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.BigQueryToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.ReadBigQueryTableRows;
import com.google.cloud.teleport.v2.transforms.BigQueryConverters.TableRowToJsonFn;
import com.google.cloud.teleport.v2.transforms.JavascriptTextTransformer.TransformTextViaJavascript;
import com.google.cloud.teleport.v2.transforms.PythonExternalTextTransformer;
import com.google.common.base.Strings;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;

/**
 * The {@link BigQueryToElasticsearch} pipeline exports data from a BigQuery table to Elasticsearch.
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/README_BigQuery_to_Elasticsearch.md">README</a>
 * for instructions on how to use or modify this template.
 */
@MultiTemplate({
  @Template(
      name = "BigQuery_to_Elasticsearch",
      category = TemplateCategory.BATCH,
      displayName = "BigQuery to Elasticsearch",
      description =
          "The BigQuery to Elasticsearch template is a batch pipeline that ingests data from a BigQuery table into Elasticsearch as documents. "
              + "The template can either read the entire table or read specific records using a supplied query.",
      optionsClass = BigQueryToElasticsearchOptions.class,
      skipOptions = {
        "javascriptTextTransformReloadIntervalMinutes",
        "pythonExternalTextTransformGcsPath",
        "pythonExternalTextTransformFunctionName"
      },
      flexContainerName = "bigquery-to-elasticsearch",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/bigquery-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The source BigQuery table must exist.",
        "A Elasticsearch host on a Google Cloud instance or on Elastic Cloud with Elasticsearch version 7.0 or above and should be accessible from the Dataflow worker machines.",
      }),
  @Template(
      name = "BigQuery_to_Elasticsearch_Xlang",
      category = TemplateCategory.BATCH,
      displayName = "BigQuery to Elasticsearch with Python UDFs",
      type = Template.TemplateType.XLANG,
      description =
          "The BigQuery to Elasticsearch template is a batch pipeline that ingests data from a BigQuery table into Elasticsearch as documents. "
              + "The template can either read the entire table or read specific records using a supplied query.",
      optionsClass = BigQueryToElasticsearchOptions.class,
      skipOptions = {
        "javascriptTextTransformReloadIntervalMinutes",
        "javascriptTextTransformGcsPath",
        "javascriptTextTransformFunctionName"
      },
      flexContainerName = "bigquery-to-elasticsearch-xlang",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/bigquery-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The source BigQuery table must exist.",
        "A Elasticsearch host on a Google Cloud instance or on Elastic Cloud with Elasticsearch version 7.0 or above and should be accessible from the Dataflow worker machines.",
      })
})
public class BigQueryToElasticsearch {
  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    BigQueryToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args)
            .withValidation()
            .as(BigQueryToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline with the supplied options.
   *
   * @param options The execution parameters to the pipeline.
   * @return The result of the pipeline execution.
   */
  private static PipelineResult run(BigQueryToElasticsearchOptions options) {

    // Create the pipeline.
    Pipeline pipeline = Pipeline.create(options);
    /*
     * Steps: 1) Read records from BigQuery via BigQueryIO.
     *        2) Create json string from Table Row.
     *        3) Write records to Elasticsearch.
     */

    boolean useJavascriptUdf = !Strings.isNullOrEmpty(options.getJavascriptTextTransformGcsPath());
    boolean usePythonUdf = !Strings.isNullOrEmpty(options.getPythonExternalTextTransformGcsPath());
    if (useJavascriptUdf && usePythonUdf) {
      throw new IllegalArgumentException(
          "Either javascript or Python gcs path must be provided, but not both.");
    }

    /*
     * Step #1: Read from BigQuery. If a query is provided then it is used to get the TableRows.
     */
    PCollection<String> readJsonDocuments =
        pipeline
            .apply(
                "ReadFromBigQuery",
                ReadBigQueryTableRows.newBuilder()
                    .setOptions(options.as(BigQueryToElasticsearchOptions.class))
                    .build())

            /*
             * Step #2: Convert table rows to JSON documents.
             */
            .apply("TableRowsToJsonDocument", ParDo.of(new TableRowToJsonFn()));

    /*
     * Step #3: Apply UDF functions (if specified)
     */
    PCollection<String> udfOut;
    if (usePythonUdf) {
      udfOut =
          readJsonDocuments
              .apply(
                  "MapToRecord",
                  PythonExternalTextTransformer.FailsafeRowPythonExternalUdf
                      .stringMappingFunction())
              .setRowSchema(PythonExternalTextTransformer.FailsafeRowPythonExternalUdf.ROW_SCHEMA)
              .apply(
                  "InvokeUDF",
                  PythonExternalTextTransformer.FailsafePythonExternalUdf.newBuilder()
                      .setFileSystemPath(options.getPythonExternalTextTransformGcsPath())
                      .setFunctionName(options.getPythonExternalTextTransformFunctionName())
                      .build())
              .apply(
                  "MapToStringElements",
                  ParDo.of(new PythonExternalTextTransformer.RowToStringElementFn()));
    } else {
      udfOut =
          readJsonDocuments.apply(
              TransformTextViaJavascript.newBuilder()
                  .setFileSystemPath(options.getJavascriptTextTransformGcsPath())
                  .setFunctionName(options.getJavascriptTextTransformFunctionName())
                  .build());
    }

    /*
     * Step #4: Write converted records to Elasticsearch
     */
    udfOut.apply(
        "WriteToElasticsearch",
        WriteToElasticsearch.newBuilder()
            .setUserAgent("dataflow-bigquery-to-elasticsearch-template/v2")
            .setOptions(options.as(BigQueryToElasticsearchOptions.class))
            .build());

    return pipeline.run();
  }
}

後續步驟

瞭解 Dataflow 範本。
請參閱 Google 提供的範本清單。

BigQuery 到 Elasticsearch 範本 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

管道相關規定

範本參數

必要參數

選用參數

使用者定義函式

索引函式

文件 ID 函式

文件刪除功能

對應類型函式

執行範本

控制台

gcloud

API

範本原始碼

Java

後續步驟

BigQuery 到 Elasticsearch 範本