Cloud Storage 到 Elasticsearch 範本

「Cloud Storage 到 Elasticsearch」範本是一種批次管道，可從 Cloud Storage 值區中儲存的 CSV 檔案讀取資料，並將資料以 JSON 文件形式寫入 Elasticsearch。

管道相關規定

Cloud Storage bucket 必須存在。
您必須在 Google Cloud Platform 執行個體或 Elasticsearch Cloud 上，建立可從 Dataflow 存取的 Elasticsearch 主機。
錯誤輸出內容的 BigQuery 資料表必須存在。

CSV 結構定義

如果 CSV 檔案包含標頭，請將 containsHeaders 範本參數設為 true。

否則，請建立描述資料的 JSON 結構定義檔。在 jsonSchemaPath 範本參數中，指定結構定義檔案的 Cloud Storage URI。以下範例顯示 JSON 結構定義：

[{"name":"id", "type":"text"}, {"name":"age", "type":"integer"}]

或者，您也可以提供使用者定義函式 (UDF)，剖析 CSV 文字並輸出 Elasticsearch 文件。

範本參數

必要參數

deadletterTable：用來傳送插入失敗資料的 BigQuery 無效信件資料表。例如：your-project:your-dataset.your-table-name。
inputFileSpec：用於搜尋 CSV 檔案的 Cloud Storage 檔案模式。例如：gs://mybucket/test-*.csv。
connectionUrl：Elasticsearch 網址，格式為 https://hostname:[port]。如果使用 Elastic Cloud，請指定 CloudID。例如：https://elasticsearch-host:9200。
apiKey：用於驗證的 Base64 編碼 API 金鑰。
index：要求發送至的 Elasticsearch 索引。例如：my-index。

選用參數

inputFormat：輸入檔案格式。預設值為 CSV。
containsHeaders：輸入 CSV 檔案是否包含標頭記錄 (true/false)。僅在讀取 CSV 檔案時需要。預設值為 false。
delimiter：輸入文字檔的資料欄分隔符號。預設值：,。例如：,。
csvFormat：用於剖析記錄的 CSV 格式規格。預設值為 Default。詳情請參閱 https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html。必須與以下網址中完全相同的格式名稱相符：https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.Predefined.html。
jsonSchemaPath：JSON 結構定義的路徑。預設值為 null。例如：gs://path/to/schema。
largeNumFiles：如果檔案數量達到數萬個，請設為 true。預設值為 false。
csvFileEncoding：CSV 檔案字元編碼格式。允許的值為 US-ASCII、ISO-8859-1、UTF-8 和 UTF-16。預設值為 UTF-8。
logDetailedCsvConversionErrors：設為 true，即可在 CSV 剖析失敗時啟用詳細錯誤記錄。請注意，這可能會在記錄中公開機密資料 (例如，如果 CSV 檔案包含密碼)。預設值：false。
elasticsearchUsername：用於驗證的 Elasticsearch 使用者名稱。如果指定，系統會忽略 apiKey 的值。
elasticsearchPassword：用於驗證的 Elasticsearch 密碼。如果指定，系統會忽略 apiKey 的值。
batchSize：批次大小，以文件數量為單位。預設值為 1000。
batchSizeBytes：批次大小 (以位元組為單位)。預設值為 5242880 (5 MB)。
maxRetryAttempts：重試次數上限。必須大於零。預設值為 no retries。
maxRetryDuration：重試時間上限 (以毫秒為單位)。必須大於零。預設值為 no retries。
propertyAsIndex：要建立索引的文件中的屬性，其值會指定要與文件一起納入大量要求的 _index 中繼資料。優先順序高於 _index UDF。預設值為 none。
javaScriptIndexFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，適用於指定 _index 中繼資料的函式，可與大量要求中的文件一併納入。預設值為 none。
javaScriptIndexFnName：UDF JavaScript 函式的名稱，用於指定要隨文件一併納入大量要求的 _index 中繼資料。預設值為 none。
propertyAsId：要建立索引的文件中的屬性，其值會指定要隨文件一併納入大量要求的 _id 中繼資料。優先順序高於 _id UDF。預設值為 none。
javaScriptIdFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，適用於指定 _id 中繼資料的函式，可與大量要求中的文件一併納入。預設值為 none。
javaScriptIdFnName：UDF JavaScript 函式的名稱，用於指定要隨文件一併納入大量要求的 _id 中繼資料。預設值為 none。
javaScriptTypeFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，適用於指定 _type 中繼資料的函式，可與大量要求中的文件一併納入。預設值為 none。
javaScriptTypeFnName：UDF JavaScript 函式的名稱，用於指定要隨文件一併納入大量要求的 _type 中繼資料。預設值為 none。
javaScriptIsDeleteFnGcsPath：JavaScript UDF 來源的 Cloud Storage 路徑，用於判斷是否要刪除文件，而非插入或更新文件。函式會傳回 true 或 false 的字串值。預設值為 none。
javaScriptIsDeleteFnName：UDF JavaScript 函式的名稱，用於判斷是否要刪除文件，而不是插入或更新文件。函式會傳回 true 或 false 的字串值。預設值為 none。
usePartialUpdate：是否要對 Elasticsearch 要求使用部分更新 (更新而非建立或建立索引，允許部分文件)。預設值為 false。
bulkInsertMethod：是否要搭配 Elasticsearch 大量要求使用 INDEX (索引，允許 upsert) 或 CREATE (建立，重複 _id 時會發生錯誤)。預設值為 CREATE。
trustSelfSignedCerts：是否信任自行簽署的憑證。安裝的 Elasticsearch 執行個體可能具有自行簽署的憑證，請將此值設為 true，略過 SSL 憑證的驗證。(預設值為 false)。
disableCertificateValidation：如果為 true，請信任自行簽署的 SSL 憑證。Elasticsearch 執行個體可能具有自行簽署的憑證。如要略過憑證驗證，請將這個參數設為 true。預設值為 false。
apiKeyKMSEncryptionKey：用於解密 API 金鑰的 Cloud KMS 金鑰。如果 apiKeySource 設為 KMS，就必須提供這項參數。如果提供這項參數，請傳遞加密的 apiKey 字串。使用 KMS API 加密端點加密參數。請使用 projects/<PROJECT_ID>/locations/<KEY_REGION>/keyRings/<KEY_RING>/cryptoKeys/<KMS_KEY_NAME> 格式的鍵。請參閱：https://cloud.google.com/kms/docs/reference/rest/v1/projects.locations.keyRings.cryptoKeys/encrypt。例如 projects/your-project-id/locations/global/keyRings/your-keyring/cryptoKeys/your-key-name。
apiKeySecretId：apiKey 的 Secret Manager 密鑰 ID。如果 apiKeySource 設為 SECRET_MANAGER，請提供這項參數。請使用 projects/<PROJECT_ID>/secrets/<SECRET_ID>/versions/<SECRET_VERSION>. For example, projects/your-project-id/secrets/your-secret/versions/your-secret-version` 格式。
apiKeySource：API 金鑰的來源。允許的值為 PLAINTEXT、KMS 或 SECRET_MANAGER。使用 Secret Manager 或 KMS 時，必須設定這個參數。如果 apiKeySource 設為 KMS，則必須提供 apiKeyKMSEncryptionKey 和加密的 apiKey。如果 apiKeySource 設為 SECRET_MANAGER，則必須提供 apiKeySecretId。如果 apiKeySource 設為 PLAINTEXT，則必須提供 apiKey。預設值為 PLAINTEXT。
socketTimeout：如果設定此參數，系統會覆寫 Elastic RestClient 中的預設重試逾時時間上限和預設 Socket 逾時時間 (30000 毫秒)。
javascriptTextTransformGcsPath：定義要使用的 JavaScript 使用者定義函式 (UDF) 的 .js 檔案 Cloud Storage URI。例如：gs://my-bucket/my-udfs/my_file.js。
javascriptTextTransformFunctionName：要使用的 JavaScript 使用者定義函式 (UDF) 名稱。舉例來說，如果您的 JavaScript 函式程式碼是 myTransform(inJson) { /*...do stuff...*/ }，則函式名稱就是 myTransform。如需 JavaScript UDF 範例，請參閱 UDF 範例 (https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples)。

使用者定義函式

此範本在管道的幾個點支援使用者定義函式 (UDF)，詳情如下。詳情請參閱「為 Dataflow 範本建立使用者定義函式」。

文字轉換函式

將 CSV 資料轉換為 Elasticsearch 文件。

範本參數：

javascriptTextTransformGcsPath：JavaScript 檔案的 Cloud Storage URI。
javascriptTextTransformFunctionName：JavaScript 函式的名稱。

函式規格：

輸入：輸入 CSV 檔案中的單行。
輸出：要插入 Elasticsearch 的字串化 JSON 文件。

索引函式

傳回文件所屬的索引。

範本參數：

javaScriptIndexFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptIndexFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，以 JSON 字串形式序列化。
輸出：文件 _index 中繼資料欄位的值。

文件 ID 函式

傳回文件 ID。

範本參數：

javaScriptIdFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptIdFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，序列化為 JSON 字串。
輸出：文件 _id 中繼資料欄位的值。

文件刪除功能

指定是否要刪除文件。如要使用這項函式，請將大量插入模式設為 INDEX，並提供文件 ID 函式。

範本參數：

javaScriptIsDeleteFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptIsDeleteFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，序列化為 JSON 字串。
輸出：傳回字串 "true" 刪除文件，或 "false" 插入/更新文件。

對應類型函式

傳回文件的對應類型。

範本參數：

javaScriptTypeFnGcsPath：JavaScript 檔案的 Cloud Storage URI。
javaScriptTypeFnName：JavaScript 函式的名稱。

函式規格：

輸入：Elasticsearch 文件，序列化為 JSON 字串。
輸出：文件 _type 中繼資料欄位的值。

執行範本

控制台

前往 Dataflow 的「Create job from template」(透過範本建立工作) 頁面。

前往「依範本建立工作」

在「Job name」(工作名稱) 欄位中，輸入不重複的工作名稱。
選用：如要使用區域端點，請從下拉式選單中選取值。預設區域為 us-central1。
如需可執行 Dataflow 工作的地區清單，請參閱「Dataflow 位置」。
從「Dataflow template」(Dataflow 範本) 下拉式選單中，選取 the Cloud Storage to Elasticsearch template。
在提供的參數欄位中輸入參數值。
按一下「Run Job」(執行工作)。

gcloud

在殼層或終端機中執行範本：

gcloud dataflow flex-template run JOB_NAME \
    --project=PROJECT_ID\
    --region=REGION_NAME \
    --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/GCS_to_Elasticsearch \
    --parameters \
inputFileSpec=INPUT_FILE_SPEC,\
connectionUrl=CONNECTION_URL,\
apiKey=APIKEY,\
index=INDEX,\
deadletterTable=DEADLETTER_TABLE,\

更改下列內容：

PROJECT_ID：您要執行 Dataflow 工作的 Google Cloud 專案 ID
JOB_NAME：您選擇的不重複工作名稱
VERSION：您要使用的範本版本
您可以使用下列值：
- latest 使用最新版範本，該範本位於值區中非依日期命名的上層資料夾： gs://dataflow-templates-REGION_NAME/latest/
- 版本名稱，例如 2023-09-12-00_RC00，可使用特定版本的範本，該範本會以巢狀結構存放在值區中相應的依日期命名上層資料夾中：gs://dataflow-templates-REGION_NAME/
注意：範本的最新版本可能會隨著破壞性變更而有所更新。您的實際工作環境使用的範本應該來自最近依日期命名的上層資料夾，以免這些破壞性變更影響實際運作流程。
REGION_NAME：您要部署 Dataflow 工作的區域，例如 us-central1
INPUT_FILE_SPEC：Cloud Storage 檔案模式。
CONNECTION_URL：您的 Elasticsearch 網址。
APIKEY：用於驗證的 base64 編碼 API 金鑰。
INDEX：您的 Elasticsearch 索引。
DEADLETTER_TABLE：您的 BigQuery 資料表。

API

如要使用 REST API 執行範本，請傳送 HTTP POST 要求。如要進一步瞭解 API 和授權範圍，請參閱 projects.templates.launch。

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch
{
   "launch_parameter": {
      "jobName": "JOB_NAME",
      "parameters": {
          "inputFileSpec": "INPUT_FILE_SPEC",
          "connectionUrl": "CONNECTION_URL",
          "apiKey": "APIKEY",
          "index": "INDEX",
          "deadletterTable": "DEADLETTER_TABLE"
      },
      "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/GCS_to_Elasticsearch",
   }
}

更改下列內容：

PROJECT_ID：您要執行 Dataflow 工作的 Google Cloud 專案 ID
JOB_NAME：您選擇的不重複工作名稱
VERSION：您要使用的範本版本
您可以使用下列值：
- latest 使用最新版範本，該範本位於值區中非依日期命名的上層資料夾： gs://dataflow-templates-REGION_NAME/latest/
- 版本名稱，例如 2023-09-12-00_RC00，可使用特定版本的範本，該範本會以巢狀結構存放在值區中相應的依日期命名上層資料夾中：gs://dataflow-templates-REGION_NAME/
注意：範本的最新版本可能會隨著破壞性變更而有所更新。您的實際工作環境使用的範本應該來自最近依日期命名的上層資料夾，以免這些破壞性變更影響實際運作流程。
LOCATION：您要部署 Dataflow 工作的區域，例如 us-central1
INPUT_FILE_SPEC：Cloud Storage 檔案模式。
CONNECTION_URL：您的 Elasticsearch 網址。
APIKEY：用於驗證的 base64 編碼 API 金鑰。
INDEX：您的 Elasticsearch 索引。
DEADLETTER_TABLE：您的 BigQuery 資料表。

範本原始碼

Java

/*
 * Copyright (C) 2021 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License"); you may not
 * use this file except in compliance with the License. You may obtain a copy of
 * the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations under
 * the License.
 */
package com.google.cloud.teleport.v2.elasticsearch.templates;

import static org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkArgument;

import com.google.cloud.teleport.metadata.MultiTemplate;
import com.google.cloud.teleport.metadata.Template;
import com.google.cloud.teleport.metadata.TemplateCategory;
import com.google.cloud.teleport.v2.coders.FailsafeElementCoder;
import com.google.cloud.teleport.v2.common.UncaughtExceptionLogger;
import com.google.cloud.teleport.v2.elasticsearch.options.GCSToElasticsearchOptions;
import com.google.cloud.teleport.v2.elasticsearch.transforms.WriteToElasticsearch;
import com.google.cloud.teleport.v2.transforms.CsvConverters;
import com.google.cloud.teleport.v2.transforms.ErrorConverters.WriteStringMessageErrors;
import com.google.cloud.teleport.v2.utils.SchemaUtils;
import com.google.cloud.teleport.v2.values.FailsafeElement;
import com.google.common.base.Strings;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.PipelineResult;
import org.apache.beam.sdk.coders.CoderRegistry;
import org.apache.beam.sdk.coders.NullableCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.MapElements;
import org.apache.beam.sdk.transforms.WithTimestamps;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.TupleTag;
import org.apache.beam.sdk.values.TypeDescriptors;
import org.joda.time.Instant;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * The {@link GCSToElasticsearch} pipeline exports data from one or more CSV files in Cloud Storage
 * to Elasticsearch.
 *
 * <p>Check out <a
 * href="https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/googlecloud-to-elasticsearch/README_GCS_to_Elasticsearch.md">README</a>
 * for instructions on how to use or modify this template.
 */
@MultiTemplate({
  @Template(
      name = "GCS_to_Elasticsearch",
      category = TemplateCategory.BATCH,
      displayName = "Cloud Storage to Elasticsearch",
      description = {
        "The Cloud Storage to Elasticsearch template is a batch pipeline that reads data from CSV files stored in a Cloud Storage bucket and writes the data into Elasticsearch as JSON documents.",
        "If the CSV files contain headers, set the <code>containsHeaders</code> template parameter to <code>true</code>.\n"
            + "Otherwise, create a JSON schema file that describes the data. Specify the Cloud Storage URI of the schema file in the jsonSchemaPath template parameter. "
            + "The following example shows a JSON schema:\n"
            + "<code>[{\"name\":\"id\", \"type\":\"text\"}, {\"name\":\"age\", \"type\":\"integer\"}]</code>\n"
            + "Alternatively, you can provide a Javascript user-defined function (UDF) that parses the CSV text and outputs Elasticsearch documents."
      },
      optionsClass = GCSToElasticsearchOptions.class,
      skipOptions = {
        "javascriptTextTransformReloadIntervalMinutes",
        "pythonExternalTextTransformGcsPath",
        "pythonExternalTextTransformFunctionName"
      },
      flexContainerName = "gcs-to-elasticsearch",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The Cloud Storage bucket must exist.",
        "A Elasticsearch host on a Google Cloud instance or on Elasticsearch Cloud that is accessible from Dataflow must exist.",
        "A BigQuery table for error output must exist."
      }),
  @Template(
      name = "GCS_to_Elasticsearch_Xlang",
      category = TemplateCategory.BATCH,
      displayName = "Cloud Storage to Elasticsearch with Python UDFs",
      type = Template.TemplateType.XLANG,
      description = {
        "The Cloud Storage to Elasticsearch template is a batch pipeline that reads data from CSV files stored in a Cloud Storage bucket and writes the data into Elasticsearch as JSON documents.",
        "If the CSV files contain headers, set the <code>containsHeaders</code> template parameter to <code>true</code>.\n"
            + "Otherwise, create a JSON schema file that describes the data. Specify the Cloud Storage URI of the schema file in the jsonSchemaPath template parameter. "
            + "The following example shows a JSON schema:\n"
            + "<code>[{\"name\":\"id\", \"type\":\"text\"}, {\"name\":\"age\", \"type\":\"integer\"}]</code>\n"
            + "Alternatively, you can provide a Python user-defined function (UDF) that parses the CSV text and outputs Elasticsearch documents."
      },
      optionsClass = GCSToElasticsearchOptions.class,
      skipOptions = {
        "javascriptTextTransformGcsPath",
        "javascriptTextTransformFunctionName",
        "javascriptTextTransformReloadIntervalMinutes"
      },
      flexContainerName = "gcs-to-elasticsearch-xlang",
      documentation =
          "https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-to-elasticsearch",
      contactInformation = "https://cloud.google.com/support",
      preview = true,
      requirements = {
        "The Cloud Storage bucket must exist.",
        "A Elasticsearch host on a Google Cloud instance or on Elasticsearch Cloud that is accessible from Dataflow must exist.",
        "A BigQuery table for error output must exist."
      })
})
public class GCSToElasticsearch {

  /** The tag for the headers of the CSV if required. */
  static final TupleTag<String> CSV_HEADERS = new TupleTag<String>() {};

  /** The tag for the lines of the CSV. */
  static final TupleTag<String> CSV_LINES = new TupleTag<String>() {};

  /** The tag for the dead-letter output of the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_DEADLETTER_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /** The tag for the main output for the UDF. */
  static final TupleTag<FailsafeElement<String, String>> PROCESSING_OUT =
      new TupleTag<FailsafeElement<String, String>>() {};

  /* Logger for class. */
  private static final Logger LOG = LoggerFactory.getLogger(GCSToElasticsearch.class);

  /** String/String Coder for FailsafeElement. */
  private static final FailsafeElementCoder<String, String> FAILSAFE_ELEMENT_CODER =
      FailsafeElementCoder.of(
          NullableCoder.of(StringUtf8Coder.of()), NullableCoder.of(StringUtf8Coder.of()));

  /**
   * Main entry point for pipeline execution.
   *
   * @param args Command line arguments to the pipeline.
   */
  public static void main(String[] args) {
    UncaughtExceptionLogger.register();

    GCSToElasticsearchOptions options =
        PipelineOptionsFactory.fromArgs(args).withValidation().as(GCSToElasticsearchOptions.class);

    run(options);
  }

  /**
   * Runs the pipeline to completion with the specified options.
   *
   * @param options The execution options.
   * @return The pipeline result.
   */
  private static PipelineResult run(GCSToElasticsearchOptions options) {
    // Create the pipeline
    Pipeline pipeline = Pipeline.create(options);

    // Register the coder for pipeline
    CoderRegistry coderRegistry = pipeline.getCoderRegistry();
    coderRegistry.registerCoderForType(
        FAILSAFE_ELEMENT_CODER.getEncodedTypeDescriptor(), FAILSAFE_ELEMENT_CODER);

    // Throw error if containsHeaders is true and a schema or Udf is also set.
    if (options.getContainsHeaders()) {
      checkArgument(
          options.getJavascriptTextTransformGcsPath() == null
              && options.getJsonSchemaPath() == null
              && options.getPythonExternalTextTransformGcsPath() == null,
          "Cannot parse file containing headers with UDF or Json schema.");
    }

    // Throw error if only one retry configuration parameter is set.
    checkArgument(
        (options.getMaxRetryAttempts() == null && options.getMaxRetryDuration() == null)
            || (options.getMaxRetryAttempts() != null && options.getMaxRetryDuration() != null),
        "To specify retry configuration both max attempts and max duration must be set.");

    // Throw error if both Javascript UDF and Python UDF are set. We can only apply one or the
    // other.
    boolean useJavascriptUdf = !Strings.isNullOrEmpty(options.getJavascriptTextTransformGcsPath());
    boolean usePythonUdf = !Strings.isNullOrEmpty(options.getPythonExternalTextTransformGcsPath());
    if (useJavascriptUdf && usePythonUdf) {
      throw new IllegalArgumentException(
          "Either javascript or Python gcs path must be provided, but not both.");
    }

    /*
     * Steps: 1) Read records from CSV(s) via {@link CsvConverters.ReadCsv}.
     *        2) Convert lines to JSON strings via {@link CsvConverters.LineToFailsafeJson}.
     *        3a) Write JSON strings as documents to Elasticsearch via {@link ElasticsearchIO}.
     *        3b) Write elements that failed processing to {@link org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO}.
     */
    PCollectionTuple readCsvLines =
        pipeline
            /*
             * Step 1: Read CSV file(s) from Cloud Storage using {@link CsvConverters.ReadCsv}.
             */
            .apply(
            "ReadCsv",
            CsvConverters.ReadCsv.newBuilder()
                .setCsvFormat(options.getCsvFormat())
                .setDelimiter(options.getDelimiter())
                .setHasHeaders(options.getContainsHeaders())
                .setInputFileSpec(options.getInputFileSpec())
                .setHeaderTag(CSV_HEADERS)
                .setLineTag(CSV_LINES)
                .setFileEncoding(options.getCsvFileEncoding())
                .build());
    /*
     * Step 2: Convert lines to Elasticsearch document.
     */
    CsvConverters.LineToFailsafeJson.Builder lineToFailsafeJsonBuilder =
        CsvConverters.LineToFailsafeJson.newBuilder()
            .setDelimiter(options.getDelimiter())
            .setJsonSchemaPath(options.getJsonSchemaPath())
            .setHeaderTag(CSV_HEADERS)
            .setLineTag(CSV_LINES)
            .setUdfOutputTag(PROCESSING_OUT)
            .setUdfDeadletterTag(PROCESSING_DEADLETTER_OUT);
    if (options.getPythonExternalTextTransformGcsPath() != null) {
      lineToFailsafeJsonBuilder
          .setPythonUdfFileSystemPath(options.getPythonExternalTextTransformGcsPath())
          .setPythonUdfFunctionName(options.getPythonExternalTextTransformFunctionName());
    } else {
      lineToFailsafeJsonBuilder
          .setJavascriptUdfFileSystemPath(options.getJavascriptTextTransformGcsPath())
          .setJavascriptUdfFunctionName(options.getJavascriptTextTransformFunctionName());
    }
    PCollectionTuple convertedCsvLines =
        readCsvLines.apply("ConvertLine", lineToFailsafeJsonBuilder.build());
    /*
     * Step 3a: Write elements that were successfully processed to Elasticsearch using {@link WriteToElasticsearch}.
     */
    convertedCsvLines
        .get(PROCESSING_OUT)
        .apply(
            "GetJsonDocuments",
            MapElements.into(TypeDescriptors.strings()).via(FailsafeElement::getPayload))
        .apply(
            "WriteToElasticsearch",
            WriteToElasticsearch.newBuilder()
                .setUserAgent("dataflow-gcs-to-elasticsearch-template/v2")
                .setOptions(options.as(GCSToElasticsearchOptions.class))
                .build());

    /*
     * Step 3b: Write elements that failed processing to deadletter table via {@link BigQueryIO}.
     */
    convertedCsvLines
        .get(PROCESSING_DEADLETTER_OUT)
        .apply(
            "AddTimestamps",
            WithTimestamps.of((FailsafeElement<String, String> failures) -> new Instant()))
        .apply(
            "WriteFailedElementsToBigQuery",
            WriteStringMessageErrors.newBuilder()
                .setErrorRecordsTable(options.getDeadletterTable())
                .setErrorRecordsTableSchema(SchemaUtils.DEADLETTER_SCHEMA)
                .build());

    return pipeline.run();
  }
}

後續步驟

瞭解 Dataflow 範本。
請參閱 Google 提供的範本清單。

Cloud Storage 到 Elasticsearch 範本 透過集合功能整理內容 你可以依據偏好儲存及分類內容。

管道相關規定

CSV 結構定義

範本參數

必要參數

選用參數

使用者定義函式

文字轉換函式

索引函式

文件 ID 函式

文件刪除功能

對應類型函式

執行範本

控制台

gcloud

API

範本原始碼

Java

後續步驟

Cloud Storage 到 Elasticsearch 範本