Dataproc으로 Trino 사용

Trino(이전의 Presto)는 하나 이상의 이기종 데이터 소스에 분산된 대규모 데이터 세트를 쿼리하도록 설계된 분산형 SQL 쿼리 엔진입니다. Trino는 커넥터를 통해 Hive, MySQL, Kafka 및 기타 데이터 소스를 쿼리할 수 있습니다. 이 튜토리얼은 다음을 수행하는 방법을 보여줍니다.

Dataproc 클러스터에 Trino 서비스 설치
클러스터에 있는 Trino 서비스와 통신하는 로컬 머신에 설치된 Trino 클라이언트에서 공개 데이터 쿼리
Trino Java JDBC 드라이버를 통해 클러스터에 있는 Trino 서비스와 통신하는 Java 애플리케이션에서 쿼리 실행

목표

Trino가 설치된 Dataproc 클러스터를 만듭니다.

데이터를 준비합니다. 이 튜토리얼에서는 BigQuery에서 제공되는 Chicago Taxi Trips 공개 데이터 세트를 사용합니다.

BigQuery에서 데이터를 추출합니다.
데이터를 Cloud Storage에 CSV 파일로 로드합니다.
데이터를 변환합니다.
1. 데이터를 Hive 외부 테이블로 노출하여 Trino가 데이터를 쿼리할 수 있게 합니다.
2. 데이터를 CSV 형식에서 Parquet 형식으로 변환하여 더 빠르게 쿼리할 수 있게 합니다.

클러스터에서 실행 중인 Trino 코디네이터에 각각 SSH 터널 또는 Trino JDBC 드라이버를 사용하여 Trino CLI 또는 애플리케이션 코드 쿼리를 전송합니다.

로그를 확인하고 Trino 웹 UI를 통해 Trino 서비스를 모니터링합니다.

비용

이 문서에서는 비용이 청구될 수 있는 Google Cloud구성요소를 사용합니다.

프로젝트 사용량을 기준으로 예상 비용을 산출하려면 가격 계산기를 사용합니다.

Google Cloud 신규 사용자는 무료 체험판을 사용할 수 있습니다.

시작하기 전에

아직 만들지 않았다면 Google Cloud 프로젝트와 이 튜토리얼에서 사용되는 데이터를 보관할 Cloud Storage 버킷을 만듭니다. 1. 프로젝트 설정

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc, Compute Engine, Cloud Storage, and BigQuery APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Install the Google Cloud CLI.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면 다음 명령어를 실행합니다.

gcloud init

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Dataproc, Compute Engine, Cloud Storage, and BigQuery APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

Install the Google Cloud CLI.

외부 ID 공급업체(IdP)를 사용하는 경우 먼저 제휴 ID로 gcloud CLI에 로그인해야 합니다.

gcloud CLI를 초기화하려면 다음 명령어를 실행합니다.

gcloud init

Cloud Storage 버킷 만들기

In the Google Cloud console, go to the Cloud Storage Buckets page.
Go to Buckets
Click Create.
On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
1. In the Get started section, do the following:
  - Enter a globally unique name that meets the bucket naming requirements.
  - To add a bucket label, expand the Labels section (), click Add label, and specify a key and a value for your label.
2. In the Choose where to store your data section, do the following:
  1. Select a Location type.
  2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
    - If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
  3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
    Set up cross-bucket replication
    
    In the Bucket menu, select a bucket.
    
    In the Replication settings section, click Configure to configure settings for the replication job.
    
    The Configure cross-bucket replication pane appears.
    
    To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
    
    To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
    
    Click Done.
3. In the Choose how to store your data section, do the following:
  1. Select a default storage class for the bucket or Autoclass for automatic storage class management of your bucket's data.
  2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
    Note: You cannot enable hierarchical namespace in existing buckets.
4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
  Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
5. In the Choose how to protect object data section, do the following:
  - Select any of the options under Data protection that you want to set for your bucket.
    - To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
    - To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
    - To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
      - To enable Object Retention Lock, click the Enable object retention checkbox.
      - To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
  - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
Click Create.

Dataproc 클러스터 만들기

optional-components 플래그(이미지 버전 2.1 이상에서 사용 가능)를 사용하여 Dataproc 클러스터를 만들고 Trino 선택적 구성요소를 클러스터에 설치하고 enable-component-gateway 플래그를 사용하여 구성요소 게이트웨이를 사용 설정하여 Google Cloud 콘솔에서 Trino 웹 UI에 액세스할 수 있도록 합니다.

환경 변수를 설정합니다.
- PROJECT: 프로젝트 ID
- BUCKET_NAME: 시작하기 전에 단계에서 만든 Cloud Storage 버킷의 이름
- REGION: 이 튜토리얼에서 사용된 클러스터가 생성되는 리전(예: 'us-west1')
- WORKERS: 이 튜토리얼에는 3~5개의 작업자가 권장됨
```
export PROJECT=project-id
export WORKERS=number
export REGION=region
export BUCKET_NAME=bucket-name
```
로컬 머신에서 Google Cloud CLI를 실행하여 클러스터를 만듭니다.
```
gcloud beta dataproc clusters create trino-cluster \
    --project=${PROJECT} \
    --region=${REGION} \
    --num-workers=${WORKERS} \
    --scopes=cloud-platform \
    --optional-components=TRINO \
    --image-version=2.1  \
    --enable-component-gateway
```
고가용성 클러스터(여러 개의 마스터 노드가 있는 클러스터)를 만드는 것은 권장되지 않습니다. 코디네이터가 마스터 노드 0에서 시작되고 다른 모든 마스터 노드는 유휴 상태로 남기 때문입니다.

데이터 준비

bigquery-public-data chicago_taxi_trips 데이터 세트를 Cloud Storage에 CSV 파일로 내보낸 후 데이터를 참조할 Hive 외부 테이블을 만듭니다.

로컬 머신에서 다음 명령어를 실행하여 BigQuery의 택시 데이터를 헤더가 없는 CSV 파일로 시작하기 전에에서 만든 Cloud Storage 버킷에 가져옵니다.

bq --location=us extract --destination_format=CSV \
     --field_delimiter=',' --print_header=false \
       "bigquery-public-data:chicago_taxi_trips.taxi_trips" \
       gs://${BUCKET_NAME}/chicago_taxi_trips/csv/shard-*.csv

Cloud Storage 버킷에서 CSV 및 Parquet 파일의 지원을 받는 Hive 외부 테이블을 만듭니다.

Hive 외부 테이블 chicago_taxi_trips_csv를 만듭니다.

gcloud dataproc jobs submit hive \
    --cluster trino-cluster \
    --region=${REGION} \
    --execute "
        CREATE EXTERNAL TABLE chicago_taxi_trips_csv(
          unique_key   STRING,
          taxi_id  STRING,
          trip_start_timestamp  TIMESTAMP,
          trip_end_timestamp  TIMESTAMP,
          trip_seconds  INT,
          trip_miles   FLOAT,
          pickup_census_tract  INT,
          dropoff_census_tract  INT,
          pickup_community_area  INT,
          dropoff_community_area  INT,
          fare  FLOAT,
          tips  FLOAT,
          tolls  FLOAT,
          extras  FLOAT,
          trip_total  FLOAT,
          payment_type  STRING,
          company  STRING,
          pickup_latitude  FLOAT,
          pickup_longitude  FLOAT,
          pickup_location  STRING,
          dropoff_latitude  FLOAT,
          dropoff_longitude  FLOAT,
          dropoff_location  STRING)
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
        STORED AS TEXTFILE
        location 'gs://${BUCKET_NAME}/chicago_taxi_trips/csv/';"

Hive 외부 테이블이 생성되었는지 확인합니다.
시간을 절약하고 리소스 사용을 줄이려면 이 단계를 건너뛰세요. 클러스터에 노드가 3개 이하인 경우에는 이 쿼리에 몇 분이 소요될 수 있습니다. 이 쿼리 단계를 건너뛰고 아래와 같이 최종 Hive Parquet이 생성되었는지만 훨씬 더 빠르게 확인할 수도 있습니다.
```
gcloud dataproc jobs submit hive \
    --cluster trino-cluster \
    --region=${REGION} \
    --execute "SELECT COUNT(*) FROM chicago_taxi_trips_csv;"
```

같은 열이 있는 다른 Hive 외부 테이블 chicago_taxi_trips_parquet을 만듭니다. 하지만 이번에는 더 나은 쿼리 성능을 위해 데이터를 Parquet 형식으로 저장합니다.

gcloud dataproc jobs submit hive \
    --cluster trino-cluster \
    --region=${REGION} \
    --execute "
        CREATE EXTERNAL TABLE chicago_taxi_trips_parquet(
          unique_key   STRING,
          taxi_id  STRING,
          trip_start_timestamp  TIMESTAMP,
          trip_end_timestamp  TIMESTAMP,
          trip_seconds  INT,
          trip_miles   FLOAT,
          pickup_census_tract  INT,
          dropoff_census_tract  INT,
          pickup_community_area  INT,
          dropoff_community_area  INT,
          fare  FLOAT,
          tips  FLOAT,
          tolls  FLOAT,
          extras  FLOAT,
          trip_total  FLOAT,
          payment_type  STRING,
          company  STRING,
          pickup_latitude  FLOAT,
          pickup_longitude  FLOAT,
          pickup_location  STRING,
          dropoff_latitude  FLOAT,
          dropoff_longitude  FLOAT,
          dropoff_location  STRING)
        STORED AS PARQUET
        location 'gs://${BUCKET_NAME}/chicago_taxi_trips/parquet/';"

데이터를 Hive CSV 테이블에서 Hive Parquet 테이블로 로드합니다.

gcloud dataproc jobs submit hive \
    --cluster trino-cluster \
    --region=${REGION} \
    --execute "
        INSERT OVERWRITE TABLE chicago_taxi_trips_parquet
        SELECT * FROM chicago_taxi_trips_csv;"

데이터가 올바르게 로드되었는지 확인합니다.

gcloud dataproc jobs submit hive \
    --cluster trino-cluster \
    --region=${REGION} \
    --execute "SELECT COUNT(*) FROM chicago_taxi_trips_parquet;"

쿼리 실행

Trino CLI 또는 애플리케이션에서 로컬로 쿼리를 실행할 수 있습니다.

Trino CLI 쿼리

이 섹션에서는 Trino CLI를 사용하여 Hive Parquet 택시 데이터 세트를 쿼리하는 방법을 보여줍니다.

로컬 머신에서 다음 명령어를 실행하여 클러스터의 마스터 노드로 SSH를 설정합니다. 명령어를 실행하는 동안 로컬 터미널이 응답을 중지합니다.
```
gcloud compute ssh trino-cluster-m
```
클러스터의 마스터 노드에 있는 SSH 터미널 창에서 Trino CLI를 실행합니다. 이 Trino CLI는 마스터 노드에서 실행 중인 Trino 서버에 연결합니다.
```
trino --catalog hive --schema default
```

trino:default 프롬프트에서 Trino가 Hive 테이블을 찾을 수 있는지 확인합니다.

show tables;

Table
‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
 chicago_taxi_trips_csv
 chicago_taxi_trips_parquet
(2 rows)

trino:default 프롬프트에서 쿼리를 실행하고, Parquet 데이터와 CSV 데이터의 쿼리 성능을 비교합니다.

Parquet 데이터 쿼리

select count(*) from chicago_taxi_trips_parquet where trip_miles > 50;

 _col0
‐‐‐‐‐‐‐‐
 117957
(1 row)

Query 20180928_171735_00006_2sz8c, FINISHED, 3 nodes
Splits: 308 total, 308 done (100.00%)
0:16 [113M rows, 297MB] [6.91M rows/s, 18.2MB/s]

CSV 데이터 쿼리

select count(*) from chicago_taxi_trips_csv where trip_miles > 50;

_col0
‐‐‐‐‐‐‐‐
 117957
(1 row)

Query 20180928_171936_00009_2sz8c, FINISHED, 3 nodes
Splits: 881 total, 881 done (100.00%)
0:47 [113M rows, 41.5GB] [2.42M rows/s, 911MB/s]

Java 애플리케이션 쿼리

Trino Java JDBC 드라이버를 통해 Java 애플리케이션에서 쿼리를 실행하려면 다음 단계를 따르세요. 1. Trino Java JDBC 드라이버를 다운로드합니다. 1. Maven pom.xml에서 trino-jdbc 종속 항목을 추가합니다.

<dependency>
  <groupId>io.trino</groupId>
  <artifactId>trino-jdbc</artifactId>
  <version>376</version>
</dependency>

샘플 Java 코드

package dataproc.codelab.trino;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;
public class TrinoQuery {
  private static final String URL = "jdbc:trino://trino-cluster-m:8080/hive/default";
  private static final String SOCKS_PROXY = "localhost:1080";
  private static final String USER = "user";
  private static final String QUERY =
      "select count(*) as count from chicago_taxi_trips_parquet where trip_miles > 50";
  public static void main(String[] args) {
    try {
      Properties properties = new Properties();
      properties.setProperty("user", USER);
      properties.setProperty("socksProxy", SOCKS_PROXY);
      Connection connection = DriverManager.getConnection(URL, properties);
      try (Statement stmt = connection.createStatement()) {
        ResultSet rs = stmt.executeQuery(QUERY);
        while (rs.next()) {
          int count = rs.getInt("count");
          System.out.println("The number of long trips: " + count);
        }
      }
    } catch (SQLException e) {
      e.printStackTrace();
    }
  }
}

로깅 및 모니터링

로깅

Trino 로그는 클러스터의 마스터 및 워커 노드의 /var/log/trino/에 있습니다.

웹 UI

로컬 브라우저의 클러스터 마스터 노드에서 실행되는 Trino 웹 UI를 열려면 구성요소 게이트웨이 URL 보기 및 액세스를 참조하세요.

모니터링

Trino는 런타임 테이블을 통해 클러스터 런타임 정보를 노출합니다. trino:default의 Trino 세션 프롬프트에서 다음 쿼리를 실행하여 런타임 테이블 데이터를 봅니다.

select * FROM system.runtime.nodes;

삭제

튜토리얼을 완료한 후에는 만든 리소스를 삭제하여 할당량 사용을 중지하고 요금이 청구되지 않도록 할 수 있습니다. 다음 섹션에서는 리소스를 삭제하거나 사용 중지하는 방법을 설명합니다.

프로젝트 삭제

비용이 청구되지 않도록 하는 가장 쉬운 방법은 튜토리얼에서 만든 프로젝트를 삭제하는 것입니다.

프로젝트를 삭제하는 방법은 다음과 같습니다.

주의: 프로젝트를 삭제하면 다음과 같은 결과가 발생합니다.

프로젝트의 모든 항목이 삭제됩니다. 이 문서의 태스크에 기존 프로젝트를 사용한 경우 프로젝트를 삭제하면 프로젝트에서 수행한 다른 작업도 삭제됩니다.
커스텀 프로젝트 ID가 손실됩니다. 이 프로젝트를 만들 때 앞으로 사용할 커스텀 프로젝트 ID를 만들었을 수 있습니다. appspot.com URL과 같이 프로젝트 ID를 사용하는 URL을 보존하려면 전체 프로젝트를 삭제하는 대신 프로젝트 내에서 선택한 리소스만 삭제합니다.

여러 아키텍처, 튜토리얼 또는 빠른 시작을 살펴보려는 경우 프로젝트를 재사용하면 프로젝트 할당량 한도 초과를 방지할 수 있습니다.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

클러스터 삭제

클러스터를 삭제하는 방법은 다음과 같습니다.

gcloud dataproc clusters delete --project=${PROJECT} trino-cluster \
    --region=${REGION}

버킷 삭제

시작하기 전에 단계에서 만든 Cloud Storage 버킷과 해당 버킷에 저장된 데이터 파일을 삭제하는 방법은 다음과 같습니다.
```
gcloud storage rm gs://${BUCKET_NAME} --recursive
```

Dataproc으로 Trino 사용 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

목표

비용

시작하기 전에

Set up cross-bucket replication

Dataproc 클러스터 만들기

데이터 준비

쿼리 실행

Trino CLI 쿼리

Java 애플리케이션 쿼리

로깅 및 모니터링

로깅

웹 UI

모니터링

삭제

프로젝트 삭제

클러스터 삭제

버킷 삭제

Dataproc으로 Trino 사용