Apache Spark로 Cloud Storage 커넥터 사용

이 튜토리얼은 Apache Spark에서 Cloud Storage 커넥터를 사용하는 예시 코드를 실행하는 방법을 보여줍니다.

목표

Java, Scala, Python에서 간단한 WordCount Spark 작업을 작성한 다음 Dataproc 클러스터에서 작업을 실행합니다.

비용

이 문서에서는 비용이 청구될 수 있는 Google Cloud구성요소를 사용합니다.

Compute Engine
Dataproc
Cloud Storage

프로젝트 사용량을 기준으로 예상 비용을 산출하려면 가격 계산기를 사용합니다.

Google Cloud 신규 사용자는 무료 체험판을 사용할 수 있습니다.

시작하기 전에

아래 단계에 따라 이 튜토리얼에서 코드 실행을 준비하세요.

프로젝트를 설정합니다. 필요한 경우 로컬 머신에 Dataproc, Compute Engine 및 Cloud Storage API가 사용 설정되고 Google Cloud CLI가 설치된 상태로 프로젝트를 설정합니다.
1. Cloud Storage 버킷을 만듭니다. 튜토리얼 데이터를 보관하려면 Cloud Storage가 필요합니다. 사용할 준비가 되지 않은 경우 프로젝트에서 새 버킷을 만듭니다.
  1. In the Google Cloud console, go to the Cloud Storage Buckets page.
    Go to Buckets
  2. Click Create.
  3. On the Create a bucket page, enter your bucket information. To go to the next step, click Continue.
    1. In the Get started section, do the following:
      - Enter a globally unique name that meets the bucket naming requirements.
      - To add a bucket label, expand the Labels section (), click Add label, and specify a key and a value for your label.
    2. In the Choose where to store your data section, do the following:
      1. Select a Location type.
      2. Choose a location where your bucket's data is permanently stored from the Location type drop-down menu.
        
        If you select the dual-region location type, you can also choose to enable turbo replication by using the relevant checkbox.
      3. To set up cross-bucket replication, select Add cross-bucket replication via Storage Transfer Service and follow these steps:
        
        Set up cross-bucket replication
        
        In the Bucket menu, select a bucket.
        
        In the Replication settings section, click Configure to configure settings for the replication job.
        
        The Configure cross-bucket replication pane appears.
        
        To filter objects to replicate by object name prefix, enter a prefix that you want to include or exclude objects from, then click Add a prefix.
        
        To set a storage class for the replicated objects, select a storage class from the Storage class menu. If you skip this step, the replicated objects will use the destination bucket's storage class by default.
        
        Click Done.
    3. In the Choose how to store your data section, do the following:
      1. Select a default storage class for the bucket or Autoclass for automatic storage class management of your bucket's data.
      2. To enable hierarchical namespace, in the Optimize storage for data-intensive workloads section, select Enable hierarchical namespace on this bucket.
        Note: You cannot enable hierarchical namespace in existing buckets.
    4. In the Choose how to control access to objects section, select whether or not your bucket enforces public access prevention, and select an access control method for your bucket's objects.
      Note: You cannot change the Prevent public access setting if this setting is enforced at an organization policy.
    5. In the Choose how to protect object data section, do the following:
      - Select any of the options under Data protection that you want to set for your bucket.
        
        To enable soft delete, click the Soft delete policy (For data recovery) checkbox, and specify the number of days you want to retain objects after deletion.
        
        To set Object Versioning, click the Object versioning (For version control) checkbox, and specify the maximum number of versions per object and the number of days after which the noncurrent versions expire.
        
        To enable the retention policy on objects and buckets, click the Retention (For compliance) checkbox, and then do the following:
        
        To enable Object Retention Lock, click the Enable object retention checkbox.
        
        To enable Bucket Lock, click the Set bucket retention policy checkbox, and choose a unit of time and a length of time for your retention period.
      - To choose how your object data will be encrypted, expand the Data encryption section (), and select a Data encryption method.
  4. Click Create.
2. 로컬 환경 변수를 설정합니다. 로컬 머신에 환경 변수를 설정합니다. 이 튜토리얼에 사용할 Cloud Storage 버킷의 이름과 Google Cloud 프로젝트 ID를 설정합니다. 또한 기존 또는 새 Dataproc 클러스터의 이름과 리전을 입력합니다. 다음 단계에서 이 튜토리얼에서 사용할 클러스터를 만들 수 있습니다.
```
PROJECT=project-id
```
```
BUCKET_NAME=bucket-name
```
```
CLUSTER=cluster-name
```
```
REGION=cluster-region Example: "us-central1"
```
3. Dataproc 클러스터를 만듭니다. 아래 명령어를 실행하여 지정된 Compute Engine 영역에서 단일 노드 Dataproc 클러스터를 만듭니다.
```
gcloud dataproc clusters create ${CLUSTER} \
    --project=${PROJECT} \
    --region=${REGION} \
    --single-node
```
  위의 명령어는 기본 클러스터 이미지 버전을 설치합니다. --image-version 플래그를 사용하여 클러스터의 이미지 버전을 선택할 수 있습니다. 각 이미지 버전은 특정 버전의 Spark 및 Scala 라이브러리 구성요소를 설치합니다. Java 또는 Scala에서 Spark WordCount 작업을 준비하는 경우 작업 패키지를 준비할 때 클러스터에 설치된 Spark 및 Scala 버전을 참조합니다.
4. Cloud Storage 버킷에 공개 데이터를 복사합니다. 공개 데이터 Shakespeare 텍스트 스니펫을 Cloud Storage 버킷의 input 폴더에 복사합니다.
```
gcloud storage cp gs://pub/shakespeare/rose.txt \
    gs://${BUCKET_NAME}/input/rose.txt
```
5. Java(Apache Maven), Scala(SBT), Python 개발 환경을 설정합니다.
  Cloud Shell을 사용합니다. Cloud Shell에는 Apache Maven, Python, Google Cloud CLI를 비롯하여 이 튜토리얼에 사용된 도구가 포함되어 있습니다.

Apache Spark로 Cloud Storage 커넥터 사용

목표

비용

시작하기 전에

Set up cross-bucket replication

Spark WordCount 작업 준비

Java

Scala

Python

작업 제출

Java

Scala

Python

출력 보기

삭제

프로젝트 삭제

Dataproc 클러스터를 삭제합니다.

Cloud Storage 버킷 삭제

다음 단계

Apache Spark로 Cloud Storage 커넥터 사용 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

목표

비용

시작하기 전에

Set up cross-bucket replication

Spark WordCount 작업 준비

Java

Scala

Python

작업 제출

Java

Scala

Python

출력 보기

삭제

프로젝트 삭제

Dataproc 클러스터를 삭제합니다.

Cloud Storage 버킷 삭제

다음 단계

Apache Spark로 Cloud Storage 커넥터 사용