Crie um pipeline do Dataflow com Python

Este documento mostra como usar o SDK Apache Beam para Python para criar um programa que define um pipeline. Em seguida, executa o pipeline através de um executor local direto ou de um executor baseado na nuvem, como o Dataflow. Para uma introdução ao pipeline WordCount, consulte o vídeo Como usar o WordCount no Apache Beam.

Para seguir orientações passo a passo para esta tarefa diretamente na Google Cloud consola, clique em Orientar-me:

Visita guiada

Antes de começar

Sign in to your Google Cloud Platform account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

Se estiver a usar um fornecedor de identidade (IdP) externo, tem primeiro de iniciar sessão na CLI gcloud com a sua identidade federada.

Para inicializar a CLI gcloud, execute o seguinte comando:

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable dataflow compute_component logging storage_component storage_api bigquery pubsub datastore.googleapis.com cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/iam.supportUser, roles/datastream.admin, roles/monitoring.metricsScopesViewer, roles/cloudaicompanion.settingsAdmin

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: Your project ID.
USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
ROLE: The IAM role that you grant to your user account.

Install the Google Cloud CLI.

Se estiver a usar um fornecedor de identidade (IdP) externo, tem primeiro de iniciar sessão na CLI gcloud com a sua identidade federada.

Para inicializar a CLI gcloud, execute o seguinte comando:

gcloud init

Create or select a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Dataflow, Compute Engine, Cloud Logging, Cloud Storage, Google Cloud Storage JSON, BigQuery, Cloud Pub/Sub, Cloud Datastore, and Cloud Resource Manager APIs:

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

gcloud services enable dataflow compute_component logging storage_component storage_api bigquery pubsub datastore.googleapis.com cloudresourcemanager.googleapis.com

Create local authentication credentials for your user account:

gcloud auth application-default login

If an authentication error is returned, and you are using an external identity provider (IdP), confirm that you have signed in to the gcloud CLI with your federated identity.

gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE

Replace the following:

PROJECT_ID: Your project ID.
USER_IDENTIFIER: The identifier for your user account. For example, myemail@example.com.
ROLE: The IAM role that you grant to your user account.

Conceda funções à conta de serviço predefinida do Compute Engine. Execute o seguinte comando uma vez para cada uma das seguintes funções do IAM:
- roles/dataflow.admin
- roles/dataflow.worker
- roles/storage.objectAdmin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" --role=SERVICE_ACCOUNT_ROLE
```
- Substitua PROJECT_ID pelo ID do seu projeto.
- Substitua PROJECT_NUMBER pelo número do seu projeto. Para encontrar o número do projeto, consulte o artigo Identifique projetos ou use o comando gcloud projects describe.
- Substitua SERVICE_ACCOUNT_ROLE por cada função individual.
Create a Cloud Storage bucket and configure it as follows:
- Set the storage class to S (Padrão).
- Defina a localização do armazenamento para o seguinte: US (Estados Unidos).
- Substitua BUCKET_NAME por um nome de contentor exclusivo. Não inclua informações confidenciais no nome do contentor, uma vez que o espaço de nomes do contentor é global e visível publicamente.
- Copie o Google Cloud ID do projeto e o nome do contentor do Cloud Storage. Vai precisar destes valores mais tarde neste documento.

Crie um pipeline do Dataflow com Python

Antes de começar

Configure o seu ambiente

Obtenha o SDK do Apache Beam

Execute o pipeline localmente

Execute o pipeline no serviço Dataflow

Veja os resultados

Google Cloud consola

Terminal local

Modifique o código da conduta

Limpar

O que se segue?