Per promuovere con maggiore forza il successo e la crescente preferenza dei clienti per le soluzioni OSS, Cloud Composer si sta evolvendo in Managed Service per Apache Airflow. Questo cambio di nome migliora la comprensione del nostro portfolio da parte dei clienti e rafforza il nostro impegno a essere l'ecosistema cloud più aperto.

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

Esegui carichi di lavoro Managed Service for Apache Spark con Managed Airflow

Managed Airflow (Gen 3) | Managed Airflow (Gen 2) | Managed Airflow (Legacy Gen 1)

Questa pagina descrive come utilizzare Managed Airflow (Gen 2) per eseguire i workload di Managed Service for Apache Spark su Google Cloud.

Gli esempi nelle sezioni seguenti mostrano come utilizzare gli operatori per gestire i workload batch di Managed Service for Apache Spark. Utilizzi questi operatori nei DAG che creano, eliminano, elencano e recuperano un workload batch Managed Service for Apache Spark:

Crea DAG per gli operatori che funzionano con i workload batch Managed Service for Apache Spark:
Crea DAG che utilizzano container personalizzati e Dataproc Metastore.
Configura il server di cronologia permanente per questi DAG.

Prima di iniziare

Abilita l'API Dataproc:
Console

Abilita l'API Managed Service per Apache Spark.
Ruoli richiesti per abilitare le API
Per abilitare le API, devi disporre del ruolo IAM Amministratore utilizzo dei servizi (roles/serviceusage.serviceUsageAdmin), che include l'autorizzazione serviceusage.services.enable. Scopri come concedere i ruoli.
Abilitare l'API
gcloud

Abilita l'API Managed Service for Apache Spark:
Ruoli richiesti per abilitare le API
Per abilitare le API, devi disporre del ruolo IAM Amministratore utilizzo dei servizi (roles/serviceusage.serviceUsageAdmin), che include l'autorizzazione serviceusage.services.enable. Scopri come concedere i ruoli.
```
gcloud services enable dataproc.googleapis.com
```
Seleziona la posizione del file del carico di lavoro batch. Puoi utilizzare una delle seguenti opzioni:
- Crea un bucket Cloud Storage che contenga questo file.
- Utilizza il bucket del tuo ambiente. Poiché non devi sincronizzare questo file con Airflow, puoi creare una sottocartella separata al di fuori delle cartelle /dags o /data. Ad esempio: /batches.
- Utilizza un bucket esistente.

Configura i file e le variabili Airflow

Questa sezione mostra come configurare i file e le variabili Airflow per questo tutorial.

Carica un file di workload ML di Managed Service per Apache Spark in un bucket

Il workload in questo tutorial esegue uno script pyspark:

Salva qualsiasi script pyspark in un file locale denominato spark-job.py. Ad esempio, puoi utilizzare lo script pyspark di esempio.
Carica il file nella posizione che hai selezionato in Prima di iniziare.

Imposta le variabili Airflow

Gli esempi nelle sezioni seguenti utilizzano le variabili Airflow. Imposti i valori per queste variabili in Airflow, poi il codice DAG può accedere a questi valori.

Gli esempi di questo tutorial utilizzano le seguenti variabili Airflow. Puoi impostarli in base alle tue esigenze, a seconda dell'esempio che utilizzi.

Imposta le seguenti variabili Airflow da utilizzare nel codice DAG:

project_id: ID progetto.
bucket_name: URI di un bucket in cui si trova il file Python principale del carico di lavoro (spark-job.py). Hai selezionato questa posizione in Prima di iniziare.
phs_cluster: il nome del cluster Persistent History Server. Imposti questa variabile quando crei un server di cronologia permanente.
image_name: nome e tag dell'immagine container personalizzata (image:tag). Imposti questa variabile quando utilizzi l'immagine container personalizzata con DataprocCreateBatchOperator.
metastore_cluster: il nome del servizio Dataproc Metastore. Imposti questa variabile quando utilizzi il servizio Dataproc Metastore con DataprocCreateBatchOperator.
region_name: la regione in cui si trova il servizio Dataproc Metastore. Imposti questa variabile quando utilizzi il servizio Dataproc Metastore con DataprocCreateBatchOperator.

Utilizza la console Google Cloud e la UI di Airflow per impostare ogni variabile Airflow

Nella console Google Cloud , vai alla pagina Ambienti.

Vai ad Ambienti
Nell'elenco degli ambienti, fai clic sul link Airflow per il tuo ambiente. Si apre la UI di Airflow.
Nell'interfaccia utente di Airflow, seleziona Admin > Variables.
Fai clic su Add a new record (Aggiungi un nuovo record).
Specifica il nome della variabile nel campo Chiave e imposta il valore nel campo Valore.
Fai clic su Salva.

Crea un server di cronologia permanente

Utilizza un server di cronologia permanente (PHS) per visualizzare i file di cronologia Spark dei carichi di lavoro batch:

Crea un server di cronologia permanente.
Assicurati di aver specificato il nome del cluster PHS nella phs_cluster variabile Airflow.

DataprocCreateBatchOperator

Il seguente DAG avvia un workload batch Managed Service for Apache Spark.

Per saperne di più sugli argomenti di DataprocCreateBatchOperator, consulta il codice sorgente dell'operatore.

Per saperne di più sugli attributi che puoi passare nel parametro batch di DataprocCreateBatchOperator, consulta la descrizione della classe Batch.


"""
Examples below show how to use operators for managing Dataproc Serverless batch workloads.
 You use these operators in DAGs that create, delete, list, and get a Dataproc Serverless Spark batch workload.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/variables.html
* project_id is the Google Cloud Project ID to use for the Cloud Dataproc Serverless.
* bucket_name is the URI of a bucket where the main python file of the workload (spark-job.py) is located.
* phs_cluster is the Persistent History Server cluster name.
* image_name is the name and tag of the custom container image (image:tag).
* metastore_cluster is the Dataproc Metastore service name.
* region_name is the region where the Dataproc Metastore service is located.
"""

import datetime

from airflow import models
from airflow.providers.google.cloud.operators.dataproc import (
    DataprocCreateBatchOperator,
    DataprocDeleteBatchOperator,
    DataprocGetBatchOperator,
    DataprocListBatchesOperator,
)
from airflow.utils.dates import days_ago

PROJECT_ID = "{{ var.value.project_id }}"
REGION = "{{ var.value.region_name}}"
BUCKET = "{{ var.value.bucket_name }}"
PHS_CLUSTER = "{{ var.value.phs_cluster }}"
METASTORE_CLUSTER = "{{var.value.metastore_cluster}}"
DOCKER_IMAGE = "{{var.value.image_name}}"

PYTHON_FILE_LOCATION = "gs://{{var.value.bucket_name }}/spark-job.py"
# for e.g.  "gs//my-bucket/spark-job.py"
# Start a single node Dataproc Cluster for viewing Persistent History of Spark jobs
PHS_CLUSTER_PATH = "projects/{{ var.value.project_id }}/regions/{{ var.value.region_name}}/clusters/{{ var.value.phs_cluster }}"
# for e.g. projects/my-project/regions/my-region/clusters/my-cluster"
SPARK_BIGQUERY_JAR_FILE = "gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"
# use this for those pyspark jobs that need a spark-bigquery connector
# https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
# Start a Dataproc MetaStore Cluster
METASTORE_SERVICE_LOCATION = "projects/{{var.value.project_id}}/locations/{{var.value.region_name}}/services/{{var.value.metastore_cluster }}"
# for e.g. projects/my-project/locations/my-region/services/my-cluster
CUSTOM_CONTAINER = "us.gcr.io/{{var.value.project_id}}/{{ var.value.image_name}}"
# for e.g. "us.gcr.io/my-project/quickstart-image",

default_args = {
    # Tell airflow to start one day ago, so that it runs as soon as you upload it
    "start_date": days_ago(1),
    "project_id": PROJECT_ID,
    "region": REGION,
}
with models.DAG(
    "dataproc_batch_operators",  # The id you will see in the DAG airflow page
    default_args=default_args,  # The interval with which to schedule the DAG
    schedule_interval=datetime.timedelta(days=1),  # Override to match your needs
) as dag:
    create_batch = DataprocCreateBatchOperator(
        task_id="batch_create",
        batch={
            "pyspark_batch": {
                "main_python_file_uri": PYTHON_FILE_LOCATION,
                "jar_file_uris": [SPARK_BIGQUERY_JAR_FILE],
            },
            "environment_config": {
                "peripherals_config": {
                    "spark_history_server_config": {
                        "dataproc_cluster": PHS_CLUSTER_PATH,
                    },
                },
            },
        },
        batch_id="batch-create-phs",
    )
    list_batches = DataprocListBatchesOperator(
        task_id="list-all-batches",
    )

    get_batch = DataprocGetBatchOperator(
        task_id="get_batch",
        batch_id="batch-create-phs",
    )
    delete_batch = DataprocDeleteBatchOperator(
        task_id="delete_batch",
        batch_id="batch-create-phs",
    )
    create_batch >> list_batches >> get_batch >> delete_batch

Utilizza l'immagine container personalizzata con DataprocCreateBatchOperator

L'esempio seguente mostra come utilizzare un'immagine container personalizzata per eseguire i tuoi carichi di lavoro. Puoi utilizzare un container personalizzato, ad esempio, per aggiungere dipendenze Python non fornite dall'immagine container predefinita.

Per utilizzare un'immagine container personalizzata:

Crea un'immagine container personalizzata e caricala in Container Registry.
Specifica l'immagine nella variabile Airflow image_name.
Utilizza DataprocCreateBatchOperator con la tua immagine personalizzata:

create_batch_with_custom_container = DataprocCreateBatchOperator(
    task_id="dataproc_custom_container",
    batch={
        "pyspark_batch": {
            "main_python_file_uri": PYTHON_FILE_LOCATION,
            "jar_file_uris": [SPARK_BIGQUERY_JAR_FILE],
        },
        "environment_config": {
            "peripherals_config": {
                "spark_history_server_config": {
                    "dataproc_cluster": PHS_CLUSTER_PATH,
                },
            },
        },
        "runtime_config": {
            "container_image": CUSTOM_CONTAINER,
        },
    },
    batch_id="batch-custom-container",
)
get_batch_custom = DataprocGetBatchOperator(
    task_id="get_batch_custom",
    batch_id="batch-custom-container",
)
delete_batch_custom = DataprocDeleteBatchOperator(
    task_id="delete_batch_custom",
    batch_id="batch-custom-container",
)
create_batch_with_custom_container >> get_batch_custom >> delete_batch_custom

Utilizza il servizio Dataproc Metastore con DataprocCreateBatchOperator

Per utilizzare un servizio Dataproc Metastore da un DAG:

Verifica che il servizio metastore sia già avviato.

Per informazioni sull'avvio di un servizio metastore, consulta Abilitare e disabilitare Dataproc Metastore.

Per informazioni dettagliate sull'operatore Batch per la creazione della configurazione, vedi PeripheralsConfig.
Una volta che il servizio metastore è attivo e funzionante, specifica il suo nome nella variabile metastore_cluster e la sua regione nella region_name variabile Airflow.
Utilizza il servizio metastore in DataprocCreateBatchOperator:

create_batch_with_metastore = DataprocCreateBatchOperator(
    task_id="dataproc_metastore",
    batch={
        "pyspark_batch": {
            "main_python_file_uri": PYTHON_FILE_LOCATION,
            "jar_file_uris": [SPARK_BIGQUERY_JAR_FILE],
        },
        "environment_config": {
            "peripherals_config": {
                "metastore_service": METASTORE_SERVICE_LOCATION,
                "spark_history_server_config": {
                    "dataproc_cluster": PHS_CLUSTER_PATH,
                },
            },
        },
    },
    batch_id="dataproc-metastore",
)
get_batch_metastore = DataprocGetBatchOperator(
    task_id="get_batch_metatstore",
    batch_id="dataproc-metastore",
)
delete_batch_metastore = DataprocDeleteBatchOperator(
    task_id="delete_batch_metastore",
    batch_id="dataproc-metastore",
)

create_batch_with_metastore >> get_batch_metastore >> delete_batch_metastore

DataprocDeleteBatchOperator

Puoi utilizzare DataprocDeleteBatchOperator per eliminare un batch in base all'ID batch del workload.

delete_batch = DataprocDeleteBatchOperator(
    task_id="delete_batch",
    batch_id="batch-create-phs",
)

DataprocListBatchesOperator

DataprocDeleteBatchOperator elenca i batch esistenti all'interno di un determinato project_id e regione.

list_batches = DataprocListBatchesOperator(
    task_id="list-all-batches",
)

DataprocGetBatchOperator

DataprocGetBatchOperator recupera un particolare workload batch.

get_batch = DataprocGetBatchOperator(
    task_id="get_batch",
    batch_id="batch-create-phs",
)