The Dataproc Spark Connect client is a wrapper of the Apache Spark Connect client. It lets applications communicate with a remote Serverless for Apache Spark session using the Spark Connect protocol. This document shows you how to install, configure, and use the client.
Before you begin
Ensure you have the Identity and Access Management roles that contain the permissions needed to manage interactive sessions and session templates.
If you run the client outside of Google Cloud, provide authentication credentials. Set the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to the path of your service account key file.
Install or uninstall the client
You can install or uninstall the dataproc-spark-connect package using pip.
Install
To install the latest version of the client, run the following command:
pip install -U dataproc-spark-connect
Uninstall
To uninstall the client, run the following command:
pip uninstall dataproc-spark-connect
Configure the client
Specify the project and region for your session. You can set these values using environment variables or by using the builder API in your code.
Environment variables
Set the GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_REGION environment
variables.
Builder API
Use the .projectId() and .location() methods.
spark = DataprocSparkSession.builder.projectId("my-project").location("us-central1").getOrCreate()
Start a Spark session
To start a Spark session, add the required imports to your PySpark application
or notebook, then call the DataprocSparkSession.builder.getOrCreate() API.
Import the
DataprocSparkSessionclass.Call the
getOrCreate()method to start the session.from google.cloud.dataproc_spark_connect import DataprocSparkSession spark = DataprocSparkSession.builder.getOrCreate()
Configure Spark properties
To configure Spark properties, chain one or more .config() methods to the
builder.
from google.cloud.dataproc_spark_connect import DataprocSparkSession
spark = DataprocSparkSession.builder.config('spark.executor.memory', '48g').config('spark.executor.cores', '8').getOrCreate()
Use advanced configuration
For advanced configuration, use the Session class to customize settings such
as the subnetwork or runtime version.
from google.cloud.dataproc_spark_connect import DataprocSparkSession
from google.cloud.dataproc_v1 import Session
session_config = Session()
session_config.environment_config.execution_config.subnetwork_uri = 'SUBNET'
session_config.runtime_config.version = '3.0'
spark = DataprocSparkSession.builder.projectId('my-project').location('us-central1').dataprocSessionConfig(session_config).getOrCreate()
Reuse a named session
Named sessions let you share a single Spark session across multiple notebooks while avoiding repeated session startup-time delays.
In your first notebook, create a session with a custom ID.
from google.cloud.dataproc_spark_connect import DataprocSparkSession session_id = 'my-ml-pipeline-session' spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate() df = spark.createDataFrame([(1, 'data')], ['id', 'value']) df.show()In another notebook, reuse the session by specifying the same session ID.
from google.cloud.dataproc_spark_connect import DataprocSparkSession session_id = 'my-ml-pipeline-session' spark = DataprocSparkSession.builder.dataprocSessionId(session_id).getOrCreate() df = spark.createDataFrame([(2, 'more-data')], ['id', 'value']) df.show()
Session IDs must be 4-63 characters long, start with a lowercase letter, and
contain only lowercase letters, numbers, and hyphens. The ID cannot end with a
hyphen. A session with an ID that is in a TERMINATED state cannot be
reused.
Use Spark SQL magic commands
The package supports the
sparksql-magic library to execute
Spark SQL queries in Jupyter notebooks. Magic commands are an optional feature.
Install the required dependencies.
pip install IPython sparksql-magicLoad the magic extension.
%load_ext sparksql_magicOptional: configure default settings.
%config SparkSql.limit=20Execute SQL queries.
%%sparksql SELECT * FROM your_table
To use advanced options, add flags to the %%sparksql command. For example, to
cache results and create a view, run the following command:
%%sparksql --cache --view result_view df
SELECT * FROM your_table WHERE condition = true
The following options are available:
--cacheor-c: caches the DataFrame.--eageror-e: caches with eager loading.--view VIEWor-v VIEW: creates a temporary view.--limit Nor-l N: overrides the default row display limit.variable_name: stores the result in a variable.
What's next
Learn more about Dataproc sessions.