Create a Managed Service for Apache Spark cluster
Requirements:
Name: The cluster name must start with a lowercase letter followed by up to 51 lowercase letters, numbers, and hyphens, and cannot end with a hyphen.
Cluster region: You must specify a Compute Engine region for the cluster, such as
us-east1oreurope-west1, to isolate cluster resources, such as VM instances and cluster metadata stored in Cloud Storage, within the region.- See Cluster region for more information on Compute Engine regions.
- See Available regions & zones
for information on selecting a region. You can also run the
gcloud compute regions listcommand to display a listing of available regions.
Connectivity: Compute Engine Virtual Machine instances (VMs) in a Managed Service for Apache Spark cluster, consisting of master and worker VMs, require full internal IP networking cross connectivity. The
defaultVPC network provides this connectivity (see Managed Service for Apache Spark Cluster Network Configuration).Machine type (recommended): While specifying a machine type is optional, Google recommends that you explicitly select a machine type for the master and worker VMs in your cluster. If you don't specify a machine type, Managed Service for Apache Spark dynamically selects machine types based on resource availability. This dynamic selection can result in variations in both cost and performance.
- For more information on choosing a machine type, see Supported machine types.
- To mitigate potential resource unavailability issues, we recommend using Flexible VMs, which lets you specify a list of acceptable machine types.
Console
Open the Managed Service for Apache Spark Create a cluster page in the Google Cloud console in your browser, then click Create in the cluster on Compute engine row in the Create a Dataproc cluster on Compute Engine page. The Set up cluster panel is selected with fields filled in with default values. You can select each panel and confirm or change default values to customize your cluster.
Click Create to create the cluster. The cluster name appears in the Clusters page, and its status is updated to Running after the cluster is provisioned. Click the cluster name to open the cluster details page where you can examine jobs, instances, and configuration settings for your cluster and connect to web interfaces running on your cluster.
gcloud
To create a Managed Service for Apache Spark cluster on the command line, run the gcloud dataproc clusters create command locally in a terminal window or in Cloud Shell.
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --master-machine-type=MASTER_MACHINE_TYPE \ --worker-machine-type=WORKER_MACHINE_TYPE
The command creates a cluster. While master and worker machine types are optional, it is recommended to explicitly specify them using the --master-machine-type and --worker-machine-type flags (for example, n4-standard-4) to ensure consistent cost and performance. If you don't specify machine types, default machine types are selected dynamically based on resource availability. See the
gcloud dataproc clusters create
command for information on using command line flags to customize cluster settings.
Create a cluster with a YAML file
- Run the following
gcloudcommand to export the configuration of an existing Managed Service for Apache Spark cluster into acluster.yamlfile.gcloud dataproc clusters export EXISTING_CLUSTER_NAME \ --region=REGION \ --destination=cluster.yaml
- Create a new cluster by importing the YAML file configuration.
gcloud dataproc clusters import NEW_CLUSTER_NAME \ --region=REGION \ --source=cluster.yaml
Note: During the export operation, cluster-specific fields, such as cluster name, output-only fields, and automatically applied labels are filtered. These fields are disallowed in the imported YAML file used to create a cluster.
REST
This section shows how to create a cluster. While specifying machine types is optional, it is recommended to explicitly include machine_type_uri in your master_config and worker_config (for example, n4-standard-4) to ensure consistent cost and performance. If you don't specify machine types, default machine types are selected dynamically based on resource availability.
Before using any of the request data, make the following replacements:
- CLUSTER_NAME: cluster name
- PROJECT: Google Cloud project ID
- REGION: An available Compute Engine region where the cluster will be created.
- ZONE: An optional zone within the selected region where the cluster will be created.
- MASTER_MACHINE_TYPE: (Recommended) The machine type for the master node (for example,
n4-standard-4). - WORKER_MACHINE_TYPE: (Recommended) The machine type for worker nodes (for example,
n4-standard-4).
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters
Request JSON body:
{
"project_id":"PROJECT",
"cluster_name":"CLUSTER_NAME",
"config":{
"master_config":{
"num_instances":1,
"machine_type_uri":"MASTER_MACHINE_TYPE",
"image_uri":""
},
"softwareConfig": {
"imageVersion": "",
"properties": {},
"optionalComponents": []
},
"worker_config":{
"num_instances":2,
"machine_type_uri":"WORKER_MACHINE_TYPE",
"image_uri":""
},
"gce_cluster_config":{
"zone_uri":"ZONE"
}
}
}
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
"metadata": {
"@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
"clusterName": "CLUSTER_NAME",
"clusterUuid": "5fe882b2-...",
"status": {
"state": "PENDING",
"innerState": "PENDING",
"stateStartTime": "2019-11-21T00:37:56.220Z"
},
"operationType": "CREATE",
"description": "Create cluster with 2 workers",
"warnings": [
"For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
]
}
}
Go
- Install the client library.
- Set up application default credentials.
- Run the code.
Note: While specifying machine types is optional, it is recommended to explicitly set the master and worker machine types in your cluster configuration (for example, to
n4-standard-4) to ensure consistent cost and performance. If omitted, default machine types are selected dynamically based on resource availability.
Java
- Install the client library.
- Set up application default credentials.
- Run the code.
Note: While specifying machine types is optional, it is recommended to explicitly set the master and worker machine types in your cluster configuration (for example, to
n4-standard-4) to ensure consistent cost and performance. If omitted, default machine types are selected dynamically based on resource availability.
Node.js
- Install the client library.
- Set up application default credentials.
- Run the code.
Note: While specifying machine types is optional, it is recommended to explicitly set the master and worker machine types in your cluster configuration (for example, to
n4-standard-4) to ensure consistent cost and performance. If omitted, default machine types are selected dynamically based on resource availability.
Python
- Install the client library.
- Set up application default credentials.
- Run the code.
Note: While specifying machine types is optional, it is recommended to explicitly set the master and worker machine types in your cluster configuration (for example, to
n4-standard-4) to ensure consistent cost and performance. If omitted, default machine types are selected dynamically based on resource availability.