"Managed Service for Apache Spark" is the new name for the product formerly known as "Dataproc on Compute Engine" (cluster deployment) and "Google Cloud Serverless for Apache Spark" (serverless deployment).

Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

יצירת אשכול

‫Managed Service for Apache Spark מונע יצירה של אשכולות עם גרסאות תמונה שקדמו לגרסאות 1.3.95,‏ 1.4.77,‏ 1.5.53 ו-2.0.27, שהושפעו מנקודות חולשה באבטחה של Apache Log4j. בנוסף, Managed Service for Apache Spark מונע יצירת אשכולות לגרסאות התמונות 0.x, ‏ 1.0.x, ‏ 1.1.x ו-1.2.x של Managed Service for Apache Spark. ב-Managed Service for Apache Spark מומלץ, כשזה אפשרי, ליצור אשכולות של Managed Service for Apache Spark עם הגרסאות העדכניות ביותר של תמונות משנה.

גרסת תמונה	גרסת log4j	הדרכה ללקוחות
‫2.0.29,‏ 1.5.55 ו-1.4.79, או גרסאות מתקדמות יותר של כל אחת מהן	log4j.2.17.1	המלצה
‫2.0.28,‏ 1.5.54 ו-1.4.78	log4j.2.17.0	המלצה
‫2.0.27,‏ 1.5.53 ו-1.4.77	log4j.2.16.0	מומלץ מאוד
‫2.0.26,‏ 1.5.52 ו-1.4.76, או גרסאות קודמות של כל אחת מהן	גרסה ישנה יותר	הפסקת השימוש

מידע ספציפי על תמונות ועדכונים זמין בהערות לגבי הגרסה של Managed Service for Apache Spark.log4j

יצירת אשכול

דרישות:

שם: שם האשכול צריך להתחיל באות קטנה באנגלית, להמשיך עם עד 51 אותיות קטנות באנגלית, מספרים ומקפים, ולא להסתיים במקף.
אזור האשכול: צריך לציין אזור של Compute Engine לאשכול, כמו us-east1 או europe-west1, כדי לבודד את משאבי האשכול, כמו מכונות וירטואליות ומטא-נתונים של האשכול שמאוחסנים ב-Cloud Storage, בתוך האזור.
- מידע נוסף על אזורים ב-Compute Engine זמין במאמר בנושא אזור של אשכול.
- מידע על בחירת אזור זמין במאמר אזורים ותחומים זמינים. אפשר גם להריץ את הפקודה gcloud compute regions list כדי להציג רשימה של האזורים הזמינים.
קישוריות: מופעים של מכונות וירטואליות (VM) של Compute Engine באשכול Managed Service for Apache Spark, שכולל מכונות וירטואליות ראשיות ומכונות וירטואליות של עובדים, צריכים קישוריות מלאה בין רשתות IP פנימיות. default רשת ה-VPC מספקת את הקישוריות הזו (ראו הגדרת רשת של אשכול Managed Service for Apache Spark).
סוג המכונה (מומלץ): הגדרת סוג המכונה היא אופציונלית, אבל Google ממליצה לבחור במפורש סוג מכונה למכונות הווירטואליות של ה-master וה-worker באשכול. אם לא מציינים סוג מכונה, Managed Service for Apache Spark בוחר באופן דינמי סוגי מכונות על סמך זמינות המשאבים. הבחירה הדינמית הזו יכולה להוביל לשינויים בעלות ובביצועים.
- מידע נוסף על בחירת סוג מכונה זמין במאמר סוגי מכונות נתמכים.
- כדי לצמצם את הסיכוי לבעיות בזמינות של משאבים, מומלץ להשתמש במכונות וירטואליות גמישות, שמאפשרות לכם לציין רשימה של סוגי מכונות מקובלים.

המסוף

פותחים את הדף Create cluster במסוף Google Cloud כדי להציג את הגדרות ברירת המחדל של האשכול. אפשר לאשר או לשנות את הגדרות ברירת המחדל שמוצגות, ואז ללחוץ על הגדרה נוספת כדי להתאים אישית את האשכול.

לוחצים על יצירת אשכול כדי ליצור את האשכול. שם האשכול מופיע בדף Clusters, והסטטוס שלו מתעדכן ל-Running אחרי שהאשכול מוקצה. לוחצים על שם האשכול כדי לפתוח את הדף עם פרטי האשכול. בדף הזה אפשר לבדוק את המשימות, המופעים והגדרות התצורה של האשכול, ולהתחבר לממשקי אינטרנט שפועלים באשכול.

gcloud

כדי ליצור אשכול של Managed Service for Apache Spark בשורת הפקודה, מריצים את הפקודה gcloud dataproc clusters create באופן מקומי בחלון טרמינל או ב-Cloud Shell.

gcloud dataproc clusters create CLUSTER_NAME \
    --region=REGION \
    --master-machine-type=MASTER_MACHINE_TYPE \
    --worker-machine-type=WORKER_MACHINE_TYPE

הפקודה יוצרת אשכול. סוגי המכונות של המאסטר והוורקר הם אופציונליים, אבל מומלץ לציין אותם באופן מפורש באמצעות הדגלים --master-machine-type ו---worker-machine-type (לדוגמה, n4-standard-4) כדי להבטיח עלות וביצועים עקביים. אם לא מציינים סוגי מכונות, המערכת בוחרת באופן דינמי סוגי מכונות שמוגדרים כברירת מחדל על סמך זמינות המשאבים. במאמר בנושא הפקודה gcloud dataproc clusters create מוסבר איך משתמשים בדגלים של שורת הפקודה כדי להתאים אישית את הגדרות האשכול.

יצירת אשכול באמצעות קובץ YAML

מריצים את הפקודה gcloud הבאה כדי לייצא את ההגדרה של אשכול קיים של Managed Service for Apache Spark לקובץ cluster.yaml.
```
gcloud dataproc clusters export EXISTING_CLUSTER_NAME \
    --region=REGION \
    --destination=cluster.yaml
```

יוצרים אשכול חדש על ידי ייבוא של קובץ התצורה מסוג YAML.

gcloud dataproc clusters import NEW_CLUSTER_NAME \
    --region=REGION \
    --source=cluster.yaml

**הערה:** במהלך פעולת הייצוא, שדות ספציפיים לאשכול, כמו שם האשכול, שדות של פלט בלבד ותוויות שמוחלות אוטומטית, מסוננים. השדות האלה אסורים בקובץ ה-YAML המיובא שמשמש ליצירת אשכול.

REST

בקטע הזה מוסבר איך ליצור אשכול. הגדרת סוגי מכונות היא אופציונלית, אבל מומלץ לכלול באופן מפורש את machine_type_uri ב-master_config וב-worker_config (לדוגמה, n4-standard-4) כדי להבטיח עלות וביצועים עקביים. אם לא מציינים סוגי מכונות, המערכת בוחרת באופן דינמי סוגי מכונות שמוגדרים כברירת מחדל על סמך זמינות המשאבים.

לפני שמשתמשים בנתוני הבקשה, צריך להחליף את הנתונים הבאים:

‫CLUSTER_NAME: שם האשכול
‫PROJECT: Google Cloud מזהה הפרויקט
‫REGION: אזור זמין ב-Compute Engine שבו ייווצר האשכול.
‫ZONE: אזור אופציונלי בתוך האזור שנבחר שבו ייצור האשכול.
‫MASTER_MACHINE_TYPE: (מומלץ) סוג המכונה של צומת הראשי (לדוגמה, n4-standard-4).
‫WORKER_MACHINE_TYPE: (מומלץ) סוג המכונה של צומתי העובדים (לדוגמה, n4-standard-4).

ה-method של ה-HTTP וכתובת ה-URL:

POST https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters

תוכן בקשת JSON:

{
  "project_id":"PROJECT",
  "cluster_name":"CLUSTER_NAME",
  "config":{
    "master_config":{
      "num_instances":1,
      "machine_type_uri":"MASTER_MACHINE_TYPE",
      "image_uri":""
    },
    "softwareConfig": {
      "imageVersion": "",
      "properties": {},
      "optionalComponents": []
    },
    "worker_config":{
      "num_instances":2,
      "machine_type_uri":"WORKER_MACHINE_TYPE",
      "image_uri":""
    },
    "gce_cluster_config":{
      "zone_uri":"ZONE"
    }
  }
}

כדי לשלוח את הבקשה צריך להרחיב אחת מהאפשרויות הבאות:

‫Curl (Linux,‏ macOS או Cloud Shell)

הערה: הפקודה הבאה מבוססת על ההנחה שנכנסתם ל-CLI של gcloud באמצעות חשבון המשתמש שלכם, על ידי הרצת gcloud init או gcloud auth login, או באמצעות Cloud Shell שמחבר אתכם אוטומטית ל-CLI של gcloud. כדי לבדוק איזה חשבון פעיל, אפשר להריץ את הפקודה gcloud auth list.

שומרים את גוף הבקשה בקובץ בשם request.json ומריצים את הפקודה הבאה:

curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @request.json \
     "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters"

‎PowerShell (Windows)

הערה: הפקודה הבאה מבוססת על ההנחה שנכנסתם ל-CLI של gcloud באמצעות חשבון המשתמש שלכם, על ידי הרצת gcloud init או gcloud auth login. כדי לבדוק איזה חשבון פעיל, אפשר להריץ את הפקודה gcloud auth list.

שומרים את גוף הבקשה בקובץ בשם request.json ומריצים את הפקודה הבאה:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
    -Method POST `
    -Headers $headers `
    -ContentType: "application/json; charset=utf-8" `
    -InFile request.json `
    -Uri "https://dataproc.googleapis.com/v1/projects/PROJECT/regions/REGION/clusters" | Select-Object -Expand Content

אתם אמורים לקבל תגובת JSON שדומה לזו:

{
"name": "projects/PROJECT/regions/REGION/operations/b5706e31......",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.dataproc.v1.ClusterOperationMetadata",
    "clusterName": "CLUSTER_NAME",
    "clusterUuid": "5fe882b2-...",
    "status": {
      "state": "PENDING",
      "innerState": "PENDING",
      "stateStartTime": "2019-11-21T00:37:56.220Z"
    },
    "operationType": "CREATE",
    "description": "Create cluster with 2 workers",
    "warnings": [
      "For PD-Standard without local SSDs, we strongly recommend provisioning 1TB ...""
    ]
  }
}

Go

מתקינים את ספריית הלקוח.
מגדירים Application Default Credentials.

מריצים את הקוד.

הערה: הגדרת סוגי מכונות היא אופציונלית, אבל מומלץ להגדיר במפורש את סוגי המכונות הראשיות והמשניות בהגדרת האשכול (לדוגמה, ל-n4-standard-4) כדי להבטיח עלות וביצועים עקביים. אם לא מציינים סוג מכונה, סוגי המכונות שמוגדרים כברירת מחדל נבחרים באופן דינמי על סמך זמינות המשאבים.

import (
	"context"
	"fmt"
	"io"

	dataproc "cloud.google.com/go/dataproc/apiv1"
	"cloud.google.com/go/dataproc/apiv1/dataprocpb"
	"google.golang.org/api/option"
)

func createCluster(w io.Writer, projectID, region, clusterName string) error {
	// projectID := "your-project-id"
	// region := "us-central1"
	// clusterName := "your-cluster"
	ctx := context.Background()

	// Create the cluster client.
	endpoint := region + "-dataproc.googleapis.com:443"
	clusterClient, err := dataproc.NewClusterControllerClient(ctx, option.WithEndpoint(endpoint))
	if err != nil {
		return fmt.Errorf("dataproc.NewClusterControllerClient: %w", err)
	}
	defer clusterClient.Close()

	// Create the cluster config.
	req := &dataprocpb.CreateClusterRequest{
		ProjectId: projectID,
		Region:    region,
		Cluster: &dataprocpb.Cluster{
			ProjectId:   projectID,
			ClusterName: clusterName,
			Config: &dataprocpb.ClusterConfig{
				MasterConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   1,
					MachineTypeUri: "n1-standard-2",
				},
				WorkerConfig: &dataprocpb.InstanceGroupConfig{
					NumInstances:   2,
					MachineTypeUri: "n1-standard-2",
				},
			},
		},
	}

	// Create the cluster.
	op, err := clusterClient.CreateCluster(ctx, req)
	if err != nil {
		return fmt.Errorf("CreateCluster: %w", err)
	}

	resp, err := op.Wait(ctx)
	if err != nil {
		return fmt.Errorf("CreateCluster.Wait: %w", err)
	}

	// Output a success message.
	fmt.Fprintf(w, "Cluster created successfully: %s", resp.ClusterName)
	return nil
}

Java

מתקינים את ספריית הלקוח.
מגדירים Application Default Credentials.

מריצים את הקוד.

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.dataproc.v1.Cluster;
import com.google.cloud.dataproc.v1.ClusterConfig;
import com.google.cloud.dataproc.v1.ClusterControllerClient;
import com.google.cloud.dataproc.v1.ClusterControllerSettings;
import com.google.cloud.dataproc.v1.ClusterOperationMetadata;
import com.google.cloud.dataproc.v1.InstanceGroupConfig;
import java.io.IOException;
import java.util.concurrent.ExecutionException;

public class CreateCluster {

  public static void createCluster() throws IOException, InterruptedException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-project-id";
    String region = "your-project-region";
    String clusterName = "your-cluster-name";
    createCluster(projectId, region, clusterName);
  }

  public static void createCluster(String projectId, String region, String clusterName)
      throws IOException, InterruptedException {
    String myEndpoint = String.format("%s-dataproc.googleapis.com:443", region);

    // Configure the settings for the cluster controller client.
    ClusterControllerSettings clusterControllerSettings =
        ClusterControllerSettings.newBuilder().setEndpoint(myEndpoint).build();

    // Create a cluster controller client with the configured settings. The client only needs to be
    // created once and can be reused for multiple requests. Using a try-with-resources
    // closes the client, but this can also be done manually with the .close() method.
    try (ClusterControllerClient clusterControllerClient =
        ClusterControllerClient.create(clusterControllerSettings)) {
      // Configure the settings for our cluster.
      InstanceGroupConfig masterConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(1)
              .build();
      InstanceGroupConfig workerConfig =
          InstanceGroupConfig.newBuilder()
              .setMachineTypeUri("n1-standard-2")
              .setNumInstances(2)
              .build();
      ClusterConfig clusterConfig =
          ClusterConfig.newBuilder()
              .setMasterConfig(masterConfig)
              .setWorkerConfig(workerConfig)
              .build();
      // Create the cluster object with the desired cluster config.
      Cluster cluster =
          Cluster.newBuilder().setClusterName(clusterName).setConfig(clusterConfig).build();

      // Create the Cloud Dataproc cluster.
      OperationFuture<Cluster, ClusterOperationMetadata> createClusterAsyncRequest =
          clusterControllerClient.createClusterAsync(projectId, region, cluster);
      Cluster response = createClusterAsyncRequest.get();

      // Print out a success message.
      System.out.printf("Cluster created successfully: %s", response.getClusterName());

    } catch (ExecutionException e) {
      System.err.println(String.format("Error executing createCluster: %s ", e.getMessage()));
    }
  }
}

Node.js

מתקינים את ספריית הלקוח.
מגדירים Application Default Credentials.

מריצים את הקוד.

const dataproc = require('@google-cloud/dataproc');

// TODO(developer): Uncomment and set the following variables
// projectId = 'YOUR_PROJECT_ID'
// region = 'YOUR_CLUSTER_REGION'
// clusterName = 'YOUR_CLUSTER_NAME'

// Create a client with the endpoint set to the desired cluster region
const client = new dataproc.v1.ClusterControllerClient({
  apiEndpoint: `${region}-dataproc.googleapis.com`,
  projectId: projectId,
});

async function createCluster() {
  // Create the cluster config
  const request = {
    projectId: projectId,
    region: region,
    cluster: {
      clusterName: clusterName,
      config: {
        masterConfig: {
          numInstances: 1,
          machineTypeUri: 'n1-standard-2',
        },
        workerConfig: {
          numInstances: 2,
          machineTypeUri: 'n1-standard-2',
        },
      },
    },
  };

  // Create the cluster
  const [operation] = await client.createCluster(request);
  const [response] = await operation.promise();

  // Output a success message
  console.log(`Cluster created successfully: ${response.clusterName}`);

Python

מתקינים את ספריית הלקוח.
מגדירים Application Default Credentials.

מריצים את הקוד.

from google.cloud import dataproc_v1 as dataproc


def create_cluster(project_id, region, cluster_name):
    """This sample walks a user through creating a Cloud Dataproc cluster
    using the Python client library.

    Args:
        project_id (string): Project to use for creating resources.
        region (string): Region where the resources should live.
        cluster_name (string): Name to use for creating a cluster.
    """

    # Create a client with the endpoint set to the desired cluster region.
    cluster_client = dataproc.ClusterControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )

    # Create the cluster config.
    cluster = {
        "project_id": project_id,
        "cluster_name": cluster_name,
        "config": {
            "master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
            "worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
        },
    }

    # Create the cluster.
    operation = cluster_client.create_cluster(
        request={"project_id": project_id, "region": region, "cluster": cluster}
    )
    result = operation.result()

    # Output a success message.
    print(f"Cluster created successfully: {result.cluster_name}")

יצירת אשכול קל לארגן דפים בעזרת אוספים אפשר לשמור ולסווג תוכן על סמך ההעדפות שלך.

יצירת אשכול

המסוף

gcloud

יצירת אשכול באמצעות קובץ YAML

REST

‫Curl (Linux,‏ macOS או Cloud Shell)

‎PowerShell (Windows)

Go

Java

Node.js

Python

יצירת אשכול