Google uses AI technology to translate content into your preferred language. AI translations can contain errors.

הצגת מודלים של קוד פתוח באמצעות יחידות TPU ב-GKE עם Optimum TPU

רגילה

במדריך הזה נראה לכם איך להכניס לשימוש בסביבת הייצור מודלים של מודלי שפה גדולים (LLM) בקוד פתוח באמצעות יחידות לעיבוד טנסורים (TPU) ב-Google Kubernetes Engine ‏ (GKE) עם מסגרת ההפעלה Optimum TPU מבית Hugging Face. במדריך הזה, תורידו מודלים בקוד פתוח מ-Hugging Face ותפרסו את המודלים באשכול Standard של GKE באמצעות קונטיינר שמריץ Optimum TPU.

המדריך הזה מספק נקודת התחלה אם אתם צריכים שליטה מפורטת, יכולת התאמה, עמידות, ניידות ויעילות כלכלית של Kubernetes מנוהל כשאתם פורסים ומפעילים את עומסי העבודה שלכם ב-AI/ML.

המדריך הזה מיועד ללקוחות של AI גנרטיבי בסביבת Hugging Face, למשתמשים חדשים או קיימים ב-GKE, למהנדסי ML, למהנדסי MLOps (DevOps) או לאדמינים של פלטפורמות שרוצים להשתמש ביכולות של תזמור קונטיינרים ב-Kubernetes כדי להפעיל מודלים גדולים של שפה (LLM).

Google Cloud מוצרים כמו GKE,‏ Vertex AI ו-Compute Engine תומכים בספריות שונות להצגת מודלים, כמו JetStream,‏ vLLM ועוד מוצרים של שותפים. לדוגמה, אפשר להשתמש ב-JetStream כדי לקבל את האופטימיזציות העדכניות מהפרויקט. אם אתם מעדיפים אפשרויות של Hugging Face, אתם יכולים להשתמש ב-Optimum TPU.

‫Optimum TPU תומך בתכונות הבאות:

איחוד רציף של בקשות
סטרימינג של טוקנים
חיפוש באלגוריתם חמדן ודגימה מולטינומית באמצעות טרנספורמרים.

מטרות

מכינים אשכול GKE Standard עם טופולוגיית TPU מומלצת על סמך מאפייני המודל.
פריסת Optimum TPU ב-GKE.
אפשר להשתמש ב-Optimum TPU כדי להציג את המודלים הנתמכים באמצעות curl.

לפני שמתחילים

נכנסים לחשבון Google Cloud . אם אתם משתמשים חדשים ב- Google Cloud, צרו חשבון כדי שתוכלו להעריך את הביצועים של המוצרים שלנו בתרחישים מהעולם האמיתי. לקוחות חדשים מקבלים בחינם גם קרדיט בשווי 300$ להרצה, לבדיקה ולפריסה של עומסי העבודה.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

If you're using an existing project for this guide, verify that you have the permissions required to complete this guide. If you created a new project, then you already have the required permissions.

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

יוצרים חשבון ב-Hugging Face, אם עדיין אין לכם חשבון.
מוודאים שיש בפרויקט מכסה מספקת ל-Cloud TPU ב-GKE.

התפקידים הנדרשים

כדי לקבל את ההרשאות שדרושות להגדרה של אשכולות ועומסי עבודה, צריך לבקש מהאדמין להקצות לכם את תפקידי ה-IAM הבאים בפרויקט:

אדמין בחשבון שירות (roles/iam.serviceAccountAdmin)
ניהול אשכולות GKE: Kubernetes Engine Admin (roles/container.admin)
כדי ליצור תמונות ולדחוף אותן אל Artifact Registry: אדמין של Artifact Registry (roles/artifactregistry.admin)

להסבר על מתן תפקידים, ראו איך מנהלים את הגישה ברמת הפרויקט, התיקייה והארגון.

יכול להיות שאפשר לקבל את ההרשאות הנדרשות גם באמצעות תפקידים בהתאמה אישית או תפקידים מוגדרים מראש.

הכנת הסביבה

במדריך הזה משתמשים ב-Cloud Shell כדי לנהל משאבים שמתארחים ב-Google Cloud. ב-Cloud Shell מותקן מראש התוכנה שצריך למדריך הזה, כולל kubectl ו- gcloud CLI.

כדי להגדיר את הסביבה באמצעות Cloud Shell:

במסוף Google Cloud , מפעילים את Cloud Shell.

הפעלת Cloud Shell

בחלק התחתון של Google Cloud המסוף יתחיל סשן של Cloud Shell ותופיע הודעה של שורת הפקודה. Cloud Shell היא סביבת מעטפת שבה ה-CLI של Google Cloud מותקן ומוגדרים ערכים לפרויקט הקיים. הסשן יופעל תוך כמה שניות.
מגדירים את משתני הסביבה שמוגדרים כברירת מחדל:
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
export HF_TOKEN=HF_TOKEN
```
מחליפים את הערכים הבאים:
- ‫PROJECT_ID: Google Cloud מזהה הפרויקט.
- ‫CLUSTER_NAME: השם של אשכול GKE.
- ‫REGION_NAME: האזור שבו נמצאים אשכול GKE, הקטגוריה של Cloud Storage וצמתי ה-TPU. האזור מכיל אזורים שבהם זמינים סוגי מכונות TPU v5e (לדוגמה, us-west1,‏ us-west4,‏ us-central1,‏ us-east1,‏ us-east5 או europe-west4).
- (אשכול רגיל בלבד) ZONE: האזור שבו משאבי ה-TPU זמינים (לדוגמה, us-west4-a). באשכולות במצב Autopilot, לא צריך לציין את האזור, רק את האזור.
- ‫HF_TOKEN: האסימון של HuggingFace.

משכפלים את מאגר Optimum TPU:

git clone https://github.com/huggingface/optimum-tpu.git

גישה למודל

אפשר להשתמש במודלים Gemma 2B או Llama3 8B. ההדרכה הזו מתמקדת בשני המודלים האלה, אבל Optimum TPU תומך ביותר מודלים.

Gemma 2B

כדי לקבל גישה למודלים של Gemma לצורך פריסה ב-GKE, קודם צריך לחתום על הסכם הרישיון ואז ליצור טוקן גישה ל-Hugging Face.

כדי להשתמש ב-Gemma, צריך לחתום על הסכם ההסכמה. פועלים לפי ההוראות הבאות:

נכנסים אל דף ההסכמה של המודל.
צריך לאמת את ההסכמה באמצעות חשבון Hugging Face.
מאשרים את התנאים של המודל.

יצירת אסימון גישה

יוצרים טוקן חדש של Hugging Face אם עדיין אין לכם טוקן:

לוחצים על הפרופיל שלך > הגדרות > טוקנים של גישה.
לוחצים על New Token (טוקן חדש).
מציינים שם לבחירתכם ותפקיד ברמה של Read לפחות.
לוחצים על יצירת אסימון.
מעתיקים את הטוקן שנוצר ללוח.

Llama3 8B

כדי להשתמש ב-Llama3 8b ב-Hugging Face Repo, צריך לחתום על הסכם ההסכמה.

יצירת אסימון גישה

יוצרים טוקן חדש של Hugging Face אם עדיין אין לכם טוקן:

לוחצים על הפרופיל שלך > הגדרות > טוקנים של גישה.
בוחרים באפשרות New Token (טוקן חדש).
מציינים שם לבחירתכם ותפקיד ברמה של Read לפחות.
לוחצים על יצירת אסימון.
מעתיקים את הטוקן שנוצר ללוח.

יצירת אשכול GKE

יוצרים אשכול GKE Standard עם צומת CPU אחד:

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=REGION_NAME

יצירת מאגר צמתים של TPU

יוצרים מאגר צמתים של TPU v5e עם צומת אחד ו-8 שבבים:

gcloud container node-pools create tpunodepool \
    --location=REGION_NAME \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --node-locations=ZONE \
    --cluster=CLUSTER_NAME

אם משאבי TPU זמינים, GKE מקצה את מאגר הצמתים. אם משאבי ה-TPU לא זמינים באופן זמני, הפלט יציג GCE_STOCKOUTהודעת שגיאה. כדי לפתור בעיות שקשורות לזמינות משאבים, אפשר לעיין במאמר בנושא אין מספיק משאבי TPU כדי למלא את בקשת ה-TPU.

בניית הקונטיינר

מריצים את פקודת ה-make כדי ליצור את התמונה

cd optimum-tpu && make tpu-tgi

העברת התמונה אל Artifact Registry

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

יצירת סוד של Kubernetes לפרטי הכניסה של Hugging Face

יוצרים סוד של Kubernetes שמכיל את הטוקן של Hugging Face:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

פריסת Optimum TPU

כדי לפרוס Optimum TPU, במדריך הזה נעשה שימוש בפריסת Kubernetes. פריסה היא אובייקט Kubernetes API שמאפשר להפעיל כמה רפליקות של Pods שמפוזרות בין הצמתים באשכול.

Gemma 2B

שומרים את מניפסט הפריסה הבא בשם optimum-tpu-gemma-2b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

קובץ המניפסט הזה מתאר פריסת TPU של Optimum עם מאזן עומסים פנימי ביציאת TCP‏ 8080.

החלת המניפסט

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

שומרים את קובץ המניפסט הבא בשם optimum-tpu-llama3-8b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

קובץ המניפסט הזה מתאר פריסת TPU של Optimum עם מאזן עומסים פנימי ביציאת TCP‏ 8080.

החלת המניפסט

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

צפייה ביומנים מהפריסה הפעילה:

kubectl logs -f -l app=tgi-tpu

הפלט אמור להיראות כך:

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

לפני שממשיכים לקטע הבא, מוודאים שהמודל הורד במלואו.

פרסום המודל

מגדירים העברה ליציאה אחרת למודל:

kubectl port-forward svc/service 8080:8080

אינטראקציה עם שרת המודל באמצעות curl

מאמתים את המודלים שפרסתם:

בסשן חדש של מסוף, משתמשים ב-curl כדי לשוחח עם המודל:

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

הפלט אמור להיראות כך:

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

הסרת המשאבים

כדי להימנע מחיובים בחשבון Google Cloud בגלל השימוש במשאבים שנעשה במסגרת המדריך הזה, אפשר למחוק את הפרויקט שמכיל את המשאבים, או להשאיר את הפרויקט ולמחוק את המשאבים בנפרד.

מחיקת המשאבים שנפרסו

כדי להימנע מחיובים בחשבון Google Cloud על המשאבים שיצרתם במדריך הזה, מריצים את הפקודה הבאה:

gcloud container clusters delete CLUSTER_NAME \
  --location=REGION_NAME

המאמרים הבאים

למסמכי העזרה של Optimum TPU
איך מריצים מודלים של Gemma ב-GKE ואיך מריצים עומסי עבודה של AI/ML שעברו אופטימיזציה באמצעות יכולות התזמור של פלטפורמת GKE.
מידע נוסף על TPU ב-GKE

הצגת מודלים של קוד פתוח באמצעות יחידות TPU ב-GKE עם Optimum TPU קל לארגן דפים בעזרת אוספים אפשר לשמור ולסווג תוכן על סמך ההעדפות שלך.

מטרות

לפני שמתחילים

התפקידים הנדרשים

הכנת הסביבה

גישה למודל

Gemma 2B

חתימה על הסכם הסכמה לרישיון

יצירת אסימון גישה

Llama3 8B

יצירת אסימון גישה

יצירת אשכול GKE

יצירת מאגר צמתים של TPU

בניית הקונטיינר

העברת התמונה אל Artifact Registry

יצירת סוד של Kubernetes לפרטי הכניסה של Hugging Face

פריסת Optimum TPU

Gemma 2B

Llama3 8B

פרסום המודל

אינטראקציה עם שרת המודל באמצעות curl

הסרת המשאבים

מחיקת המשאבים שנפרסו

המאמרים הבאים

הצגת מודלים של קוד פתוח באמצעות יחידות TPU ב-GKE עם Optimum TPU