JetStream 및 PyTorch를 사용하여 GKE에서 TPU를 사용하는 LLM 제공

Autopilot Standard

이 가이드에서는 PyTorch를 통해 JetStream과 함께 Google Kubernetes Engine(GKE)에서 TPU(Tensor Processing Unit)를 사용하여 대규모 언어 모델(LLM)을 제공하는 방법을 보여줍니다. 이 가이드에서는 모델 가중치를 Cloud Storage에 다운로드하고 JetStream을 실행하는 컨테이너를 사용하여 GKE Autopilot 또는 Standard 클러스터에 배포합니다.

JetStream에 모델을 배포할 때 Kubernetes 기능에서 제공하는 확장성, 복원력, 비용 효율성이 필요한 경우 이 가이드가 좋은 출발점이 될 수 있습니다.

이 가이드는 PyTorch를 사용하는 생성형 AI 고객, GKE의 신규 또는 기존 사용자, ML 엔지니어, MLOps(DevOps) 엔지니어, LLM 제공을 위해 Kubernetes 컨테이너 조정 기능을 사용하는 데 관심이 있는 플랫폼 관리자를 대상으로 합니다.

배경

JetStream과 함께 GKE에서 TPU를 사용하는 LLM을 제공하면 비용 효율성, 확장성 및 더 높은 가용성을 비롯하여 관리형 Kubernetes의 모든 이점을 갖춘 강력한 프로덕션에 즉시 사용 가능한 서빙 솔루션을 빌드할 수 있습니다. 이 섹션에서는 이 튜토리얼에서 사용되는 주요 기술을 설명합니다.

TPU 정보

TPU는 Google이 TensorFlow, PyTorch 및 JAX와 같은 프레임워크를 활용하여 빌드된 머신러닝 및 AI 모델을 가속화하는 데 사용하기 위해 커스텀 개발한 ASIC(application-specific integrated circuits)입니다.

GKE에서 TPU를 사용하기 전에 다음 학습 과정을 완료하는 것이 좋습니다.

Cloud TPU 시스템 아키텍처를 사용하는 현재 TPU 버전 가용성 알아보기
GKE의 TPU 알아보기

이 튜토리얼에서는 다양한 LLM 모델 서빙을 다룹니다. GKE는 지연 시간이 짧은 프롬프트를 제공하기 위한 모델 요구사항에 따라 TPU 토폴로지가 구성된 단일 호스트 TPUv5e 노드에 모델을 배포합니다.

JetStream 정보

JetStream은 Google이 개발한 오픈소스 추론 서빙 프레임워크입니다. JetStream은 TPU 및 GPU에서 고성능, 높은 처리량, 메모리 최적화된 추론을 가능하게 합니다. JetStream은 연속 일괄 처리, KV 캐시 최적화, 양자화 기법을 포함한 고급 성능 최적화를 제공하여 LLM 배포를 용이하게 합니다. JetStream은 최적의 성능을 달성하기 위해 PyTorch/XLA 및 JAX TPU 서빙을 지원합니다.

연속 일괄 처리

연속 일괄 처리는 수신되는 추론 요청을 동적으로 배치로 그룹화하여 지연 시간을 줄이고 처리량을 늘리는 기법입니다.

KV 캐시 양자화

KV 캐시 양자화에는 주의 메커니즘에 사용되는 키-값 캐시를 압축하여 메모리 요구사항을 줄이는 작업이 포함됩니다.

Int8 가중치 양자화

Int8 가중치 양자화는 모델 가중치 정밀도를 32비트 부동 소수점에서 8비트 정수로 줄여 계산 속도를 높이고 메모리 사용량을 줄입니다.

이러한 최적화에 대한 자세한 내용은 JetStream PyTorch 및 JetStream MaxText 프로젝트 저장소를 참조하세요.

PyTorch 정보

PyTorch는 Meta에서 개발한 오픈소스 머신러닝 프레임워크로, 현재 Linux Foundation에 속해 있습니다. PyTorch는 텐서 계산 및 심층 신경망과 같은 고급 기능을 제공합니다.

목표

모델 특성에 따라 권장 TPU 토폴로지를 사용하여 GKE Autopilot 또는 Standard 클러스터를 준비합니다.
GKE에 JetStream 구성요소를 배포합니다.
모델을 가져와 게시합니다.
게시된 모델을 제공하고 상호작용합니다.

아키텍처

이 섹션에서는 이 튜토리얼에서 사용되는 GKE 아키텍처를 설명합니다. 이 아키텍처는 TPU를 프로비저닝하고 모델을 배포하고 제공하기 위해 JetStream 구성요소를 호스팅하는 GKE Autopilot 또는 Standard 클러스터를 포함합니다.

다음 다이어그램은 이 아키텍처의 구성요소를 보여줍니다.

JetStream-PyTorch 및 JetStream HTTP 구성요소가 포함된 단일 호스트 TPU 노드 풀이 있는 GKE 클러스터의 아키텍처

이 아키텍처에는 다음 구성요소가 포함됩니다.

GKE Autopilot 또는 Standard 리전 클러스터입니다.
JetStream 배포를 호스팅하는 2개의 단일 호스트 TPU 슬라이스 노드 풀입니다.
서비스 구성요소는 수신 트래픽을 모든 JetStream HTTP 복제본으로 전파합니다.
JetStream HTTP는 요청을 JetStream의 필수 형식에 관한 래퍼로 수락하고 JetStream의 GRPC 클라이언트로 전송하는 HTTP 서버입니다.
JetStream-PyTorch은 연속 일괄 처리를 사용하여 추론을 수행하는 JetStream 서버입니다.

시작하기 전에

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the required API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/resourcemanager.projectIamAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  IAM으로 이동
2. 프로젝트를 선택합니다.
3. 액세스 권한 부여를 클릭합니다.
4. 새 주 구성원 필드에 사용자 식별자를 입력합니다. 일반적으로 Google 계정의 이메일 주소입니다.
5. 역할 선택 목록에서 역할을 선택합니다.
6. 역할을 추가로 부여하려면 다른 역할 추가를 클릭하고 각 역할을 추가합니다.
7. 저장을 클릭합니다.

JetStream 및 PyTorch를 사용하여 GKE에서 TPU를 사용하는 LLM 제공

배경

TPU 정보

JetStream 정보

PyTorch 정보

목표

아키텍처

시작하기 전에

Check for the roles

Grant the roles

모델 액세스 권한 얻기

Gemma 7B-it

Llama 3 8B

환경 준비

Google Cloud 리소스 만들기 및 구성

GKE 클러스터 만들기

Autopilot

Standard

Cloud Shell에서 Hugging Face CLI 토큰 생성

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

GKE용 워크로드 아이덴티티 제휴를 사용하여 워크로드 액세스 구성

JetStream 배포

Gemma 7B-it

Llama 3 8B

모델 제공

포트 전달 설정

curl을 사용하여 모델과 상호작용

모델 성능 관찰

문제해결

삭제

배포된 리소스 삭제

다음 단계

JetStream 및 PyTorch를 사용하여 GKE에서 TPU를 사용하는 LLM 제공 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

배경

TPU 정보

JetStream 정보

PyTorch 정보

목표

아키텍처

시작하기 전에

Check for the roles

Grant the roles

모델 액세스 권한 얻기

Gemma 7B-it

Llama 3 8B

환경 준비

Google Cloud 리소스 만들기 및 구성

GKE 클러스터 만들기

Autopilot

Standard

Cloud Shell에서 Hugging Face CLI 토큰 생성

Hugging Face 사용자 인증 정보용 Kubernetes 보안 비밀 만들기

GKE용 워크로드 아이덴티티 제휴를 사용하여 워크로드 액세스 구성

JetStream 배포

Gemma 7B-it

Llama 3 8B

모델 제공

포트 전달 설정

curl을 사용하여 모델과 상호작용

모델 성능 관찰

문제해결

삭제

배포된 리소스 삭제

다음 단계

JetStream 및 PyTorch를 사용하여 GKE에서 TPU를 사용하는 LLM 제공