學習路徑：可擴充的應用程式 - 使用 Prometheus 進行監控

Autopilot

IT 管理員和營運人員可以透過這個系列的教學課程，瞭解如何部署、執行及管理 Google Kubernetes Engine (GKE) 中運作的現代化應用程式環境。在本系列教學課程中，您將瞭解如何設定監控和快訊、擴充工作負載，以及模擬失敗，所有操作都使用 Cymbal Bank 範例微服務應用程式：

建立叢集並部署範例應用程式
使用 Google Cloud Managed Service for Prometheus 監控 (本教學課程)
擴充工作負載
模擬失敗
集中管理變更

總覽和目標

本系列教學課程使用的 Cymbal Bank 範例應用程式是由多個微服務組成，這些微服務全都在 GKE 叢集中執行。如果這些服務發生問題，銀行客戶可能會遇到不良體驗，例如無法存取銀行應用程式。盡快瞭解服務問題，有助於快速開始排解及解決問題。

在本教學課程中，您將瞭解如何使用 Google Cloud Managed Service for Prometheus 和 Cloud Monitoring，監控 GKE 叢集中的工作負載。您將瞭解如何完成下列工作：

為 Alertmanager 建立 Slack Webhook。
設定 Prometheus，監控以微服務為基礎的範例應用程式狀態。
模擬服務中斷，並查看透過 Slack Webhook 傳送的快訊。

費用

啟用 GKE 並部署本系列教學課程的 Cymbal Bank 範例應用程式，表示您將產生 GKE 叢集費用，計費方式如 Google Cloud 價格頁面所示。除非停用 GKE 或刪除專案，否則系統會持續計費。

您也必須支付執行 Cymbal Bank 範例應用程式時產生的其他 Google Cloud 費用，例如 Compute Engine VM 和 Cloud Monitoring 的費用。

事前準備

如要瞭解如何監控工作負載，請先完成第一個教學課程，建立使用 Autopilot 的 GKE 叢集，並部署 Cymbal Bank 範例微服務型應用程式。

建議您依序完成這組可擴充應用程式的教學課程。完成一系列教學課程後，您將學到新技能，並使用其他 Google Cloud 產品和服務。

為示範 GKE Autopilot 叢集如何使用 Google Cloud Managed Service for Prometheus，在通訊平台上產生訊息，本教學課程將使用 Slack。在自己的正式版部署作業中，如果 GKE 叢集發生問題，您可以使用貴機構偏好的通訊工具處理及傳送訊息。

加入 Slack 工作區，方法是註冊電子郵件，或使用工作區管理員傳送的邀請。

注意： 如果您不是 Slack 工作區的管理員，可能需要工作區管理員的核准，才能將應用程式部署至工作區。

建立 Slack 應用程式

設定監控時，請務必確保發生停機等可採取行動的事件時，您會收到通知。常見的做法是將通知傳送至 Slack 等通訊工具，而這也是本教學課程採用的方式。Slack 提供 Webhook 功能，可讓外部應用程式 (例如生產部署) 產生訊息。如果 GKE 叢集發生問題，您可以使用機構中的其他通訊工具處理及傳送訊息。

使用 Autopilot 的 GKE 叢集包含 Google Cloud Managed Service for Prometheus 執行個體。當應用程式發生問題時，這個執行個體可以產生快訊。這些快訊隨後會使用 Slack Webhook 將訊息傳送至 Slack 工作區，讓您在發生問題時立即收到通知。

如要根據 Prometheus 產生的快訊設定 Slack 通知，請建立 Slack 應用程式、啟用應用程式的傳入 Webhook，然後將應用程式安裝至 Slack 工作區。

使用工作區名稱和 Slack 帳戶憑證登入 Slack。
建立新的 Slack 應用程式
1. 在「建立應用程式」對話方塊中，按一下「從頭開始」。
2. 指定「應用程式名稱」，然後選擇 Slack 工作區。
3. 點選「建立應用程式」。
4. 在「新增功能」下方，按一下「連入的 Webhook」。
5. 按一下「啟用連入的 Webhook」切換按鈕。
6. 在「Webhook URLs for Your Workspace」(工作區的 Webhook 網址) 部分中，按一下「Add New Webhook to Workspace」(將新的 Webhook 新增至工作區)。
7. 在開啟的授權頁面中，選取要接收通知的頻道。
8. 按一下「Allow」。
9. Slack 應用程式的 Webhook 會顯示在「Webhook URLs for Your Workspace」(工作區的 Webhook 網址) 部分。儲存網址以供日後使用。

設定 Alertmanager

在 Prometheus 中，Alertmanager 會處理部署作業產生的監控事件。Alertmanager 可以略過重複事件、將相關事件分組，以及傳送通知 (例如使用 Slack Webhook)。本節說明如何設定 Alertmanager，以使用新的 Slack Webhook。如要指定 Alertmanager 處理事件並傳送的方式，請參閱本教學課程的下一節「設定 Prometheus」。

如要設定 Alertmanager 使用 Slack Webhook，請完成下列步驟：

將目錄變更為 Git 存放區，其中包含上一個教學課程中 Cymbal Bank 的所有範例資訊清單：
```
cd ~/bank-of-anthos/
```
如有需要，請將目錄位置變更為先前複製存放區的位置。
使用 Slack 應用程式的 Webhook 網址，更新 Alertmanager 範例 YAML 資訊清單：
```
sed -i "s@SLACK_WEBHOOK_URL@SLACK_WEBHOOK_URL@g" "extras/prometheus/gmp/alertmanager.yaml"
```
將 SLACK_WEBHOOK_URL 替換成上一節的 Webhook 網址。
如要動態使用專屬的 Slack Webhook 網址，且不變更應用程式程式碼，可以使用 Kubernetes Secret。應用程式程式碼會讀取這個密鑰的值。在較複雜的應用程式中，這項功能可讓您基於安全或法規遵循原因，變更或輪替值。

使用包含 Slack Webhook 網址的 YAML 資訊清單範例，為 Alertmanager 建立 Kubernetes 密鑰：
```
kubectl create secret generic alertmanager \
  -n gmp-public \
  --from-file=extras/prometheus/gmp/alertmanager.yaml
```
Prometheus 可以使用匯出工具從應用程式取得指標，不必修改程式碼。Prometheus Blackbox 匯出工具可讓您探測 HTTP 或 HTTPS 等端點。如果您不想或無法向 Prometheus 公開應用程式的內部運作方式，這個匯出工具就非常適合您。Prometheus 黑箱匯出工具可與應用程式程式碼搭配運作，無須進行任何變更，即可向 Prometheus 顯示指標。

將 Prometheus blackbox exporter 部署至叢集：
```
kubectl apply -f extras/prometheus/gmp/blackbox-exporter.yaml
```

設定 Prometheus

將 Alertmanager 設為使用 Slack Webhook 後，您需要告知 Prometheus 要監控 Cymbal Bank 中的哪些項目，以及要讓 Alertmanager 透過 Slack Webhook 通知您哪些類型的事件。

在這些教學課程中使用的 Cymbal Bank 範例應用程式中，有各種微服務在 GKE 叢集中執行。您可能想盡快瞭解的問題之一，是 Cymbal Bank 服務是否已停止正常回應要求，這可能表示您的客戶無法存取應用程式。您可以設定 Prometheus，根據貴機構的政策回應事件。

探測

您可以為要監控的資源設定 Prometheus 探針。這些探測器可以根據收到的回應產生快訊。在 Cymbal Bank 範例應用程式中，您可以使用 HTTP 探測器，檢查服務是否傳回 200 級別的回應代碼。HTTP 200 層級的回應表示服務運作正常，可以回應要求。如果發生問題，探測器未收到預期回應，您可以定義 Prometheus 規則，產生警報供 Alertmanager 處理並執行其他動作。

建立一些 Prometheus 探測器，監控 Cymbal Bank 範例應用程式各種微服務的 HTTP 狀態。請查看下列範例資訊清單：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: frontend-probe
  labels:
    app.kubernetes.io/name: frontend-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [frontend:80]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: userservice-probe
  labels:
    app.kubernetes.io/name: userservice-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [userservice:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: balancereader-probe
  labels:
    app.kubernetes.io/name: balancereader-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [balancereader:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: contacts-probe
  labels:
    app.kubernetes.io/name: contacts-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [contacts:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: ledgerwriter-probe
  labels:
    app.kubernetes.io/name: ledgerwriter-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [ledgerwriter:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s
---
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: transactionhistory-probe
  labels:
    app.kubernetes.io/name: transactionhistory-probe
spec:
  selector:
    matchLabels:
      app: blackbox-exporter
  endpoints:
  - port: metrics
    path: /probe
    params:
      target: [transactionhistory:8080/ready]
      module: [http_2xx]
    timeout: 30s
    interval: 60s

如這個資訊清單檔案所示，最佳做法是讓每個 PodMonitoring Prometheus 存活探查分別監控每個部署作業。

如要建立 Prometheus liveness 探測，請將資訊清單套用至叢集：
```
kubectl apply -f extras/prometheus/gmp/probes.yaml
```

規則

Prometheus 必須根據您在上一個步驟中建立的探針所收到的回應，瞭解您想執行的動作。您可以使用 Prometheus 規則定義這項回應。

在本教學課程中，您會建立 Prometheus 規則，根據對 liveness 探測的回應產生快訊。接著，Alertmanager 會處理這些規則的輸出內容，並使用 Slack Webhook 產生通知。

建立規則，根據存活探查的回應產生事件。請查看下列範例資訊清單：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: monitoring.googleapis.com/v1
kind: Rules
metadata:
  name: uptime-rule
spec:
  groups:
  - name: Micro services uptime
    interval: 60s
    rules:
    - alert: BalancereaderUnavailable
      expr: probe_success{job="balancereader-probe"} == 0
      for: 1m
      annotations:
        summary: Balance Reader Service is unavailable
        description: Check Balance Reader pods and its logs
      labels:
        severity: 'critical'
    - alert: ContactsUnavailable
      expr: probe_success{job="contacts-probe"} == 0
      for: 1m
      annotations:
        summary: Contacts Service is unavailable
        description: Check Contacts pods and its logs
      labels:
        severity: 'warning'
    - alert: FrontendUnavailable
      expr: probe_success{job="frontend-probe"} == 0
      for: 1m
      annotations:
        summary: Frontend Service is unavailable
        description: Check Frontend pods and its logs
      labels:
        severity: 'critical'
    - alert: LedgerwriterUnavailable
      expr: probe_success{job="ledgerwriter-probe"} == 0
      for: 1m
      annotations:
        summary: Ledger Writer Service is unavailable
        description: Check Ledger Writer pods and its logs
      labels:
        severity: 'critical'
    - alert: TransactionhistoryUnavailable
      expr: probe_success{job="transactionhistory-probe"} == 0
      for: 1m
      annotations:
        summary: Transaction History Service is unavailable
        description: Check Transaction History pods and its logs
      labels:
        severity: 'critical'
    - alert: UserserviceUnavailable
      expr: probe_success{job="userservice-probe"} == 0
      for: 1m
      annotations:
        summary: User Service is unavailable
        description: Check User Service pods and its logs
      labels:
        severity: 'critical'

這個資訊清單說明 PrometheusRule，並包含下列欄位：

spec.groups.[*].name：規則群組的名稱。
spec.groups.[*].interval：評估群組中規則的頻率。
spec.groups.[*].rules[*].alert：快訊名稱。
spec.groups.[*].rules[*].expr：要評估的 PromQL 運算式。
spec.groups.[*].rules[*].for：快訊必須回報的時間長度，系統才會視為觸發快訊。
spec.groups.[*].rules[*].annotations：要新增至每項快訊的註解清單。這項設定僅適用於快訊規則。
spec.groups.[*].rules[*].labels：要新增或覆寫的標籤。

如要建立規則，請將資訊清單套用至叢集：
```
kubectl apply -f extras/prometheus/gmp/rules.yaml
```

模擬服務中斷

為確保 Prometheus 探測、規則和 Alertmanager 設定正確無誤，您應測試在發生問題時是否會傳送快訊和通知。如果未測試此流程，發生問題時，您可能不會發現生產服務中斷。

如要模擬其中一個微服務中斷的情況，請將 contactsDeployment 調度率降至零。如果沒有任何服務執行個體，Cymbal Bank 範例應用程式就無法讀取顧客的聯絡資訊：
```
kubectl scale deployment contacts --replicas 0
```
GKE 最多可能需要 5 分鐘才能縮減 Deployment。

檢查叢集中的 Deployment 狀態，確認 contacts Deployment 是否正確縮減：

kubectl get deployments

在下列輸出範例中，contacts Deployment 已成功縮減至 0 個執行個體：

NAME                 READY   UP-TO-DATE   AVAILABLE   AGE
balancereader        1/1     1            1           17m
blackbox-exporter    1/1     1            1           5m7s
contacts             0/0     0            0           17m
frontend             1/1     1            1           17m
ledgerwriter         1/1     1            1           17m
loadgenerator        1/1     1            1           17m
transactionhistory   1/1     1            1           17m
userservice          1/1     1            1           17m

contacts部署作業縮減至零後，Prometheus 探針會回報 HTTP 錯誤代碼。這個 HTTP 錯誤會產生 Alertmanager 要處理的快訊。

檢查 Slack 工作區管道，確認是否收到類似下列範例的服務中斷通知訊息：
```
[FIRING:1] ContactsUnavailable
Severity: Warning :warning:
Summary: Contacts Service is unavailable
Namespace: default
Check Contacts pods and it's logs
```
在實際的服務中斷情境中，收到 Slack 通知後，您會開始排解問題並還原服務。在本教學課程中，請模擬這個程序，並將副本數量調回原先的數量，藉此還原 contacts Deployment：
```
kubectl scale deployment contacts --replicas 1
```
部署作業最多可能需要 5 分鐘才能完成擴縮，而 Prometheus 探測器也需要這麼久的時間，才能收到 HTTP 200 回應。您可以使用 kubectl get deployments 指令檢查 Deployment 的狀態。

收到 Prometheus 探查的正常回應後，Alertmanager 就會清除事件。您應該會在 Slack 工作區管道中看到類似以下範例的快訊解決通知訊息：
```
[RESOLVED] ContactsUnavailable
Severity: Warning :warning:
Summary: Contacts Service is unavailable
Namespace: default
Check Contacts pods and it's logs
```

清除所用資源

建議您依序完成 Cymbal Bank 的這組教學課程。完成一系列教學課程後，您將學到新技能，並使用其他 Google Cloud 產品和服務。

如果您想暫時休息，之後再繼續下一個教學課程，並避免系統向您的 Google Cloud 帳戶收取本教學課程所用資源的費用，請刪除您建立的專案。

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

後續步驟

請參閱下一個教學課程，瞭解如何在 GKE 中調度部署作業資源。