使用 nettest 檢查叢集連線

Google Distributed Cloud nettest 會找出叢集 Kubernetes 物件的連線問題，例如 Pod、節點、服務和部分外部目標。nettest 不會檢查從外部目標到 Pod、節點或服務的連線。本文說明如何使用 anthos-samples GitHub 存放區中的資訊清單 (nettest.yaml 或 nettest_rhel.yaml) 部署及執行 nettest。如果您在 Red Hat Enterprise Linux (RHEL) 上執行 Google Distributed Cloud，請使用 nettest_rhel.yaml。如果您在 Ubuntu 上執行 Google Distributed Cloud，請使用 nettest.yaml。

本文也說明如何解讀 nettest 產生的記錄，找出叢集的連線問題。

關於「`nettest`」

nettest 診斷工具包含下列 Kubernetes 物件。每個物件都會在 nettest YAML 資訊清單檔案中指定。

cloudprober：負責收集網路連線狀態 (例如錯誤率和延遲時間) 的 DaemonSet 和服務。
echoserver：DaemonSet 和 Service 負責回應 cloudprober，並提供網路連線指標。
nettest：包含 prometheus 和 nettest 容器的 Pod。
- prometheus 會從 cloudprober 收集指標。
- nettest 查詢 prometheus，並在記錄中顯示網路測試結果。
nettest-engine：用於設定 nettest Pod 中 nettest 容器的 ConfigMap。

資訊清單也會指定 nettest 命名空間和專屬的 ServiceAccount (以及 ClusterRole 和 ClusterRoleBinding)，將 nettest 與其他叢集資源隔離。

執行網路測試

執行下列作業系統的指令，部署 nettest。 nettest Pod 啟動後，測試會自動執行。這項測試大約需要五分鐘才能完成。

Ubuntu：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

RHEL：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest_rhel.yaml

取得測試結果

測試完成後 (部署資訊清單後約五分鐘)，請執行下列指令來查看結果：nettestnettest

kubectl -n nettest logs nettest -c nettest

nettest 執行時，會將類似下列的訊息傳送至 stdout：

I0413 03:33:04.879141       1 collectorui.go:130] Listening on ":8999"
I0413 03:33:04.879258       1 prometheus.go:172] Running prometheus controller
E0413 03:33:04.879628       1 prometheus.go:178] Prometheus controller: failed to
retries probers: Get "http://127.0.0.1:9090/api/v1/targets": dial tcp 127.0.0.1:9090:
connect: connection refused

如果 nettest 順利執行，且未發現任何連線失敗情形，您會看到下列記錄項目：

I0211 21:58:34.689290       1 validate_metrics.go:78] Metric validation passed!

如果 nettest 發現連線問題，會寫入類似下列內容的記錄項目：

E0211 06:40:11.948634       1 collector.go:65] Engine error: step validateMetrics failed:
"Error rate in percentage": probe from "10.200.0.3" to "172.26.115.210:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "10.200.0.3" to "172.26.27.229:80" has value 100.000000,
threshold is 1.000000
"Error rate in percentage": probe from "192.168.3.248" to "echoserver-hostnetwork_10.200.0.2_8080"
has value 2.007046, threshold is 1.000000

雖然預設門檻為 1% (1.000000)，但您可以放心忽略高達 5% 的錯誤率。舉例來說，在上述範例中，從 IP 位址 192.168.3.248 到 echoserver-hostnetwork_10.200.0.2_8080 的連線錯誤率約為 2% (2.007046)。這是可忽略的連線問題回報範例。

解讀測試結果

nettest 完成並發現連線問題時，您會在 nettest Pod 記錄中看到下列項目：

"Error rate in percentage": probe from {src} to {dst} has value 100.000000, threshold is 1.000000

其中 {src} 和 {dst} 可以是下列任一項：

echoserver Pod IP：節點上 Pod 的連線。
節點 IP：連入或連出節點的連線。
服務 IP (詳情請參閱下文)

此外，{dst} 也可以是：

google.com：外部連線。
dns：透過 DNS 連線至非 hostNetwork 服務，也就是 echoserver-non-hostnetwork.nettest.svc.cluster.local。

服務 IP 的詳細資料位於記錄中 JSON 格式的探查項目，如下列範例所示。以下探查範例顯示 172.26.27.229:80 是 service-clusterip 的位址。有兩個探針具有這個 targets 值，一個用於 Pod (pod-service-clusterip)，另一個用於節點 (node-service-clusterip)。
```
probe {
  name: "node-service-clusterip"
  …
  targets {
    host_names: "172.26.27.229:80"
  }
```

驗證修正結果

解決所有回報的連線問題後，請移除 nettest Pod，然後重新套用 nettest 資訊清單，再次執行連線測試。

舉例來說，如要重新執行 Ubuntu 的 nettest，請執行下列指令：

kubectl -n nettest delete pod nettest
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/anthos-samples/main/anthos-bm-utils/abm-nettest/nettest.yaml

清除 `nettest`

測試完成後，請執行下列指令來移除所有 nettest 資源：

kubectl delete namespace nettest
kubectl delete clusterroles nettest:nettest
kubectl delete clusterrolebindings nettest:nettest

後續步驟

如需其他協助，請與 Cloud Customer Care 團隊聯絡。如要進一步瞭解支援資源，包括下列項目，請參閱「取得支援」：

建立支援案件的規定。
工具：協助您排解問題，例如環境設定、記錄和指標。
支援的元件。