引言
线上环境新上了几个服务,需要监控它相应的指标,这边使用 Prometheus-Operator 的 ServiceMonitor 实现。
马上开动。
开始
直接上它的 YAML 文件:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lobby-bank-consumer
namespace: lobby
labels:
app.kubernetes.io/name: lobby-bank-consumer
app.kubernetes.io/part-of: lobby
spec:
selector:
matchLabels:
app: lobby-bank-consumer
namespaceSelector:
matchNames:
- lobby
endpoints:
- port: tcp-63200
path: /metrics
interval: 30s
scrapeTimeout: 10s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lobby-bank-producer
namespace: lobby
labels:
app.kubernetes.io/name: lobby-bank-producer
app.kubernetes.io/part-of: lobby
spec:
selector:
matchLabels:
app: lobby-bank-producer
namespaceSelector:
matchNames:
- lobby
endpoints:
- port: tcp-63100
path: /metrics
interval: 30s
scrapeTimeout: 10s
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: lobby-bank-server
namespace: lobby
labels:
app.kubernetes.io/name: lobby-bank-server
app.kubernetes.io/part-of: lobby
spec:
selector:
matchLabels:
app: lobby-bank-server
namespaceSelector:
matchNames:
- lobby
endpoints:
- port: tcp-63001
path: /metrics
interval: 30s
scrapeTimeout: 10s
部署:
$ kubectl apply -f lobby-bank-sm.yaml
部署完成后,这边没有数据:
图片
开始排查。
排查
详细检查了我的 ServiceMonitor YAML 文件是否有问题,发现没有问题,奇怪了,
想了半天,我想不应该是 RBAC 之类的,但是没办法了,只能去看看 Prometheus 的 Logs 了。
没想到问题真出在这里:
图片
这里有添加了相应资源和 Verb:
- apiGroups:
- "monitoring.coreos.com"
resources:
- servicemonitors
- podmonitors
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
以下是完整的 YAML 文件:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: prometheus
app.kubernetes.io/instance: k8s
app.kubernetes.io/name: prometheus
app.kubernetes.io/part-of: kube-prometheus
app.kubernetes.io/version: 3.0.1
name: prometheus-k8s
rules:
- apiGroups:
- "monitoring.coreos.com"
resources:
- servicemonitors
- podmonitors
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- nonResourceURLs:
- /metrics
- /metrics/slis
verbs:
- get
重新部署下 Prometheus-Operator:
$ kubectl delete -f .
$ kubectl create -f .
依次等待全部启动完成。
再次查看:
图片
最好再用 PromQL 查看下:
图片