Grafana查看Pod监控显示为no data
Rancher创建的Kubernetes集群(RKE),kubernetes版本为1.24.x。在Rancher上安装了Monitoring应用,无法查看Pod监控指标数据。Grafana显示为no data
在Prometheus界面上通过PromQL查询,发现指标数据缺失container、image、name、namespace、pod等标签,如下:
查看cadvisor的原始数据,进一步验证了container、image、name、namespace、pod等标签的缺失,如下:
curl -k -H "Authorization: Bearer $TOKEN" https://10.6.128.7:10250/metrics/cadvisor
container_cpu_load_average_10s{container="",id="/",image="",name="",namespace="",pod=""} 0 1666834382282
container_cpu_load_average_10s{container="",id="/docker/5678922ca0bd7afc30b75ffa4ae5fb96298170c3f58a47ae335940b20cd6fa7b",image="",name="",namespace="",pod=""} 0 1666834372644
container_cpu_load_average_10s{container="",id="/kubepods",image="",name="",namespace="",pod=""} 0 1666834372281
container_cpu_load_average_10s{container="",id="/kubepods/besteffort",image="",name="",namespace="",pod=""} 0 1666834378893
container_cpu_load_average_10s{container="",id="/kubepods/besteffort/pod25a7ff7b-7058-4015-8f35-62b2b2a07035",image="",name="",namespace="longhorn-system",pod="csi-resizer-67c8b75747-bgz4c"} 0 1666834369247
> ...
通过搜索,发现有以下几个Issue:
- Missing image, name and container labels from cAdvisor metrics in 1.24 · Issue #111077 · kubernetes/kubernetes · GitHub
- [BUG, RKE1, Monitoring V2] RKE1 1.24 seems to be omitting relevant cadvisor container labels and metric series that break Monitoring V2 dashboards · Issue #38934 · rancher/rancher · GitHub
Kubernetes 1.24 版本中,容器运行时使用 docker 就会重现这个问题,原因是 1.24 版本删除了对dockershim的支持。解决思路是既然集成到kubelet的cadvisor有问题,那么我们尝试使用外置的cadvisor。
扒了一下kubernetes的源代码,发现在 v1.24.0-alpha.1 版本就移除了对dockershim支持的相关代码。
https://github.com/kubernetes/kubernetes/commit/bc78dff42ec6be929648e91f3ef2dd6dae5169fb
在[BUG, RKE1, Monitoring V2] RKE1 1.24 seems to be omitting relevant cadvisor container labels and metric series that break Monitoring V2 dashboards · Issue #38934 · rancher/rancher · GitHub 这个Issue中,有人提到了 https://github.com/fe-ax/cadvisor-k8s-fix 这个仓库提供了修复的方法。
这个仓库提供的yaml有点小问题,我修改了一下,经过验证问题得到了解决。
使用方法:
- 更新Monitoring应用,点到
编辑YAML
页面 - 修改 defaultRules.rules.k8s 的值为 false
- 修改 kubelet.serviceMonitor.cAdvisor 的值为 false
修改 additionalPrometheusRulesMap 的值为
additionalPrometheusRulesMap: k8s-custom-rules: groups: - name: k8s.rules rules: - expr: >- sum by (cluster, namespace, pod, container) ( irate(container_cpu_usage_seconds_total{job="cadvisor", metrics_path="/metrics", image!=""}[5m]) ) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) ( 1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}) ) record: >- node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate - expr: >- container_memory_working_set_bytes{job="cadvisor", metrics_path="/metrics", image!=""} * on (namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""}) ) record: node_namespace_pod_container:container_memory_working_set_bytes - expr: >- container_memory_rss{job="cadvisor", metrics_path="/metrics", image!=""} * on (namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""}) ) record: node_namespace_pod_container:container_memory_rss - expr: >- container_memory_cache{job="cadvisor", metrics_path="/metrics", image!=""} * on (namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""}) ) record: node_namespace_pod_container:container_memory_cache - expr: >- container_memory_swap{job="cadvisor", metrics_path="/metrics", image!=""} * on (namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""}) ) record: node_namespace_pod_container:container_memory_swap - expr: >- kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"} * on (namespace, pod, cluster) group_left() max by (namespace, pod) ( (kube_pod_status_phase{phase=~"Pending|Running"} == 1) ) record: >- cluster:namespace:pod_memory:active:kube_pod_container_resource_requests - expr: |- sum by (namespace, cluster) ( sum by (namespace, pod, cluster) ( max by (namespace, pod, container, cluster) ( kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"} ) * on(namespace, pod, cluster) group_left() max by (namespace, pod) ( kube_pod_status_phase{phase=~"Pending|Running"} == 1 ) ) ) record: namespace_memory:kube_pod_container_resource_requests:sum - expr: >- kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"} * on (namespace, pod, cluster) group_left() max by (namespace, pod) ( (kube_pod_status_phase{phase=~"Pending|Running"} == 1) ) record: >- cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests - expr: |- sum by (namespace, cluster) ( sum by (namespace, pod, cluster) ( max by (namespace, pod, container, cluster) ( kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"} ) * on(namespace, pod, cluster) group_left() max by (namespace, pod) ( kube_pod_status_phase{phase=~"Pending|Running"} == 1 ) ) ) record: namespace_cpu:kube_pod_container_resource_requests:sum - expr: >- kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"} * on (namespace, pod, cluster) group_left() max by (namespace, pod) ( (kube_pod_status_phase{phase=~"Pending|Running"} == 1) ) record: >- cluster:namespace:pod_memory:active:kube_pod_container_resource_limits - expr: |- sum by (namespace, cluster) ( sum by (namespace, pod, cluster) ( max by (namespace, pod, container, cluster) ( kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"} ) * on(namespace, pod, cluster) group_left() max by (namespace, pod) ( kube_pod_status_phase{phase=~"Pending|Running"} == 1 ) ) ) record: namespace_memory:kube_pod_container_resource_limits:sum - expr: >- kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"} * on (namespace, pod, cluster) group_left() max by (namespace, pod) ( (kube_pod_status_phase{phase=~"Pending|Running"} == 1) ) record: >- cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits - expr: |- sum by (namespace, cluster) ( sum by (namespace, pod, cluster) ( max by (namespace, pod, container, cluster) ( kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"} ) * on(namespace, pod, cluster) group_left() max by (namespace, pod) ( kube_pod_status_phase{phase=~"Pending|Running"} == 1 ) ) ) record: namespace_cpu:kube_pod_container_resource_limits:sum - expr: |- max by (cluster, namespace, workload, pod) ( label_replace( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)" ) * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) ( 1, max by (replicaset, namespace, owner_name) ( kube_replicaset_owner{job="kube-state-metrics"} ) ), "workload", "$1", "owner_name", "(.*)" ) ) labels: workload_type: deployment record: namespace_workload_pod:kube_pod_owner:relabel - expr: |- max by (cluster, namespace, workload, pod) ( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)" ) ) labels: workload_type: daemonset record: namespace_workload_pod:kube_pod_owner:relabel - expr: |- max by (cluster, namespace, workload, pod) ( label_replace( kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)" ) ) labels: workload_type: statefulset record: namespace_workload_pod:kube_pod_owner:relabel
应用以下yaml
apiVersion: v1 kind: ServiceAccount metadata: labels: app: cadvisor name: cadvisor namespace: cadvisor --- apiVersion: policy/v1beta1 kind: PodSecurityPolicy metadata: labels: app: cadvisor name: cadvisor spec: allowedHostPaths: - pathPrefix: / - pathPrefix: /var/run - pathPrefix: /sys - pathPrefix: /var/lib/docker - pathPrefix: /dev/disk - pathPrefix: /var/lib/containerd - pathPrefix: /run/containerd fsGroup: rule: RunAsAny runAsUser: rule: RunAsAny seLinux: rule: RunAsAny supplementalGroups: rule: RunAsAny volumes: - '*' --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app: cadvisor name: cadvisor rules: - apiGroups: - policy resourceNames: - cadvisor resources: - podsecuritypolicies verbs: - use --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app: cadvisor name: cadvisor roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cadvisor subjects: - kind: ServiceAccount name: cadvisor namespace: cadvisor --- apiVersion: v1 kind: Service metadata: labels: app: cadvisor name: cadvisor namespace: cadvisor spec: clusterIP: None ports: - name: http port: 8080 protocol: TCP targetPort: http selector: app: cadvisor --- apiVersion: apps/v1 kind: DaemonSet metadata: annotations: seccomp.security.alpha.kubernetes.io/pod: docker/default labels: app: cadvisor name: cadvisor namespace: cadvisor spec: selector: matchLabels: app: cadvisor name: cadvisor template: metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" labels: app: cadvisor name: cadvisor spec: automountServiceAccountToken: false containers: - args: - --housekeeping_interval=10s - --max_housekeeping_interval=15s - --event_storage_event_limit=default=0 - --event_storage_age_limit=default=0 - --enable_metrics=app,cpu,disk,diskIO,memory,network,process - --docker_only - --store_container_labels=false - --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace image: zcube/cadvisor:v0.45.0 name: cadvisor ports: - containerPort: 8080 name: http protocol: TCP resources: limits: cpu: 1 memory: 512Mi requests: cpu: 100m memory: 256Mi securityContext: privileged: true volumeMounts: - mountPath: /dev name: dev - mountPath: /rootfs name: rootfs readOnly: true - mountPath: /var/run name: var-run readOnly: true - mountPath: /sys name: sys readOnly: true - mountPath: /var/lib/docker name: docker readOnly: true - mountPath: /dev/disk name: disk readOnly: true - mountPath: /run/containerd name: containerd readOnly: true - mountPath: /var/lib/containerd name: containerd-var readOnly: true priorityClassName: system-node-critical serviceAccountName: cadvisor terminationGracePeriodSeconds: 30 tolerations: - key: CriticalAddonsOnly operator: Exists - effect: NoSchedule operator: Exists - effect: NoExecute operator: Exists volumes: - hostPath: path: /dev name: dev - hostPath: path: / name: rootfs - hostPath: path: /var/run name: var-run - hostPath: path: /sys name: sys - hostPath: path: /var/lib/docker name: docker - hostPath: path: /dev/disk name: disk - hostPath: path: /var/lib/containerd type: "" name: containerd-var - hostPath: path: /run/containerd type: "" name: containerd --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: cadvisor name: cadvisor namespace: cadvisor spec: endpoints: - honorLabels: true metricRelabelings: - sourceLabels: - container_label_io_kubernetes_pod_name targetLabel: pod - sourceLabels: - container_label_io_kubernetes_pod_namespace targetLabel: namespace - sourceLabels: - container_label_io_kubernetes_container_name targetLabel: container - replacement: "" targetLabel: cluster path: /metrics port: http relabelings: - sourceLabels: - __metrics_path__ targetLabel: metrics_path namespaceSelector: matchNames: - cadvisor selector: matchLabels: app: cadvisor
查看grafana面板,可以看到已经有监控数据了,问题得到解决。