使用kube-state-metrics监控kubernetes并微信告警
前言
监控指标 | 具体实现 | 举例 |
---|---|---|
Pod性能 | cAdvisor | 容器CPU,内存利用率 |
Node性能 | node-exporter | 节点CPU,内存利用率 |
K8S资源对象 | kube-state-metrics | Pod/Deployment/Service |
数据收集
我们这里使用kube-state-metrics
对k8s
资源数据进行收集。
架构图
监控指标
指标类别包括:
- CronJob Metrics
- DaemonSet Metrics
- Deployment Metrics
- Job Metrics
- LimitRange Metrics
- Node Metrics
- PersistentVolume Metrics
- PersistentVolumeClaim Metrics
- Pod Metrics
- Pod Disruption Budget Metrics
- ReplicaSet Metrics
- ReplicationController Metrics
- ResourceQuota Metrics
- Service Metrics
- StatefulSet Metrics
- Namespace Metrics
- Horizontal Pod Autoscaler Metrics
- Endpoint Metrics
- Secret Metrics
- ConfigMap Metrics
以pod为例:
- kube_pod_info
- kube_pod_owner
- kube_pod_status_phase
- kube_pod_status_ready
- kube_pod_status_scheduled
- kube_pod_container_status_waiting
- kube_pod_container_status_terminated_reason
- ...
部署 kube-state-metrics
默认会在kube-system命名空间下创建对应的资源,最好不要更换yaml
文件中的命名空间。
# 获取yml文件 git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git # 部署kube-state-metrics kubectl apply -f kube-state-metrics-configs # 查看pod状态 kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE kube-state-metrics-c698dc7b5-zstz9 1/1 Running 0 32m
获取kube-state-metrics-c698dc7b5-zstz9
容器日志,如图:
数据对接prometheus
当prometheus
准备好了之后,添加对应的采集job
即可:
# 添加在sidecar/cm-kube-mon-sidecar.yaml最后 - job_name: 'kube-state-metrics' static_configs: - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
# 重新apply configmap文件,无需重启prometheus(配置的有热加载) kubectl apply -f sidecar/cm-kube-mon-sidecar.yaml
打开promethues的web看一下target里的配置是否生效,如下图:
数据对接Grafana
当Grafana
准备好之后,我们在Grafana中导入选定/自定义的dashboard
,添加Prometheus
数据源,即可:
prometheus报警规则
修改sidecar/rules-cm-kube-mon-sidecar.yaml
配置文件,添加如下报警指标。
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-mon data: alert-rules.yaml: |- groups: - name: White box monitoring rules: - alert: Pod-重启 expr: changes(kube_pod_container_status_restarts_total{pod !~ "analyzer.*"}[10m]) > 0 for: 1m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Pod: {{ $labels.pod }} Restart" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" pod: "{{ $labels.pod }}" container: "{{ $labels.container }}" - alert: Pod-未知错误/失败 expr: kube_pod_status_phase{phase="Unknown"} == 1 or kube_pod_status_phase{phase="Failed"} == 1 for: 1m labels: severity: 紧急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Pod: {{ $labels.pod }} 未知错误/失败" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" pod: "{{ $labels.pod }}" container: "{{ $labels.container }}" - alert: Daemonset Unavailable expr: kube_daemonset_status_number_unavailable > 0 for: 5m labels: severity: 紧急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Daemonset: {{ $labels.daemonset }} 守护进程不可用" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" daemonset: "{{ $labels.daemonset }}" - alert: Job-失败 expr: kube_job_status_failed == 1 for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Job: {{ $labels.job_name }} Failed" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" job: "{{ $labels.job_name }}" - alert: Pod NotReady expr: sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id)(kube_pod_status_phase{job=~".*kubernetes-service-endpoints",phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id)group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod,owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0 for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "pod: {{ $labels.pod }} 处于 NotReady 状态超过15分钟" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" - alert: Deployment副本数 expr: (kube_deployment_spec_replicas{job=~".*kubernetes-service-endpoints"} !=kube_deployment_status_replicas_available{job=~".*kubernetes-service-endpoints"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0) for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Deployment: {{ $labels.deployment }} 实际副本数和设置副本数不一致" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" deployment: "{{ $labels.deployment }}" - alert: Statefulset副本数 expr: (kube_statefulset_status_replicas_ready{job=~".*kubernetes-service-endpoints"} !=kube_statefulset_status_replicas{job=~".*kubernetes-service-endpoints"}) and (changes(kube_statefulset_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0) for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Statefulset: {{ $labels.statefulset }} 实际副本数和设置副本数不一致" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" statefulset: "{{ $labels.statefulset }}" - alert: 存储卷PV expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kubernetes-service-endpoints"} > 0 for: 5m labels: severity: 紧急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "存储卷PV: {{ $labels.persistentvolume }} 处于Failed或Pending状态" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" persistentvolume: "{{ $labels.persistentvolume }}" - alert: 存储卷PVC expr: kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost",job=~".*kubernetes-service-endpoints"} > 0 for: 5m labels: severity: 紧急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "存储卷PVC: {{ $labels.persistentvolumeclaim }} Failed或Pending状态" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" persistentvolumeclaim: "{{ $labels.persistentvolumeclaim }}" - alert: k8s service expr: kube_service_status_load_balancer_ingress != 1 for: 5m labels: severity: 紧急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Service: {{ $labels.service }} 服务负载均衡器入口状态DOWN!" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" persistentvolumeclaim: "{{ $labels.service }}"
更新sidecar/rules-cm-kube-mon-sidecar.yaml
配置文件,如下:
# 稍等待一分钟左右,prometheus已定义热更新,无需apply kubectl apply -f rules-cm-kube-mon-sidecar.yaml
对接AlertManager
部署微信告警
# 获取yml文件 git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git && cd thanos/AlertManager # 部署AlertManager ## 更改为自己wechat信息 kubectl apply -f cm-kube-mon-alertmanager.yaml kubectl apply -f wechat-template-kube-mon.yaml kubectl apply -f deploy-kube-mon-alertmanager.yaml kubectl apply -f svc-kube-mon-alertmanager.yaml
查看报警状态
微信报警和恢复信息
报警信息
恢复信息

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。
持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。
转载内容版权归作者及来源网站所有,本站原创内容转载请注明来源。
- 上一篇
深入解析,快速教会你 SQL 子查询优化!
子查询(Subquery)的优化一直以来都是 SQL 查询优化中的难点之一。关联子查询的基本执行方式类似于 Nested-Loop,但是这种执行方式的效率常常低到难以忍受。当数据量稍大时,必须在优化器中对其进行去关联化(Decoorelation 或 Unnesting),将其改写为类似于 Semi-Join 这样的更高效的算子。 前人已经总结出一套完整的方法论,理论上能对任意一个查询进行去关联化。本文结合 SQL Server 以及 HyPer 的几篇经典论文,由浅入深地讲解一下这套去关联化的理论体系。它们二者所用的方法大同小异,基本思想是想通的。 子查询简介 子查询是定义在 SQL 标准中一种语法,它可以出现在 SQL 的几乎任何地方,包括 SELECT, FROM, WHERE 等子句中。 总的来说,子查询可以分为关联子查询(Correlated Subquery)和非关联子查询(Non-correlated Subquery)。后者非关联子查询是个很简单的问题,最简单地,只要先执行它、得到结果集并物化,再执行外层查询即可。下面是一个例子: SELECT c_count, cou...
- 下一篇
【4.26-5.2】上周精彩回顾
优秀文章推荐 1、蚂蚁金服SOFARegistry之消息总线异步处理 2、tep环境变量、fixtures、用例三者之间的关系 3、进阶的决策树,从ID3升级到C4.5,模型大升级 4、Celery 异步任务队列高级用法 以及 3种调用任务的方法详解5、数据结构——30行代码实现栈和模拟递归 6、MongoDB副本集oplog设置过小问题7、 DenyHosts的安装与配置Centos7/6 8、Linux 驱动开发 | 驱动世界里的宏伟建筑9、 生产环境自动化变更全纪录10、UT之最后一测 专题推荐 1、死磕JVM2、机器学习基础3、MongoDB学习 活动及优化 1、2021年度51CTO技术博客大赛已经进入激动人心的投票阶段!欢迎各位投出宝贵一票: 【投票入口】 2、企业博客首页入口已经开放啦!!免费获得博客优质认证、首页超强曝光、内容优先推荐、审核快速通道等福利! 具体开通方式请参考:https://blog.51cto.com/51ctoblog/2633600 51CTO期待各位的入驻~
相关文章
文章评论
共有0条评论来说两句吧...