Kubernetes中使用prometheus+alertmanager实现监控告警-低调大师

Kubernetes中使用prometheus+alertmanager实现监控告警

2018-12-16 602

监控告警原型图

a095f5691889d8803cda17b2daaa252fe126c507

原型图解释

prometheus与alertmanager作为container运行在同一个pods中并交由Deployment控制器管理，alertmanager默认开启9093端口，因为我们的prometheus与alertmanager是处于同一个pod中，所以prometheus直接使用localhost:9093就可以与alertmanager通信(用于发送告警通知)，告警规则配置rules.yml以Configmap的形式挂载到prometheus容器供prometheus使用，告警通知对象配置也通过Configmap挂载到alertmanager容器供alertmanager使用，这里我们使用邮件接收告警通知，具体配置在alertmanager.yml中

测试环境

环境：Linux 3.10.0-693.el7.x86_64 x86_64 GNU/Linux
平台：Kubernetes v1.10.5
Tips：prometheus与alertmanager完整的配置在文档末尾

创建告警规则

在prometheus中指定告警规则的路径， rules.yml就是用来指定报警规则，这里我们将rules.yml用ConfigMap的形式挂载到/etc/prometheus目录下面即可:

rule_files:
- /etc/prometheus/rules.yml

这里我们指定了一个InstanceDown告警，当主机挂掉1分钟则prometheus会发出告警

  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

配置prometheus与alertmanager通信(用于prometheus向alertmanager发送告警信息)

alertmanager默认开启9093端口，又因为我们的prometheus与alertmanager是处于同一个pod中，所以prometheus直接使用localhost:9093就可以与alertmanager通信

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]

alertmanager配置告警通知对象

我们这里举了一个邮件告警的例子，alertmanager接收到prometheus发出的告警时，alertmanager会向指定的邮箱发送一封告警邮件，这个配置也是通过Configmap的形式挂载到alertmanager所在的容器中供alertmanager使用

alertmanager.yml: |-
    global:
      smtp_smarthost: 'smtp.exmail.qq.com:465'
      smtp_from: 'xin.liu@woqutech.com'
      smtp_auth_username: 'xin.liu@woqutech.com'
      smtp_auth_password: 'xxxxxxxxxxxx'
      smtp_require_tls: false
    route:
      group_by: [alertname]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: default-receiver
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: '1148576125@qq.com'

原型效果展示

在prometheus web ui中可以看到配置的告警规则

为了看测试效果，关掉一个主机节点：
在prometheus web ui中可以看到一个InstanceDown告警被触发

在alertmanager web ui中可以看到alertmanager收到prometheus发出的告警

指定接收告警的邮箱收到alertmanager发出的告警邮件

全部配置

node_exporter_daemonset.yaml

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: kube-system
  labels:
    app: node_exporter
spec:
  selector:
    matchLabels:
      name: node_exporter
  template:
    metadata:
      labels:
        name: node_exporter
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: node-exporter
        image: alery/node-exporter:1.0
        ports:
        - name: node-exporter
          containerPort: 9100
          hostPort: 9100
        volumeMounts:
        - name: localtime
          mountPath: /etc/localtime
        - name: host
          mountPath: /host
          readOnly: true
      volumes:
      - name: localtime
        hostPath:
          path: /usr/share/zoneinfo/Asia/Shanghai
      - name: host
        hostPath:
          path: /

alertmanager-cm.yaml

kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: kube-system
data:
  alertmanager.yml: |-
    global:
      smtp_smarthost: 'smtp.exmail.qq.com:465'
      smtp_from: 'xin.liu@woqutech.com'
      smtp_auth_username: 'xin.liu@woqutech.com'
      smtp_auth_password: 'xxxxxxxxxxxx'
      smtp_require_tls: false
    route:
      group_by: [alertname]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 10m
      receiver: default-receiver
    receivers:
    - name: 'default-receiver'
      email_configs:
      - to: '1148576125@qq.com'

prometheus-rbac.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
  namespace: kube-system
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system

prometheus-cm.yaml

kind: ConfigMap
apiVersion: v1
data:
  prometheus.yml: |
    rule_files:
    - /etc/prometheus/rules.yml
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["localhost:9093"]
    scrape_configs:
    - job_name: 'node'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        replacement: $1:9100
      - source_labels: [__meta_kubernetes_pod_host_ip]
        action: replace
        target_label: instance
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: node_name
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(name)
      - source_labels: [__meta_kubernetes_pod_label_name]
        regex: node_exporter
        action: keep

  rules.yml: |
    groups:
    - name: example
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
      - alert: APIHighRequestLatency
        expr: api_http_request_latencies_second{quantile="0.5"} > 1
        for: 10m
        annotations:
          summary: "High request latency on {{ $labels.instance }}"
          description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

metadata:
  name: prometheus-config-v0.1.0
  namespace: kube-system

prometheus.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: kube-system
  name: prometheus
  labels:
    name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      name: prometheus
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master

本文转自SegmentFault- Kubernetes中使用prometheus+alertmanager实现监控告警

微信关注我们

原文链接：https://yq.aliyun.com/articles/680013

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

Kubernetes助力Spark大数据分析

Kubernetes 作为一个广受欢迎的开源容器协调系统，是Google于2014年酝酿的项目。从Google趋势上看到，Kubernetes自2014年以来热度一路飙升，短短几年时间就已超越了大数据分析领域的长老Hadoop。本公众号之前的文章（Kubernetes核心组件解析）也对Kubernetes的几个组件做了一些详细的剖析，本文就带领大家一起看看Kubernetes和Spark碰到一起会擦出什么样的火花。 Spark2.3.0之前的版本只原生支持Standalone、YARN和Mesos三种部署模式，也就是说要迁移Spark2.3.0之前的Spark到Kuberbetes上，还得准备一层Standalone、YARN或者Mesos环境，不过Spark2.3.0已经引入了对Kubernetes的原生支持。 Spark2.3.0可以将编写好的数据处理程序直接通过spark-submit提交到Kubernetes集群，通过创建一个Drive Pod和一系列Executor Pods，然后共同协调完成计算任务，整体过程的官方示意图如下。当我们通过spark-submit将Spar...

2018-12-17

634

译见 | 掌舵 Kubernetes：云原生时代的 Kubernetes 部署

容器提供了将应用程序及其依赖项与操作系统解耦的能力。因为其不同于虚拟机镜像打包操作系统的方式，容器可以节省大量的系统资源：计算、内存和磁盘空间。同时，容器还可以进行更快的下载、更新、部署和迭代。因此，在技术领域上来说，容器技术引领了一场技术革命，并被谷歌、微软和亚马逊等大佬级公司采用。同样，由容器技术引领的这场技术革命也带了激烈的竞争，来满足容器的编排和管理的需求。而 Kubernetes 在这样的竞争中能成为领先的解决方案（包括 Amazon ECS 和 Docker Swarm）,主要原因有三：云原生设计：支持部署和运行下一代应用程序开源生态：快速创新，避免供应商的独断可移植性：无论是在云、内部、虚拟机中，都可以在任意的地方进行部署下图便显示了 Kubernetes 在你的云原生部署中所发挥的作用：请点击此处输入图片描述如你所见，Kubernetes 可以部署和管理你的容器应用程序, 这其中包括了NGINX、MySQL、Apache 等等。它可以为容器提供配置、缩放、复制、监听等其他功能。一旦你选择了容器编排平台之后, 下一步便是部署 Kubernetes。如前...

2018-12-17

564

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

Oracle

Oracle Database，又名Oracle RDBMS，或简称Oracle。是甲骨文公司的一款关系数据库管理系统。它是在数据库领域一直处于领先地位的产品。可以说Oracle数据库系统是目前世界上流行的关系数据库管理系统，系统可移植性好、使用方便、功能强，适用于各类大、中、小、微机环境。它是一种高效率、可靠性好的、适应高吞吐量的数据库方案。

JDK

JDK是 Java 语言的软件开发工具包，主要用于移动设备、嵌入式设备上的java应用程序。JDK是整个java开发的核心，它包含了JAVA的运行环境（JVM+Java系统类库）和JAVA工具。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。