实用干货丨如何使用Prometheus配置自定义告警规则
前 言
Prometheus是一个用于监控和告警的开源系统。一开始由Soundcloud开发,后来在2016年,它迁移到CNCF并且称为Kubernetes之后最流行的项目之一。从整个Linux服务器到stand-alone web服务器、数据库服务或一个单独的进程,它都能监控。在Prometheus术语中,它所监控的事物称为目标(Target)。每个目标单元被称为指标(metric)。它以设置好的时间间隔通过http抓取目标,以收集指标并将数据放置在其时序数据库(Time Series Database)中。你可以使用PromQL查询语言查询相关target的指标。
本文中,我们将一步一步展示如何:
-
安装Prometheus(使用prometheus-operator Helm chart)以基于自定义事件进行监控/告警
-
创建和配置自定义告警规则,它将会在满足条件时发出告警
-
集成Alertmanager以处理由客户端应用程序(在本例中为Prometheus server)发送的告警
-
将Alertmanager与发送告警通知的邮件账户集成。
理解Prometheus及其抽象概念
从下图我们将看到所有组成Prometheus生态的组件:
以下是与本文相关的术语,大家可以快速了解:
-
Prometheus Server:在时序数据库中抓取和存储指标的主要组件
抓取:一种拉取方法以获取指标。它通常以10-60秒的时间间隔抓取。
Target:检索数据的server客户端
-
服务发现:启用Prometheus,使其能够识别它需要监控的应用程序并在动态环境中拉取指标
-
Alert Manager:负责处理警报的组件(包括silencing、inhibition、聚合告警信息,并通过邮件、PagerDuty、Slack等方式发送告警通知)。
-
数据可视化:抓取的数据存储在本地存储中,并使用PromQL直接查询,或通过Grafana dashboard查看。
理解Prometheus Operator
根据Prometheus Operator的项目所有者CoreOS称,Prometheus Operator可以配置原生Kubernetes并且可以管理和操作Prometheus和Alertmanager集群。
该Operator引入了以下Kubernetes自定义资源定义(CRDs):Prometheus、ServiceMonitor、PrometheusRule和Alertmanager。如果你想了解更多内容可以访问链接:
https://github.com/coreos/prometheus-operator/blob/master/Documentation/design.md
在我们的演示中,我们将使用PrometheusRule来定义自定义规则。
首先,我们需要使用 stable/prometheus-operator Helm chart来安装Prometheus Operator,下载链接:
https://github.com/helm/charts/tree/master/stable/prometheus-operator
默认安装程序将会部署以下组件:prometheus-operator、prometheus、alertmanager、node-exporter、kube-state-metrics以及grafana。默认状态下,Prometheus将会抓取Kubernetes的主要组件:kube-apiserver、kube-controller-manager以及etcd。
安装Prometheus软件
前期准备
要顺利执行此次demo,你需要准备以下内容:
-
一个Google Cloud Platform账号(免费套餐即可)。其他任意云也可以
-
Rancher v2.3.5(发布文章时的最新版本)
-
运行在GKE(版本1.15.9-gke.12.)上的Kubernetes集群(使用EKS或AKS也可以)
-
在计算机上安装好Helm binary
启动一个Rancher实例
直接按照这一直观的入门指南进行操作即可:
https://rancher.com/quick-start
使用Rancher部署一个GKE集群
使用Rancher来设置和配置你的Kubernetes集群:
https://rancher.com/docs/rancher/v2.x/en/cluster-provisioning/hosted-kubernetes-clusters/gke/
部署完成后,并且为kubeconfig文件配置了适当的credential和端点信息,就可以使用kubectl指向该特定集群。
部署Prometheus 软件
首先,检查一下我们所运行的Helm版本
$ helm version
version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.13.8"}
当我们使用Helm 3时,我们需要添加一个stable 镜像仓库,因为默认状态下不会设置该仓库。
$ helm repo add stable https://kubernetes-charts.storage.googleapis.com
"stable" has been added to your repositories
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "stable" chart repository
Update Complete. ⎈ Happy Helming!⎈
$ helm repo list
NAME URL
stable https://kubernetes-charts.storage.googleapis.com
Helm配置完成后,我们可以开始安装prometheus-operator
$ kubectl create namespace monitoring
namespace/monitoring created
$ helm install --namespace monitoring demo stable/prometheus-operator
manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
manifest_sorter.go:192: info: skipping unknown hook: "crd-install"
NAME: demo
LAST DEPLOYED: Sat Mar 14 09:40:35 2020
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
The Prometheus Operator has been installed. Check its status by running:
kubectl --namespace monitoring get pods -l "release=demo"
Visit https://github.com/coreos/prometheus-operator for instructions on how
to create & configure Alertmanager and Prometheus instances using the Operator.
规 则
除了监控之外,Prometheus还让我们创建触发告警的规则。这些规则基于Prometheus的表达式语言。只要满足条件,就会触发告警并将其发送到Alertmanager。之后,我们会看到规则的具体形式。
我们回到demo。Helm完成部署之后,我们可以检查已经创建了什么pod:
$ kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-demo-prometheus-operator-alertmanager-0 2/2 Running 0 61s
demo-grafana-5576fbf669-9l57b 3/3 Running 0 72s
demo-kube-state-metrics-67bf64b7f4-4786k 1/1 Running 0 72s
demo-prometheus-node-exporter-ll8zx 1/1 Running 0 72s
demo-prometheus-node-exporter-nqnr6 1/1 Running 0 72s
demo-prometheus-node-exporter-sdndf 1/1 Running 0 72s
demo-prometheus-operator-operator-b9c9b5457-db9dj 2/2 Running 0 72s
prometheus-demo-prometheus-operator-prometheus-0 3/3 Running 1 50s
为了从web浏览器中访问Prometheus和Alertmanager,我们需要使用port转发。
由于本例中使用的是GCP实例,并且所有的kubectl命令都从该实例运行,因此我们使用实例的外部IP地址访问资源。
$ kubectl port-forward --address 0.0.0.0 -n monitoring prometheus-demo-prometheus-operator-prometheus-0 9090 >/dev/null 2>&1 &
$ kubectl port-forward --address 0.0.0.0 -n monitoring alertmanager-demo-prometheus-operator-alertmanager-0 9093 >/dev/null 2>&1 &
“Alert”选项卡向我们展示了所有当前正在运行/已配置的告警。也可以通过查询名称为prometheusrules的CRD从CLI进行检查:
$ kubectl -n monitoring get prometheusrules
NAME AGE
demo-prometheus-operator-alertmanager.rules 3m21s
demo-prometheus-operator-etcd 3m21s
demo-prometheus-operator-general.rules 3m21s
demo-prometheus-operator-k8s.rules 3m21s
demo-prometheus-operator-kube-apiserver-error 3m21s
demo-prometheus-operator-kube-apiserver.rules 3m21s
demo-prometheus-operator-kube-prometheus-node-recording.rules 3m21s
demo-prometheus-operator-kube-scheduler.rules 3m21s
demo-prometheus-operator-kubernetes-absent 3m21s
demo-prometheus-operator-kubernetes-apps 3m21s
demo-prometheus-operator-kubernetes-resources 3m21s
demo-prometheus-operator-kubernetes-storage 3m21s
demo-prometheus-operator-kubernetes-system 3m21s
demo-prometheus-operator-kubernetes-system-apiserver 3m21s
demo-prometheus-operator-kubernetes-system-controller-manager 3m21s
demo-prometheus-operator-kubernetes-system-kubelet 3m21s
demo-prometheus-operator-kubernetes-system-scheduler 3m21s
demo-prometheus-operator-node-exporter 3m21s
demo-prometheus-operator-node-exporter.rules 3m21s
demo-prometheus-operator-node-network 3m21s
demo-prometheus-operator-node-time 3m21s
demo-prometheus-operator-node.rules 3m21s
demo-prometheus-operator-prometheus 3m21s
demo-prometheus-operator-prometheus-operator 3m21s
我们也可以检查位于prometheus容器中prometheus-operator Pod中的物理文件。
$ kubectl -n monitoring exec -it prometheus-demo-prometheus-operator-prometheus-0 -- /bin/sh
Defaulting container name to prometheus.
Use 'kubectl describe pod/prometheus-demo-prometheus-operator-prometheus-0 -n monitoring' to see all of the containers in this pod.
在容器中,我们可以检查规则的存储路径:
/prometheus $ ls /etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0/
monitoring-demo-prometheus-operator-alertmanager.rules.yaml monitoring-demo-prometheus-operator-kubernetes-system-apiserver.yaml
monitoring-demo-prometheus-operator-etcd.yaml monitoring-demo-prometheus-operator-kubernetes-system-controller-manager.yaml
monitoring-demo-prometheus-operator-general.rules.yaml monitoring-demo-prometheus-operator-kubernetes-system-kubelet.yaml
monitoring-demo-prometheus-operator-k8s.rules.yaml monitoring-demo-prometheus-operator-kubernetes-system-scheduler.yaml
monitoring-demo-prometheus-operator-kube-apiserver-error.yaml monitoring-demo-prometheus-operator-kubernetes-system.yaml
monitoring-demo-prometheus-operator-kube-apiserver.rules.yaml monitoring-demo-prometheus-operator-node-exporter.rules.yaml
monitoring-demo-prometheus-operator-kube-prometheus-node-recording.rules.yaml monitoring-demo-prometheus-operator-node-exporter.yaml
monitoring-demo-prometheus-operator-kube-scheduler.rules.yaml monitoring-demo-prometheus-operator-node-network.yaml
monitoring-demo-prometheus-operator-kubernetes-absent.yaml monitoring-demo-prometheus-operator-node-time.yaml
monitoring-demo-prometheus-operator-kubernetes-apps.yaml monitoring-demo-prometheus-operator-node.rules.yaml
monitoring-demo-prometheus-operator-kubernetes-resources.yaml monitoring-demo-prometheus-operator-prometheus-operator.yaml
monitoring-demo-prometheus-operator-kubernetes-storage.yaml monitoring-demo-prometheus-operator-prometheus.yaml
为了详细了解如何将这些规则加载到Prometheus中,请检查Pod的详细信息。我们可以看到用于prometheus容器的配置文件是etc/prometheus/config_out/prometheus.env.yaml。该配置文件向我们展示了文件的位置或重新检查yaml的频率设置。
$ kubectl -n monitoring describe pod prometheus-demo-prometheus-operator-prometheus-0
完整命令输出如下:
Name: prometheus-demo-prometheus-operator-prometheus-0
Namespace: monitoring
Priority: 0
Node: gke-c-7dkls-default-0-c6ca178a-gmcq/10.132.0.15
Start Time: Wed, 11 Mar 2020 18:06:47 +0000
Labels: app=prometheus
controller-revision-hash=prometheus-demo-prometheus-operator-prometheus-5ccbbd8578
prometheus=demo-prometheus-operator-prometheus
statefulset.kubernetes.io/pod-name=prometheus-demo-prometheus-operator-prometheus-0
Annotations: <none>
Status: Running
IP: 10.40.0.7
IPs: <none>
Controlled By: StatefulSet/prometheus-demo-prometheus-operator-prometheus
Containers:
prometheus:
Container ID: docker://360db8a9f1cce8d72edd81fcdf8c03fe75992e6c2c59198b89807aa0ce03454c
Image: quay.io/prometheus/prometheus:v2.15.2
Image ID: docker-pullable://quay.io/prometheus/prometheus@sha256:914525123cf76a15a6aaeac069fcb445ce8fb125113d1bc5b15854bc1e8b6353
Port: 9090/TCP
Host Port: 0/TCP
Args:
--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention.time=10d
--web.enable-lifecycle
--storage.tsdb.no-lockfile
--web.external-url=http://demo-prometheus-operator-prometheus.monitoring:9090
--web.route-prefix=/
State: Running
Started: Wed, 11 Mar 2020 18:07:07 +0000
Last State: Terminated
Reason: Error
Message: caller=main.go:648 msg="Starting TSDB ..."
level=info ts=2020-03-11T18:07:02.185Z caller=web.go:506 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-03-11T18:07:02.192Z caller=head.go:584 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-03-11T18:07:02.192Z caller=head.go:632 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:663 fs_type=EXT4_SUPER_MAGIC
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:664 msg="TSDB started"
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:734 msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:517 msg="Stopping scrape discovery manager..."
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:531 msg="Stopping notify discovery manager..."
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:553 msg="Stopping scrape manager..."
level=info ts=2020-03-11T18:07:02.194Z caller=manager.go:814 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-03-11T18:07:02.194Z caller=manager.go:820 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:513 msg="Scrape discovery manager stopped"
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:527 msg="Notify discovery manager stopped"
level=info ts=2020-03-11T18:07:02.194Z caller=main.go:547 msg="Scrape manager stopped"
level=info ts=2020-03-11T18:07:02.197Z caller=notifier.go:598 component=notifier msg="Stopping notification manager..."
level=info ts=2020-03-11T18:07:02.197Z caller=main.go:718 msg="Notifier manager stopped"
level=error ts=2020-03-11T18:07:02.197Z caller=main.go:727 err="error loading config from \"/etc/prometheus/config_out/prometheus.env.yaml\": couldn't load configuration (--config.file=\"/etc/prometheus/config_out/prometheus.env.yaml\"): open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory"
Exit Code: 1
Started: Wed, 11 Mar 2020 18:07:02 +0000
Finished: Wed, 11 Mar 2020 18:07:02 +0000
Ready: True
Restart Count: 1
Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120
Environment: <none>
Mounts:
/etc/prometheus/certs from tls-assets (ro)
/etc/prometheus/config_out from config-out (ro)
/etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 from prometheus-demo-prometheus-operator-prometheus-rulefiles-0 (rw)
/prometheus from prometheus-demo-prometheus-operator-prometheus-db (rw)
/var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro)
prometheus-config-reloader:
Container ID: docker://de27cdad7067ebd5154c61b918401b2544299c161850daf3e317311d2d17af3d
Image: quay.io/coreos/prometheus-config-reloader:v0.37.0
Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:5e870e7a99d55a5ccf086063efd3263445a63732bc4c04b05cf8b664f4d0246e
Port: <none>
Host Port: <none>
Command:
/bin/prometheus-config-reloader
Args:
--log-format=logfmt
--reload-url=http://127.0.0.1:9090/-/reload
--config-file=/etc/prometheus/config/prometheus.yaml.gz
--config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
State: Running
Started: Wed, 11 Mar 2020 18:07:04 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment:
POD_NAME: prometheus-demo-prometheus-operator-prometheus-0 (v1:metadata.name)
Mounts:
/etc/prometheus/config from config (rw)
/etc/prometheus/config_out from config-out (rw)
/var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro)
rules-configmap-reloader:
Container ID: docker://5804e45380ed1b5374a4c2c9ee4c9c4e365bee93b9ccd8b5a21f50886ea81a91
Image: quay.io/coreos/configmap-reload:v0.0.1
Image ID: docker-pullable://quay.io/coreos/configmap-reload@sha256:e2fd60ff0ae4500a75b80ebaa30e0e7deba9ad107833e8ca53f0047c42c5a057
Port: <none>
Host Port: <none>
Args:
--webhook-url=http://127.0.0.1:9090/-/reload
--volume-dir=/etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0
State: Running
Started: Wed, 11 Mar 2020 18:07:06 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment: <none>
Mounts:
/etc/prometheus/rules/prometheus-demo-prometheus-operator-prometheus-rulefiles-0 from prometheus-demo-prometheus-operator-prometheus-rulefiles-0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from demo-prometheus-operator-prometheus-token-jvbrr (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-demo-prometheus-operator-prometheus
Optional: false
tls-assets:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-demo-prometheus-operator-prometheus-tls-assets
Optional: false
config-out:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
prometheus-demo-prometheus-operator-prometheus-rulefiles-0:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-demo-prometheus-operator-prometheus-rulefiles-0
Optional: false
prometheus-demo-prometheus-operator-prometheus-db:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
demo-prometheus-operator-prometheus-token-jvbrr:
Type: Secret (a volume populated by a Secret)
SecretName: demo-prometheus-operator-prometheus-token-jvbrr
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m51s default-scheduler Successfully assigned monitoring/prometheus-demo-prometheus-operator-prometheus-0 to gke-c-7dkls-default-0-c6ca178a-gmcq
Normal Pulling 4m45s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/prometheus/prometheus:v2.15.2"
Normal Pulled 4m39s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/prometheus/prometheus:v2.15.2"
Normal Pulling 4m36s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/coreos/prometheus-config-reloader:v0.37.0"
Normal Pulled 4m35s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/coreos/prometheus-config-reloader:v0.37.0"
Normal Pulling 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Pulling image "quay.io/coreos/configmap-reload:v0.0.1"
Normal Started 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container prometheus-config-reloader
Normal Created 4m34s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container prometheus-config-reloader
Normal Pulled 4m33s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Successfully pulled image "quay.io/coreos/configmap-reload:v0.0.1"
Normal Created 4m32s (x2 over 4m36s) kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container prometheus
Normal Created 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Created container rules-configmap-reloader
Normal Started 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container rules-configmap-reloader
Normal Pulled 4m32s kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Container image "quay.io/prometheus/prometheus:v2.15.2" already present on machine
Normal Started 4m31s (x2 over 4m36s) kubelet, gke-c-7dkls-default-0-c6ca178a-gmcq Started container prometheus
让我们清理默认规则,使得我们可以更好地观察我们将要创建的那个规则。以下命令将删除所有规则,但会留下monitoring-demo-prometheus-operator-alertmanager.rules。
$ kubectl -n monitoring delete prometheusrules $(kubectl -n monitoring get prometheusrules | grep -v alert)
$ kubectl -n monitoring get prometheusrules
NAME AGE
demo-prometheus-operator-alertmanager.rules 8m53s
请注意:我们只保留一条规则是为了让demo更容易。但是有一条规则,你绝对不能删除,它位于monitoring-demo-prometheus-operator-general.rules.yaml中,被称为看门狗。该告警总是处于触发状态,其目的是确保整个告警流水线正常运转。
让我们从CLI中检查我们留下的规则并将其与我们将在浏览器中看到的进行比较。
$ kubectl -n monitoring describe prometheusrule demo-prometheus-operator-alertmanager.rules
Name: demo-prometheus-operator-alertmanager.rules
Namespace: monitoring
Labels: app=prometheus-operator
chart=prometheus-operator-8.12.1
heritage=Tiller
release=demo
Annotations: prometheus-operator-validated: true
API Version: monitoring.coreos.com/v1
Kind: PrometheusRule
Metadata:
Creation Timestamp: 2020-03-11T18:06:25Z
Generation: 1
Resource Version: 4871
Self Link: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/demo-prometheus-operator-alertmanager.rules
UID: 6a84dbb0-feba-4f17-b3dc-4b6486818bc0
Spec:
Groups:
Name: alertmanager.rules
Rules:
Alert: AlertmanagerConfigInconsistent
Annotations:
Message: The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync.
Expr: count_values("config_hash", alertmanager_config_hash{job="demo-prometheus-operator-alertmanager",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="demo-prometheus-operator-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "$1", "name", "(.*)") != 1
For: 5m
Labels:
Severity: critical
Alert: AlertmanagerFailedReload
Annotations:
Message: Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}.
Expr: alertmanager_config_last_reload_successful{job="demo-prometheus-operator-alertmanager",namespace="monitoring"} == 0
For: 10m
Labels:
Severity: warning
Alert: AlertmanagerMembersInconsistent
Annotations:
Message: Alertmanager has not found all other members of the cluster.
Expr: alertmanager_cluster_members{job="demo-prometheus-operator-alertmanager",namespace="monitoring"}
!= on (service) GROUP_LEFT()
count by (service) (alertmanager_cluster_members{job="demo-prometheus-operator-alertmanager",namespace="monitoring"})
For: 5m
Labels:
Severity: critical
Events: <none>
让我们移除所有默认告警并创建一个我们自己的告警:
$ kubectl -n monitoring edit prometheusrules demo-prometheus-operator-alertmanager.rules
prometheusrule.monitoring.coreos.com/demo-prometheus-operator-alertmanager.rules edited
我们的自定义告警如下所示:
$ kubectl -n monitoring describe prometheusrule demo-prometheus-operator-alertmanager.rules
Name: demo-prometheus-operator-alertmanager.rules
Namespace: monitoring
Labels: app=prometheus-operator
chart=prometheus-operator-8.12.1
heritage=Tiller
release=demo
Annotations: prometheus-operator-validated: true
API Version: monitoring.coreos.com/v1
Kind: PrometheusRule
Metadata:
Creation Timestamp: 2020-03-11T18:06:25Z
Generation: 3
Resource Version: 18180
Self Link: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/demo-prometheus-operator-alertmanager.rules
UID: 6a84dbb0-feba-4f17-b3dc-4b6486818bc0
Spec:
Groups:
Name: alertmanager.rules
Rules:
Alert: PodHighCpuLoad
Annotations:
Message: Alertmanager has found {{ $labels.instance }} with CPU too high
Expr: rate (container_cpu_usage_seconds_total{pod_name=~"nginx-.*", image!="", container!="POD"}[5m]) > 0.04
For: 1m
Labels:
Severity: critical
Events: <none>
以下是我们创建的告警的选项:
-
annotation:描述告警的信息标签集。
-
expr:由PromQL写的表达式
-
for:可选参数,设置了之后会告诉Prometheus在定义的时间段内告警是否处于active状态。仅在此定义时间后才会触发告警。
-
label:可以附加到告警的额外标签。如果你想了解更多关于告警的信息,可以访问: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
现在我们已经完成了Prometheus告警的设置,让我们配置Alertmanager,使得我们能够通过电子邮件获得告警通知。Alertmanager的配置位于Kubernetes secret对象中。
$ kubectl get secrets -n monitoring
NAME TYPE DATA AGE
alertmanager-demo-prometheus-operator-alertmanager Opaque 1 32m
default-token-x4rgq kubernetes.io/service-account-token 3 37m
demo-grafana Opaque 3 32m
demo-grafana-test-token-p6qnk kubernetes.io/service-account-token 3 32m
demo-grafana-token-ff6nl kubernetes.io/service-account-token 3 32m
demo-kube-state-metrics-token-vmvbr kubernetes.io/service-account-token 3 32m
demo-prometheus-node-exporter-token-wlnk9 kubernetes.io/service-account-token 3 32m
demo-prometheus-operator-admission Opaque 3 32m
demo-prometheus-operator-alertmanager-token-rrx4k kubernetes.io/service-account-token 3 32m
demo-prometheus-operator-operator-token-q9744 kubernetes.io/service-account-token 3 32m
demo-prometheus-operator-prometheus-token-jvbrr kubernetes.io/service-account-token 3 32m
prometheus-demo-prometheus-operator-prometheus Opaque 1 31m
prometheus-demo-prometheus-operator-prometheus-tls-assets Opaque 0 31m
我们只对alertmanager-demo-prometheus-operator-alertmanager感兴趣。让我们看一下:
kubectl -n monitoring get secret alertmanager-demo-prometheus-operator-alertmanager -o yaml
apiVersion: v1
data:
alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==
kind: Secret
metadata:
creationTimestamp: "2020-03-11T18:06:24Z"
labels:
app: prometheus-operator-alertmanager
chart: prometheus-operator-8.12.1
heritage: Tiller
release: demo
name: alertmanager-demo-prometheus-operator-alertmanager
namespace: monitoring
resourceVersion: "3018"
selfLink: /api/v1/namespaces/monitoring/secrets/alertmanager-demo-prometheus-operator-alertmanager
uid: 6baf6883-f690-47a1-bb49-491935956c22
type: Opaque
alertmanager.yaml字段是由base64编码的,让我们看看:
$ echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6ICJudWxsIgpyb3V0ZToKICBncm91cF9ieToKICAtIGpvYgogIGdyb3VwX2ludGVydmFsOiA1bQogIGdyb3VwX3dhaXQ6IDMwcwogIHJlY2VpdmVyOiAibnVsbCIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogIm51bGwiCg==' | base64 --decode
global:
resolve_timeout: 5m
receivers:
- name: "null"
route:
group_by:
- job
group_interval: 5m
group_wait: 30s
receiver: "null"
repeat_interval: 12h
routes:
- match:
alertname: Watchdog
receiver: "null"
正如我们所看到的,这是默认的Alertmanager配置。你也可以在Alertmanager UI的Status选项卡中查看此配置。接下来,我们来对它进行一些更改——在本例中为发送邮件:
$ cat alertmanager.yaml
global:
resolve_timeout: 5m
route:
group_by: [Alertname]
# Send all notifications to me.
receiver: demo-alert
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
routes:
- match:
alertname: DemoAlertName
receiver: 'demo-alert'
receivers:
- name: demo-alert
email_configs:
- to: your_email@gmail.com
from: from_email@gmail.com
# Your smtp server address
smarthost: smtp.gmail.com:587
auth_username: from_email@gmail.com
auth_identity: from_email@gmail.com
auth_password: 16letter_generated token # you can use gmail account password, but better create a dedicated token for this
headers:
From: from_email@gmail.com
Subject: 'Demo ALERT'
首先,我们需要对此进行编码:
$ cat alertmanager.yaml | base64 -w0
我们获得编码输出后,我们需要在我们将要应用的yaml文件中填写它:
cat alertmanager-secret-k8s.yaml
apiVersion: v1
data:
alertmanager.yaml: <paste here de encoded content of alertmanager.yaml>
kind: Secret
metadata:
name: alertmanager-demo-prometheus-operator-alertmanager
namespace: monitoring
type: Opaque
$ kubectl apply -f alertmanager-secret-k8s.yaml
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
secret/alertmanager-demo-prometheus-operator-alertmanager configured
该配置将会自动重新加载并在UI中显示更改。
接下来,我们部署一些东西来对其进行监控。对于本例而言,一个简单的nginx deployment已经足够:
$ cat nginx-deployment.yaml
apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 3 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
$ kubectl apply -f nginx-deployment.yaml
deployment.apps/nginx-deployment created
根据配置yaml,我们有3个副本:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-5754944d6c-7g6gq 1/1 Running 0 67s
nginx-deployment-5754944d6c-lhvx8 1/1 Running 0 67s
nginx-deployment-5754944d6c-whhtr 1/1 Running 0 67s
在Prometheus UI中,使用我们为告警配置的相同表达式:
rate (container_cpu_usage_seconds_total{pod_name=~"nginx-.*", image!="", container!="POD"}[5m])
我们可以为这些Pod检查数据,所有Pod的值应该为0。
让我们在其中一个pod中添加一些负载,然后来看看值的变化,当值大于0.04时,我们应该接收到告警:
$ kubectl exec -it nginx-deployment-5754944d6c-7g6gq -- /bin/sh
# yes > /dev/null
该告警有3个阶段:
-
Inactive:不满足告警触发条件
-
Pending:条件已满足
-
Firing:触发告警
我们已经看到告警处于inactive状态,所以在CPU上添加一些负载,以观察到剩余两种状态:
告警一旦触发,将会在Alertmanager中显示:
Alertmanger配置为当我们收到告警时发送邮件。所以此时,如果我们检查收件箱,会看到类似以下内容:
总 结
我们知道监控的重要性,但是如果没有告警,它将是不完整的。发生问题时,告警可以立即通知我们,让我们立即了解到系统出现了问题。而Prometheus涵盖了这两个方面:既有监控解决方案又通过Alertmanager组件发出告警。本文中,我们看到了如何在Prometheus配置中定义告警以及告警在触发时如何到达Alertmanager。然后根据Alertmanager的定义/集成,我们收到了一封电子邮件,其中包含触发的告警的详细信息(也可以通过Slack或PagerDuty发送)。
关注公众号
低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。
持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。
转载内容版权归作者及来源网站所有,本站原创内容转载请注明来源。
-
上一篇
每个程序员都必须掌握的 8 种数据结构!
云栖号资讯:【点击查看更多行业资讯】在这里您可以找到不同行业的第一手的上云资讯,还在等什么,快来! 数据结构是一种特殊的组织和存储数据的方式,可以使我们可以更高效地对存储的数据执行操作。数据结构在计算机科学和软件工程领域具有广泛而多样的用途。 几乎所有已开发的程序或软件系统都使用数据结构。此外,数据结构属于计算机科学和软件工程的基础。当涉及软件工程面试问题时,这是一个关键主题。因此,作为开发人员,我们必须对数据结构有充分的了解。 在本文中,我将简要解释每个程序员必须知道的8种常用数据结构。 1.数组 数组是固定大小的结构,可以容纳相同数据类型的项目。它可以是整数数组,浮点数数组,字符串数组或什至是数组数组(例如二维数组)。数组已建立索引,这意味着可以进行随机访问。 Fig 1. Visualization of basic Terminology of Arrays 数组运算 遍历:遍历所有元素并进行打印。 插入:将一个或多个元素插入数组。 删除:从数组中删除元素 搜索:在数组中搜索元素。您可以按元素的值或索引搜索元素 更新:在给定索引处更新现有元素的值 数组的应用 用作构建其他数据结...
-
下一篇
PHP-Casbin v2.1.3 发布,项目入选今年谷歌编程之夏(GSoC)
PHP-Casbinv2.1.3 发布了,PHP-Casbin是一个用 PHP 语言打造的轻量级开源访问控制框架,支持 ACL、RBAC、ABAC 多种模型。它采用了元模型的设计思想,支持多种经典的访问控制方案,如基于角色的访问控制 RBAC、基于属性的访问控制 ABAC 等。 更新内容: https://github.com/php-casbin/php-casbin/releases Casbin 开源项目介绍 Casbin 是一个强大的、高效的开源访问控制框架。涉及到 Go、Java、Node.js、Javascript (React)、Python、PHP、.NET、Delphi、Rust 等多种语言。Casbin 由北京大学罗杨博士创立于 2017 年,核心维护团队有数十人。2020 年被 Google 选入 GSoC,成为入选的 200 个开源项目之一,也是今年 30 个首次入选的项目之一。 Google Summer of Code 介绍 Google Summer of Code(GSoC,即 Google 编程之夏)是 Google(谷歌)组织并提供经费,面对全球在读...
相关文章
文章评论
共有0条评论来说两句吧...
文章二维码
点击排行
推荐阅读
最新文章
- Crontab安装和使用
- CentOS8,CentOS7,CentOS6编译安装Redis5.0.7
- CentOS7编译安装Gcc9.2.0,解决mysql等软件编译问题
- Red5直播服务器,属于Java语言的直播服务器
- SpringBoot2整合MyBatis,连接MySql数据库做增删改查操作
- Eclipse初始化配置,告别卡顿、闪退、编译时间过长
- Windows10,CentOS7,CentOS8安装MongoDB4.0.16
- CentOS8安装MyCat,轻松搞定数据库的读写分离、垂直分库、水平分库
- MySQL数据库中FOR UPDATE的使用
- CentOS6,CentOS7官方镜像安装Oracle11G













微信收款码
支付宝收款码