OpenShift Master节点丢失, 集群恢复方法
故障模拟:
模拟 2 个 master 节点 down,需要重装,直接关闭故障节点,事先已经备份过 etcd。
需要事先备份了 etcd数据,离线环境需要 image :registry.redhat.io/rhel8/support-tools,否则不能启动 oc debug。
具体方法如下:
[root@bastion~]# oc debug node/master-1.offline.nielasaran.comsh-4.4# chroot /hostsh-4.4# /usr/local/bin/cluster-backup.sh/home/core/assets/backupsh-4.4# scp -r /home/core/assets/backup/ root@bastion:
1. 模拟故障关机后,可以看到 oc 超时
[root@bastion~]# oc get nodes Error from server (Timeout): the server was unable to return aresponse in the time allotted, but may still be processing the request (getnodes)
2. 把备份的 etcd 数据和 kubeconfig 证书 scp 过去,后面 master-2 要用 oc 命令并且需要管理员权限
[root@bastion~]# scp -r backup/ core@master-2:/home/core/ [root@bastion ~]# scp .kube/configcore@master-2:
3. 注意:如果只是丢了1个master 节点,意味着此时有 2 个 master 节点可用,这些内容需要在非恢复主机上做,也就是说,如果我计划把 master-2 主机作为 恢复主机,以下操作请不要再 master-2 上做,文档原文,如果只剩1个恢复节点,做不做都行,反正后面恢复脚本会帮你把这个做了。
3.1 进入额外剩余的 controll plane node
[root@bastion ~]# ssh core@master-1
3.2 移除 etcd static pod,然后等 1 - 2 分钟
[core@master-1~]$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
3.3 确定 etcd Pod 是否还在,如果还在继续等,等到没了为止
[core@master-1~]$ sudo crictl ps | grep etcd | grep -v operator
3.4 移除 kube-apiserver static pod,然后等 1 - 2 分钟
[core@master-1~]$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
3.5 确保 api-server 停了
[core@master-1~]$ sudo crictl ps | grep kube-apiserver | grep -v operator
3.6 清理 etcd 目录
[core@master-1~]$ sudo mv /var/lib/etcd/ /tmp
4. 进入恢复用 master 节点,执行恢复脚本
[core@master-2~]$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup...stoppingkube-apiserver-pod.yaml...stoppingkube-controller-manager-pod.yaml...stoppingkube-scheduler-pod.yaml...stoppingetcd-pod.yamlWaiting for container etcd tostopcompleteWaiting for container etcdctl to stopcompleteWaiting for containeretcd-metrics to stopcompleteWaiting for container kube-controller-manager tostop.completeWaiting for container kube-apiserver to stopcompleteWaiting forcontainer kube-scheduler to stopcompletestarting restore-etcd staticpodstartingkube-apiserver-pod.yamlstatic-pod-resources/kube-apiserver-pod-4/kube-apiserver-pod.yamlstartingkube-controller-manager-pod.yamlstatic-pod-resources/kube-controller-manager-pod-6/kube-controller-manager-pod.yamlstartingkube-scheduler-pod.yamlstatic-pod-resources/kube-scheduler-pod-5/kube-scheduler-pod.yaml
5. 重启 kubelet 服务,如果只坏1 个master 节点,需要 剩余2个 master 都重启
[core@master-1~]$ sudo systemctl restart kubelet.service
Warning: The unit file, source configurationfile or drop-ins of kubelet.service changed on disk. Run 'systemctldaemon-reload' to reload units.
6. 验证单 etcd 是否恢复
[core@master-2~]$ sudo crictl ps | grep etcd | grep -v operator00f72c57e27b9 621d5c808fe6847daf29bf02a1c47aef440f5ce4a08749cb8c7014a712565b6d 3 minutes ago Running etcd 0 0db089710e5a4
7. 此时发现 oc 已经可以恢复使用
[core@master-2~]$ oc get nodes NAME
8. 查看 etcd 集群状况,目前只有一个 member,无需手动移除
$ oc get pods ‐n openshift‐etcd | grep etcd 查看[core@master‐2 ~]$ oc rsh etcd‐master‐2.offline.nielasaran.comsh‐4.4# etcdctl endpoint status ‐w table
9. 移除废弃的 master 节点
10. 重装 master node,装好之后,等 csr
#oc get csr ‐o name | xargs oc adm certificate approve
11. 检查集群状况,如果遇到不正常大概多等一会,测试环境10分钟恢复正常,根据集群规模可能时间会更长
检查节点状态,看是否都是一致的
上图所用cli:
[root@bastion ~]# oc get etcd ‐o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'[root@bastion ~]# oc get kubeapiserver ‐o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'[root@bastion ~]# oc get kubecontrollermanager ‐o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'[root@bastion ~]# oc get kubescheduler ‐o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}







