MySQL InnoDB Cluster 常见故障场景分析
作者:赵黎明,爱可生 MySQL DBA 团队成员,熟悉 Oracle、MySQL 等数据库,擅长数据库性能问题诊断、事务与锁问题的分析等,负责处理客户 MySQL 及我司自研 DMP 平台日常运维中的问题,对开源数据库相关技术非常感兴趣。
爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。
本文约 4800 字,预计阅读需要 15 分钟。
一、背景
随着企业业务对数据库高可用性要求的不断提高,MySQL InnoDB Cluster[1](以下简称为 MIC)作为 MySQL 官方提供的高可用解决方案,在生产环境中得到了广泛应用。该方案基于 MySQL Group Replication[2] 和 MySQL Shell[3] 管理工具,能够提供自动故障检测与恢复、数据一致性保证等关键特性。
然而,在实际运维过程中,由于硬件故障、网络问题、配置错误等多种因素,集群仍可能面临各种异常情况。为了确保数据库服务的持续稳定运行,MySQL 运维人员需要具备快速诊断和恢复集群故障的能力。
本文是基于实际生产环境中的经验总结,对 MIC 的常见故障场景进行了测试,并提供了相应的恢复方案和操作指导,旨在帮助 MySQL 运维人员提升故障处理效率,保障业务连续性。
测试环境信息
-
OS:Red Hat Enterprise Linux release 8.10 (Ootpa)
-
MySQL:mysql-community-server-8.4.5-1.el8.x86_64
-
mysqlshell:mysql-shell-8.4.5-1.el8.x86_64
-
mysqlrouter:mysql-router-community-8.4.5-1.el8.x86_64
二、场景示例
场景 1:集群非主节点实例重启
故障模拟:将 node2 从节点重启
-- node2
systemctl restart mysql
-- node1
mysqlsh --uri clusteruser@10.186.65.13:3306
\js
var cluster=dba.getCluster()
cluster.status()
恢复过程
-- 登陆 mysqlshell 查看集群初始状态
[root@node1 ~]# mysqlsh --uri clusteruser@10.186.65.13:3306
MySQL Shell 8.4.5
Copyright (c) 2016, 2025, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.
Type '\help' or '\?' for help; '\quit' to exit.
Creating a Classic session to 'clusteruser@10.186.65.13:3306'
Fetching global names for auto-completion... Press ^C to stop.
Your MySQL connection id is 402320
Server version: 8.4.5 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
MySQL 10.186.65.13:3306 ssl SQL > \js
Switching to JavaScript mode...
MySQL 10.186.65.13:3306 ssl JS > var cluster=dba.getCluster()
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 集群内所有成员状态都为 "ONLINE",node1 为 "PRIMARY" 节点,node2/3 为 "SECONDARY" 节点
-- node2 上实例重启期间观察集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2013: Could not open connection to 'node2:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104",
"status": "(MISSING)"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 通过 mysqlshell 连接 node2 节点报错 Error 2013,该节点状态变为 "(MISSING)"
-- node2 上实例完成重启后再次查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# node2 上的实例自动重启完成后,集群可自动恢复正常,无需额外处理
小结
故障现象
-
重启期间 node2 状态显示为
(MISSING) -
集群状态变为
OK_NO_TOLERANCE_PARTIAL -
出现连接错误
Lost connection to MySQL server
恢复要点
-
从节点重启后,集群能够自动检测并重新加入
-
无需人工干预,自动恢复为
ONLINE状态 -
集群在单节点故障期间仍可正常提供服务
影响范围
-
重启期间集群容错能力降为 0
-
对应用读写无影响
-
短暂影响读负载均衡
场景 2:集群非主节点实例宕机
故障模拟:将 node3 从节点停止后再启动
-- node3:
systemctl stop mysql
-- node1:
mysqlsh --uri clusteruser@10.186.65.13:3306
var cluster=dba.getCluster()
cluster.status()
-- node3:
systemctl start mysql
恢复过程
-- 集群初始状态
同上,略...
-- node3 上实例停止后查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2013: Could not open connection to 'node3:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104",
"status": "(MISSING)"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# mysqlshell 连接 node3 上的实例报错,节点状态为 "(MISSING)"
-- node3 实例重新启动后再次查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 将 node3 上宕机的实例启动后,集群可自动恢复正常,无需额外处理
小结
故障现象
-
node3 状态显示为
(MISSING) -
集群状态为
OK_NO_TOLERANCE_PARTIAL -
出现连接错误
Lost connection to MySQL server
恢复要点
-
停止的从节点重新启动后自动恢复
-
集群自动重新同步数据
-
整个过程无需人工干预
影响范围
-
故障期间集群失去容错能力
-
读负载能力下降
-
主节点写入不受影响
场景 3:集群主节点与从节点之间网络割裂
故障模拟:断开 node1 主节点与其他从节点之间的网络
-- node1:
./stop_net.sh
cat stop_net.sh
#!/bin/bash
iptables -A INPUT -s 10.186.65.18 -j DROP
iptables -A OUTPUT -d 10.186.65.18 -j DROP
iptables -A INPUT -s 10.186.65.24 -j DROP
iptables -A OUTPUT -d 10.186.65.24 -j DROP
sleep 30
iptables -F
-- node1:
mysqlsh --uri clusteruser@10.186.65.13:3306
\js
var cluster=dba.getCluster()
cluster.status()
恢复过程
-- 集群初始状态
同上,略...
-- 执行断网脚本期间集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "NO_QUORUM",
"statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"memberState": "(MISSING)",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)",
"status": "UNREACHABLE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"memberState": "(MISSING)",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to 'node3:3306': Can't connect to MySQL server on 'node3:3306' (110)",
"status": "UNREACHABLE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 此时,node2/3的memberState均为"(MISSING)",msyqlshell无法连接,节点status均为"UNREACHABLE"
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "NO_QUORUM",
"statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"memberState": "(MISSING)",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)",
"status": "UNREACHABLE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"instanceErrors": [
"ERROR: split-brain! Instance is not part of the majority group, but has state ONLINE"
],
"memberRole": "SECONDARY",
"memberState": "ONLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "UNREACHABLE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 此时,集群状态略有不同,node2的memberState、status保持不变,而node3的memberState变为"ONLINE",出现了split-brain(脑裂)的提示
-- 同时在node2节点上查看集群状态
MySQL 10.186.65.18:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"instanceErrors": [
"WARNING: Instance is NOT a PRIMARY but super_read_only option is OFF."
],
"memberRole": "SECONDARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 此时,node2/3之间的网络是正常的,仅仅是node1与它们之间的网络发生割裂,整个集群依然是正常的,未发生切主
-- 断网恢复后,再次在node1上查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
# 断网状态恢复后,node1上查看集群状态也恢复正常
小结
故障现象
-
主节点显示
NO_QUORUM状态 -
从节点显示
UNREACHABLE -
可能出现脑裂警告
split-brain! Instance is not part of the majority group
恢复要点
-
网络恢复后集群自动重新建立连接
-
自动进行数据一致性检查
-
主从角色自动协商确定
影响范围
-
网络分区期间写入操作可能受阻
-
需要确保多数节点达成一致
-
可能产生数据冲突需要解决
场景 4:集群主节点重启
故障模拟:将 node1 主节点重启
-- node1:
systemctl restart mysql
-- node1:
mysqlsh --uri clusteruser@10.186.65.13:3306
\js
var cluster=dba.getCluster()
cluster.status()
恢复过程
-- node1 主节点重启时查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node2:3306",
"ssl": "DISABLED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "SECONDARY",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2013: Could not open connection to 'node1:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104",
"status": "(MISSING)"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node2:3306"
}
# 主节点重启后,mysqlshell连接该节点失败,发生了切主,其memberRole由"PRIMARY"变为"SECONDARY"
-- node1主节点完成重启后再次查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node2:3306",
"ssl": "DISABLED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node2:3306"
}
# node1主节点实例重启完成后,mysqlshell 恢复连接,memberState 变为 "OFFLINE",节点 status 变为 "(MISSING)",如未自动加入集群,需要手动执行 rejoinInstance 将其重新加入集群
小结
故障现象
-
主节点重启期间自动故障转移
-
node2 被选举为新的主节点
-
原主节点重启后状态为
(MISSING)
恢复要点
-
集群自动完成主节点切换
-
原主节点需要手动或自动重新加入
-
确保应用能够感知主节点变更
影响范围:
-
主节点切换期间短暂写入中断
-
应用连接可能需要重连
-
原主节点重新加入期间集群容错能力下降
场景 5:集群所有节点同时重启后丢失 Quorum
故障模拟:同时将集群内所有节点重启
-- node1 & node2 & node3
./restart_node.sh
cat restart_node.sh
#!/bin/bash
systemctl stop mysql
sleep 30
systemctl start mysql
-- node1
cluster.status()
var cluster=dba.getCluster()
\sql
show binary log status;
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;
\js
var cluster=dba.getCluster()
cluster.status()
cluster.rejoinInstance("clusteruser@node2:3306")
cluster.rejoinInstance("clusteruser@node3:3306")
cluster.status()
# PS:如仍然无法成功执行,可能需要先执行rescan,再执行addInstance将故障节点重新加回集群
cluster.rescan()
cluster.addInstance(clusteruser@node2:3306)
恢复过程
-- 同时将3个节点上的实例重启,重启完成后查看所有节点上mysqld进程
[root@node1 ~]# ./restart_node.sh
[root@node1 ~]# ps -ef|grep mysql
mysqlro+ 65096 1 3 Nov19 ? 01:36:34 /usr/bin/mysqlrouter -c /mysql_router/mysqlrouter.conf
root 77313 77260 0 13:48 pts/0 00:00:02 mysqlsh --uri clusteruser@10.186.65.13:3306
mysql 77963 1 4 15:45 ? 00:00:01 /usr/sbin/mysqld
root 78021 77883 0 15:45 pts/2 00:00:00 grep --color=auto mysql
[root@node1 ~]#
[root@node2 ~]# ./restart_node.sh
[root@node2 ~]# ps -ef|grep mysql
root 27868 27780 0 15:29 pts/0 00:00:01 mysqlsh --uri clusteruser@10.186.65.18:3306
mysql 28001 1 19 15:45 ? 00:00:03 /usr/sbin/mysqld
root 28078 27924 0 15:45 pts/2 00:00:00 grep --color=auto mysql
[root@node2 ~]#
[root@node3 ~]# ./restart_node.sh
[root@node3 ~]# ps -ef|grep mysql
root 127839 127035 0 15:29 pts/0 00:00:00 mysqlsh --uri clusteruser@10.186.65.24:3306
mysql 128073 1 10 15:44 ? 00:00:02 /usr/sbin/mysqld
root 128153 127992 0 15:44 pts/2 00:00:00 grep --color=auto mysql
[root@node3 ~]#
-- 重启集群所有节点时,无法查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
Cluster.status: The cluster object is disconnected. Please use dba.getCluster() to obtain a fresh cluster handle. (RuntimeError)
MySQL 10.186.65.13:3306 ssl JS > var cluster=dba.getCluster()
Dba.getCluster: This functionisnot available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR isnot active) (MYSQLSH 51314)
-- 确认 GTID 最全的节点(假设其为 node1 旧主节点)
MySQL 10.186.65.13:3306 ssl SQL > show binary log status;
ERROR: 2013 (HY000): Lost connection to MySQL server during query
The global session got disconnected..
Attempting to reconnect to 'mysql://clusteruser@10.186.65.13:3306'..
The global session was successfully reconnected.
MySQL 10.186.65.13:3306 ssl SQL > show binary log status;
+------------------+----------+--------------+------------------+---------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+---------------------------------------------+
| mysql_bin.000003 | 194 | | | aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1-2540 |
+------------------+----------+--------------+------------------+---------------------------------------------+
1 row in set (0.0001 sec)
-- 将主节点作为组复制引导节点并启动组复制
MySQL 10.186.65.13:3306 ssl SQL > set global group_replication_bootstrap_group=on;
Query OK, 0 rows affected (0.0006 sec)
MySQL 10.186.65.13:3306 ssl SQL > start group_replication;
Query OK, 0 rows affected (1.1069 sec)
MySQL 10.186.65.13:3306 ssl SQL > set global group_replication_bootstrap_group=off;
Query OK, 0 rows affected (0.0004 sec)
-- 再次查看集群状态
MySQL 10.186.65.13:3306 ssl SQL > \js
Switching to JavaScript mode...
MySQL 10.186.65.13:3306 ssl JS > var cluster=dba.getCluster()
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
MySQL 10.186.65.13:3306 ssl JS >
# 此时,其他两个从节点的memberState均为"OFFLINE",status均为"(MISSING)",需要执行rejoinInstance,将它们重新加入集群
-- 依次将其他节点加入集群
MySQL 10.186.65.13:3306 ssl JS > cluster.rejoinInstance("clusteruser@node2:3306")
The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of 'node2:3306' with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' optionto 'clone'.
The incremental state recovery may be safely used if you are sure all updates ever executed in the cluster were done with GTIDs enabled, there are no purged transactions and the new instance contains the same GTID setas the cluster or a subset of it. Touse this method by default, set the 'recoveryMethod' optionto 'incremental'.
Incremental state recovery was selected because it seems to be safely usable.
Validating instance configuration at node2:3306...
This instance reports its own address as node2:3306
Instance configuration is suitable.
Rejoining instance'node2:3306'to cluster 'mycluster'...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
State recovery already finished for'node2:3306'
The instance'node2:3306' was successfully rejoined to the cluster.
MySQL 10.186.65.13:3306 ssl JS > cluster.rejoinInstance("clusteruser@node3:3306")
The safest and most convenient way to provision a new instance is through auto matic clone provisioning, which will completely overwrite the state of'node3:3306'with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' optionto 'clone'.
The incremental state recovery may be safely used if you are sure all updates ever executed in the cluster were done with GTIDs enabled, there areno purged transactions and the new instance contains the same GTID setas the cluster or a subset of it. To use this method by default, set the 'recoveryMethod' option to 'incremental'.
Incremental state recovery was selected because it seems to be safely usable.
Validating instance configuration at node3:3306...
This instance reports its own address as node3:3306
Instance configuration is suitable.
Rejoining instance'node3:3306'to cluster 'mycluster'...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoringand let it continue in background.
State recovery already finished for 'node3:3306'
The instance'node3:3306' was successfully rejoined to the cluster.
-- 再次查看集群状态
MySQL 10.186.65.13:3306 ssl JS > cluster.status()
{
"clusterName": "mycluster",
"defaultReplicaSet": {
"name": "default",
"primary": "node1:3306",
"ssl": "DISABLED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"node1:3306": {
"address": "node1:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node2:3306": {
"address": "node2:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
},
"node3:3306": {
"address": "node3:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": "applier_queue_applied",
"role": "HA",
"status": "ONLINE",
"version": "8.4.5"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "node1:3306"
}
MySQL 10.186.65.13:3306 ssl JS >
# 实例重启后,组复制默认不会开启,可能会缺少仲裁节点,导致无法查看集群状态,需要先配置一个主引导节点后启动组复制,并将其余节点通过rejoinInstance的方式加入集群
# 如果要对集群所在服务器进行计划内的Linux内核升级等运维操作时,建议采用滚动方式依次对各个节点上的实例进行重启,应尽量避免所有节点同时发生重启
小结
故障现象
-
所有节点重启后集群可能无法自动恢复
-
出现
The cluster object is disconnected错误 -
组复制服务未自动启动
恢复要点
-
需要手动选择 GTID 最全的节点作为引导节点
-
使用
group_replication_bootstrap_group参数引导集群 -
通过
rejoinInstance重新加入其他节点
影响范围
-
整个集群服务中断
-
需要人工干预恢复
-
恢复时间相对较长
总结
本文对 MIC 在实际使用过程中出现的常见故障场景进行了一些测试,旨在使 MySQL 运维人员可以更好地理解其故障特性和恢复机制,提高应对各种异常情况的能力,确保数据库服务的高可用性和数据安全性。
其中,故障场景主要分为两类:
-
自动化恢复能力:
-
单节点故障(重启、宕机)集群具备自动恢复能力
-
网络分区恢复后集群能够自动重新建立一致性
-
主节点故障时自动进行故障转移
-
-
需要人工干预的场景:
-
所有节点同时重启导致仲裁丢失
-
集群引导节点选择不当
-
数据不一致需要手动修复
-
参考资料
[1] MySQL InnoDB Cluster: https://dev.mysql.com/doc/mysql-shell/8.4/en/monitoring-innodb-cluster.html
[2] MySQL Group Replication: https://dev.mysql.com/doc/en/group-replication.html
[3] MySQL Shell: https://dev.mysql.com/doc/mysql-shell/8.4/en/rejoin-cluster.html