MySQL InnoDB Cluster 常见故障场景分析-低调大师

MySQL InnoDB Cluster 常见故障场景分析

2025-12-04 54

作者：赵黎明，爱可生 MySQL DBA 团队成员，熟悉 Oracle、MySQL 等数据库，擅长数据库性能问题诊断、事务与锁问题的分析等，负责处理客户 MySQL 及我司自研 DMP 平台日常运维中的问题，对开源数据库相关技术非常感兴趣。

爱可生开源社区出品，原创内容未经授权不得随意使用，转载请联系小编并注明来源。

本文约 4800 字，预计阅读需要 15 分钟。

一、背景

随着企业业务对数据库高可用性要求的不断提高，MySQL InnoDB Cluster[1]（以下简称为 MIC）作为 MySQL 官方提供的高可用解决方案，在生产环境中得到了广泛应用。该方案基于 MySQL Group Replication[2] 和 MySQL Shell[3] 管理工具，能够提供自动故障检测与恢复、数据一致性保证等关键特性。

然而，在实际运维过程中，由于硬件故障、网络问题、配置错误等多种因素，集群仍可能面临各种异常情况。为了确保数据库服务的持续稳定运行，MySQL 运维人员需要具备快速诊断和恢复集群故障的能力。

本文是基于实际生产环境中的经验总结，对 MIC 的常见故障场景进行了测试，并提供了相应的恢复方案和操作指导，旨在帮助 MySQL 运维人员提升故障处理效率，保障业务连续性。

测试环境信息

OS：Red Hat Enterprise Linux release 8.10 (Ootpa)
MySQL：mysql-community-server-8.4.5-1.el8.x86_64
mysqlshell：mysql-shell-8.4.5-1.el8.x86_64
mysqlrouter：mysql-router-community-8.4.5-1.el8.x86_64

二、场景示例

场景 1：集群非主节点实例重启

故障模拟：将 node2 从节点重启

-- node2
systemctl restart mysql

-- node1
mysqlsh --uri clusteruser@10.186.65.13:3306
\js
var cluster=dba.getCluster()
cluster.status()

恢复过程

-- 登陆 mysqlshell 查看集群初始状态
[root@node1 ~]# mysqlsh --uri clusteruser@10.186.65.13:3306
MySQL Shell 8.4.5

Copyright (c) 2016, 2025, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a Classic session to 'clusteruser@10.186.65.13:3306'
Fetching global names for auto-completion... Press ^C to stop.
Your MySQL connection id is 402320
Server version: 8.4.5 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  10.186.65.13:3306 ssl  SQL > \js
Switching to JavaScript mode...
 MySQL  10.186.65.13:3306 ssl  JS > var cluster=dba.getCluster()
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 集群内所有成员状态都为 "ONLINE"，node1 为 "PRIMARY" 节点，node2/3 为 "SECONDARY" 节点

-- node2 上实例重启期间观察集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2013: Could not open connection to 'node2:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", 
                "status": "(MISSING)"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 通过 mysqlshell 连接 node2 节点报错 Error 2013，该节点状态变为 "(MISSING)"

-- node2 上实例完成重启后再次查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# node2 上的实例自动重启完成后，集群可自动恢复正常，无需额外处理

小结

故障现象

重启期间 node2 状态显示为 (MISSING)
集群状态变为 OK_NO_TOLERANCE_PARTIAL
出现连接错误 Lost connection to MySQL server

恢复要点

从节点重启后，集群能够自动检测并重新加入
无需人工干预，自动恢复为 ONLINE 状态
集群在单节点故障期间仍可正常提供服务

影响范围

重启期间集群容错能力降为 0
对应用读写无影响
短暂影响读负载均衡

场景 2：集群非主节点实例宕机

故障模拟：将 node3 从节点停止后再启动

-- node3:
systemctl stop mysql

-- node1:
mysqlsh --uri clusteruser@10.186.65.13:3306
var cluster=dba.getCluster()
cluster.status()

-- node3:
systemctl start mysql

恢复过程

-- 集群初始状态
同上，略...

-- node3 上实例停止后查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2013: Could not open connection to 'node3:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", 
                "status": "(MISSING)"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# mysqlshell 连接 node3 上的实例报错，节点状态为 "(MISSING)"

-- node3 实例重新启动后再次查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 将 node3 上宕机的实例启动后，集群可自动恢复正常，无需额外处理

小结

故障现象

node3 状态显示为 (MISSING)
集群状态为 OK_NO_TOLERANCE_PARTIAL
出现连接错误 Lost connection to MySQL server

恢复要点

停止的从节点重新启动后自动恢复
集群自动重新同步数据
整个过程无需人工干预

影响范围

故障期间集群失去容错能力
读负载能力下降
主节点写入不受影响

场景 3：集群主节点与从节点之间网络割裂

故障模拟：断开 node1 主节点与其他从节点之间的网络

-- node1：
./stop_net.sh
cat stop_net.sh
#!/bin/bash
iptables -A INPUT -s 10.186.65.18 -j DROP
iptables -A OUTPUT -d 10.186.65.18 -j DROP
iptables -A INPUT -s 10.186.65.24 -j DROP
iptables -A OUTPUT -d 10.186.65.24 -j DROP
sleep 30
iptables -F

-- node1：
mysqlsh --uri clusteruser@10.186.65.13:3306
\js
var cluster=dba.getCluster()
cluster.status()

恢复过程

-- 集群初始状态
同上，略...

-- 执行断网脚本期间集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)", 
                "status": "UNREACHABLE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to 'node3:3306': Can't connect to MySQL server on 'node3:3306' (110)", 
                "status": "UNREACHABLE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 此时，node2/3的memberState均为"(MISSING)"，msyqlshell无法连接，节点status均为"UNREACHABLE"

 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions. 2 members are not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to 'node2:3306': Can't connect to MySQL server on 'node2:3306' (110)", 
                "status": "UNREACHABLE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "instanceErrors": [
                    "ERROR: split-brain! Instance is not part of the majority group, but has state ONLINE"
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "ONLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "UNREACHABLE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 此时，集群状态略有不同，node2的memberState、status保持不变，而node3的memberState变为"ONLINE"，出现了split-brain(脑裂)的提示

-- 同时在node2节点上查看集群状态
 MySQL  10.186.65.18:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "instanceErrors": [
                    "WARNING: Instance is NOT a PRIMARY but super_read_only option is OFF."
                ], 
                "memberRole": "SECONDARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 此时，node2/3之间的网络是正常的，仅仅是node1与它们之间的网络发生割裂，整个集群依然是正常的，未发生切主

-- 断网恢复后，再次在node1上查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}

# 断网状态恢复后，node1上查看集群状态也恢复正常

小结

故障现象

主节点显示 NO_QUORUM 状态
从节点显示 UNREACHABLE
可能出现脑裂警告 split-brain! Instance is not part of the majority group

恢复要点

网络恢复后集群自动重新建立连接
自动进行数据一致性检查
主从角色自动协商确定

影响范围

网络分区期间写入操作可能受阻
需要确保多数节点达成一致
可能产生数据冲突需要解决

场景 4：集群主节点重启

故障模拟：将 node1 主节点重启

-- node1：
systemctl restart mysql

-- node1:
mysqlsh --uri clusteruser@10.186.65.13:3306
\js
var cluster=dba.getCluster()
cluster.status()

恢复过程

-- node1 主节点重启时查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node2:3306", 
        "ssl": "DISABLED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2013: Could not open connection to 'node1:3306': Lost connection to MySQL server at 'reading initial communication packet', system error: 104", 
                "status": "(MISSING)"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node2:3306"
}

# 主节点重启后，mysqlshell连接该节点失败，发生了切主，其memberRole由"PRIMARY"变为"SECONDARY"

-- node1主节点完成重启后再次查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node2:3306", 
        "ssl": "DISABLED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node2:3306"
}

# node1主节点实例重启完成后，mysqlshell 恢复连接，memberState 变为 "OFFLINE"，节点 status 变为 "(MISSING)"，如未自动加入集群，需要手动执行 rejoinInstance 将其重新加入集群

小结

故障现象

主节点重启期间自动故障转移
node2 被选举为新的主节点
原主节点重启后状态为 (MISSING)

恢复要点

集群自动完成主节点切换
原主节点需要手动或自动重新加入
确保应用能够感知主节点变更

影响范围：

主节点切换期间短暂写入中断
应用连接可能需要重连
原主节点重新加入期间集群容错能力下降

场景 5：集群所有节点同时重启后丢失 Quorum

故障模拟：同时将集群内所有节点重启

-- node1 & node2 & node3
./restart_node.sh
cat restart_node.sh
#!/bin/bash
systemctl stop mysql
sleep 30
systemctl start mysql

-- node1
cluster.status()
var cluster=dba.getCluster()
\sql
show binary log status;
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=off;
\js
var cluster=dba.getCluster()
cluster.status()
cluster.rejoinInstance("clusteruser@node2:3306")
cluster.rejoinInstance("clusteruser@node3:3306")
cluster.status()

# PS：如仍然无法成功执行，可能需要先执行rescan，再执行addInstance将故障节点重新加回集群
cluster.rescan()
cluster.addInstance(clusteruser@node2:3306)

恢复过程

-- 同时将3个节点上的实例重启，重启完成后查看所有节点上mysqld进程
[root@node1 ~]# ./restart_node.sh
[root@node1 ~]# ps -ef|grep mysql
mysqlro+   65096       1  3 Nov19 ?        01:36:34 /usr/bin/mysqlrouter -c /mysql_router/mysqlrouter.conf
root       77313   77260  0 13:48 pts/0    00:00:02 mysqlsh --uri clusteruser@10.186.65.13:3306
mysql      77963       1  4 15:45 ?        00:00:01 /usr/sbin/mysqld
root       78021   77883  0 15:45 pts/2    00:00:00 grep --color=auto mysql
[root@node1 ~]#

[root@node2 ~]# ./restart_node.sh
[root@node2 ~]# ps -ef|grep mysql
root       27868   27780  0 15:29 pts/0    00:00:01 mysqlsh --uri clusteruser@10.186.65.18:3306
mysql      28001       1 19 15:45 ?        00:00:03 /usr/sbin/mysqld
root       28078   27924  0 15:45 pts/2    00:00:00 grep --color=auto mysql
[root@node2 ~]#

[root@node3 ~]# ./restart_node.sh
[root@node3 ~]# ps -ef|grep mysql
root      127839  127035  0 15:29 pts/0    00:00:00 mysqlsh --uri clusteruser@10.186.65.24:3306
mysql     128073       1 10 15:44 ?        00:00:02 /usr/sbin/mysqld
root      128153  127992  0 15:44 pts/2    00:00:00 grep --color=auto mysql
[root@node3 ~]#

-- 重启集群所有节点时，无法查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
Cluster.status: The cluster object is disconnected. Please use dba.getCluster() to obtain a fresh cluster handle. (RuntimeError)
 MySQL  10.186.65.13:3306 ssl  JS > var cluster=dba.getCluster()
Dba.getCluster: This functionisnot available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR isnot active) (MYSQLSH 51314)

-- 确认 GTID 最全的节点（假设其为 node1 旧主节点）
 MySQL  10.186.65.13:3306 ssl  SQL > show binary log status;
ERROR: 2013 (HY000): Lost connection to MySQL server during query
The global session got disconnected..
Attempting to reconnect to 'mysql://clusteruser@10.186.65.13:3306'..
The global session was successfully reconnected.
 MySQL  10.186.65.13:3306 ssl  SQL > show binary log status;
+------------------+----------+--------------+------------------+---------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                           |
+------------------+----------+--------------+------------------+---------------------------------------------+
| mysql_bin.000003 |      194 |              |                  | aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1-2540 |
+------------------+----------+--------------+------------------+---------------------------------------------+
1 row in set (0.0001 sec)

-- 将主节点作为组复制引导节点并启动组复制
 MySQL  10.186.65.13:3306 ssl  SQL > set global group_replication_bootstrap_group=on;
Query OK, 0 rows affected (0.0006 sec)
 MySQL  10.186.65.13:3306 ssl  SQL > start group_replication;
Query OK, 0 rows affected (1.1069 sec)
 MySQL  10.186.65.13:3306 ssl  SQL > set global group_replication_bootstrap_group=off;
Query OK, 0 rows affected (0.0004 sec)

-- 再次查看集群状态
 MySQL  10.186.65.13:3306 ssl  SQL > \js
Switching to JavaScript mode...
 MySQL  10.186.65.13:3306 ssl  JS > var cluster=dba.getCluster()
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}
 MySQL  10.186.65.13:3306 ssl  JS > 

# 此时，其他两个从节点的memberState均为"OFFLINE"，status均为"(MISSING)"，需要执行rejoinInstance，将它们重新加入集群

-- 依次将其他节点加入集群
 MySQL  10.186.65.13:3306 ssl  JS > cluster.rejoinInstance("clusteruser@node2:3306")
The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of 'node2:3306' with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' optionto 'clone'.

The incremental state recovery may be safely used if you are sure all updates ever executed in the cluster were done with GTIDs enabled, there are no purged transactions and the new instance contains the same GTID setas the cluster or a subset of it. Touse this method by default, set the 'recoveryMethod' optionto 'incremental'.

Incremental state recovery was selected because it seems to be safely usable.

Validating instance configuration at node2:3306...

This instance reports its own address as node2:3306

Instance configuration is suitable.
Rejoining instance'node2:3306'to cluster 'mycluster'...

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
State recovery already finished for'node2:3306'

The instance'node2:3306' was successfully rejoined to the cluster.

 MySQL  10.186.65.13:3306 ssl  JS > cluster.rejoinInstance("clusteruser@node3:3306")
The safest and most convenient way to provision a new instance is through auto matic clone provisioning, which will completely overwrite the state of'node3:3306'with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' optionto 'clone'.

The incremental state recovery may be safely used if you are sure all updates ever executed in the cluster were done with GTIDs enabled, there areno purged transactions and the new instance contains the same GTID setas the cluster or a subset of it. To use this method by default, set the 'recoveryMethod' option to 'incremental'.

Incremental state recovery was selected because it seems to be safely usable.

Validating instance configuration at node3:3306...

This instance reports its own address as node3:3306

Instance configuration is suitable.
Rejoining instance'node3:3306'to cluster 'mycluster'...

Monitoring recovery process of the new cluster member. Press ^C to stop monitoringand let it continue in background.
State recovery already finished for 'node3:3306'

The instance'node3:3306' was successfully rejoined to the cluster.

-- 再次查看集群状态
 MySQL  10.186.65.13:3306 ssl  JS > cluster.status()
{
    "clusterName": "mycluster", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "node1:3306", 
        "ssl": "DISABLED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "node1:3306": {
                "address": "node1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node2:3306": {
                "address": "node2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }, 
            "node3:3306": {
                "address": "node3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.4.5"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "node1:3306"
}
 MySQL  10.186.65.13:3306 ssl  JS >

# 实例重启后，组复制默认不会开启，可能会缺少仲裁节点，导致无法查看集群状态，需要先配置一个主引导节点后启动组复制，并将其余节点通过rejoinInstance的方式加入集群
# 如果要对集群所在服务器进行计划内的Linux内核升级等运维操作时，建议采用滚动方式依次对各个节点上的实例进行重启，应尽量避免所有节点同时发生重启

小结

故障现象

所有节点重启后集群可能无法自动恢复
出现 The cluster object is disconnected 错误
组复制服务未自动启动

恢复要点

需要手动选择 GTID 最全的节点作为引导节点
使用 group_replication_bootstrap_group 参数引导集群
通过 rejoinInstance 重新加入其他节点

影响范围

整个集群服务中断
需要人工干预恢复
恢复时间相对较长

总结

本文对 MIC 在实际使用过程中出现的常见故障场景进行了一些测试，旨在使 MySQL 运维人员可以更好地理解其故障特性和恢复机制，提高应对各种异常情况的能力，确保数据库服务的高可用性和数据安全性。

其中，故障场景主要分为两类：

自动化恢复能力：
- 单节点故障（重启、宕机）集群具备自动恢复能力
- 网络分区恢复后集群能够自动重新建立一致性
- 主节点故障时自动进行故障转移
需要人工干预的场景：
- 所有节点同时重启导致仲裁丢失
- 集群引导节点选择不当
- 数据不一致需要手动修复

参考资料

[1] MySQL InnoDB Cluster: https://dev.mysql.com/doc/mysql-shell/8.4/en/monitoring-innodb-cluster.html

[2] MySQL Group Replication: https://dev.mysql.com/doc/en/group-replication.html

[3] MySQL Shell: https://dev.mysql.com/doc/mysql-shell/8.4/en/rejoin-cluster.html

微信关注我们

原文链接：https://my.oschina.net/actiontechoss/blog/18876184

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

Entity Explorer：基于 UModel 的实体探索平台

作者：灵亦什么是实体探索（Entity Explorer） 1.1 实体探索概述在可观测性领域，实体（Entity）指的是任何可以被独立识别和监控的对象，例如：基础设施实体：主机、容器、网络设备、存储系统应用层实体：微服务、API 接口、数据库实例、消息队列业务实体：用户会话、业务流程、交易订单运维实体：部署环境、代码仓库、运维人员实体查询的核心价值在于打破传统监控中按"产品"或"指标"划分的孤岛式视图，构建全景化的实体资产目录。用户可以在实体查询中实现以下目标：统一盘点：清晰查看 UModel 中定义的所有实体类型（Entity Type）及具体实体实例，并按域（Domain）和类型进行分类展示。快速检索：利用 USearch 查询语言，支持全文检索、精确匹配、条件过滤等多种方式，快速定位所需实体。关联分析：以任一实体为起点，探索其状态、性能指标、关联日志、链路追踪和拓扑关系，实现一站式问题诊断与分析。 1.2 实体探索的难度要实现高效、精准的实体探索，意味着需要在海量、异构的数据中快速定位并分析目标。在实践过程中，我们面临三大核心技术挑战： 1.2.1...

2025-12-04

SQLShift V5.0 发布！引入增强模型：复杂 SQL 转换准确率质变

2025 年 11 月 28 日，SQLShift[1]V5 版本正式更新，欢迎体验！ SQLShift是面向企业的数据库非表对象迁移平台，基于大模型实现存储过程、触发器等对象的跨库自动转换与兼容修复，降低迁移成本与风险，支持Oracle、PostgreSQL、GaussDB、OceanBase、SQLServer等。本次更新是SQLShift发布以来最重要的一次架构升级。产品引入新的任务转换模型：增强模型（Beta），并在此基础上新增智能修复中心，为您的数据库迁移带来更可靠的体验。一、增强模型（Beta）增强模型（Beta）相较于原有的标准模型，带来了更高的转换正确率、更稳定的上下文处理能力。在本次更新中，我们将历史版本的转换机制统一命名为标准模型，便于用户在创建转换任务时选择适合的转换模式。提升复杂 SQL 转换正确率在多表关联、深度嵌套逻辑等复杂场景中表现更加稳定。改善上下文一致性已解决部分用户反馈的变量位置错乱和业务逻辑丢失等问题。新增迁移组合支持已新增SQL Server → OceanBase 4.3.5迁移能力，提升对新版本 OceanBase 的兼容...

2025-12-04

发表评论

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。