Zookeeper的选举算法和脑裂问题深度讲解

2019-09-24 798

ZK介绍

ZK = zookeeper

ZK是微服务解决方案中拥有服务注册发现最为核心的环境，是微服务的基石。作为服务注册发现模块，并不是只有ZK一种产品，目前得到行业认可的还有：Eureka、Consul。

这里我们只聊ZK，这个工具本身很小zip包就几兆，安装非常傻瓜，能够支持集群部署。

官网地址：https://zookeeper.apache.org/

背景

在集群环境下ZK的leader&follower的概念，已经节点异常ZK面临的问题以及如何解决。ZK本身是java语言开发，也开源到Github上但官方文档对内部介绍的很少，零散的博客很多，有些写的很不错。

提问：

zookeeper选举算法中的过半票数才提供正常服务，这是什么逻辑？

ZK节点状态角色

ZK集群单节点状态（每个节点有且只有一个状态），ZK的定位一定需要一个leader节点处于lading状态。

looking：寻找leader状态，当前集群没有leader，进入leader选举流程。
following：跟随者状态，接受leading节点同步和指挥。
leading：领导者状态。
observing：观察者状态，表名当前服务器是observer。

ZAB协议（原子广播）

Zookeeper专门设计了一种名为原子广播（ZAB）的支持崩溃恢复的一致性协议。ZK实现了一种主从模式的系统架构来保持集群中各个副本之间的数据一致性，所有的写操作都必须通过Leader完成，Leader写入本地日志后再复制到所有的Follower节点。一旦Leader节点无法工作，ZAB协议能够自动从Follower节点中重新选出一个合适的替代者，即新的Leader，该过程即为领导选举。

ZK集群中事务处理是leader负责，follower会转发到leader来统一处理。简单理解就是ZK的写统一leader来做，读可以follower处理，这也就是CAP理论中ZK更适合读多写少的服务。

过半选举算法

ZK投票处理策略

投票信息包含：所选举leader的Serverid，Zxid，SelectionEpoch

Epoch判断，自身logicEpoch与SelectionEpoch判断：大于、小于、等于。
优先检查ZXID。ZXID比较大的服务器优先作为Leader。
如果ZXID相同，那么就比较myid。myid较大的服务器作为Leader服务器。

ZK中有三种选举算法，分别是LeaderElection,FastLeaderElection,AuthLeaderElection，FastLeaderElection和AuthLeaderElection是类似的选举算法，唯一区别是后者加入了认证信息， FastLeaderElection比LeaderElection更高效，后续的版本只保留FastLeaderElection。

理解：

在集群环境下多个节点启动，ZK首先需要在多个节点中选出一个节点作为leader并处于Leading状态，这样就面临一个选举问题，同时选举规则是什么样的。“过半选举算法”：投票选举中获得票数过半的节点胜出，即状态从looking变为leading，效率更高。

官网资料描述：Clustered (Multi-Server) Setup，如下图：

As long as a majority of the ensemble are up, the service will be available. Because Zookeeper requires a majority, it is best to use an odd number of machines. For example, with four machines ZooKeeper can only handle the failure of a single machine; if two machines fail, the remaining two machines do not constitute a majority. However, with five machines ZooKeeper can handle the failure of two machines.

以5台服务器讲解思路：

服务器1启动,此时只有它一台服务器启动了,它发出去的Vote没有任何响应,所以它的选举状态一直是LOOKING状态；
服务器2启动,它与最开始启动的服务器1进行通信,互相交换自己的选举结果,由于两者都没有历史数据,所以id值较大的服务器2胜出,但是由于没有达到超过半数以上的服务器都同意选举它(这个例子中的半数以上是3),所以服务器1,2还是继续保持LOOKING状态.
服务器3启动,根据前面的理论，分析有三台服务器选举了它,服务器3成为服务器1,2,3中的老大,所以它成为了这次选举的leader.
服务器4启动,根据前面的分析，理论上服务器4应该是服务器1,2,3,4中最大的,但是由于前面已经有半数以上的服务器选举了服务器3,所以它只能接收当小弟的命了.
服务器5启动,同4一样,当小弟.

假设5台中挂了2台（3、4），其中leader也挂掉：

leader和follower间有检查心跳，需要同步数据 Leader节点挂了，整个Zookeeper集群将暂停对外服务，进入新一轮Leader选举

1）服务器1、2、5发现与leader失联，状态转为looking，开始新的投票 2）服务器1、2、5分别开始投票并广播投票信息，自身Epoch自增； 3) 服务器1、2、5分别处理投票，判断出leader分别广播 4）根据投票处理逻辑会选出一台（2票过半） 5）各自服务器重新变更为leader、follower状态 6）重新提供服务

源码解析：

URL: FastLeaderElection

/**
 * Starts a new round of leader election. Whenever our QuorumPeer
 * changes its state to LOOKING, this method is invoked, and it
 * sends notifications to all other peers.
 */
public Vote lookForLeader() throws InterruptedException {
    try {
        self.jmxLeaderElectionBean = new LeaderElectionBean();
        MBeanRegistry.getInstance().register(self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
    } catch (Exception e) {
        LOG.warn("Failed to register with JMX", e);
        self.jmxLeaderElectionBean = null;
    }

    self.start_fle = Time.currentElapsedTime();
    try {
        Map<Long, Vote> recvset = new HashMap<Long, Vote>();

        Map<Long, Vote> outofelection = new HashMap<Long, Vote>();

        int notTimeout = minNotificationInterval;

        synchronized (this) {
            logicalclock.incrementAndGet();
            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
        }

        LOG.info("New election. My id =  " + self.getId() + ", proposed zxid=0x" + Long.toHexString(proposedZxid));
        sendNotifications();

        SyncedLearnerTracker voteSet;

        /*
         * Loop in which we exchange notifications until we find a leader
         */

        while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
            /*
             * Remove next notification from queue, times out after 2 times
             * the termination time
             */
            Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);

            /*
             * Sends more notifications if haven't received enough.
             * Otherwise processes new notification.
             */
            if (n == null) {
                if (manager.haveDelivered()) {
                    sendNotifications();
                } else {
                    manager.connectAll();
                }

                /*
                 * Exponential backoff
                 */
                int tmpTimeOut = notTimeout * 2;
                notTimeout = (tmpTimeOut < maxNotificationInterval ? tmpTimeOut : maxNotificationInterval);
                LOG.info("Notification time out: " + notTimeout);
            } else if (validVoter(n.sid) && validVoter(n.leader)) {
                /*
                 * Only proceed if the vote comes from a replica in the current or next
                 * voting view for a replica in the current or next voting view.
                 */
                switch (n.state) {
                case LOOKING:
                    if (getInitLastLoggedZxid() == -1) {
                        LOG.debug("Ignoring notification as our zxid is -1");
                        break;
                    }
                    if (n.zxid == -1) {
                        LOG.debug("Ignoring notification from member with -1 zxid {}", n.sid);
                        break;
                    }
                    // If notification > current, replace and send messages out
                    if (n.electionEpoch > logicalclock.get()) {
                        logicalclock.set(n.electionEpoch);
                        recvset.clear();
                        if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                        } else {
                            updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                        }
                        sendNotifications();
                    } else if (n.electionEpoch < logicalclock.get()) {
                        if (LOG.isDebugEnabled()) {
                            LOG.debug(
                                "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch)
                                + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                        }
                        break;
                    } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                        sendNotifications();
                    }

                    if (LOG.isDebugEnabled()) {
                        LOG.debug("Adding vote: from=" + n.sid
                                  + ", proposed leader=" + n.leader
                                  + ", proposed zxid=0x" + Long.toHexString(n.zxid)
                                  + ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                    }

                    // don't care about the version if it's in LOOKING state
                    recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                    voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));

                    if (voteSet.hasAllQuorums()) {

                        // Verify if there is any change in the proposed leader
                        while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                recvqueue.put(n);
                                break;
                            }
                        }

                        /*
                         * This predicate is true once we don't read any new
                         * relevant message from the reception queue
                         */
                        if (n == null) {
                            setPeerState(proposedLeader, voteSet);
                            Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }
                    break;
                case OBSERVING:
                    LOG.debug("Notification from observer: {}", n.sid);
                    break;
                case FOLLOWING:
                case LEADING:
                    /*
                     * Consider all notifications from the same epoch
                     * together.
                     */
                    if (n.electionEpoch == logicalclock.get()) {
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                        voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            setPeerState(n.leader, voteSet);
                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                    }

                    /*
                     * Before joining an established ensemble, verify that
                     * a majority are following the same leader.
                     */
                    outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                    voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));

                    if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                        synchronized (this) {
                            logicalclock.set(n.electionEpoch);
                            setPeerState(n.leader, voteSet);
                        }
                        Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                        leaveInstance(endVote);
                        return endVote;
                    }
                    break;
                default:
                    LOG.warn("Notification state unrecoginized: " + n.state + " (n.state), " + n.sid + " (n.sid)");
                    break;
                }
            } else {
                if (!validVoter(n.leader)) {
                    LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                }
                if (!validVoter(n.sid)) {
                    LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                }
            }
        }
        return null;
    } finally {
        try {
            if (self.jmxLeaderElectionBean != null) {
                MBeanRegistry.getInstance().unregister(self.jmxLeaderElectionBean);
            }
        } catch (Exception e) {
            LOG.warn("Failed to unregister with JMX", e);
        }
        self.jmxLeaderElectionBean = null;
        LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount());
    }
}

/*
* We return true if one of the following three cases hold:
* 1- New epoch is higher
* 2- New epoch is the same as current epoch, but new zxid is higher
* 3- New epoch is the same as current epoch, new zxid is the same
*  as current zxid, but server id is higher.
*/

return ((newEpoch > curEpoch)
	|| ((newEpoch == curEpoch)
		&& ((newZxid > curZxid)
			|| ((newZxid == curZxid)
				&& (newId > curId)))));

脑裂问题

脑裂问题出现在集群中leader死掉，follower选出了新leader而原leader又复活了的情况下，因为ZK的过半机制是允许损失一定数量的机器而扔能正常提供给服务，当leader死亡判断不一致时就会出现多个leader。

方案：

ZK的过半机制一定程度上也减少了脑裂情况的出现，起码不会出现三个leader同时。ZK中的Epoch机制（时钟）每次选举都是递增+1，当通信时需要判断epoch是否一致，小于自己的则抛弃，大于自己则重置自己，等于则选举；

// If notification > current, replace and send messages out
if (n.electionEpoch > logicalclock.get()) {
    logicalclock.set(n.electionEpoch);
    recvset.clear();
    if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
        updateProposal(n.leader, n.zxid, n.peerEpoch);
    } else {
        updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
    }
    sendNotifications();
} else if (n.electionEpoch < logicalclock.get()) {
    if (LOG.isDebugEnabled()) {
        LOG.debug(
            "Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch)
            + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
    }
    break;
} else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
    updateProposal(n.leader, n.zxid, n.peerEpoch);
    sendNotifications();
}

归纳

在日常的ZK运维时需要注意以上场景在极端情况下出现问题，特别是脑裂的出现，可以采用：

过半选举策略下部署原则：

服务器群部署要单数，如：3、5、7、...，单数是最容易选出leader的配置量。
ZK允许节点最大损失数，原则就是“保证过半选举正常”，多了就是浪费。

详细的算法逻辑是很复杂要考虑很多情况，其中有个Epoch的概念（自增长），分为：LogicEpoch和ElectionEpoch，每次投票都有判断每个投票周期是否一致等等。

在思考ZK策略时经常遇到这样的问题（上文中两块），梳理了一下思路以便于理解也作为后续回顾，特别感谢下面几篇博文的支持，感谢分享；

作者：Owen Jia

可以关注他的博客：Owen Blog

参考博文资料：

zookeeper3.3.5

理解zookeeper选举机制

Zookeeper选举算法原理

看完这篇文章你就清楚的知道 ZooKeeper的概念了

脑裂是什么？Zookeeper是如何解决的？

zookeeper脑裂问题

微信关注我们

原文链接：https://my.oschina.net/timestorm/blog/3110115

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

使用K8s遇难题？Istio来帮您！

如果你正在使用容器，特别是Kubernetes，那么你应该也听说过Istio。对于初学者来说，Istio是Kubernetes的服务网格（service mesh）。所谓服务网格，它是一个网络层，并且可以动态管理服务流量，然后以安全的方式进行管理。如何充分使用Istio，这不是一篇博客文章能阐述清楚的。因此，在本文中我将介绍一些它的特性，更重要的是，你可以通过这篇文章，了解到一些方法来自动化解决某些实际问题。 Istio可以让你使用一组自定义Kubernetes资源来管理网络流量，并且可以帮助你保护和加密服务之间以及集群内外的网络流量。它全面集成了Kubernetes API，这意味着可以使用与其他Kubernetes配置完全相同的方式来定义和管理Istio设置。权衡利弊,再做选择如果要开始使用Istio，首先应该问自己为什么。Istio提供了一些非常有价值的功能，如金丝雀发布等，但是如果不增加一些复杂性，就无法使用它们。你还需要投入一定的时间来学习它。也就是说，如果你的情况合适使用它，你可以（并且应该）在自己的集群中谨慎且逐步地采用Istio的功能。如果你要从头开始构建新环境...

2019-09-24

650

Spring 最重要的概念是 IOC 和 AOP，本篇文章其实就是要带领大家来分析下 Spring 的 IOC 容器。既然大家平时都要用到 Spring，怎么可以不好好了解 Spring 呢？阅读本文并不能让你成为 Spring 专家，不过一定有助于大家理解 Spring 的很多概念，帮助大家排查应用中和 Spring 相关的一些问题。本文采用的源码版本是 4.3.11.RELEASE，算是 5.0.x 前比较新的版本了。为了降低难度，本文所说的所有的内容都是基于 xml 的配置的方式，实际使用已经很少人这么做了，至少不是纯 xml 配置，不过从理解源码的角度来看用这种方式来说无疑是最合适的。阅读建议：读者至少需要知道怎么配置 Spring，了解 Spring 中的各种概念，少部分内容我还假设读者使用过 SpringMVC。本文要说的 IOC 总体来说有两处地方最重要，一个是创建 Bean 容器，一个是初始化 Bean，如果读者觉得一次性看完本文压力有点大，那么可以按这个思路分两次消化。读者不一定对 Spring 容器的源码感兴趣，也许附录部分介绍的知识对读者有些许作用。希望通过...

2019-09-24

632

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。