Kubernetes Endpoints Controller源码分析-低调大师

Kubernetes Endpoints Controller源码分析

2018-11-04 790

Author: xidianwangtao@gmail.com

摘要：最近我们在写自己的Kubernetes服务路由组件对接公司自研的负载均衡器，这其中涉及到非常核心的Endpoints相关的逻辑，因此对Endpoints Controller的深入分析是非常有必要的，比如Pod Label发生变更、孤立Pod、Pod HostName发生变更等情况下，Endpoints Controller的处理逻辑是否与我们想要的一致。

Endpoints Controller相关的配置项

--concurrent-endpoint-syncs int32 Default: 5 The number of endpoint syncing operations that will be done concurrently. Larger number = faster endpoint updating, but more CPU (and network) load.
--leader-elect-resource-lock endpoints Default: "endpoints" The type of resource object that is used for locking during leader election. Supported options are endpoints (default) and configmaps.

Endpoints Controller Watch的GVK

Core/V1/Pods
Core/V1/Services
Core/V1/Endpoints

Endpoints Controller Event Handler

Add Service Event --> enqueueService
Update Service Event --> enqueueService(new)
Delete Service Event --> enqueueService
Add Pod Event --> addPod
Update Pod Event --> updatePod
Delete Pod Event --> deletePod
Add/Update/Delete Endpoints Event --> nil

Run Endpoints Controller

启动两类go协程：

一类协程数为--concurrent-endpoint-syncs配置值(default 5)，每个worker负责从service queue中pop service进行syncService同步，完成一次sync后等待1s再从service queue中pop一个service进行sync，如此反复。
另一类协程只有一个协程，负责checkLeftoverEndpoints，只有启动时会执行一次。

// Run will not return until stopCh is closed. workers determines how many
// endpoints will be handled in parallel.
func (e *EndpointController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer e.queue.ShutDown()

	glog.Infof("Starting endpoint controller")
	defer glog.Infof("Shutting down endpoint controller")

	if !controller.WaitForCacheSync("endpoint", stopCh, e.podsSynced, e.servicesSynced, e.endpointsSynced) {
		return
	}

	// workers = --concurrent-endpoint-syncs's value (default 5)
	for i := 0; i < workers; i++ {
		// workerLoopPeriod = 1s
		go wait.Until(e.worker, e.workerLoopPeriod, stopCh)
	}

	go func() {
		defer utilruntime.HandleCrash()
		e.checkLeftoverEndpoints()
	}()

	<-stopCh
}

checkLeftoverEndpoints

checkLeftoverEndpoints负责List所有当前集群中的endpoints并将它们对应的services添加到queue中，由workers进行syncService同步。

这是为了防止在controller-manager发生重启时时，用户删除了某些Services或者某些Endpoints还没删除干净，Endpoints Controller没有处理的情况下，在Endpoints Controller再次启动时能通过checkLeftoverEndpoints检测到那些孤立的endpionts（没有对应services），将虚构的Services重新加入到队列进行syncService操作，从而完成这些孤立endpoint的清理工作。

上面提到的虚构Services其实是把Endpoints的Key(namespace/name)作为Services的Key，因此这就是为什么要求Endpiont和Service的名字要一致的原因之一。

func (e *EndpointController) checkLeftoverEndpoints() {
	list, err := e.endpointsLister.List(labels.Everything())
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("Unable to list endpoints (%v); orphaned endpoints will not be cleaned up. (They're pretty harmless, but you can restart this component if you want another attempt made.)", err))
		return
	}
	for _, ep := range list {
		if _, ok := ep.Annotations[resourcelock.LeaderElectionRecordAnnotationKey]; ok {
			// when there are multiple controller-manager instances,
			// we observe that it will delete leader-election endpoints after 5min
			// and cause re-election
			// so skip the delete here
			// as leader-election only have endpoints without service
			continue
		}
		key, err := keyFunc(ep)
		if err != nil {
			utilruntime.HandleError(fmt.Errorf("Unable to get key for endpoint %#v", ep))
			continue
		}
		e.queue.Add(key)
	}
}

另外，还需要注意一点，对于kube-controller-manager多实例HA部署时，各个contorller-manager endpoints是没有对应service的，这种情况下，我们不能把虚构的Service加入到队列触发这些“理应孤立”的endpoints被清理，因此我们给这些“理应孤立”的endpoints加上Annotation "control-plane.alpha.kubernetes.io/leader"以做跳过处理。

Endpoint Contoller的核心逻辑syncService

Service的Add/Update/Delete Event Handler都是将Service Key加入到Queue中，等待worker进行syncService处理：

根据queue中得到的service key(namespace/name)去indexer中获取对应的Service Object，如果没获取到，则调api删除同Key（namespace/name）的Endpoints Object进行清理工作，这对应到checkLeftoverEndpoints中描述到的那些孤立endpoints清理工作。
因为Service是通过LabelSelector进行Pod匹配，将匹配的Pods构建对应的Endpoints Subsets加入到Endpoints中，因此这里会先过滤掉那些没有LabelSelector的Services。
然后用Service的LabelSelector获取同namespace下的所有Pods。
检查service.Spec.PublishNotReadyAddresses是否为true，或者Service Annotations "service.alpha.kubernetes.io/tolerate-unready-endpoints"是否为true(/t/T/True/TRUE/1)，如果为true，则表示tolerate Unready Endpoints，即Unready的Pods信息也会被加入该Service对应的Endpoints中。

注意，Annotations "service.alpha.kubernetes.io/tolerate-unready-endpoints"在Kubernetes 1.13中将被弃用，后续只使用.Spec.PublishNotReadyAddresses Field。
接下来就是遍历前面获取到的Pods，用各个Pod的IP、ContainerPorts、HostName及Service的Port去构建Endpoints的Subsets，注意如下特殊处理：
1. 跳过没有pod.Status.PodIP为空的pod；
2. 当tolerate Unready Endpoints为false时，跳过那些被标记删除(DeletionTimestamp != nil)的Pods;
3. 对于Headless Service，因为没有Service Port，因此构建EndpointSubset时对应的Ports内容为空；
4）当tolerate Unready Endpoints为true(即使Pod not Ready)或者Pod isReady时，Pod对应的EndpointAddress也会被加入到(Ready)Addresses中。

5）tolerate Unready Endpoints为false且Pod isNotReady情况下：
```
 - 当pod.Spec.RestartPolicy为Never，Pod Status.Phase为非结束状态(非Failed/Successed)时，Pod对应的EndpointAddress也会被加入到NotReadyAddresses中。
 - 当pod.Spec.RestartPolicy为OnFailure, Pod Status.Phase为非Successed时，Pod对应的EndpointAddress也会被加入到NotReadyAddresses中。
 - 其他情况下，Pod对应的EndpointAddress也会被加入到NotReadyAddresses中。
```
从indexer中获取service对应的Endpoints Object(currentEndpoints)，如果从indexer中没有返回对应的Endpoints Object，则构建一个与该Service同名、同Labels的Endpoints对象(newEndpoints)。
如果currentEndpoints的ResourceVersion不为空，则对比currentEndpoints.Subsets、Labels与前面构建的Subsets、Service.Labels是否DeepEqual，如果是则说明不需要update，流程结束。
否则，就像currentEndpoints DeepCopy给newEndpoints,并用前面构建的Subsets和Services.Labels替换newEndpoints中对应内容。
如果currentEndpoints的ResourceVersion为空，则调用Create API去创建上一步的newEndpoints Object。如果currentEndpoints的ResourceVersion不为空，表示已经存在对应的Endpoints，则调用Update API用newEndpoints去更新该Endpoints。
流程结束。

Pod Event Hanlder

Add Pod

通过Services LabeleSelector与Pod Labels进行匹配的方法，将该Pod能匹配上的所有Services都找出来，然后将它们的Key(namespace/name)都加入到queue等待sync。

// When a pod is added, figure out what services it will be a member of and
// enqueue them. obj must have *v1.Pod type.
func (e *EndpointController) addPod(obj interface{}) {
	pod := obj.(*v1.Pod)
	services, err := e.getPodServiceMemberships(pod)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("Unable to get pod %s/%s's service memberships: %v", pod.Namespace, pod.Name, err))
		return
	}
	for key := range services {
		e.queue.Add(key)
	}
}

Update Pod

如果newPod.ResourceVersion等于oldPod.ResourceVersion，则跳过，不进行任何update。
检查新老Pod的DeletionTimestamp、Ready Condition以及由PodIP,Hostname等建构的EndpointAddress是否发生变更，只要其中之一发生变更，podChangedFlag就为true。
检查新老Pod Spec的Labels、HostName、Subdomain是否发生变更，只要其中之一发生变更，labelChangedFlag就为true。
如果podChangedFlag和labelChangedFlag都为false，则跳过，不做任何update。
通过Services LabeleSelector与Pod Labels进行匹配的方法，将newPod能匹配上的所有Services都找出来(services记录)，如果labelChangedFlag为true，则根据LabelSelector匹配找出oldPod对应的oldServices:
- 如果podChangedFlag为true,则将services和oldServices进行union集合，将集合内的所有Services Key都加入到queue中等待sync；
- 如果podChangedFlag为false，则将services和oldServices的互相差值进行union集合，将集合内的所有Services Key都加入到queue中等待sync；
互相差值进行union集合的含义：services.Difference(oldServices).Union(oldServices.Difference(services))

Delete Pod

如果该pod还是个完整记录的pod，则跟addPod逻辑一样：通过Services LabeleSelector与Pod Labels进行匹配的方法，将该Pod能匹配上的所有Services都找出来，然后将它们的Key(namespace/name)都加入到queue等待sync。
如果该pod是tombstone object(final state is unrecorded)，则将其转换成v1.pod后，再调用addPod。相比正常的Pod，就是多了一步：从tombstone到v1.pod的转换。

// When a pod is deleted, enqueue the services the pod used to be a member of.
// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.
func (e *EndpointController) deletePod(obj interface{}) {
	if _, ok := obj.(*v1.Pod); ok {
		// Enqueue all the services that the pod used to be a member
		// of. This happens to be exactly the same thing we do when a
		// pod is added.
		e.addPod(obj)
		return
	}
	// If we reached here it means the pod was deleted but its final state is unrecorded.
	tombstone, ok := obj.(cache.DeletedFinalStateUnknown)
	if !ok {
		utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj))
		return
	}
	pod, ok := tombstone.Obj.(*v1.Pod)
	if !ok {
		utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a Pod: %#v", obj))
		return
	}
	glog.V(4).Infof("Enqueuing services of deleted pod %s/%s having final state unrecorded", pod.Namespace, pod.Name)
	e.addPod(pod)
}

核心Struct

里面有几个struct，挺容易混淆的，简单用图表示下，方便比对：

总结

通过对Endpoints Controller的源码分析，我们了解了其中很多细节，比如对Service和Pod事件处理逻辑、对孤立Pod的处理方法、Pod Labels变更带来的影响等等，这对我们通过Watch Endpoints去写自己的Ingress组件对接公司内部的路由组件时是有帮助的。

微信关注我们

原文链接：https://my.oschina.net/jxcdwangtao/blog/2721719

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

Scala基本语法及使用

Scala Author: Lijb Email: lijb1121@163.com Scala是一门多范式的编程语言,同时支持面向对象和面向函数编程风格。它以一种优雅的方式解决现实问题。虽然它是强静态类型的编程语言，但是它强大的类型推断能力，使其看起来就像是一个动态编程语言一样。Scala语言最终会被翻译成java字节码文件，可以无缝的和JVM进行集成，同时可以使用Scala调用java的代码库。编程指南：https://docs.scala-lang.org/tour/tour-of-scala.html Scala环境搭建下载scala:https://www.scala-lang.org/download/2.11.12.html Windows版本安装点击scala-2.11.12.msi傻瓜安装配置Scala的环境变量SCALA_HOME变量 SCALA_HOME=C:\Program Files (x86)\scala PATH=%SCALA_HOME%/bin 打开命名窗口 C:\Users\HIAPAD>scala Welcome to Scala 2....

2018-11-04

941

引言之前项目中遇到数据拷贝、引用之间数据层级嵌套过深，拷贝的值相互之间影响的问题，后来引入了immutability-helper，使用过程中的一些总结，跟大家分享下，至于为什么不是immutable，请看下文分解，这里是@IT·平头哥联盟，我是首席填坑官——苏南。相信大家在面试/工作中都遇到过js对象/数组的拷贝问题，面试官问你，你一般怎么做？？在现在ES6盛行的当下，不会一点ES6都不好意思说自己是前端(其实我一般都说自己是攻城狮、切图崽)，我们想的大多第一想法，如下： Object.assign - 最方便; [...] - 最有逼格; JSON.parse、JSON.stringify - 完美组合; $.extend() - jQuery时代的引领潮流时尚前沿的API; 最后想到的才是自己递归实现一个; 但是通常我们使用的Object.assign属于浅拷贝，当数据嵌套层级较深时，就……呵呵了；而JSON.parse、stringify它应该是创建一个临时可能很大的字符串，然后又访问解析器，性能是比较慢的。于是后来发现了 immutable「不可变数据」，曾经我也一度特...

2018-11-05

815

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。