Storm-源码分析-Topology Submit-Executor-低调大师

Storm-源码分析-Topology Submit-Executor

2017-05-01 686

在worker中通过executor/mk-executor worker e, 创建每个executor

(defn mk-executor [worker executor-id]
  (let [executor-data (mk-executor-data worker executor-id) ;;1.mk-executor-data 
        _ (log-message "Loading executor " (:component-id executor-data) ":" (pr-str executor-id))
        task-datas (->> executor-data
                        :task-ids
                        (map (fn [t] [t (task/mk-task executor-data t)])) ;;2.mk-task 
                        (into {})
                        (HashMap.))
        _ (log-message "Loaded executor tasks " (:component-id executor-data) ":" (pr-str executor-id))
        report-error-and-die (:report-error-and-die executor-data)
        component-id (:component-id executor-data)

        ;;3.创建threads

        ;; starting the batch-transfer->worker ensures that anything publishing to that queue 
        ;; doesn't block (because it's a single threaded queue and the caching/consumer started
        ;; trick isn't thread-safe)
        system-threads [(start-batch-transfer->worker-handler! worker executor-data)]
        handlers (with-error-reaction report-error-and-die
                   (mk-threads executor-data task-datas))
        threads (concat handlers system-threads)]

    
    ;;使用schedule-recurring定期产生SYSTEM_TICK(触发spout pending rotate)

    (setup-ticks! worker executor-data)

1. mk-executor-data

(defn mk-executor-data [worker executor-id]
  (let [worker-context (worker-context worker)
        task-ids (executor-id->tasks executor-id) ;;包含的tasks
        component-id (.getComponentId worker-context (first task-ids)) ;;所属于的component
        storm-conf (normalized-component-conf (:storm-conf worker) worker-context component-id)
        executor-type (executor-type worker-context component-id) ;;executor类型, blot或者spout
        batch-transfer->worker (disruptor/disruptor-queue   ;;executor的发送缓存queue
                                  (storm-conf TOPOLOGY-EXECUTOR-SEND-BUFFER-SIZE)
                                  :claim-strategy :single-threaded
                                  :wait-strategy (storm-conf TOPOLOGY-DISRUPTOR-WAIT-STRATEGY))
        ]
    (recursive-map
     :worker worker
     :worker-context worker-context
     :executor-id executor-id
     :task-ids task-ids
     :component-id component-id
     :open-or-prepare-was-called? (atom false)
     :storm-conf storm-conf
     :receive-queue ((:executor-receive-queue-map worker) executor-id) ;;取出executor所对应的disruptor queue
     :storm-id (:storm-id worker)
     :conf (:conf worker)
     :shared-executor-data (HashMap.)
     :storm-active-atom (:storm-active-atom worker)
     :batch-transfer-queue batch-transfer->worker
     :transfer-fn (mk-executor-transfer-fn batch-transfer->worker) ;;(1.1) 
     :suicide-fn (:suicide-fn worker)
     :storm-cluster-state (cluster/mk-storm-cluster-state (:cluster-state worker))
     :type executor-type
     ;; TODO: should refactor this to be part of the executor specific map (spout or bolt with :common field)
     :stats (mk-executor-stats <> (sampling-rate storm-conf)) ;;(1.2)
     :interval->task->metric-registry (HashMap.)
     :task->component (:task->component worker)
     :stream->component->grouper (outbound-components worker-context component-id)
     :report-error (throttled-report-error-fn <>)
     :report-error-and-die (fn [error] ;;将error写到zk的error目录下,其他daemon进程可以知道
                             ((:report-error <>) error)
                             ((:suicide-fn <>)))
     :deserializer (KryoTupleDeserializer. storm-conf worker-context)
     :sampler (mk-stats-sampler storm-conf) ;;1.3 mk-stats-sampler 
     ;; TODO: add in the executor-specific stuff in a :specific... or make a spout-data, bolt-data function?
     )))

1.1 mk-executor-transfer-fn

executor会把需要发送的tuple缓存到batch-transfer->worker queue中
参考下面的comments, 为了避免component block (大量的tuple没有被及时处理), 额外创建了overflow buffer, 只有当这个buffer也满了, 才停止nextTuple(对于spout executor比较需要overflow buffer)

        ;; the overflow buffer is used to ensure that spouts never block when emitting
        ;; this ensures that the spout can always clear the incoming buffer (acks and fails), which
        ;; prevents deadlock from occuring across the topology (e.g. Spout -> Bolt -> Acker -> Spout, and all
        ;; buffers filled up)
        ;; when the overflow buffer is full, spouts stop calling nextTuple until it's able to clear the overflow buffer
        ;; this limits the size of the overflow buffer to however many tuples a spout emits in one call of nextTuple, 
        ;; preventing memory issues
        overflow-buffer (LinkedList.)]

返回fn, fn用于将[task, tuple]放到overflow-buffer或者batch-transfer->worker queue中

注意, 这是executor->transfer-fn, 不同于worker->transfer-fn, 名字起的不好, 会混淆
executor的transfer-fn将tuple缓存到executor的batch-transfer->worker, 而worker->transfer-fn将tuple发送到worker的transfer queue

;; in its own function so that it can be mocked out by tracked topologies
(defn mk-executor-transfer-fn [batch-transfer->worker]
  (fn this
    ([task tuple block? ^List overflow-buffer]
      (if (and overflow-buffer (not (.isEmpty overflow-buffer))) ;;overflow存在并且不为空,说明queue已经满了,所以直接放overflow-buffer中
        (.add overflow-buffer [task tuple])
        (try-cause
          (disruptor/publish batch-transfer->worker [task tuple] block?)
        (catch InsufficientCapacityException e
          (if overflow-buffer
            (.add overflow-buffer [task tuple])
            (throw e))
          ))))
    ([task tuple overflow-buffer]
      (this task tuple (nil? overflow-buffer) overflow-buffer))
    ([task tuple]
      (this task tuple nil)
      )))

1.2 mk-executor-stats <> (sampling-rate storm-conf)

Storm-源码分析-Stats (backtype.storm.stats)

1.3 mk-stats-sampler

根据conf里面的sampling-rate创建一个sampler

(defn mk-stats-sampler [conf]
  (even-sampler (sampling-rate conf)))

这里创建的是even-sampler,

(defn even-sampler [freq]
  (let [freq (int freq)
        start (int 0)
        r (java.util.Random.)
        curr (MutableInt. -1)
        target (MutableInt. (.nextInt r freq))] ;;[0,freq]中的随机值
    (with-meta
      (fn []
        (let [i (.increment curr)]
          (when (>= i freq)
            (.set curr start)
            (.set target (.nextInt r freq))))
          (= (.get curr) (.get target))) ;;FP没有直接赋值, 所以==简化为=
      {:rate freq})))

(defn sampler-rate [sampler]
  (:rate (meta sampler)))

even-sampler, 返回的是个fn ,并且通过with-meta添加metadata({:rate freq})
所以, 通过(:rate (meta sampler)), 可以从sampler的meta里面取出rate值

sampler就是fn, 每次调用都会返回(= curr target)
curr从start开始递增, 在达到target之前, 调用fn都是返回false
当curr等于target时, 调用fn返回true
当curr大于target时, 从新随机生成target, 将curr清零

所以sampler实际产生的效果, 就是不停的调用sampler, 会随机出现若干次false和一次true (在freq的范围内)
从而达到sampler的效果, 只有是true的时候才取样

其实对于简单的sampler, 比如rate是20%, 可以简单的每跳过4个取一个, 但是这样可能的问题是, 取样的规律性太强, 如果数据恰好符合你的规律, 比如5倍数的数据相同, 就会有问题
所以这里为了增加随机性, 采用这样的实现
并且这里对闭包和metadata的应用, 值得借鉴

2.mk-task, 创建task

(task/mk-task executor-data t)

Storm-源码分析-Topology Submit-Task

3.创建threads

3.1 batch-transfer-queue handle thread, spout发送线程

从batch-transfer-queue取出messages, 没有到达batchend时, 放到cached-emit中的arraylist中
当达到batchend时, 使用transfer-fn将messages发送到transfer-queue (spout应该没有发送给自己的tuple吧)

(defn start-batch-transfer->worker-handler! [worker executor-data]
  (let [worker-transfer-fn (:transfer-fn worker)
        cached-emit (MutableObject. (ArrayList.)) ;;用于cache所有messages,直到batchend
        storm-conf (:storm-conf executor-data)
        serializer (KryoTupleSerializer. storm-conf (:worker-context executor-data))
        ]
    (disruptor/consume-loop*
      (:batch-transfer-queue executor-data)
      (disruptor/handler [o seq-id batch-end?]
        (let [^ArrayList alist (.getObject cached-emit)]
          (.add alist o)
          (when batch-end?
            (worker-transfer-fn serializer alist)
            (.setObject cached-emit (ArrayList.))
            )))
      :kill-fn (:report-error-and-die executor-data))))

Worker, transfer-fn

将task分为local和remote
对于local的, 使用local-transfer将messages发送到对应的recieve-queue里面
而对于remote的, 使用disruptor/publish发送到transfer-queue里面

storm使用kryo作为其java的序列化F/W (http://code.google.com/p/kryo/)

(defn mk-transfer-fn [worker]
  (let [local-tasks (-> worker :task-ids set)
        local-transfer (:transfer-local-fn worker)
        ^DisruptorQueue transfer-queue (:transfer-queue worker)]
    (fn [^KryoTupleSerializer serializer tuple-batch]
      (let [local (ArrayList.)
            remote (ArrayList.)]
        (fast-list-iter [[task tuple :as pair] tuple-batch]
          (if (local-tasks task)
            (.add local pair)
            (.add remote pair)
            ))
        (local-transfer local)
        ;; not using map because the lazy seq shows up in perf profiles
        (let [serialized-pairs (fast-list-for [[task ^TupleImpl tuple] remote] [task (.serialize serializer tuple)])]
          (disruptor/publish transfer-queue serialized-pairs)

3.2 executor的执行thread

try…catch mk-threads函数, 如果发生异常将error写到zk, 以便其他的daemon能及时知道

handlers (with-error-reaction report-error-and-die
(mk-threads executor-data task-datas))

本文章摘自博客园，原文发布日期：2013-08-05

微信关注我们

原文链接：https://yq.aliyun.com/articles/85682

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

大数据分析：机器学习算法实现的演化

本文翻译自《BIG DATA ANALYTICS BEYOND HADOOP》我将会对机器学习算法的不同的实现范式进行讲解，既有来自文献中的，也有来自开源社区里的。首先，这里列出了目前可用的三代机器学习工具。传统的机器学习和数据分析的工具，包括SAS，IBM的SPSS，Weka以及R语言。它们可以在小数据集上进行深度分析——工具所运行的节点的内存可以容纳得下的数据集。第二代机器学习工具，包括Mahout，Pentaho，以及RapidMiner。它们可以对大数据进行我称之为粗浅的分析。基于Hadoop之上进行的传统机器学习工具的规模化的尝试，包括Revolution Analytics的成果(RHadoop)以及Hadoop上的SAS，都可以归到第二代工具里面。第三代工具，比如Spark, Twister，HaLoop，Hama以及GraphLab。它们可以对大数据进行深度的分析。传统供应商最近的一些尝试包括SAS的内存分析，也属于这一类。第一代机器学习工具/范式由于第一代工具拥有大量的机器学习算法，因此它们适合进行深度的分析。然而，由于可扩展性的限制，它们并不都能在大数据...

2017-05-01

632

1.6　机器学习工作流和Spark pipeline 在本节中，我们介绍机器学习工作流和Spark pipeline，然后讨论Spark pipeline作为机器学习计算工作流的优秀工具是如何发挥作用的。学习完本节，读者将掌握这两个重要概念，并且为编程和实现机器学习工作流的Spark pipeline做好准备。机器学习的工作流步骤几乎所有的机器学习项目均涉及数据清洗、特征挖掘、模型估计、模型评估，然后是结果解释，这些都可以组织为循序渐进的工作流。这些工作流有时称为分析过程。有些人甚至定义机器学习是将数据转化为可执行的洞察结果的工作流，有些人会在工作流中增加对业务的理解或问题的定义，以作为他们工作的出发点。在数据挖掘领域，跨行业数据挖掘标准过程（CRISP-DM）是一个被广泛接受和采用的标准流程。许多标准机器学习的工作流都只是CRIS

2017-05-01

684

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。