Spark 源码分析 -- Stage-低调大师

Spark 源码分析 -- Stage

2017-05-01 652

理解stage, 关键就是理解Narrow Dependency和Wide Dependency, 可能还是觉得比较难理解
关键在于是否需要shuffle, 不需要shuffle是可以随意并发的, 所以stage的边界就是需要shuffle的地方, 如下图很清楚

并且Stage分为两种,

shuffle map stage, in which case its tasks' results are input for another stage
其实就是,非最终stage, 后面还有其他的stage, 所以它的输出一定是需要shuffle并作为后续的输入
result stage, in which case its tasks directly compute the action that initiated a job (e.g. count(), save(), etc)
最终的stage, 没有输出, 而是直接产生结果或存储

1 stage class

这个注释写的很清楚
可以看到stage的RDD参数只有一个RDD, final RDD, 而不是一系列的RDD
因为在一个stage中的所有RDD都是map, partition不会有任何改变, 只是在data依次执行不同的map function
所以对于task scheduler而言, 一个RDD的状况就可以代表这个stage

/**
 * A stage is a set of independent tasks all computing the same function that need to run as part
 * of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run
 * by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the
 * DAGScheduler runs these stages in topological order.
 *
 * Each Stage can either be a shuffle map stage, in which case its tasks' results are input for
 * another stage, or a result stage, in which case its tasks directly compute the action that
 * initiated a job (e.g. count(), save(), etc). For shuffle map stages, we also track the nodes
 * that each output partition is on.
 *
 * Each Stage also has a jobId, identifying the job that first submitted the stage.  When FIFO
 * scheduling is used, this allows Stages from earlier jobs to be computed first or recovered
 * faster on failure.
 */
private[spark] class Stage(
    val id: Int,
    val rdd: RDD[_], // final RDD
    val shuffleDep: Option[ShuffleDependency[_,_]],  // Output shuffle if stage is a map stage
    val parents: List[Stage], // 父stage
    val jobId: Int,
    callSite: Option[String])
  extends Logging {

  val isShuffleMap = shuffleDep != None  // 是否是shuffle map stage, 取决于是否有shuffleDep 
  val numPartitions = rdd.partitions.size
  val outputLocs = Array.fill[List[MapStatus]](numPartitions)(Nil) // 用于buffer每个shuffle中每个maptask的MapStatus
  var numAvailableOutputs = 0

  private var nextAttemptId = 0

  def isAvailable: Boolean = {
    if (!isShuffleMap) {
      true
    } else {
      numAvailableOutputs == numPartitions
    }
  }

2 newStage

如果是shuffle map stage, 需要在这里向mapOutputTracker注册shuffle

  /**
   * Create a Stage for the given RDD, either as a shuffle map stage (for a ShuffleDependency) or
   * as a result stage for the final RDD used directly in an action. The stage will also be
   * associated with the provided jobId.
   */
  private def newStage(
      rdd: RDD[_],
      shuffleDep: Option[ShuffleDependency[_,_]],
      jobId: Int,
      callSite: Option[String] = None)
    : Stage =
  {
    if (shuffleDep != None) {
      // Kind of ugly: need to register RDDs with the cache and map output tracker here
      // since we can't do it in the RDD constructor because # of partitions is unknown
      logInfo("Registering RDD " + rdd.id + " (" + rdd.origin + ")")
      mapOutputTracker.registerShuffle(shuffleDep.get.shuffleId, rdd.partitions.size)
    }
    val id = nextStageId.getAndIncrement()
    val stage = new Stage(id, rdd, shuffleDep, getParentStages(rdd, jobId), jobId, callSite)
    stageIdToStage(id) = stage
    stageToInfos(stage) = StageInfo(stage)
    stage
  }

3 getMissingParentStages

可以根据final stage的deps找出所有的parent stage

  private def getMissingParentStages(stage: Stage): List[Stage] = {
    val missing = new HashSet[Stage]
    val visited = new HashSet[RDD[_]]
    def visit(rdd: RDD[_]) {
      if (!visited(rdd)) {
        visited += rdd
        if (getCacheLocs(rdd).contains(Nil)) {
          for (dep <- rdd.dependencies) {
            dep match {
              case shufDep: ShuffleDependency[_,_] => // 如果发现ShuffleDependency, 说明遇到新的stage
                val mapStage = getShuffleMapStage(shufDep, stage.jobId) // check shuffleToMapStage, 如果该stage已经被创建则直接返回, 否则newStage
                if (!mapStage.isAvailable) {
                  missing += mapStage
                }
              case narrowDep: NarrowDependency[_] => // 对于NarrowDependency, 说明仍然在这个stage中
                visit(narrowDep.rdd)
            }
          }
        }
      }
    }
    visit(stage.rdd)
    missing.toList
  }

本文章摘自博客园，原文发布日期：2013-12-26

微信关注我们

原文链接：https://yq.aliyun.com/articles/85765

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

《Spark核心技术与高级应用》——第2章Spark部署和运行

本节书摘来自华章社区《Spark核心技术与高级应用》一书中的第2章Spark部署和运行，作者于俊　向海　代其锋　马海平，更多章节内容可以访问云栖社区“华章社区”公众号查看第2章Spark部署和运行合抱之木，生于毫末；九层之台，起于累土；千里之行，始于足下。——《道德经》第六十四章合抱的粗木，是从细如针毫时长起来的；九层的高台，是一筐土一筐土筑起来的；千里的行程，是一步又一步迈出来的。那么，Spark高手之路，是从Spark部署和运行开始的,只要坚持，就一定会有收获！对于大部分想学习Spark的人而言，如何构建稳定的Spark集群是学习的重点之一，为了解决构建Spark集群的困难，本章内容从简入手，循序渐进，主要包括：部署准备工作、本地模式部署、独立模式部署、YARN模式部署，以及基于各种模式的应用程序运行。很多优秀的集成部署工具值

2017-05-02

618

.本节书摘来自华章出版社《智能数据时代：企业大数据战略与实战》一书中的第1章，第1.5节，作者 TalkingData ，更多章节内容可以访问云栖社区“华章计算机”公众号查看 1.5　大数据环境下的处理分析工具 Apache HadoopApache Hadoop（包括基于它的各种包装，以下通称Hadoop）是一种开源工具，它提供了处理大数据的新平台。虽然Hadoop已经存在一段时间了，但是越来越多的企业才刚刚开始利用其功能。Hadoop平台旨在解决大量数据造成的问题，特别是包含复杂结构化数据和非结构化数据的混合数据，这些数据不适合放在表中。Hadoop在需要深度分析和计算量大（如集群和定位）的情况下运行良好。对于寻求利用大数据的决策者而言，Hadoop解决了与大数据相关的最常见的问题：以高效的方式存储和访问大量数据。Hadoop的内

2017-05-02

628

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。