Spark源码分析 -- PairRDD-低调大师

Spark源码分析 -- PairRDD

2017-05-01 599

和一般RDD最大的不同就是有两个泛型参数, [K, V]表示pair的概念
关键的function是, combineByKey, 所有pair相关操作的抽象

combine是这样的操作, Turns an RDD[(K, V)] into a result of type RDD[(K, C)]
其中C有可能只是简单类型, 但经常是seq, 比如(Int, Int) to (Int, Seq[Int])

下面来看看combineByKey的参数,
首先需要用户自定义一些操作,
createCombiner: V => C, C不存在的情况下, 比如通过V创建seq C
mergeValue: (C, V) => C, 当C已经存在的情况下, 需要merge, 比如把item V加到seq C中, 或者叠加
mergeCombiners: (C, C) => C, 合并两个C
partitioner: Partitioner, Shuffle时需要的Partitioner
mapSideCombine: Boolean = true, 为了减小传输量, 很多combine可以在map端先做, 比如叠加, 可以先在一个partition中把所有相同的key的value叠加, 再shuffle
serializerClass: String = null, 传输需要序列化, 用户可以自定义序列化类

/**
 * Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
 * Import `org.apache.spark.SparkContext._` at the top of your program to use these functions.
 */
class PairRDDFunctions[K: ClassManifest, V: ClassManifest](self: RDD[(K, V)])
  extends Logging
  with SparkHadoopMapReduceUtil
  with Serializable {

  /**
   * Generic function to combine the elements for each key using a custom set of aggregation
   * functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined type" C
   * Note that V and C can be different -- for example, one might group an RDD of type
   * (Int, Int) into an RDD of type (Int, Seq[Int]). Users provide three functions:
   *
   * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
   * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
   * - `mergeCombiners`, to combine two C's into a single one.
   *
   * In addition, users can control the partitioning of the output RDD, and whether to perform
   * map-side aggregation (if a mapper can produce multiple items with the same key).
   */
  def combineByKey[C](createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializerClass: String = null): RDD[(K, C)] = {

    val aggregator = new Aggregator[K, V, C](createCombiner, mergeValue, mergeCombiners) //1.Aggregator

    //RDD本身的partitioner和传入的partitioner相等时, 即不需要重新shuffle, 直接map即可
    if (self.partitioner == Some(partitioner)) {  
      self.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true) //2. mapPartitions, map端直接调用combineValuesByKey
    } else if (mapSideCombine) { //如果需要mapSideCombine
      val combined = self.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true) //先在partition内部做mapSideCombine
      val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner).setSerializer(serializerClass) //3. ShuffledRDD, 进行shuffle
      partitioned.mapPartitions(aggregator.combineCombinersByKey, preservesPartitioning = true) //Shuffle完后, 在reduce端再做一次combine, 使用combineCombinersByKey
    } else {
      // Don't apply map-side combiner.和上面的区别就是不做mapSideCombine
      // A sanity check to make sure mergeCombiners is not defined.
      assert(mergeCombiners == null)
      val values = new ShuffledRDD[K, V, (K, V)](self, partitioner).setSerializer(serializerClass)
      values.mapPartitions(aggregator.combineValuesByKey, preservesPartitioning = true)
    }
  }
}

1. Aggregator

在combineByKey中, 首先创建Aggregator, 其实在Aggregator中封装了两个函数,
combineValuesByKey, 用于处理将V加入到C的case, 比如加入一个item到一个seq里面, 用于map端
combineCombinersByKey, 用于处理两个C合并, 比如两个seq合并, 用于reduce端

case class Aggregator[K, V, C] (
    createCombiner: V => C,
    mergeValue: (C, V) => C,
    mergeCombiners: (C, C) => C) {

  def combineValuesByKey(iter: Iterator[_ <: Product2[K, V]]) : Iterator[(K, C)] = {
    val combiners = new JHashMap[K, C]
    for (kv <- iter) {
      val oldC = combiners.get(kv._1)
      if (oldC == null) {
        combiners.put(kv._1, createCombiner(kv._2))
      } else {
        combiners.put(kv._1, mergeValue(oldC, kv._2))
      }
    }
    combiners.iterator
  }

  def combineCombinersByKey(iter: Iterator[(K, C)]) : Iterator[(K, C)] = {
    val combiners = new JHashMap[K, C]
    iter.foreach { case(k, c) =>
      val oldC = combiners.get(k)
      if (oldC == null) {
        combiners.put(k, c)
      } else {
        combiners.put(k, mergeCombiners(oldC, c))
      }
    }
    combiners.iterator
  }
}

2. mapPartitions

mapPartitions其实就是使用MapPartitionsRDD
做的事情就是对当前partition执行map函数f, Iterator[T] => Iterator[U]
比如, 执行combineValuesByKey: Iterator[_ <: Product2[K, V]] to Iterator[(K, C)]

  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   */
  def mapPartitions[U: ClassManifest](f: Iterator[T] => Iterator[U],
    preservesPartitioning: Boolean = false): RDD[U] =
    new MapPartitionsRDD(this, sc.clean(f), preservesPartitioning)

private[spark]
class MapPartitionsRDD[U: ClassManifest, T: ClassManifest](
    prev: RDD[T],
    f: Iterator[T] => Iterator[U],
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner =
    if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext) =
    f(firstParent[T].iterator(split, context)) // 对于map,就是调用f

3. ShuffledRDD

Shuffle实际上是由系统的shuffleFetcher完成的, Spark的抽象封装非常的好
所以在这里看不到Shuffle具体是怎么样做的, 这个需要分析到shuffleFetcher时候才能看到
因为每个shuffle是有一个全局的shuffleid的
所以在compute里面, 你只是看到用BlockStoreShuffleFetcher根据shuffleid和partitionid直接fetch到shuffle过后的数据

/**
* The resulting RDD from a shuffle (e.g. repartitioning of data).
* @param prev the parent RDD.
* @param part the partitioner used to partition the RDD
* @tparam K the key class.
* @tparam V the value class.
*/
class ShuffledRDD[K, V, P <: Product2[K, V] : ClassManifest](
    @transient var prev: RDD[P],
    part: Partitioner)
  extends RDD[P](prev.context, Nil) {

  override val partitioner = Some(part)  
  //ShuffleRDD会进行repartition, 所以从Partitioner中取出新的part数目

  //并用Array.tabulate动态创建相应个数的ShuffledRDDPartition
  override def getPartitions: Array[Partition] = {
    Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i)) 
  }
  
  override def compute(split: Partition, context: TaskContext): Iterator[P] = {
    val shuffledId = dependencies.head.asInstanceOf[ShuffleDependency[K, V]].shuffleId
    SparkEnv.get.shuffleFetcher.fetch[P](shuffledId, split.index, context.taskMetrics,
      SparkEnv.get.serializerManager.get(serializerClass))
  }
}

ShuffledRDDPartition没啥区别, 一样只是记录id

private[spark] class ShuffledRDDPartition(val idx: Int) extends Partition {
  override val index = idx
  override def hashCode(): Int = idx
}

下面再来看一下, 如果使用combineByKey来实现其他的操作的,

group

group是比较典型的例子, (Int, Int) to (Int, Seq[Int])
由于groupByKey不使用map side combine, 因为这样也无法减少传输空间, 所以不需要实现mergeCombiners

  /**
   * Group the values for each key in the RDD into a single sequence. Allows controlling the
   * partitioning of the resulting key-value pair RDD by passing a Partitioner.
   */
  def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    def createCombiner(v: V) = ArrayBuffer(v) //创建seq
    def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v //添加item到seq
    val bufs = combineByKey[ArrayBuffer[V]](
      createCombiner _, mergeValue _, null, partitioner, mapSideCombine=false)
    bufs.asInstanceOf[RDD[(K, Seq[V])]]
  }

reduce

reduce是更简单的一种情况, 只是两个值合并成一个值, (Int, Int V) to (Int, Int C), 比如叠加
所以createCombiner很简单, 就是直接返回v
而mergeValue和mergeCombiners逻辑是相同的, 没有区别

  /**
   * Merge the values for each key using an associative reduce function. This will also perform
   * the merging locally on each mapper before sending results to a reducer, similarly to a
   * "combiner" in MapReduce.
   */
  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = {
    combineByKey[V]((v: V) => v, func, func, partitioner)
  }

本文章摘自博客园，原文发布日期：2013-12-24

微信关注我们

原文链接：https://yq.aliyun.com/articles/85758

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

Apache Kylin权威指南导读

前言 “麒麟出没，必有祥瑞。” ——中国古谚语 “于我而言，与Apache Kylin团队一起合作使Kylin通过孵化成为顶级项目是非常激动人心的，诚然，Kylin在技术方面非常振奋人心，但同样令人兴奋的是Kylin代表了亚洲国家，特别是中国，在开源社区中越来越高的参与度。” ——Ted Dunning Apache孵化项目副总裁，MapR首席应用架构师今天，随着移动互联网、物联网、AI等技术的快速兴起，数据成为了所有这些技术背后最重要，也是最有价值的“资产”。如何从数据中获得有价值的信息？这个问题驱动了相关技术的发展，从最初的基于文件的检索、分析程序，到数据仓库理念的诞生，再到基于数据库的商业智能分析。而现在，这一问题已经变成了如何从海量的超大规模数据中快速获取有价值的信息，新的时代、新的挑战、新的技术必然应运而生。在数据分析领域，

2017-05-01

642

Dependency 依赖, 用于表示RDD之间的因果关系, 一个dependency表示一个parent rdd, 所以在RDD中使用Seq[Dependency[_]]来表示所有的依赖关系 Dependency的base class 可见Dependency唯一的成员就是rdd, 即所依赖的rdd, 或parent rdd /** * Base class for dependencies. */ abstract class Dependency[T](val rdd: RDD[T]) extends Serializable Dependency分为两种, narrow和shuffle NarrowDependency 先看看比较简单的narrow 定义, parent RDD中的每个partition最多被child RDD中的一个partition使用, 即不需要shuffle 更直白点, 就是Narrow只有map, partition本身范围不会改变, 一个parititon经过transform还是一个partition, 虽然内容发生了变化, 所以可以在local...

2017-05-01

713

资源下载

更多资源

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。