Spark中两个类似的api是什么

发布时间：2022-01-14 17:00:08 来源：亿速云阅读：107 作者：iii 栏目：云计算

这篇“Spark中两个类似的api是什么”文章的知识点大部分人都不太理解，所以小编给大家总结了以下内容，内容详细，步骤清晰，具有一定的借鉴价值，希望大家阅读完这篇文章能有所收获，下面我们一起来看看这篇“Spark中两个类似的api是什么”文章吧。

Spark 中有两个类似的api，分别是 reduceByKey 和 groupByKey 。这两个的功能类似，但底层实现却有些不同，那么为什么要这样设计呢？我们来从源码的角度分析一下。

先看两者的调用顺序（都是使用默认的Partitioner，即defaultPartitioner）

所用 spark 版本：spark 2.1.0

#### 先看reduceByKey
Step1
```
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
```
Setp2
```
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
```
Setp3
```
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
```

姑且不去看方法里面的细节，我们会只要知道最后调用的是 combineByKeyWithClassTag 这个方法。这个方法有两个参数我们来重点看一下，
```
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)
```
首先是 **partitioner** 参数，这个即是 RDD 的分区设置。除了默认的 defaultPartitioner，Spark 还提供了 RangePartitioner 和 HashPartitioner 外，此外用户也可以自定义 partitioner 。通过源码可以发现如果是 HashPartitioner 的话，那么是会抛出一个错误的。

然后是 **mapSideCombine** 参数，这个参数正是 reduceByKey 和 groupByKey 最大不同的地方，它决定是是否会先在节点上进行一次 Combine 操作，下面会有更具体的例子来介绍。

#### 然后是groupByKey
Step1
```
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
```
Step2
```
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
```
Setp3
```
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
```

结合上面 reduceByKey 的调用链，可以发现最终其实都是调用 combineByKeyWithClassTag 这个方法的，但调用的参数不同。
reduceByKey的调用
```
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
```
groupByKey的调用
```
combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
```
正是两者不同的调用方式导致了两个方法的差别，我们分别来看
- reduceByKey的泛型参数直接是[V]，而groupByKey的泛型参数是[CompactBuffer[V]]。这直接导致了 reduceByKey 和 groupByKey 的返回值不同，前者是RDD[(K, V)]，而后者是RDD[(K, Iterable[V])]

- 然后就是mapSideCombine = false 了，这个mapSideCombine 参数的默认是true的。这个值有什么用呢，上面也说了，这个参数的作用是控制要不要在map端进行初步合并（Combine）。可以看看下面具体的例子。

从功能上来说，可以发现 ReduceByKey 其实就是会在每个节点先进行一次**合并**的操作，而 groupByKey 没有。

这么来看 ReduceByKey 的性能会比 groupByKey 好很多，因为有些工作在节点已经处理了。

以上就是关于“Spark中两个类似的api是什么”这篇文章的内容，相信大家都有了一定的了解，希望小编分享的内容对大家有帮助，若想了解更多相关的知识内容，请关注亿速云行业资讯频道。

向AI问一下细节

Spark中两个类似的api是什么

猜你喜欢

最新资讯

相关推荐

相关标签