python spark 通过key来统计不同values个数
>>> rdd = sc.parallelize([("a", "1"), ("b", 1), ("a", 1), ("a", 1)]) >>> rdd.distinct().countByKey().items() [('a', 2), ('b', 1)] OR: from operator import add rdd.distinct().map(lambda x: (x[0], 1)).reduceByKey(add) rdd.distinct().keys().map(lambda x: (x, 1)).reduceByKey(add) distinct(numPartitions=None) Return a new RDD containing the distinct elements in this RDD. >>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect()) [1, 2, 3] countByKey() Count the number of...