Spark RDDs vs DataFrames vs SparkSQL-低调大师

Spark RDDs vs DataFrames vs SparkSQL

2017-11-20 603

简介

Spark的 RDD、DataFrame 和 SparkSQL的性能比较。

2方面的比较

单条记录的随机查找
aggregation聚合并且sorting后输出

使用以下Spark的三种方式来解决上面的2个问题，对比性能。

Using RDD’s
Using DataFrames
Using SparkSQL

数据源

在HDFS中3个文件中存储的9百万不同记录
每条记录11个字段
总大小 1.4 GB

实验环境

HDP 2.4
Hadoop version 2.7
Spark 1.6
HDP Sandbox

测试结果

原始的RDD 比 DataFrames 和 SparkSQL性能要好
DataFrames 和 SparkSQL 性能差不多
使用DataFrames 和 SparkSQL 比 RDD 操作更直观
Jobs都是独立运行，没有其他job的干扰

2个操作

Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

代码

RDD Random Lookup

#!/usr/bin/env python
 
from time import time
from pyspark import SparkConf, SparkContext
 
conf = (SparkConf()
  .setAppName("rdd_random_lookup")
  .set("spark.executor.instances", "10")
  .set("spark.executor.cores", 2)
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.executor.memory", "500MB"))
sc = SparkContext(conf = conf)
 
t0 = time()
 
path = "/data/customer_orders*"
lines = sc.textFile(path)
 
## filter where the order_id, the second field, is equal to 96922894
print lines.map(lambda line: line.split('|')).filter(lambda line: int(line[1]) == 96922894).collect()
 
tt = str(time() - t0)
print "RDD lookup performed in " + tt + " seconds"

DataFrame Random Lookup

#!/usr/bin/env python
 
from time import time
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
 
conf = (SparkConf()
  .setAppName("data_frame_random_lookup")
  .set("spark.executor.instances", "10")
  .set("spark.executor.cores", 2)
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.executor.memory", "500MB"))
sc = SparkContext(conf = conf)
 
sqlContext = SQLContext(sc)
 
t0 = time()
 
path = "/data/customer_orders*"
lines = sc.textFile(path)
 
## create data frame
orders_df = sqlContext.createDataFrame( \
lines.map(lambda l: l.split("|")) \
.map(lambda p: Row(cust_id=int(p[0]), order_id=int(p[1]), email_hash=p[2], ssn_hash=p[3], product_id=int(p[4]), product_desc=p[5], \
country=p[6], state=p[7], shipping_carrier=p[8], shipping_type=p[9], shipping_class=p[10]  ) ) )
 
## filter where the order_id, the second field, is equal to 96922894
orders_df.where(orders_df['order_id'] == 96922894).show()
 
tt = str(time() - t0)
print "DataFrame performed in " + tt + " seconds"

SparkSQL Random Lookup

#!/usr/bin/env python
 
from time import time
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
 
conf = (SparkConf()
  .setAppName("spark_sql_random_lookup")
  .set("spark.executor.instances", "10")
  .set("spark.executor.cores", 2)
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.executor.memory", "500MB"))
sc = SparkContext(conf = conf)
 
sqlContext = SQLContext(sc)
 
t0 = time()
 
path = "/data/customer_orders*"
lines = sc.textFile(path)
 
## create data frame
orders_df = sqlContext.createDataFrame( \
lines.map(lambda l: l.split("|")) \
.map(lambda p: Row(cust_id=int(p[0]), order_id=int(p[1]), email_hash=p[2], ssn_hash=p[3], product_id=int(p[4]), product_desc=p[5], \
country=p[6], state=p[7], shipping_carrier=p[8], shipping_type=p[9], shipping_class=p[10]  ) ) )
 
## register data frame as a temporary table
orders_df.registerTempTable("orders")
 
## filter where the customer_id, the first field, is equal to 96922894
print sqlContext.sql("SELECT * FROM orders where order_id = 96922894").collect()
 
tt = str(time() - t0)
print "SparkSQL performed in " + tt + " seconds"

RDD with GroupBy, Count, and Sort Descending

#!/usr/bin/env python
 
from time import time
from pyspark import SparkConf, SparkContext
 
conf = (SparkConf()
  .setAppName("rdd_aggregation_and_sort")
  .set("spark.executor.instances", "10")
  .set("spark.executor.cores", 2)
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.executor.memory", "500MB"))
sc = SparkContext(conf = conf)
 
t0 = time()
 
path = "/data/customer_orders*"
lines = sc.textFile(path)
 
counts = lines.map(lambda line: line.split('|')) \
.map(lambda x: (x[5], 1)) \
.reduceByKey(lambda a, b: a + b) \
.map(lambda x:(x[1],x[0])) \
.sortByKey(ascending=False)
 
for x in counts.collect():
  print x[1] + '\t' + str(x[0])
 
tt = str(time() - t0)
print "RDD GroupBy performed in " + tt + " seconds"

DataFrame with GroupBy, Count, and Sort Descending

#!/usr/bin/env python
 
from time import time
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
 
conf = (SparkConf()
  .setAppName("data_frame_aggregation_and_sort")
  .set("spark.executor.instances", "10")
  .set("spark.executor.cores", 2)
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.executor.memory", "500MB"))
sc = SparkContext(conf = conf)
 
sqlContext = SQLContext(sc)
 
t0 = time()
 
path = "/data/customer_orders*"
lines = sc.textFile(path)
 
## create data frame
orders_df = sqlContext.createDataFrame( \
lines.map(lambda l: l.split("|")) \
.map(lambda p: Row(cust_id=int(p[0]), order_id=int(p[1]), email_hash=p[2], ssn_hash=p[3], product_id=int(p[4]), product_desc=p[5], \
country=p[6], state=p[7], shipping_carrier=p[8], shipping_type=p[9], shipping_class=p[10]  ) ) )
 
results = orders_df.groupBy(orders_df['product_desc']).count().sort("count",ascending=False)
 
for x in results.collect():
  print x
 
tt = str(time() - t0)
print "DataFrame performed in " + tt + " seconds"

SparkSQL with GroupBy, Count, and Sort Descending

#!/usr/bin/env python
 
from time import time
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
 
conf = (SparkConf()
  .setAppName("spark_sql_aggregation_and_sort")
  .set("spark.executor.instances", "10")
  .set("spark.executor.cores", 2)
  .set("spark.dynamicAllocation.enabled", "false")
  .set("spark.shuffle.service.enabled", "false")
  .set("spark.executor.memory", "500MB"))
sc = SparkContext(conf = conf)
 
sqlContext = SQLContext(sc)
 
t0 = time()
 
path = "/data/customer_orders*"
lines = sc.textFile(path)
 
## create data frame
orders_df = sqlContext.createDataFrame(lines.map(lambda l: l.split("|")) \
.map(lambda r: Row(product=r[5])))
 
## register data frame as a temporary table
orders_df.registerTempTable("orders")
 
results = sqlContext.sql("SELECT product, count(*) AS total_count FROM orders GROUP BY product ORDER BY total_count DESC")
 
for x in results.collect():
  print x
 
tt = str(time() - t0)
print "SparkSQL performed in " + tt + " seconds"

本文转自阿凡卢博客园博客，原文链接： http://www.cnblogs.com/luxiaoxun/p/6397996.html ，如需转载请自行联系原作者

微信关注我们

原文链接：https://yq.aliyun.com/articles/371911

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

11月21日云栖精选夜读：如何扛住1.8亿/秒的双11数据洪峰？阿里流计算技术全揭秘

今年的双11再次刷新了记录——支付成功峰值达25.6万笔／秒、实时数据处理峰值4.72亿条/秒。面对较去年增幅100%的数据洪峰，流计算技术可谓功不可没。今天，我们将揭开阿里流计算技术的神秘面纱。热点热议如何扛住1.8亿/秒的双11数据洪峰？阿里流计算技术全揭秘作者：技术小能手发表于：阿里技术基于云上分布式NoSQL的海量气象数据存储和查询方案作者：亦征发表于：阿里云存储服务决战双11之巅阿里安全治理黑灰产之技术图谱曝光作者：华蒙发表于：阿里安全知识整理 Linux下自动化监控内存、存储空间！作者：思梦php JAVA【异常一】异常体系作者：dongguo 大数据人工智能领域从菜鸟到高手晋级指南作者：技术小能手发表于：大数据文摘深入浅出了解 JavaScript 中的 this 作者：webmirror 展望云计算新时代数据库计算力的进化作者：场景研读美文回顾不止财务自由的诱惑：最顶级的AI科学家正在离开大学作者：技术小能手发表于：大数据文摘应用MaxCompute实现电力设备监测数据的批量特征分析作者：syqq 一个夫妻淘宝店都知道这样用数据 ...

2017-11-21

592

一组相关数据，今年双11共产生8.12亿个物流订单。其中，全网第一单包裹只用12分18秒就送达，进口第一单33分15秒送达，农村第一单69分50秒送达。双11当天共发送3.66亿个订单，第1亿个订单送达仅需要2.8天，比起去年提前0.7天。双11买买买之后，是今年物流行业的“稳定有序”，和感觉明显的“快递快了”。（耗时12分18秒，2017年天猫双11首单在上海嘉定签收）应对双11海量物流订单大规模数据及计算力需求，阿里云搭建起全球最大物流混合云应用，在物流合作伙伴IT系统与海量物流订单之间，建立起“蓄水池”和“超级计算大脑”，为合作伙伴业务搭建起“云上物流”体系。云计算普惠物流行业实现提效降本。双11当天，物流订单将率先进入阿里云物流混合云，优先在云端完成计算处理，再进而分派至物流合作伙伴IT系统，既保护了合作伙伴IT系统免受数

2017-11-21

569

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。