[Spark][python]从 web log 中提取出 UserID 作为key 值,形成新的 RDD
针对RDD, 使用 keyBy 来构筑 key-line 对: [training@localhost ~]$ cat webs.log 56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0" 56.32.230.186 - 90700 "GET/contents.css HTTP/1.0" 202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0" [training@localhost ~]$ [training@localhost ~]$ hdfs dfs -put webs.log [training@localhost ~]$ [training@localhost ~]$ hdfs dfs -cat webs.log 56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0" 56.32.230.186 - 90700 "GET/contents.css HTTP/1.0" 202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0" [training@localhost ~]$ [training@localhost ~]$ In [23]: mylogs = sc.textFile("webs.log") In [25]: mylogs001 = mylogs.keyBy(lambda line: line.split(' ')[2]) In [26]: mylogs001.take(1) Out[26]: [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"')] In [28]: mylogs001.take(2) Out[28]: [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"'), (u'90700', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"')] 作一个对比,看看 mylogs001.take(3) 和 mylogs.take(3) In [30]: mylogs001.take(3) Out[30]: [(u'90700', u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"'), (u'90700', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"'), (u'25223', u'202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"')] In [31]: mylogs.take(3) Out[31]: [u'56.31.230.188 - 90700 "GET/KDDOC-00101.html HTTP/1.0"', u'56.32.230.186 - 90700 "GET/contents.css HTTP/1.0"', u'202.156.27.99 - 25223 "GET /KDDOC-00220.html HTTP/1.0"'] 本文转自健哥的数据花园博客园博客,原文链接:http://www.cnblogs.com/gaojian/p/008-Aggregating-Data-with-Pair-RDDs-keyBy.html,如需转载请自行联系原作者