MapReduce中的倒排索引-低调大师

MapReduce中的倒排索引

2017-11-12 558

0.倒排索引资料:

http://blog.csdn.net/pzasdq/article/details/51442856

1.三个日志源文件:

a.txt

hello tom
hello jerry
hello tom

b.txt

hello jerry
hello jerry
tom jerry

c.txt

hello jerry
hello tom

希望统计出来的结果如下:

hello   a.txt->3 b.txt->2 c.txt->2
jerry   b.txt->3 a.txt->1 c.txt->1
tom     a.txt->2 b.txt->1 c.txt->1

2.上代码:

 1 import java.io.IOException;
 2 
 3 import org.apache.hadoop.conf.Configuration;
 4 import org.apache.hadoop.fs.Path;
 5 import org.apache.hadoop.io.LongWritable;
 6 import org.apache.hadoop.io.Text;
 7 import org.apache.hadoop.mapreduce.Job;
 8 import org.apache.hadoop.mapreduce.Mapper;
 9 import org.apache.hadoop.mapreduce.Reducer;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.FileSplit;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 
14 public class InverseIndex {
15     
16     public static class IndexMapper extends Mapper<LongWritable, Text, Text, Text>{
17         private Text k = new Text();
18         private Text v = new Text();
19         @Override
20         protected void map(LongWritable key, Text value,Mapper<LongWritable, Text, Text, Text>.Context context)
21                 throws IOException, InterruptedException {
22             String line = value.toString();
23             String [] words = line.split(" ");
24             FileSplit inputSplit = (FileSplit)context.getInputSplit();//返回mapper读取的是哪个切片split
25             //path=hdfs://itcast:9000/ii/a.txt
26             //k2,v2 为 hello->a.txt     {1,1,1}
27             String path = inputSplit.getPath().toString();
28             for (String word : words) {
29                 k.set(word + "->" + path);
30                 v.set("1");
31                 context.write(k, v);
32             }
33         }
34     }
35     
36     public static class IndexCombiner extends Reducer<Text, Text, Text, Text>{
37         private Text k = new Text();
38         private Text v = new Text();
39         @Override
40         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
41                 throws IOException, InterruptedException {
42             //k2,v2 为hello->a.txt {1,1,1}   ----->  k3,v3为 hello,a.txt->3
43             int counter = 0;
44             for(Text text :values){
45                 counter += Integer.parseInt(text.toString());
46             }
47             String[] wordAndPath = key.toString().split("->");
48             String word = wordAndPath[0];
49             String path = wordAndPath[1];
50             k.set(word);
51             v.set(path+"->"+counter);
52             context.write(k,v);
53         }
54     }
55     
56     
57     public static class IndexReducer extends Reducer<Text, Text, Text, Text>{
58         private Text v = new Text();
59         @Override
60         protected void reduce(Text key, Iterable<Text> values,Reducer<Text, Text, Text, Text>.Context context)
61                 throws IOException, InterruptedException {
62             //Reducer这里 是把所有key相同的搞到一块了,这个地方对应的values为Iterable也证实这一点.
63             //不同的Map根据k2 到达Reducer 把k2相同的汇聚到一起...对应的k2对应的v2组成一个集合.
64             //从combiner过来的k和v为   hello,a.txt->3  经过reducer变成
65             String result = "";
66             for(Text t:values){
67                 result += t.toString() + "\t";
68             }
69             v.set(result);
70             context.write(key,v);
71         }
72     }
73     
74     public static void main(String[] args) throws Exception {
75         Configuration conf = new Configuration();
76         Job job = Job.getInstance(conf);
77         job.setJarByClass(InverseIndex.class);
78         
79         job.setMapperClass(IndexMapper.class);
80         job.setMapOutputKeyClass(Text.class);
81         job.setMapOutputValueClass(Text.class);
82         
83         job.setCombinerClass(IndexCombiner.class);
84         FileInputFormat.setInputPaths(job, new Path(args[0]));
85         
86         job.setReducerClass(IndexReducer.class);
87         job.setOutputKeyClass(Text.class);
88         job.setOutputValueClass(Text.class);
89         
90         FileOutputFormat.setOutputPath(job, new Path(args[1]));
91         System.exit(job.waitForCompletion(true) ? 0 : 1);//0是正常推出以 1是异常退出.
92     }
93 }

3.打成jar包,通过命令执行

hadoop jar /root/itcastmr.jar itcastmr.inverseindex.InverseIndex /user/root/InverseIndex /InverseIndexResult

查看结果文件:

本文转自SummerChill博客园博客，原文链接：http://www.cnblogs.com/DreamDrive/p/7400391.html，如需转载请自行联系原作者

微信关注我们

原文链接：https://yq.aliyun.com/articles/376380

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

启动HDFS之后一直处于安全模式org.apache.hadoop.hdfs.server.namenode.SafeModeExcepti...

一.现象三台机器 crxy99,crxy98,crxy97(crxy99是NameNode+DataNode,crxy98和crxy97是DataNode) 按正常命令启动HDFS之后,HDFS一直处于安全模式(造成启动Hive的时候失败,不能向HDFS上写数据),正常情况下是在启动的前30秒处于安全模式,之后就退出了. 可以采取强制退出安全模式的方式; 安全模式的相关命令: 获取安全模式的状态: hdfs dfsadmin -safemode get 安全模式打开 hdfs dfsadmin -safemode enter 安全模式关闭 hdfs dfsadmin -safemode leave 二.调查查看HDFS启动的日志,到HDFS配置的日志对应的目录去查看日志信息: 查看crxy99 NameNode对应的日志信息关于安全模式的报错信息如下: 2017-08-29 00:30:52,201 DEBUG org.apache.hadoop.ipc.Server: Served: rollEditLog queueTime= 6 procesingTime= 0 excep...

2017-11-13

739

1.日志源文件: 1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200 1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200 1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200 1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200 1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200 1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysI...

2017-11-13

495

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。