MapReduce中的Join

2017-11-12 569

一. MR中的join的两种方式：

1.reduce side join(面试题)

reduce side join是一种最简单的join方式，其主要思想如下：

在map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value对，对每条数据打一个标签（tag）,比如：tag=1表示来自文件File1，tag=2表示来自文件File2。即：map阶段的主要任务是对不同文件中的数据打标签,在shuffle阶段已经自然按key分组.

在reduce阶段，reduce函数获取相同k2的v2 list（v2来自File1和File2），然后对于同一个key，对File1和File2中的数据进行join（笛卡尔乘积）。即：reduce阶段进行实际的连接操作。

这种方法有2个问题：

1, map阶段没有对数据瘦身，shuffle的网络传输和排序性能很低。

2, reduce端对2个集合做乘积计算，很耗内存，容易导致OOM。

我关于reduce side join的博文总结地址：http://www.cnblogs.com/DreamDrive/p/7692042.html

2.map side join(面试题)

之所以存在reduce side join，是因为在map阶段不能获取所有需要的join字段，即：同一个key对应的字段可能位于不同map中。Reduce side join是非常低效的，因为shuffle阶段要进行大量的数据传输。

Map side join是针对以下场景进行的优化：

两个待连接表中，有一个表非常大，而另一个表非常小，以至于小表可以直接存放到内存中。这样，我们可以将小表复制多份，让每个map task内存中存在一份（比如存放到hash table中），

然后只扫描大表：对于大表中的每一条记录key/value，在hash table中查找是否有相同的key的记录，如果有，则连接后输出即可。

为了支持文件的复制，Hadoop提供了一个类DistributedCache，使用该类的方法如下：

（1）用户使用静态方法DistributedCache.addCacheFile()指定要复制的文件，它的参数是文件的URI（如果是HDFS上的文件，可以这样：hdfs://namenode:9000/home/XXX/file，其中9000是自己配置的NameNode端口号）。Job在作业启动之前会获取这个URI列表，并将相应的文件拷贝到各个Container的本地磁盘上。

（2）用户使用DistributedCache.getLocalCacheFiles()方法获取文件目录，并使用标准的文件读写API读取相应的文件。

这种方法的局限性：

这种方法，要使用hadoop中的DistributedCache把小数据分布到各个计算节点，每个map节点都要把小数据库加载到内存，按关键字建立索引。
这种方法有明显的局限性：有一份数据比较小，在map端，能够把它加载到内存，并进行join操作。

3.针对Map Side Join 局限的解决方法：

①使用内存服务器，扩大节点的内存空间

针对map join，可以把一份数据存放到专门的内存服务器，在map()方法中，对每一个<key,value>的输入对，根据key到内存服务器中取出数据，进行连接

②使用BloomFilter过滤空连接的数据

对其中一份数据在内存中建立BloomFilter，另外一份数据在连接之前，用BloomFilter判断它的key是否存在，如果不存在，那这个记录是空连接，可以忽略。

③使用mapreduce专为join设计的包

在mapreduce包里看到有专门为join设计的包，对这些包还没有学习，不知道怎么使用，只是在这里记录下来，作个提醒。

jar： mapreduce-client-core.jar

package： org.apache.hadoop.mapreduce.lib.join

4.具体Map Side Join的使用

有客户数据customer和订单数据orders。

customer

客户编号	姓名	地址	电话
1	hanmeimei	ShangHai	110
2	leilei	BeiJing	112
3	lucy	GuangZhou	119

** order**

订单编号	客户编号	其它字段被忽略
1	1	50
2	1	200
3	3	15
4	3	350
5	3	58
6	1	42
7	1	352
8	2	1135
9	2	400
10	2	2000
11	2	300

要求对customer和orders按照客户编号进行连接，结果要求对客户编号分组，对订单编号排序，对其它字段不作要求

客户编号	订单编号	订单金额	姓名	地址	电话
1	1	50	hanmeimei	ShangHai	110
1	2	200	hanmeimei	ShangHai	110
1	6	42	hanmeimei	ShangHai	110
1	7	352	hanmeimei	ShangHai	110
2	8	1135	leilei	BeiJing	112
2	9	400	leilei	BeiJing	112
2	10	2000	leilei	BeiJing	112
2	11	300	leilei	BeiJing	112
3	3	15	lucy	GuangZhou	119
3	4	350	lucy	GuangZhou	119
3	5	58	lucy	GuangZhou	119

在提交job的时候，把小数据通过DistributedCache分发到各个节点。
map端使用DistributedCache读到数据，在内存中构建映射关系--如果使用专门的内存服务器，就把数据加载到内存服务器，map()节点可以只保留一份小缓存；如果使用BloomFilter来加速，在这里就可以构建；
map()函数中，对每一对<key,value>，根据key到第2)步构建的映射里面中找出数据，进行连接，输出。

上代码：

  1 public class MapSideJoin extends Configured implements Tool {
  2     // customer文件在hdfs上的位置。
  3     private static final String CUSTOMER_CACHE_URL = "hdfs://hadoop1:9000/user/hadoop/mapreduce/cache/customer.txt";
  4     //客户数据表对应的实体类
  5     private static class CustomerBean {
  6         private int custId;
  7         private String name;
  8         private String address;
  9         private String phone;
 10         
 11         public CustomerBean() {
 12         }
 13         
 14         public CustomerBean(int custId, String name, String address,String phone) {
 15             super();
 16             this.custId = custId;
 17             this.name = name;
 18             this.address = address;
 19             this.phone = phone;
 20         }
 21         
 22         public int getCustId() {
 23             return custId;
 24         }
 25 
 26         public String getName() {
 27             return name;
 28         }
 29 
 30         public String getAddress() {
 31             return address;
 32         }
 33 
 34         public String getPhone() {
 35             return phone;
 36         }
 37     }
 38     //客户订单对应的实体类
 39     private static class CustOrderMapOutKey implements WritableComparable<CustOrderMapOutKey> {
 40         private int custId;
 41         private int orderId;
 42 
 43         public void set(int custId, int orderId) {
 44             this.custId = custId;
 45             this.orderId = orderId;
 46         }
 47         
 48         public int getCustId() {
 49             return custId;
 50         }
 51         
 52         public int getOrderId() {
 53             return orderId;
 54         }
 55         
 56         @Override
 57         public void write(DataOutput out) throws IOException {
 58             out.writeInt(custId);
 59             out.writeInt(orderId);
 60         }
 61 
 62         @Override
 63         public void readFields(DataInput in) throws IOException {
 64             custId = in.readInt();
 65             orderId = in.readInt();
 66         }
 67 
 68         @Override
 69         public int compareTo(CustOrderMapOutKey o) {
 70             int res = Integer.compare(custId, o.custId);
 71             return res == 0 ? Integer.compare(orderId, o.orderId) : res;
 72         }
 73         
 74         @Override
 75         public boolean equals(Object obj) {
 76             if (obj instanceof CustOrderMapOutKey) {
 77                 CustOrderMapOutKey o = (CustOrderMapOutKey)obj;
 78                 return custId == o.custId && orderId == o.orderId;
 79             } else {
 80                 return false;
 81             }
 82         }
 83         
 84         @Override
 85         public String toString() {
 86             return custId + "\t" + orderId;
 87         }
 88     }
 89     
 90     private static class JoinMapper extends Mapper<LongWritable, Text, CustOrderMapOutKey, Text> {
 91         private final CustOrderMapOutKey outputKey = new CustOrderMapOutKey();
 92         private final Text outputValue = new Text();
 93         /**
 94          * 把表中每一行的客户信息封装成一个Map，存储在内存中
 95          * Map的key是客户的id，value是封装的客户bean对象
 96          */
 97         private static final Map<Integer, CustomerBean> CUSTOMER_MAP = new HashMap<Integer, Join.CustomerBean>();
 98         @Override
 99         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
100             // 格式: 订单编   客户编号    订单金额
101             String[] cols = value.toString().split("\t");           
102             if (cols.length < 3) {
103                 return;
104             }
105             
106             int custId = Integer.parseInt(cols[1]);// 取出客户编号
107             CustomerBean customerBean = CUSTOMER_MAP.get(custId);
108             
109             if (customerBean == null) {// 没有对应的customer信息可以连接
110                 return;
111             }
112             
113             StringBuffer sb = new StringBuffer();
114             sb.append(cols[2]).append("\t")
115                 .append(customerBean.getName()).append("\t")
116                 .append(customerBean.getAddress()).append("\t")
117                 .append(customerBean.getPhone());
118             outputValue.set(sb.toString());
119             outputKey.set(custId, Integer.parseInt(cols[0]));
120             context.write(outputKey, outputValue);
121         }
122         
123         //在Mapper方法执行前执行
124         @Override
125         protected void setup(Context context) throws IOException, InterruptedException {
126             FileSystem fs = FileSystem.get(URI.create(CUSTOMER_CACHE_URL), context.getConfiguration());
127             FSDataInputStream fdis = fs.open(new Path(CUSTOMER_CACHE_URL));
128             
129             BufferedReader reader = new BufferedReader(new InputStreamReader(fdis));
130             String line = null;
131             String[] cols = null;
132             
133             // 格式：客户编号  姓名  地址  电话
134             while ((line = reader.readLine()) != null) {
135                 cols = line.split("\t");
136                 if (cols.length < 4) {// 数据格式不匹配，忽略
137                     continue;
138                 }
139                 CustomerBean bean = new CustomerBean(Integer.parseInt(cols[0]), cols[1], cols[2], cols[3]);
140                 CUSTOMER_MAP.put(bean.getCustId(), bean);
141             }
142         }
143     }
144 
145     /**
146      * reduce
147      */
148     private static class JoinReducer extends Reducer<CustOrderMapOutKey, Text, CustOrderMapOutKey, Text> {
149         @Override
150         protected void reduce(CustOrderMapOutKey key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
151             // 什么事都不用做，直接输出
152             for (Text value : values) {
153                 context.write(key, value);
154             }
155         }
156     }
157     /**
158      * @param args
159      * @throws Exception
160      */
161     public static void main(String[] args) throws Exception {
162         if (args.length < 2) {
163             new IllegalArgumentException("Usage: <inpath> <outpath>");
164             return;
165         }
166         ToolRunner.run(new Configuration(), new Join(), args);
167     }
168 
169     @Override
170     public int run(String[] args) throws Exception {
171         Configuration conf = getConf();
172         Job job = Job.getInstance(conf, Join.class.getSimpleName());
173         job.setJarByClass(SecondarySortMapReduce.class);
174         
175         // 添加customer cache文件
176         job.addCacheFile(URI.create(CUSTOMER_CACHE_URL));
177         
178         FileInputFormat.addInputPath(job, new Path(args[0]));
179         FileOutputFormat.setOutputPath(job, new Path(args[1]));
180         
181         // map settings
182         job.setMapperClass(JoinMapper.class);
183         job.setMapOutputKeyClass(CustOrderMapOutKey.class);
184         job.setMapOutputValueClass(Text.class);
185         
186         // reduce settings
187         job.setReducerClass(JoinReducer.class);
188         job.setOutputKeyClass(CustOrderMapOutKey.class);
189         job.setOutputKeyClass(Text.class);
190         
191         boolean res = job.waitForCompletion(true);
192         return res ? 0 : 1;
193     }
194 }

上面的代码没有使用DistributedCache类：

5.Map Side Join的再一个例子：

  1 import java.io.BufferedReader;   
  2 import java.io.FileReader;   
  3 import java.io.IOException;   
  4 import java.util.HashMap;   
  5 import org.apache.hadoop.conf.Configuration;   
  6 import org.apache.hadoop.conf.Configured;   
  7 import org.apache.hadoop.filecache.DistributedCache;   
  8 import org.apache.hadoop.fs.Path;   
  9 import org.apache.hadoop.io.Text;   
 10 import org.apache.hadoop.mapreduce.Job;   
 11 import org.apache.hadoop.mapreduce.Mapper;   
 12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;   
 13 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;   
 14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;   
 15 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;   
 16 import org.apache.hadoop.util.Tool;   
 17 import org.apache.hadoop.util.ToolRunner;   
 18 import org.slf4j.Logger;   
 19 import org.slf4j.LoggerFactory;   
 20 /**   
 21  * 用途说明：   
 22  * Map side join中的left outer join   
 23  * 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段   
 24  * table1(左表):tb_dim_city
 25  * (id int,name string,orderid int,city_code int,is_show int)，   
 26  * 假设tb_dim_city文件记录数很少
 27  * tb_dim_city.dat文件内容,分隔符为"|"：   
 28  * id     name  orderid  city_code  is_show   
 29  * 0       其他        9999     9999         0   
 30  * 1       长春        1        901          1   
 31  * 2       吉林        2        902          1   
 32  * 3       四平        3        903          1   
 33  * 4       松原        4        904          1   
 34  * 5       通化        5        905          1   
 35  * 6       辽源        6        906          1   
 36  * 7       白城        7        907          1   
 37  * 8       白山        8        908          1   
 38  * 9       延吉        9        909          1   
 39  * -------------------------风骚的分割线-------------------------------   
 40  * table2(右表)：tb_user_profiles
 41  * (userID int,userName string,network string,flow double,cityID int)   
 42  * tb_user_profiles.dat文件内容,分隔符为"|"：   
 43  * userID   network     flow    cityID   
 44  * 1           2G       123      1   
 45  * 2           3G       333      2   
 46  * 3           3G       555      1   
 47  * 4           2G       777      3   
 48  * 5           3G       666      4   
 49  * ..................................
 50  * ..................................
 51  * -------------------------风骚的分割线-------------------------------   
 52  *  结果：   
 53  *  1   长春  1   901 1   1   2G  123   
 54  *  1   长春  1   901 1   3   3G  555   
 55  *  2   吉林  2   902 1   2   3G  333   
 56  *  3   四平  3   903 1   4   2G  777   
 57  *  4   松原  4   904 1   5   3G  666   
 58  */ 
 59 public class MapSideJoinMain extends Configured implements Tool{   
 60     private static final Logger logger = LoggerFactory.getLogger(MapSideJoinMain.class);
 61     
 62     public static class LeftOutJoinMapper extends Mapper<Object, Text, Text, Text> {
 63         private HashMap<String,String> city_infoMap = new HashMap<String, String>();   
 64         private Text outPutKey = new Text();   
 65         private Text outPutValue = new Text();   
 66         private String mapInputStr = null;   
 67         private String mapInputSpit[] = null;   
 68         private String city_secondPart = null;   
 69         /**   
 70          * 此方法在每个task开始之前执行，这里主要用作从DistributedCache   
 71          * 中取到tb_dim_city文件，并将里边记录取出放到内存中。   
 72          */ 
 73         @Override 
 74         protected void setup(Context context) throws IOException, InterruptedException {   
 75             BufferedReader br = null;   
 76             //获得当前作业的DistributedCache相关文件   
 77             Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());   
 78             String cityInfo = null;   
 79             for(Path p : distributePaths){   
 80                 if(p.toString().endsWith("tb_dim_city.dat")){   
 81                     //读缓存文件，并放到mem中   
 82                     br = new BufferedReader(new FileReader(p.toString()));   
 83                     while(null!=(cityInfo=br.readLine())){   
 84                         String[] cityPart = cityInfo.split("\\|",5);   
 85                         if(cityPart.length ==5){   
 86                             city_infoMap.put(cityPart[0], cityPart[1]+"\t"+cityPart[2]+"\t"+cityPart[3]+"\t"+cityPart[4]);   
 87                         }   
 88                     }   
 89                 }   
 90             }   
 91         }
 92  
 93         /**   
 94          * Map端的实现相当简单，直接判断tb_user_profiles.dat中的   
 95          * cityID是否存在我的map中就ok了，这样就可以实现Map Join了   
 96          */ 
 97         @Override 
 98         protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {   
 99             //排掉空行   
100             if(value == null || value.toString().equals("")){   
101                 return;   
102             }   
103             mapInputStr = value.toString();   
104             mapInputSpit = mapInputStr.split("\\|",4);   
105             //过滤非法记录   
106             if(mapInputSpit.length != 4){   
107                 return;   
108             }   
109             //判断链接字段是否在map中存在   
110             city_secondPart = city_infoMap.get(mapInputSpit[3]);   
111             if(city_secondPart != null){   
112                 this.outPutKey.set(mapInputSpit[3]);   
113                 this.outPutValue.set(city_secondPart+"\t"+mapInputSpit[0]+"\t"+mapInputSpit[1]+"\t"+mapInputSpit[2]);   
114                 context.write(outPutKey, outPutValue);   
115             }   
116         }   
117     }   
118     @Override 
119     public int run(String[] args) throws Exception {   
120             Configuration conf=getConf(); //获得配置文件对象   
121             DistributedCache.addCacheFile(new Path(args[1]).toUri(), conf);//为该job添加缓存文件   
122             Job job=new Job(conf,"MapJoinMR");   
123             job.setNumReduceTasks(0);
124  
125             FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径   
126             FileOutputFormat.setOutputPath(job, new Path(args[2])); //设置reduce输出文件路径
127  
128             job.setJarByClass(MapSideJoinMain.class);   
129             job.setMapperClass(LeftOutJoinMapper.class);
130  
131             job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式   
132             job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式
133  
134             //设置map的输出key和value类型   
135             job.setMapOutputKeyClass(Text.class);
136  
137             //设置reduce的输出key和value类型   
138             job.setOutputKeyClass(Text.class);   
139             job.setOutputValueClass(Text.class);   
140             job.waitForCompletion(true);   
141             return job.isSuccessful()?0:1;   
142     }   
143     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {   
144         try {   
145             int returnCode = ToolRunner.run(new MapSideJoinMain(),args);   
146             System.exit(returnCode);   
147         } catch (Exception e) {   
148             logger.error(e.getMessage());   
149         }   
150     }   
151 }

6.SemiJoin

SemiJoin就是所谓的半连接，其实仔细一看就是reduce join的一个变种，就是在map端过滤掉一些数据，在网络中只传输参与连接的数据不参与连接的数据不必在网络中进行传输，从而减少了shuffle的网络传输量，使整体效率得到提高，其他思想和reduce join是一模一样的。说得更加接地气一点就是将小表中参与join的key单独抽出来通过DistributedCach分发到相关节点，然后将其取出放到内存中（可以放到HashSet中），在map阶段扫描连接表，将join key不在内存HashSet中的记录过滤掉，让那些参与join的记录通过shuffle传输到reduce端进行join操作，其他的和reduce join都是一样的。看代码：

  1 import java.io.BufferedReader;   
  2 import java.io.FileReader;   
  3 import java.io.IOException;   
  4 import java.util.ArrayList;   
  5 import java.util.HashSet;   
  6 import org.apache.hadoop.conf.Configuration;   
  7 import org.apache.hadoop.conf.Configured;   
  8 import org.apache.hadoop.filecache.DistributedCache;   
  9 import org.apache.hadoop.fs.Path;   
 10 import org.apache.hadoop.io.Text;   
 11 import org.apache.hadoop.mapreduce.Job;   
 12 import org.apache.hadoop.mapreduce.Mapper;   
 13 import org.apache.hadoop.mapreduce.Reducer;   
 14 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;   
 15 import org.apache.hadoop.mapreduce.lib.input.FileSplit;   
 16 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;   
 17 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;   
 18 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;   
 19 import org.apache.hadoop.util.Tool;   
 20 import org.apache.hadoop.util.ToolRunner;   
 21 import org.slf4j.Logger;   
 22 import org.slf4j.LoggerFactory;   
 23 /**   
 24  * @author zengzhaozheng   
 25  *   
 26  * 用途说明：   
 27  * reudce side join中的left outer join   
 28  * 左连接，两个文件分别代表2个表,连接字段table1的id字段和table2的cityID字段   
 29  * table1(左表):tb_dim_city
 30  * (id int,name string,orderid int,city_code,is_show)   
 31  * tb_dim_city.dat文件内容,分隔符为"|"：   
 32  * id     name  orderid  city_code  is_show   
 33  * 0       其他        9999     9999         0   
 34  * 1       长春        1        901          1   
 35  * 2       吉林        2        902          1   
 36  * 3       四平        3        903          1   
 37  * 4       松原        4        904          1   
 38  * 5       通化        5        905          1   
 39  * 6       辽源        6        906          1   
 40  * 7       白城        7        907          1   
 41  * 8       白山        8        908          1   
 42  * 9       延吉        9        909          1   
 43  * -------------------------风骚的分割线-------------------------------   
 44  * table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)   
 45  * tb_user_profiles.dat文件内容,分隔符为"|"：   
 46  * userID   network     flow    cityID   
 47  * 1           2G       123      1   
 48  * 2           3G       333      2   
 49  * 3           3G       555      1   
 50  * 4           2G       777      3   
 51  * 5           3G       666      4   
 52  * ..................................
 53  * ..................................
 54  * -------------------------风骚的分割线-------------------------------   
 55  * joinKey.dat内容：   
 56  * city_code   
 57  * 1   
 58  * 2   
 59  * 3   
 60  * 4   
 61  * -------------------------风骚的分割线-------------------------------   
 62  *  结果：   
 63  *  1   长春  1   901 1   1   2G  123   
 64  *  1   长春  1   901 1   3   3G  555   
 65  *  2   吉林  2   902 1   2   3G  333   
 66  *  3   四平  3   903 1   4   2G  777   
 67  *  4   松原  4   904 1   5   3G  666   
 68  */ 
 69 public class SemiJoin extends Configured implements Tool{   
 70     private static final Logger logger = LoggerFactory.getLogger(SemiJoin.class);   
 71     public static class SemiJoinMapper extends Mapper<Object, Text, Text, CombineValues> {   
 72         private CombineValues combineValues = new CombineValues();   
 73         private HashSet<String> joinKeySet = new HashSet<String>();   
 74         private Text flag = new Text();   
 75         private Text joinKey = new Text();   
 76         private Text secondPart = new Text();   
 77         /**   
 78          * 将参加join的key从DistributedCache取出放到内存中，以便在map端将要参加join的key过滤出来。b   
 79          */ 
 80         @Override 
 81         protected void setup(Context context) throws IOException, InterruptedException {   
 82             BufferedReader br = null;   
 83             //获得当前作业的DistributedCache相关文件   
 84             Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());   
 85             String joinKeyStr = null;   
 86             for(Path p : distributePaths){   
 87                 if(p.toString().endsWith("joinKey.dat")){   
 88                     //读缓存文件，并放到mem中   
 89                     br = new BufferedReader(new FileReader(p.toString()));   
 90                     while(null!=(joinKeyStr=br.readLine())){   
 91                         joinKeySet.add(joinKeyStr);   
 92                     }   
 93                 }   
 94             }   
 95         }   
 96         @Override 
 97         protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {   
 98             //获得文件输入路径   
 99             String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();   
100             //数据来自tb_dim_city.dat文件,标志即为"0"   
101             if(pathName.endsWith("tb_dim_city.dat")){   
102                 String[] valueItems = value.toString().split("\\|");   
103                 //过滤格式错误的记录   
104                 if(valueItems.length != 5){   
105                     return;   
106                 }   
107                 //过滤掉不需要参加join的记录   
108                 if(joinKeySet.contains(valueItems[0])){   
109                     flag.set("0");   
110                     joinKey.set(valueItems[0]);   
111                     secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);   
112                     combineValues.setFlag(flag);   
113                     combineValues.setJoinKey(joinKey);   
114                     combineValues.setSecondPart(secondPart);   
115                     context.write(combineValues.getJoinKey(), combineValues);   
116                 }else{   
117                     return ;   
118                 }   
119             }//数据来自于tb_user_profiles.dat，标志即为"1"   
120             else if(pathName.endsWith("tb_user_profiles.dat")){   
121                 String[] valueItems = value.toString().split("\\|");   
122                 //过滤格式错误的记录   
123                 if(valueItems.length != 4){   
124                     return;   
125                 }   
126                 //过滤掉不需要参加join的记录   
127                 if(joinKeySet.contains(valueItems[3])){   
128                     flag.set("1");   
129                     joinKey.set(valueItems[3]);   
130                     secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);   
131                     combineValues.setFlag(flag);   
132                     combineValues.setJoinKey(joinKey);   
133                     combineValues.setSecondPart(secondPart);   
134                     context.write(combineValues.getJoinKey(), combineValues);   
135                 }else{   
136                     return ;   
137                 }   
138             }   
139         }   
140     }   
141     public static class SemiJoinReducer extends Reducer<Text, CombineValues, Text, Text> {   
142         //存储一个分组中的左表信息   
143         private ArrayList<Text> leftTable = new ArrayList<Text>();   
144         //存储一个分组中的右表信息   
145         private ArrayList<Text> rightTable = new ArrayList<Text>();   
146         private Text secondPar = null;   
147         private Text output = new Text();   
148         /**   
149          * 一个分组调用一次reduce函数   
150          */ 
151         @Override 
152         protected void reduce(Text key, Iterable<CombineValues> value, Context context) throws IOException, InterruptedException {   
153             leftTable.clear();   
154             rightTable.clear();   
155             /**   
156              * 将分组中的元素按照文件分别进行存放   
157              * 这种方法要注意的问题：   
158              * 如果一个分组内的元素太多的话，可能会导致在reduce阶段出现OOM，   
159              * 在处理分布式问题之前最好先了解数据的分布情况，根据不同的分布采取最   
160              * 适当的处理方法，这样可以有效的防止导致OOM和数据过度倾斜问题。   
161              */ 
162             for(CombineValues cv : value){   
163                 secondPar = new Text(cv.getSecondPart().toString());   
164                 //左表tb_dim_city   
165                 if("0".equals(cv.getFlag().toString().trim())){   
166                     leftTable.add(secondPar);   
167                 }   
168                 //右表tb_user_profiles   
169                 else if("1".equals(cv.getFlag().toString().trim())){   
170                     rightTable.add(secondPar);   
171                 }   
172             }   
173             logger.info("tb_dim_city:"+leftTable.toString());   
174             logger.info("tb_user_profiles:"+rightTable.toString());   
175             for(Text leftPart : leftTable){   
176                 for(Text rightPart : rightTable){   
177                     output.set(leftPart+ "\t" + rightPart);   
178                     context.write(key, output);   
179                 }   
180             }   
181         }   
182     }   
183     @Override 
184     public int run(String[] args) throws Exception {   
185             Configuration conf=getConf(); //获得配置文件对象   
186             DistributedCache.addCacheFile(new Path(args[2]).toUri(), conf);
187             Job job=new Job(conf,"LeftOutJoinMR");   
188             job.setJarByClass(SemiJoin.class);
189  
190             FileInputFormat.addInputPath(job, new Path(args[0])); //设置map输入文件路径   
191             FileOutputFormat.setOutputPath(job, new Path(args[1])); //设置reduce输出文件路径
192  
193             job.setMapperClass(SemiJoinMapper.class);   
194             job.setReducerClass(SemiJoinReducer.class);
195  
196             job.setInputFormatClass(TextInputFormat.class); //设置文件输入格式   
197             job.setOutputFormatClass(TextOutputFormat.class);//使用默认的output格式
198  
199             //设置map的输出key和value类型   
200             job.setMapOutputKeyClass(Text.class);   
201             job.setMapOutputValueClass(CombineValues.class);
202  
203             //设置reduce的输出key和value类型   
204             job.setOutputKeyClass(Text.class);   
205             job.setOutputValueClass(Text.class);   
206             job.waitForCompletion(true);   
207             return job.isSuccessful()?0:1;   
208     }   
209     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {   
210         try {   
211             int returnCode =  ToolRunner.run(new SemiJoin(),args);   
212             System.exit(returnCode);   
213         } catch (Exception e) {   
214             logger.error(e.getMessage());   
215         }   
216     }   
217 }

这里还说说SemiJoin也是有一定的适用范围的，其抽取出来进行join的key是要放到内存中的，所以不能够太大，容易在Map端造成OOM。

二、总结

blog介绍了三种join方式。这三种join方式适用于不同的场景，其处理效率上的相差还是蛮大的，其中主要导致因素是网络传输。Map join效率最高，其次是SemiJoin，最低的是reduce join。另外，写分布式大数据处理程序的时最好要对整体要处理的数据分布情况作一个了解，这可以提高我们代码的效率，使数据的倾斜度降到最低，使我们的代码倾向性更好。

本文转自SummerChill博客园博客，原文链接：http://www.cnblogs.com/DreamDrive/p/7692618.html，如需转载请自行联系原作者

微信关注我们

原文链接：https://yq.aliyun.com/articles/376362

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

Hadoop HDFS概念学习系列之shell命令使用HDFS的一些其他命令（十九）

其他相关命令还包括以下这些： NameNode -format：格式化DFS文件系统 secondaryNameNode: 运行DFS的SecndaryNameNode进程 NameNode ：运行DFS的NameNode进程 DataNode ：运行DFS的DataNode进程 dfsadmin : 运行DFS的管理客户端 mradmin : 运行MapReduce的管理客户端 fsck : 运行HDFS的检测进程 fs : 运行一个文件系统工具 balancer : 运行一个文件系统平衡进程 jobtracker : 运行一个JobTracker进程 pipes : 运行一个Pipes任务 tasktracker : 运行一个TaskTracker进程 job：管理运行中的MapReducer任务 queue : 获得运行中的MapReduce队列的信息 version: 打印版本号 jar <jar>：运行一个JAR文件 daemonlog：读取/设置守护进程的日志记录级别相信大家已经对这些命令中的一部分很熟悉了。比如在命令行终端...

2017-11-12

672

一.MR的二次排序的需求说明在mapreduce操作时，shuffle阶段会多次根据key值排序。但是在shuffle分组后，相同key值的values序列的顺序是不确定的(如下图)。如果想要此时value值也是排序好的，这种需求就是二次排序。二.测试的文件数据 a 1 a 5 a 7 a 9 b 3 b 8 b 10 三.未经过二次排序的输出结果 a 9 a 7 a 5 a 1 b 10 b 8 b 3 四.第一种实现思路直接在reduce端对分组后的values进行排序。 reduce关键代码 1 @Override 2 public void reduce(Text key, Iterable<IntWritable> values, Context context) 3 throws IOException, InterruptedException { 4 5 List<Integer> valuesList = new ArrayList<Integer>(); 6 7 // 取出value 8 for(IntW...

2017-11-12

738

资源下载

更多资源

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。