对于Hadoop文件常用的几种压缩方法,我写了一个java程序进行比较。
期望是,给出一个大文件(bigfile.txt) ,我们用各种方式压缩他们然后最终复制到HDFS中。
代码很简单:就是构造codec的实例,然后让它来创建到HDFS的输出流
-
-
- package com.charles.hadoop.fs;
-
- import java.io.BufferedInputStream;
- import java.io.FileInputStream;
- import java.io.InputStream;
- import java.io.OutputStream;
- import java.net.URI;
-
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.io.IOUtils;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.CompressionCodecFactory;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.util.ReflectionUtils;
-
-
-
-
-
-
-
-
-
- public class HadoopCodec {
-
-
-
-
- public static void main(String[] args) throws Exception {
-
-
- String inputFile = "bigfile.txt";
-
- String outputFolder = "hdfs://192.168.129.35:9000/user/hadoop-user/codec/";
-
-
-
- Configuration conf = new Configuration();
- conf.set("hadoop.job.ugi", "hadoop-user,hadoop-user");
-
-
-
- long gzipTime = copyAndZipFile(conf, inputFile, outputFolder, "org.apache.hadoop.io.compress.GzipCodec", "gz");
-
- long bzip2Time = copyAndZipFile(conf, inputFile, outputFolder, "org.apache.hadoop.io.compress.BZip2Codec", "bz2");
-
- long deflateTime = copyAndZipFile(conf, inputFile, outputFolder, "org.apache.hadoop.io.compress.DefaultCodec", "deflate");
-
- System.out.println("被压缩的文件名为: "+inputFile);
- System.out.println("使用gzip压缩,时间为: "+gzipTime+"毫秒!");
- System.out.println("使用bzip2压缩,时间为: "+bzip2Time+"毫秒!");
- System.out.println("使用deflate压缩,时间为: "+deflateTime+"毫秒!");
- }
-
- public static long copyAndZipFile(Configuration conf, String inputFile, String outputFolder, String codecClassName,
- String suffixName) throws Exception {
- long startTime = System.currentTimeMillis();
-
-
- InputStream in = new BufferedInputStream(new FileInputStream(inputFile));
-
-
- String baseName = inputFile.substring(0, inputFile.indexOf("."));
-
- String outputFile = outputFolder + baseName + "."+suffixName;
-
-
- FileSystem fs = FileSystem.get(URI.create(outputFile), conf);
-
-
- CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(Class.forName(codecClassName), conf);
-
-
- OutputStream out = codec.createOutputStream(fs.create(new Path(outputFile)));
-
-
- try {
- IOUtils.copyBytes(in, out, conf);
-
- } finally {
- IOUtils.closeStream(in);
- IOUtils.closeStream(out);
- }
-
- long endTime = System.currentTimeMillis();
-
- return endTime - startTime;
- }
-
- }
最终显示结果为:
- 被压缩的文件名为: bigfile.txt
- 使用gzip压缩,时间为: 11807毫秒!
- 使用bzip2压缩,时间为: 44982毫秒!
- 使用deflate压缩,时间为: 3696毫秒!
同时我们查看HDFS文件目录,可以证实,这几个文件的确存在:
![]()
分析结果:
我们可以从性能和压缩比率2个方面来进行对比:
性能:一目了然 deflate>bzip2>gzip, 而且gzip的性能差好大一截。
压缩比:
我们的原文件大小为114,576,640 字节
![]()
gzip 压缩比率为:9513416/114576640=8.3%,bzip2 压缩比率为5006568/114576640=4.37%,deflate压缩比率为9513404/114576640=8.3%
所以压缩比: bzip2>deflate=gzip
综上所述:压缩效果最好的是bzip2,压缩速度最快的是deflate
本文转自 charles_wang888 51CTO博客,原文链接:http://blog.51cto.com/supercharles888/879179,如需转载请自行联系原作者