[Hadoop]Hadoop Archives
1. 什么是Hadoop archives?
Hadoop archives是特殊的档案格式。一个Hadoop archive对应一个文件系统目录。 Hadoop archive的扩展名是*.har。Hadoop archive包含元数据(形式是_index和_masterindx)和数据(part-*)文件。_index文件包含了档案中的文件的文件名和位置信息。
2. 如何创建archive?
2.1 格式
hadoop archive -archiveName name -p <parent> <src>* <dest>
2.2 参数
(1)由-archiveName选项指定你要创建的archive的名字(name)。比如user_order.har。archive的名字的扩展名应该是*.har
(2)-p参数指定文件存档文件(src)的相对路径,举个例子:
-p /foo/bar a/b/c e/f/g
(3)src 是输入归档文件的目录
(4)dest 是目标目录,创建的archive会保存到该目录下
2.3 Example
hadoop archive -archiveName user_order.har -p /user/xiaosi user/user_active_new_user order/entrance_order test/archive
注意创建archives是一个Map/Reduce job。你应该在map reduce集群上运行这个命令:
xiaosi@yoona:~/opt/hadoop-2.7.3$ hadoop archive -archiveName user_order.har -p /user/xiaosi user/user_active_new_user order/entrance_order test/archive16/12/26 20:45:36 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id16/12/26 20:45:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=16/12/26 20:45:37 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized16/12/26 20:45:37 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized16/12/26 20:45:37 INFO mapreduce.JobSubmitter: number of splits:116/12/26 20:45:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local133687258_000116/12/26 20:45:37 INFO mapreduce.Job: The url to track the job: http://localhost:8080/16/12/26 20:45:37 INFO mapreduce.Job: Running job: job_local133687258_0001...16/12/26 20:45:38 INFO mapred.LocalJobRunner: reduce task executor complete.16/12/26 20:45:39 INFO mapreduce.Job: map 100% reduce 100%16/12/26 20:45:39 INFO mapreduce.Job: Job job_local133687258_0001 completed successfully16/12/26 20:45:39 INFO mapreduce.Job: Counters: 35File System CountersFILE: Number of bytes read=95398FILE: Number of bytes written=678069FILE: Number of read operations=0FILE: Number of large read operations=0FILE: Number of write operations=0HDFS: Number of bytes read=974540HDFS: Number of bytes written=975292HDFS: Number of read operations=55HDFS: Number of large read operations=0HDFS: Number of write operations=11Map-Reduce FrameworkMap input records=8Map output records=8Map output bytes=761Map output materialized bytes=783Input split bytes=147Combine input records=0Combine output records=0Reduce input groups=8Reduce shuffle bytes=783Reduce input records=8Reduce output records=0Spilled Records=16Shuffled Maps =1Failed Shuffles=0Merged Map outputs=1GC time elapsed (ms)=0Total committed heap usage (bytes)=593494016Shuffle ErrorsBAD_ID=0CONNECTION=0IO_ERROR=0WRONG_LENGTH=0WRONG_MAP=0WRONG_REDUCE=0File Input Format CountersBytes Read=689File Output Format CountersBytes Written=0
3. 如何查看archives中的文件?
archive作为文件系统层暴露给外界。所以所有的fs shell命令都能在archive上运行,但是要使用不同的URI。 另外,archive是不可改变的。所以重命名,删除和创建都会返回错误。
Hadoop Archives 的URI是
har://scheme-hostname:port/archivepath/fileinarchive
har:///archivepath/fileinarchive
hadoop dfs -ls har:///user/xiaosi/test/archive/user_order.har
xiaosi@yoona:~/opt/hadoop-2.7.3$ hadoop dfs -ls har:///user/xiaosi/test/archive/user_order.harDEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.Found 2 itemsdrwxr-xr-x - xiaosi supergroup 0 2016-12-13 13:39 har:///user/xiaosi/test/archive/user_order.har/orderdrwxr-xr-x - xiaosi supergroup 0 2016-12-24 15:51 har:///user/xiaosi/test/archive/user_order.har/user
hadoop dfs -cat har:///user/xiaosi/test/archive/user_order.har/order/entrance_order/entrance_order.txt
xiaosi@yoona:~/opt/hadoop-2.7.3$ hadoop dfs -cat har:///user/xiaosi/test/archive/user_order.har/order/entrance_order/entrance_order.txtDEPRECATED: Use of this script to execute hdfs command is deprecated.Instead use the hdfs command for it.{"clickTime":"20161210 14:47:35.000","entrance":"306","actionTime":"20161210 14:48:14.000","orderId":"21014149","businessType":"TRAIN","gid":"1B369BF1D","uid":"8661840271741","vid":"01151","income":105.5,"status":140}{"clickTime":"20161210 14:47:35.000","entrance":"306","actionTime":"20161210 14:48:18.000","orderId":"121818e46","businessType":"TRAIN","gid":"69BF1D","uid":"86618471741","vid":"01151","income":105.5,"status":140