必须针对data文件中的value类型来使用对应的类来查看(把这个data文件,放到了本地Windows的D盘根目录下).
代码:
1 package cn.summerchill.nutch;
2 import java.io.IOException;
3
4 import org.apache.hadoop.conf.Configuration;
5 import org.apache.hadoop.fs.FileSystem;
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.io.SequenceFile;
8 import org.apache.hadoop.io.Text;
9 import org.apache.nutch.crawl.CrawlDatum;
10 import org.apache.nutch.crawl.Inlinks;
11 import org.apache.nutch.parse.ParseData;
12 import org.apache.nutch.parse.ParseText;
13 import org.apache.nutch.protocol.Content;
14 /**
15 * 读取nutch生成的sequencefile文件
16 * @author Administrator
17 *
18 */
19 public class SeFileReader {
20 public static void main(String[] args) throws IOException {
21 Configuration conf=new Configuration();
22 Path dataPath=new Path("D:\\data");
23 FileSystem fs=dataPath.getFileSystem(conf);
24 SequenceFile.Reader reader=new SequenceFile.Reader(fs,dataPath,conf);
25 Text key=new Text();
26 CrawlDatum value=new CrawlDatum();
27 //Content value = new Content();
28 //Inlinks value = new Inlinks();
29 //ParseText value = new ParseText();
30 //ParseData value = new ParseData();
31 while(reader.next(key,value)){
32 System.out.println("key->\n"+key);
33 System.err.println("value->\n"+value);
34 try {
35 Thread.sleep(1000);
36 } catch (InterruptedException e) {
37 e.printStackTrace();
38 }
39 System.out.println("=======================================");
40 }
41 reader.close();
42 }
43 }
运行结果:
key->
http://bbs.superwu.cn/
value->
Version: 7
Status: 2 (db_fetched)
Fetch time: Tue Nov 08 08:31:30 CST 2016
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.6153846
Signature: 22defcd7cb4e7b1dc8a16a0a2f339ecb
Metadata:
Content-Type=application/xhtml+xml
_pst_=success(1), lastModified=0
_rs_=610
=======================================
value->
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Oct 09 08:31:35 CST 2016
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.23076925
Signature: null
Metadata:
key->
http://bbs.superwu.cn/archiver/
=======================================
key->
http://bbs.superwu.cn/forum.php
value->
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sun Oct 09 08:31:35 CST 2016
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.15384616
Signature: null
Metadata:
=======================================
本文转自SummerChill博客园博客,原文链接:http://www.cnblogs.com/DreamDrive/p/5944073.html,如需转载请自行联系原作者