您现在的位置是：首页 > 文章详情

Spark中使用HanLP分词

日期：2018-10-30点击：465收藏

1.将HanLP的data(包含词典和模型)放到hdfs上，然后在项目配置文件hanlp.properties中配置root的路径，比如：
root=hdfs://localhost:9000/tmp/

2.实现com.hankcs.hanlp.corpus.io.IIOAdapter接口：

public static class HadoopFileIoAdapter implements IIOAdapter { @Override public InputStream open(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(path), conf); return fs.open(new Path(path)); } @Override public OutputStream create(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(path), conf); OutputStream out = fs.create(new Path(path)); return out; } }

3.设置IoAdapter，创建分词器：

private static Segment segment;

static {

HanLP.Config.IOAdapter = new HadoopFileIoAdapter(); segment = new CRFSegment();

}

然后，就可以在Spark的操作中使用segment进行分词了。

文章来源于云聪的博客

原文链接：https://yq.aliyun.com/articles/662106

关注公众号

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。

持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

转载内容版权归作者及来源网站所有，本站原创内容转载请注明来源。

Java分享

Spark中使用HanLP分词

python七天快速开发优惠券搜索引擎项目实战（第一天）

一分钟“零代码”生成API，DataWorks数据服务快速上手指南

相关文章

文章评论

文章二维码

点击排行

推荐阅读

最新文章