首页 文章 精选 留言 我的

精选列表

搜索[文档处理],共10000篇文章
优秀的个人博客,低调大师

Jcseg 2.5.0 发布,Java 轻量级开源自然语言处理

Jcseg是基于mmseg算法的一个轻量级中文分词器,同时集成了关键字提取,关键短语提取,关键句子提取和文章自动摘要等功能,并且提供了一个基于Jetty的web服务器,方便各大语言直接http调用,同时提供了最新版本的lucene,solr和elasticsearch的分词接口! Jcseg 2.5.0更新如下: 1,修复NLP模式下部分“第xx”实体识别的position错误的bug (Reported by https://gitee.com/lionsoul/jcseg/issues/I10FKC)。 2,修复elasticsearch插件的词库autoload的bug(Reported by https://gitee.com/lionsoul/jcseg/issues/IWT2P)。 3,对于全部的切分模式增加同义词自动词性继承。 4,增加elasticsearch 7.2.0支持和lucene, solr 8.0.0支持 (Reported by https://gitee.com/lionsoul/jcseg/issues/IZ7GS)。 5,lucene,solr,elasticsearch检索同义词解决方案与Jcseg同义词方案结合。 6,修复了lucene以及其扩展产品Elasticsearch, solr等同义词以及派生词 (例如,中文数字转阿拉伯数字)的高亮的bug。 这个问题打Jcseg增加同义词以来一直都有的问题,issue中被提了很多次,感谢以下的issue和信息提供者,已经测试OK了: https://gitee.com/lionsoul/jcseg/issues/IM8GD https://gitee.com/lionsoul/jcseg/issues/IMBLD https://gitee.com/lionsoul/jcseg/issues/IRLA2 https://gitee.com/lionsoul/jcseg/issues/IRLA2 https://gitee.com/lionsoul/jcseg/issues/IXA40 https://gitee.com/lionsoul/jcseg/issues/I11505 https://github.com/lionsoul2014/jcseg/issues/46 7,jccseg-server更改jetty版本号为:9.4.18.v20190429。 8,词条格式调整为:“词条/词性集合/拼音/实体集合/自定义参数”。 9,少量词库优化 。 下载地址: Gitee:https://gitee.com/lionsoul/jcseg/tree/v2.5.0-release Github:https://github.com/lionsoul2014/jcseg/releases/tag/v2.5.0-release Maven仓库地址: <dependency> <groupId>org.lionsoul</groupId> <artifactId>jcseg-core</artifactId> <version>2.5.0</version> </dependency>

优秀的个人博客,低调大师

8月28日社区直播【Spark Streaming SQL流式处理简介】

直播间直达链接:(回看链接) https://tianchi.aliyun.com/course/live?liveId=41084 或钉钉扫描海报上二维码,进群直接观看。 时间 8月28日19:00 主讲人: 云魄,阿里云E-MapReduce 高级开发工程师,专注于流式计算,Spark Contributor,开源爱好者 简介: 本次直播将简要介绍EMR Spark Streaming SQL,主要包含Streaming SQL的语法和使用,最后做demo演示

优秀的个人博客,低调大师

自然语言处理工具HanLP-基于层叠HMM地名识别

本篇接上一篇内容《HanLP-基于HMM-Viterbi的人名识别原理介绍》介绍一下层叠隐马的原理。首先说一下上一篇介绍的人名识别效果对比: 只有Jieba识别出的人名准确率极低,基本为地名或复杂地名组成部分或复杂机构名组成部分。举例如下: [1] 战乱的阿富汗地区,qiang zhi可随意买卖,AK47价格约500人民币“阿富汗”被识别为人名。[2] 安庆到桂林自驾游如何规划?“桂林”被识别为人名。[3] 2018天津市和平分局招聘社区戒毒、社区康复工作人员成绩查询入口“康复”被识别为人名。 只有HanLP识别出的人名除了特别常用姓氏的名字识别正确,其他的都识别错误。举例如下: [1] 纳溪区副区长李明带队到“花田酒地”景区检查节前安全工作“花田酒”被被识别为人名。[2] 秀英“线上线下”齐发力 助力贫困户“微互动”拓宽农产品销路“齐发力”被识别为人名。[3] 紧急通知:秦报融媒粉团祖山一日游日报名费大调整!“秦报”被识别为人名。 HanLP与Jieba都识别出的人名 非常用姓氏识别出的人名基本错误。[1] 房产高管薪酬大起底 万科郁亮年薪1189.9万仅排第二 [2] 生生不息 南通支云发布汶川地震十周年海报呼吁赛前默哀[3] 为什么伊郎不能有he wu qi,而美国有he wu qi? 名字本身构成词时基本错误。[1] 周口一村庄杨絮着火,对付杨絮用啥方法好呢? [2] 上联: 三国魏蜀吴,如何对下联?[3] 上联:灯火辉煌万家乐。求下联? 如何解决这些badcase呢,要看你的时间了,如果时间充裕的话,可以调整发射概率文件也就是nr.txt文件。如果时间不充裕的话,比如我现在的情况,那就只保留常用姓氏,以及特别需要关注的人名了。上一篇的内容先说到这里,介绍本篇的主题”基于层叠隐马的命名实体识别”我这里主要阅读的是这篇文章《基于层叠隐马尔可夫模型的中文命名实体识别》。层叠就是将模型级联起来的意思,因此系统的结构如下图所示: 如图所示,层叠隐马就是训练三个隐马模型,每个模型标注一种实体,三个模型采用级联形式连接。 不同的实体有不同的角色标注,实际就是特征,这些特征需要有语言学的知识,实际上就是你的阅读量,通过你大量阅读总结经验,比如姓氏可以作为名字的一个特征(张、王、李、赵),常用地名的后缀可以作为一个特征(省、市、区、县),机构名表处所的尾字可以作为一个特征(局、处、所、院)。这里地名的角色标注简表如下所示:

优秀的个人博客,低调大师

关于flink的时间处理不正确的现象复现&原因分析

跟朋友聊天,说输出的时间不对,之前测试没关注到这个,然后就在processing模式下看了下,发现时间确实不正确 然后就debug,看问题在哪,最终分析出了原因,记录如下: 具体我在朋友的https://github.com/apache/flink/pull/7180 最下面给出了复现方案及原因分析 let me show how to generate the wrong result background: processing time in tumbling window flink:1.5.0 the invoke stack is as follows: [1] org.apache.calcite.runtime.SqlFunctions.internalToTimestamp (SqlFunctions.java:1,747) [2] org.apache.flink.table.runtime.aggregate.TimeWindowPropertyCollector.collect (TimeWindowPropertyCollector.scala:53) [3] org.apache.flink.table.runtime.aggregate.IncrementalAggregateWindowFunction.apply (IncrementalAggregateWindowFunction.scala:74) [4] org.apache.flink.table.runtime.aggregate.IncrementalAggregateTimeWindowFunction.apply (IncrementalAggregateTimeWindowFunction.scala:72) [5] org.apache.flink.table.runtime.aggregate.IncrementalAggregateTimeWindowFunction.apply (IncrementalAggregateTimeWindowFunction.scala:39) [6] org.apache.flink.streaming.runtime.operators.windowing.functions.InternalSingleValueWindowFunction.process (InternalSingleValueWindowFunction.java:46) [7] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.emitWindowContents (WindowOperator.java:550) [8] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.onProcessingTime (WindowOperator.java:505) [9] org.apache.flink.streaming.api.operators.HeapInternalTimerService.onProcessingTime (HeapInternalTimerService.java:266) [10] org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run (SystemProcessingTimeService.java:281) [11] java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) [12] java.util.concurrent.FutureTask.run (FutureTask.java:266) [13] java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) [14] java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) [15] java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1,142) [16] java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) [17] java.lang.Thread.run (Thread.java:748) now ,we are at [1] org.apache.calcite.runtime.SqlFunctions.internalToTimestamp (SqlFunctions.java:1,747) and the code is as follows: public static Timestamp internalToTimestamp(long v) { return new Timestamp(v - LOCAL_TZ.getOffset(v)); } let us print the value of windowStart:v print v v = 1544074830000 let us print the value of windowEnd:v print v v = 1544074833000 after this, come back to [1] org.apache.flink.table.runtime.aggregate.TimeWindowPropertyCollector.collect (TimeWindowPropertyCollector.scala:51) then,we will execute ` if (windowStartOffset.isDefined) { output.setField( lastFieldPos + windowStartOffset.get, SqlFunctions.internalToTimestamp(windowStart)) } if (windowEndOffset.isDefined) { output.setField( lastFieldPos + windowEndOffset.get, SqlFunctions.internalToTimestamp(windowEnd)) } ` before execute,the output is output = "pro0,throwable0,ERROR,ip0,1,ymm-appmetric-dev-self1_5_924367729,null,null,null" after execute,the output is output = "pro0,throwable0,ERROR,ip0,1,ymm-appmetric-dev-self1_5_924367729,2018-12-06 05:40:30.0,2018-12-06 05:40:33.0,null" so,do you think the long value 1544074830000 translated to be 2018-12-06 05:40:30.0 long value 1544074833000 translated to be 2018-12-06 05:40:33.0 would be right? I am in China, I think the timestamp should be 2018-12-06 13:40:30.0 and 2018-12-06 13:40:33.0 okay,let us continue now ,the data will be write to kafka,before write ,the data will be serialized let us see what happened! the call stack is as follows: [1] org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ser.std.DateSerializer._timestamp (DateSerializer.java:41) [2] org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ser.std.DateSerializer.serialize (DateSerializer.java:48) [3] org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ser.std.DateSerializer.serialize (DateSerializer.java:15) [4] org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue (DefaultSerializerProvider.java:130) [5] org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.writeValue (ObjectMapper.java:2,444) [6] org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.valueToTree (ObjectMapper.java:2,586) [7] org.apache.flink.formats.json.JsonRowSerializationSchema.convert (JsonRowSerializationSchema.java:189) [8] org.apache.flink.formats.json.JsonRowSerializationSchema.convertRow (JsonRowSerializationSchema.java:128) [9] org.apache.flink.formats.json.JsonRowSerializationSchema.serialize (JsonRowSerializationSchema.java:102) [10] org.apache.flink.formats.json.JsonRowSerializationSchema.serialize (JsonRowSerializationSchema.java:51) [11] org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper.serializeValue (KeyedSerializationSchemaWrapper.java:46) [12] org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer010.invoke (FlinkKafkaProducer010.java:355) [13] org.apache.flink.streaming.api.operators.StreamSink.processElement (StreamSink.java:56) [14] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560) [15] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535) [16] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515) [17] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679) [18] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657) [19] org.apache.flink.streaming.api.operators.StreamMap.processElement (StreamMap.java:41) [20] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560) [21] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535) [22] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515) [23] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679) [24] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657) [25] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51) [26] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:37) [27] org.apache.flink.table.runtime.CRowWrappingCollector.collect (CRowWrappingCollector.scala:28) [28] DataStreamCalcRule$88.processElement (null) [29] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:66) [30] org.apache.flink.table.runtime.CRowProcessRunner.processElement (CRowProcessRunner.scala:35) [31] org.apache.flink.streaming.api.operators.ProcessOperator.processElement (ProcessOperator.java:66) [32] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator (OperatorChain.java:560) [33] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:535) [34] org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect (OperatorChain.java:515) [35] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:679) [36] org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect (AbstractStreamOperator.java:657) [37] org.apache.flink.streaming.api.operators.TimestampedCollector.collect (TimestampedCollector.java:51) [38] org.apache.flink.table.runtime.aggregate.TimeWindowPropertyCollector.collect (TimeWindowPropertyCollector.scala:65) [39] org.apache.flink.table.runtime.aggregate.IncrementalAggregateWindowFunction.apply (IncrementalAggregateWindowFunction.scala:74) [40] org.apache.flink.table.runtime.aggregate.IncrementalAggregateTimeWindowFunction.apply (IncrementalAggregateTimeWindowFunction.scala:72) [41] org.apache.flink.table.runtime.aggregate.IncrementalAggregateTimeWindowFunction.apply (IncrementalAggregateTimeWindowFunction.scala:39) [42] org.apache.flink.streaming.runtime.operators.windowing.functions.InternalSingleValueWindowFunction.process (InternalSingleValueWindowFunction.java:46) [43] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.emitWindowContents (WindowOperator.java:550) [44] org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.onProcessingTime (WindowOperator.java:505) [45] org.apache.flink.streaming.api.operators.HeapInternalTimerService.onProcessingTime (HeapInternalTimerService.java:266) [46] org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run (SystemProcessingTimeService.java:281) [47] java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:511) [48] java.util.concurrent.FutureTask.run (FutureTask.java:266) [49] java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201 (ScheduledThreadPoolExecutor.java:180) [50] java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run (ScheduledThreadPoolExecutor.java:293) [51] java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1,142) [52] java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:617) [53] java.lang.Thread.run (Thread.java:748) and the code is as follows: protected long _timestamp(Date value) { return value == null ? 0L : value.getTime(); } here,use windowEnd for example,the value is value = "2018-12-06 05:40:33.0" value.getTime() = 1544046033000 see,the initial value is 1544074833000 and the final value is 1544046033000 the minus value is 28800000, ---> 8 hours ,because I am in China. why? the key reason is SqlFunctions.internalToTimestamp public static Timestamp internalToTimestamp(long v) { return new Timestamp(v - LOCAL_TZ.getOffset(v)); } in the code, It minus the LOCAL_TZ , I think it is redundant! 刚才又看了下,其实根本原因就是时间转换来转换去,没有用同一个类,用了2个类的方法 结果就乱套了,要改的话就是SqlFunctions的那个类

资源下载

更多资源
腾讯云软件源

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题,腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构,目前腾讯云软件源站支持公网访问和内网访问。

Spring

Spring

Spring框架(Spring Framework)是由Rod Johnson于2002年提出的开源Java企业级应用框架,旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念,提供核心容器、应用上下文、数据访问集成等模块,支持整合Hibernate、Struts等第三方框架,其适用范围不仅限于服务器端开发,绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux

Rocky Linux(中文名:洛基)是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版,作为CentOS稳定版停止维护后与RHEL(Red Hat Enterprise Linux)完全兼容的开源替代方案,由社区拥有并管理,支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性,采用模块化包装和SELinux安全架构,默认包含GNOME桌面环境及XFS文件系统,支持十年生命周期更新。

Sublime Text

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能,例如代码缩略图,Python的插件,代码段等。还可自定义键绑定,菜单和工具栏。Sublime Text 的主要功能包括:拼写检查,书签,完整的 Python API , Goto 功能,即时项目切换,多选择,多窗口等等。Sublime Text 是一个跨平台的编辑器,同时支持Windows、Linux、Mac OS X等操作系统。

用户登录
用户注册