首页 文章 精选 留言 我的

精选列表

搜索[国密算法],共10000篇文章
优秀的个人博客,低调大师

ES BM25 TF-IDF相似度算法设置——

Pluggable Similarity Algorithms Before we move on from relevance and scoring, we will finish this chapter with a more advanced subject: pluggable similarity algorithms.While Elasticsearch uses theLucene’s Practical Scoring Functionas its default similarity algorithm, it supports other algorithms out of the box, which are listed in theSimilarity Modulesdocumentation. Okapi BM25 The most interesting competitor to TF/IDF and the vector space model is calledOkapi BM25, which is considered to be astate-of-the-artranking function.BM25 originates from theprobabilistic relevance model, rather than the vector space model, yetthe algorithm has a lot in common with Lucene’s practical scoring function. Both use term frequency, inverse document frequency, and field-length normalization, but the definition of each of these factors is a little different. Rather than explaining the BM25 formula in detail, we will focus on the practical advantages that BM25 offers. Term-frequency saturation Both TF/IDF and BM25 useinverse document frequencyto distinguish between common (low value) words and uncommon (high value) words.Both also recognize (seeTerm frequency) that the more often a word appears in a document, the more likely is it that the document is relevant for that word. However, common words occur commonly.The fact that a common word appears many times in one document is offset by the fact that the word appears many times inalldocuments. However, TF/IDF was designed in an era when it was standard practice to remove themostcommon words (orstopwords, seeStopwords: Performance Versus Precision) from the index altogether.The algorithm didn’t need to worry about an upper limit for term frequency because the most frequent terms had already been removed. In Elasticsearch, thestandardanalyzer—the default forstringfields—doesn’t remove stopwordsbecause, even though they are words of little value, they do still have some value. The result is that, for very long documents, the sheer number of occurrences of words liketheandandcan artificially boost their weight. BM25, on the other hand, does have an upper limit. Terms that appear 5 to 10 times in a document have a significantly larger impact on relevance than terms that appear just once or twice. However, as can be seen inFigure34, “Term frequency saturation for TF/IDF and BM25”, terms that appear 20 times in a document have almost the same impact as terms that appear a thousand times or more. This is known asnonlinear term-frequency saturation. Figure34.Term frequency saturation for TF/IDF and BM25 Field-length normalization InField-length norm, we said that Lucene considers shorter fields to have more weight than longer fields: the frequency of a term in a field is offset by the length of the field. However, the practical scoring function treats all fields in the same way.It will treat alltitlefields (because they are short) as more important than allbodyfields (because they are long). BM25 also considers shorter fields to have more weight than longer fields, but it considers each field separately by taking the average length of the field into account. It can distinguish between a shorttitlefield and alongtitle field. InQuery-Time Boosting, we said that thetitlefield has anaturalboost over thebodyfield because of its length. This natural boost disappears with BM25 as differences in field length apply only within a single field. 摘自:https://www.elastic.co/guide/en/elasticsearch/guide/current/pluggable-similarites.html 本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/6472820.html ,如需转载请自行联系原作者

优秀的个人博客,低调大师

hadoop : hdfs的心跳时间设置及心跳检测算法

datenode以固定周期向namenode发送心跳,namenode如果在一段时间内没有收到心跳,就会标记datenode为宕机。 此段时间的计算公式是:timeout = 2 * heartbeat.recheck.interval + 10 * dfs.heartbeat.interval 而默认的heartbeat.recheck.interval 大小为5分钟,dfs.heartbeat.interval默认的大小为3秒。 所以namenode如果在10分钟+30秒后,仍然没有收到datanode的心跳,就认为datanode已经宕机,并标记为dead 注意:hdfs-site.xml中 heartbeat.recheck.interval的单位为毫秒 dfs.heartbeat.interval的单位为秒 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

优秀的个人博客,低调大师

Mahout学习之运行canopy算法错误及解决办法

一:将Text转换成Vector序列文件时 在Hadoop中运行编译打包好的jar程序,可能会报下面的错误: Exceptioninthread"main"java.lang.NoClassDefFoundError: org/apache/mahout/common/AbstractJob 书中和网上给的解决办法都是:把Mahout根目录下的相应的jar包复制到Hadoop根目录下的lib文件夹下,同时重启Hadoop 但是到了小编这里不管怎么尝试,都不能解决,最终放弃了打包成jar运行的念头,就在对源码进行了修改,在eclipse运行了 二:java.lang.Exception: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text 此种错误,是由于map和reduce端函数格式输入输出不一致,导致数据类型不匹配 在次要注意一个特别容易出错的地方:Mapper和Reducer类中的函数必须是map和reduce,名字不能改,因为是继承Mapper类和Reducer类,如果函数名字改变了的话也可能造成以上的错误,或者Reducer端不输出 三:当在命令行里直接用命令转化文件格式时抛出如下错误: ERROR common.AbstractJob: Unexpected --seqFileDir while processing Job-Specific Options 注:转化命令为:bin/mahout clusterdump --seqFileDir /home/thinkgamer/document/canopy/output/clusters-0-final/ --pointsDir /home/thinkgamer/document/canopy/output/clusteredPoints/ --output /home/thinkgamer/document/canopy/clusteranalyze.txt 上网搜了搜热心的网友给出的解决办法是:将--seqFileDir换成--input即可

优秀的个人博客,低调大师

HMS Core手语服务荣获2022中互联网大会“特别推荐案例”:助力建设数字社会

11月15日,HMS Core手语服务在2022(第二十一届)中国互联网大会 “互联网助力经济社会数字化转型”案例评选活动中,荣获“特别推荐案例”。 经过一年多的技术迭代和经验积累,HMS Core手语服务已与多个行业的开发者合作,将AI手语翻译能力应用在了教育、社交、新闻、政务办理等场景,助力开发者高效建设数字政府、数字社会。 面对政府网站中的政务动态、信息公开、视频新闻和解读回应等资讯内容,听障人士存在信息获取障碍。HMS Core手语服务,可将播报内容实时翻译、自动生成为手语动作,助力听障人士快速、及时获取政府网站资讯内容,帮助政府管理服务提高服务效率、优化服务质量。 此外,由于听障群体受教育程度不同、所能获得的康复训练和助听设备不同,他们在日常生活中还面临着沟通难、生活难、融入社会难的问题。HMS Core手语服务,通过庞大的词汇库、准确的手语动作、灵动的面部表情等优势,满足听障人士在家庭、学校、医院等诸多环境中的沟通难题,让他们在无声世界也能自在交流。 HMS Core将手语服务融入数字社会,让一个又一个无障碍技术帮助听障者畅通无阻。这不只是对个人的便利,更是为消除数字鸿沟、促进公共服务方式创新迈出的一大步。未来,HMS Core将始终开放共享,与开发者一起,同心同行,助力数字生态建设,构筑全民畅享的数智生活。这是对“互联网助力经济社会数字化转型” 的完美诠释,是对“科技,不让任何一个人掉队”的努力践行,也是对数字包容倡议的实际响应。

资源下载

更多资源
Mario

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长,特征是大鼻子、头戴帽子、身穿背带裤,还留着胡子。与他的双胞胎兄弟路易基一起,长年担任任天堂的招牌角色。

Spring

Spring

Spring框架(Spring Framework)是由Rod Johnson于2002年提出的开源Java企业级应用框架,旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念,提供核心容器、应用上下文、数据访问集成等模块,支持整合Hibernate、Struts等第三方框架,其适用范围不仅限于服务器端开发,绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux

Rocky Linux(中文名:洛基)是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版,作为CentOS稳定版停止维护后与RHEL(Red Hat Enterprise Linux)完全兼容的开源替代方案,由社区拥有并管理,支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性,采用模块化包装和SELinux安全架构,默认包含GNOME桌面环境及XFS文件系统,支持十年生命周期更新。

WebStorm

WebStorm

WebStorm 是jetbrains公司旗下一款JavaScript 开发工具。目前已经被广大中国JS开发者誉为“Web前端开发神器”、“最强大的HTML5编辑器”、“最智能的JavaScript IDE”等。与IntelliJ IDEA同源,继承了IntelliJ IDEA强大的JS部分的功能。

用户登录
用户注册