【NLP学习笔记】（二）gensim使用之Topics and Transformations-低调大师

【NLP学习笔记】（二）gensim使用之Topics and Transformations

2018-12-10 553

本文主要翻译自：https://radimrehurek.com/gensim/tut2.html

这个教程会向大家展示如何将代表文档的向量转换成另一种向量，做这件事的目的主要有两个：

发现语料中的隐藏结构，比如词与词之间的联系，然后用一种全新的方式、一种更能表现语义的方式（semantic way）来描述文档。
使文档的表示更加紧凑，这样可以提高效率和功效，因为新的表达方式消耗更少的资源，并且去除了噪音。

一、回顾

在之前的gensim基础使用中，我们介绍了如何将语料提取特征后转换为向量(基于词袋模型)，上一章中的结果：

# 清洗后的语料库，只有九句话，代表九个文档
[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]
# 根据上面语料训练的词典，每个词都有一个id
{'computer': 0,
 'eps': 8,
 'graph': 10,
 'human': 1,
 'interface': 2,
 'minors': 11,
 'response': 3,
 'survey': 4,
 'system': 5,
 'time': 6,
 'trees': 9,
 'user': 7}
# 根据词典基于词袋模型，训练上面语料的结果，（0,1.0）的意思是id为0的单词，即“computer”在第一篇文章中出现了1次。其它类似。
[[(0, 1.0), (1, 1.0), (2, 1.0)],
 [(0, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0)],
 [(2, 1.0), (5, 1.0), (7, 1.0), (8, 1.0)],
 [(1, 1.0), (5, 2.0), (8, 1.0)],
 [(3, 1.0), (6, 1.0), (7, 1.0)],
 [(9, 1.0)],
 [(9, 1.0), (10, 1.0)],
 [(9, 1.0), (10, 1.0), (11, 1.0)],
 [(4, 1.0), (10, 1.0), (11, 1.0)]]

二、加载上一章中结果（保存的字典和语料向量）

from gensim import corpora, models, similarities
import os
if(os.path.exists('./gensim_out/deerwester.dict')):
    dictionary = corpora.Dictionary.load('./gensim_out/deerwester.dict')
    corpus = corpora.MmCorpus('./gensim_out/deerwester.mm')
    print("使用之前已经存储的字典和语料向量")
else:
    print("请先通过上一章生成deerwester.dict和deerwester.mm")

#pprint(dictionary.tokenz`2id)
#pprint(corpus)

三、初始化一个转换模型（Creating a transformation）

转换模型是标准的python对象，通常需要传入一个语料库进行初始化。
我们使用教程1中的旧语料库来初始化（训练）转换模型。也就是上面加载的corpus，不同的转换模型一般需要不同的初始化参数; 在TfIdf的情况下，“训练”仅包括通过提供的语料库一次并计算其所有词频和逆文档频率。训练其他模型，例如潜在语义分析或潜在狄利克雷分析，涉及更多，因此也会消耗更多时间。

tfidf = models.TfidfModel(corpus) #初始化一个模型

doc_bow = [(0, 1), (1, 1)]

print(tfidf[doc_bow])#输出：[(0, 0.70710678), (1, 0.70710678)]

上面已经创建了tfidf模型，我们应该将其作为一个只读对象来看待，用它可以将旧的向量表示（上一节中的词袋模型）转换为新的向量表示（比如tf-idf权重）
假设新文本为：“Human computer interaction”
doc_bow是新文本经过上一章的清洗、分词、基于词袋模型转换后的结果，(0, 1)表示id为0，即“computer”。1表示“computer”在新文本中出现了1次；(1, 1)表示id为1，即“human”也出现了1次。
tfidf模型将新文本从词袋向量模型（[(0, 1), (1, 1)]）转换为了词频-逆文档频率权重向量（[(0, 0.70710678), (1, 0.70710678)]），即“computer”的权重为0 0.70710678，“human”的权重为0.70710678

四、序列化转换后的结果

调用model_name[corpus]仅在旧的语料库文档流周围创建一个包装器，实际转换在文档迭代期间即时完成。我们无法在调用corpus_transformed = model [corpus]时转换整个语料库，因为这意味着将结果存储在内存中，这与gensim的内存独立的目标相矛盾。如果您将多次迭代转换的corpus_transformed，并且转换成本很高，请先将生成的语料库序列化到磁盘然后再使用它。

#用tfidf转换语料库corpus
corpus_tfidf = tfidf[corpus]

 #initialize an LSI transformation(初始化LSI模型)
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)

#create a double wrapper over the original corpus: bow->tfidf->fold-in-lsicorpus_lsi = lsi[corpus_tfidf] 
lsi.print_topics(2)
#输出：topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

根据上面结果可以看出“trees”, “graph” and “minors都是相关联的词汇，并且对第一个主题的贡献度最高，第二个主题，更多的是关注其他的词汇

for doc in corpus_lsi:
    print(doc)

#输出结果（可以看出前五个文档与第二个主题关联度更高）：
[(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications"
[(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time"
[(0, -0.090), (1, 0.724)] # "The EPS user interface management system"
[(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS"
[(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement"
[(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees"
[(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees"
[(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering"
[(0, -0.617), (1, 0.054)] # "Graph minors A survey"

保存模型与加载模型

lsi.save('./gensim_out/model.lsi') #保存 same for tfidf, lda, ...

lsi = models.LsiModel.load('/tmp/model.lsi') #加载

gensim中可用的转换模型

1.Term Frequency * Inverse Document Frequency, Tf-Idf

model = models.TfidfModel(corpus, normalize=True)

2.Latent Semantic Indexing, LSI (or sometimes LSA)

model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
model.add_documents(another_tfidf_corpus)  #now LSI has been trained on tfidf_corpus + another_tfidf_corpus
lsi_vec = model[tfidf_vec]  #convert some new document into the LSI space, without affecting the model
...
model.add_documents(more_documents)  #tfidf_corpus + another_tfidf_corpus + more_documents
lsi_vec = model[tfidf_vec]
...

3.Random Projections, RP

model = models.RpModel(tfidf_corpus, num_topics=500)

4.Latent Dirichlet Allocation, LDA

model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

5.Hierarchical Dirichlet Process, HDP

model = models.HdpModel(corpus, id2word=dictionary)

微信关注我们

原文链接：https://yq.aliyun.com/articles/676059

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

05.Beetl标签函数以及定界符、占位符介绍---《Beetl视频课程》

本期视频实现了博客的详情页面；内容简介：使用了标签函数layout完成详情功能一起学beetl目录：https://my.oschina.net/u/1590490?tab=newest&catalogId=6214598 作者：GK 标签函数 layout 所谓标签函数，即允许处理模板文件里的一块内容，功能等于同jsp tag。如Beetl内置的layout标签 index.html <% layout("/inc/layout.html",{title:'主题'}){ %> Hello,this is main part <% } %> layout.html title is ${title} body content ${layoutContent} footer 第1行变量title来自于layout标签函数的参数第2行layoutContent 是layout标签体{}渲染后的结果关于layout标签，参考高级主题布局 layout标签函数，相当于把公共部分抽取出来，包裹主单个页面的个性化内容。定界符、占位符通俗易懂的说：定界...

2018-12-11

646

在 IT 行业中，总有一些终端用户输入的内容让统计人员无从下手。技术人员在做统计汇总时，经常会遇到数据对不上的情况，经过一番沟通、讨论、排查后才发现有非法内容录入，这类问题在月初月底的财务报表中尤为常见。那么，有没有一劳永逸的解决办法呢？（苦思冥想中…… ding~）有了，如果能在用户录入数据时进行有效性验证，从源头保证数据输入的准确性，那么这些问题就都不存在了！ So，具体应该怎么操作呢？不用担心，不用着急，我们接下来要请出的润乾报表就提供了一整套的解决机制，来看这里，我们从最简单的部分开始。 1.数据类型校验最简单的校验方式莫过于数据类型校验，它是针对数据类型做匹配的一种校验方式，例如，当我们需要用户输入整数数据时，只需要把这个单元格的数据类型设置为整数型就可以了，这样，在数据录入的页面中，如果用户输入了不是日期格式的数据，就会弹出提示信息。润乾报表提供了各种常见的数据类型供用户选择使用：日期、字符串、整数、数值···· 配置方式如下图所示：在右侧单元格属性中设置数据类型 2.单元格校验如果我们在要求用户输入整数的基础上，再加上对整数范围的要求，那么简单地数据类型校验就不...

2018-12-11

545

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。