sklearn.feature_extraction.text.CountVectorizer提取文本特征，将文档词块化-低调大师

sklearn.feature_extraction.text.CountVectorizer提取文本特征，将文档词块化

2018-01-29 682

 
 sklearn.feature_extraction.text.
 CountVectorizer
 (
 input=u'content'
 , 
 encoding=u'utf-8'
 , 
 decode_error=u'strict'
 , 

 
 strip_accents=None
 , 
 lowercase=True
 , 
 preprocessor=None
 , 
 tokenizer=None
 , 
 stop_words=None
 ,
 token_pattern=u'(?
 u)\b\w\w+\b'
 , 
 ngram_range=(1
 , 
 1)
 , 
 analyzer=u'word'
 , 
 max_df=1.0
 , 
 min_df=1
 ,
 max_features=None
 , 
 vocabulary=None
 , 
 binary=False
 , 
 dtype=<type 'numpy.int64'>
 ) 

 作用：Convert a collection of text documents to a matrix of token counts（计算词汇的数量，即tf）；结果由 scipy.sparse.coo_matrix进行稀疏表示。

 看下参数就知道CountVectorizer在提取tf时都做了什么：

 
 strip_accents
  : {‘ascii’, ‘unicode’, None}：是否除去“音调”，不知道什么是“音调”？看：http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt== 

 
 lowercase 
 : boolean, True by default：计算tf前，先将所有字符转化为小写。
 这个参数一般为True。 

 
 preprocessor
  : callable or None (default)：复写the preprocessing (string transformation) stage，但保留tokenizing and n-grams generation steps.
 这个参数可以自己写。 

 
 tokenizer 
 : callable or None (default)：复写the string tokenization step，但保留preprocessing and n-grams generation steps.
 这个参数可以自己写。 

 
 stop_words 
 : string {‘english’}, list, or None (default)：如果是‘english’, a built-in stop word list for English is used。如果是a list，那么最终的tokens中将去掉list中的所有的stop word。如果是None, 不处理停顿词；但
 参数 max_df
 可以设置为 [0.7, 1.0) 之间，进而根据
 intra corpus document frequency(df)
  of terms自动detect and filter stop words。这个参数要根据自己的需求调整。 

 
 token_pattern 
 : string：正则表达式，默认筛选长度大于等于2的字母和数字混合字符（select tokens of 2 or more alphanumeric characters ），参数analyzer设置为word时才有效。 

 
 ngram_range
  : tuple (min_n, max_n)：n-values值得上下界，默认是
 ngram_range=(1
 , 
 1)，
 该范围之内的n元feature都会被提取出来！
 这个参数要根据自己的需求调整。 

 
 analyzer
  : string, {‘word’, ‘char’, ‘char_wb’} or callable：特征基于wordn-grams还是character n-grams。如果是callable是自己复写的从the raw, unprocessed input提取特征的函数。 

 
 max_df 
 : float in range [0.0, 1.0] or int, default=1.0： 

 
 min_df
  : float in range [0.0, 1.0] or int, default=1：按比例，或绝对数量删除df超过max_df或者df小于min_df的word tokens。有效的前提是参数vocabulary设置成Node。 

 
 max_features 
 : int or None, default=None：选择tf最大的max_features个特征。有效的前提是参数vocabulary设置成Node。 

 
 vocabulary
  : Mapping or iterable, optional：自定义的特征word tokens，如果不是None，则只计算vocabulary中的词的tf。
 还是设为None靠谱。 

 
 binary
  : boolean, default=False：如果是True，tf的值只有0和1，表示出现和不出现，useful for discrete probabilistic models that model binary events rather than integer counts.。 

 
 dtype
  : type, optional：Type of the matrix returned by fit_transform() or transform().。 

微信关注我们

原文链接：https://yq.aliyun.com/articles/414625

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

sklearn.feature_extraction.DictVectorizer将字典格式的数据转换为特征

class sklearn.feature_extraction. DictVectorizer ( dtype=<class ‘numpy.float64’> , separator=’=’ , sparse=True , sort=True ) Transforms lists of feature-value mappings to vectors. This transformer turns lists of mappings (dict-like objects ) of feature names to feature values into Numpy arrays or scipy. sparse matrices（稀疏矩阵） for use with scikit-learn estimators. When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is c...

2018-01-29

797

本篇博客带来Jmeter的进阶使用，包括新建测试计划、CSV参数化、BeanShell使用和服务器监控等碎碎念惯例碎碎念。关于Jmeter，关于压力/性能测试，本不是我的专业范畴，但是由于前线需要，所以我就上阵了，粗浅涉猎并没有精通，所以哪里有写的不好的，请果断指出，反正我是不会改的。忙本不应该成为拖延的理由，何况我并不是很忙。但是这时常出现的拖延症，让这篇博客一直在草稿箱里等待问世，终于是抽空把它写完了，时间就像那个什（ma）么（sai）一(ke)样，挤挤还是有的，不信你试试。前提好像很多事的开头都要有个前提。电视剧的开头都还有个前情提要，所以这里也有个前提，那就是Jmeter的运行环境和软件安装。 1、JDK 1.8 2、Jmeter 3.2 (or higher) 如果还不了解Jmeter，还未安装配置的，请参考格子的上一篇Jmeter文章 Hello World 对于一个科班出身的程序猿来说，学习一个语言的第一步就是写一个Hello word。那么对于使用工具来说也不例外，先从一个最简单的Hello world来熟悉一下Jmeter吧。 1、启动Jmeter 2、新...

2018-01-29

622

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。