搜索[文档处理]结果-低调大师优秀个人博客

精选列表

搜索[文档处理]，共10000篇文章

Win8MetroC#数字图像处理--2.1图像灰度化

原文: Win8MetroC#数字图像处理--2.1图像灰度化 [函数说明] 图像灰度化函数GrayProcess(WriteableBitmap src) [算法说明] 图像灰度化就是去掉彩色图像的彩色信息

2018-03-12

大型分布式C++框架《二：大包处理过程》

这里打算就介绍下大包的处理。其实这个更多的是介绍了下TCP切包。跟分布式没啥关系。。。。

2018-03-07

实体嵌入(向量化)：用深度学习处理结构化数据

其次，在构建机器学习模型时，你必须始终问自己一个问题：将如何处理数据集中的分类变量？令人惊讶的是，我们可以用相同的答案回答这两个问题：实体嵌入。

2018-02-26

大数据||日志文件数据存储、收集、预处理和分析

项目实战之四日志文件数据存储、收集、预处理和分析日志文件：每天的文件安装【日期】存放在对应的文件夹中一天之内只产生一个文件，以每天零点为准收集数据 &&程序 put hdfs *

2018-02-26

免费的自然语言处理/机器学习教育资源12例

随着自然语言处理和机器学习技术的不断进步以及在人们日常生活中的影响力的逐渐扩大，使得越来越多的人对这些领域产生浓厚的兴趣。而近些年在线教育的发展，也为人们学习知识提供了良好的平台。

2018-02-18

利用函数计算对oss压缩文件做自动解压处理

前言一个应用可能已经使用对象存储（Object Storage Service，简称OSS）来存放用户上传的图片，用户可以实现一个函数去下载图片进行处理，并将结果存入OSS或者其他服务。

2018-02-11

如何在E-MapReduce上提交Storm作业处理Kafka数据

0. 序言本文演示如何在E-MapReduce上部署Storm集群和Kafka集群，并运行Storm作业消费Kafka数据。 1. 准备环境这里我选择在杭州Region进行测试，版本选择EMR-3.8.0，本次测试需要的组件版本有： Kafka：2.11_1.0.0 Storm: 1.0.1 E-MapReduce的集群管理界面地址：https://emr.console.aliyun.com/console#/cn-hangzhou/ 1.1 创建Hadoop集群由于Zookeeper和Storm组件默认不是必选的，所以在创建集群时需要记得勾选上，如下：详细创建集群步骤，请参考E-MapReduce-用户指南-集群一节。 1.2 创建Kafka集群接着创建Kafka集群，集群类型选择Kafka，如下：注意：如果使用经典网络，请注意将Hado

2018-02-08

python处理数据的风骚操作[pandas 之 groupby&agg]

本文重点介绍了pandas中groupby、Grouper和agg函数的使用。这2个函数作用类似，都是对数据集中的一类属性进行聚合操作，比如统计一个用户在每个月内的全部花销，统计某个属性的最大、最小、累和、平均等数值。其中，agg是pandas 0.20新引入的功能 groupby && Grouper 首先，我们从网上把数据下载下来，后面的操作都是基于这份数据的： import pandas as pd df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True") df["date"] = pd.to_datetime(df['date']) df.head() （图片来自于jupyter notebook，强烈推荐使用它作为python的交互工具）下面，我们统计'ext price'这个属性在每个月的累和(sum)值，resample只有在index为date类型的时候才能用： df.set_index('date').resample('M')['ext price'].sum() date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 2014-12-31 163867.26 Freq: M, Name: ext price, dtype: float64 进一步的，我们想知道每个用户每个月的sum值，那么就需要一个groupby了： df.set_index('date').groupby('name')['ext price'].resample("M").sum() name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 2014-09-30 12693.74 2014-10-31 9324.37 2014-11-30 6021.11 2014-12-31 7640.60 Frami, Hills and Schmidt 2014-01-31 5112.34 2014-02-28 4124.53 2014-03-31 10397.44 2014-04-30 5036.18 2014-05-31 4097.87 2014-06-30 13192.19 ... Trantow-Barrows 2014-07-31 11987.34 2014-08-31 17251.65 2014-09-30 6992.48 2014-10-31 10064.27 2014-11-30 6550.10 2014-12-31 10124.23 White-Trantow 2014-01-31 13703.77 2014-02-28 11783.98 2014-03-31 8583.05 2014-04-30 19009.20 2014-05-31 5877.29 2014-06-30 14791.32 2014-07-31 10242.62 2014-08-31 12287.21 2014-09-30 5315.16 2014-10-31 19896.85 2014-11-30 9544.61 2014-12-31 4806.93 Will LLC 2014-01-31 20953.87 2014-02-28 13613.06 2014-03-31 9838.93 2014-04-30 6094.94 2014-05-31 11856.95 2014-06-30 2419.52 2014-07-31 11017.54 2014-08-31 1439.82 2014-09-30 4345.99 2014-10-31 7085.33 2014-11-30 3210.44 2014-12-31 12561.21 Name: ext price, Length: 240, dtype: float64 结果肯定是对的，但是不够完美。我们可以使用Grouper写得更加简洁： # df.set_index('date').groupby('name')['ext price'].resample("M").sum() df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum() 结果和上面一样，就不列出来了。显然，这种写法多敲了很多次键盘，那么它的好处是啥呢？首先，逻辑上更加直接，当你敲代码完成以上统计的时候，你首先想到的就是groupby操作，而set_index， resample好像不会立马想到。想到了groupby这个'动作'之后，你就会紧接着想按照哪个key来操作，此时你只需要用字符串，或者Grouper把key定义好就行了。最后使用聚合函数，就得到了结果。所以，从人类的思考角度看，后者更容易记忆。另外，Grouper里的freq可以方便的改成其他周期参数（resample也可以），比如： # 按照年度，且截止到12月最后一天统计ext price的sum值 df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum() name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64 agg 从0.20.1开始，pandas引入了agg函数，它提供基于列的聚合操作。而groupby可以看做是基于行，或者说index的聚合操作。从实现上看，groupby返回的是一个DataFrameGroupBy结构，这个结构必须调用聚合函数（如sum）之后，才会得到结构为Series的数据结果。而agg是DataFrame的直接方法，返回的也是一个DataFrame。当然，很多功能用sum、mean等等也可以实现。但是agg更加简洁, 而且传给它的函数可以是字符串，也可以自定义，参数是column对应的子DataFrame 举个栗子吧： df[["ext price","quantity","unit price"]].agg(['sum','mean']) 怎么样，是不是比使用 df[["ext price", "quantity"]].sum() df['unit price'].mean() 简洁多了？上例中，你还可以针对不同的列使用不同的聚合函数： df.agg({'ext price': ['sum','mean'],'quantity': ['sum','mean'],'unit price': ['mean']}) 另外，自定义函数怎么用呢，也是so easy. 比如，我想统计sku中，购买次数最多的产品编号，可以这样做： # 这里的x是sku对应的column get_max = lambda x: x.value_counts(dropna=False).index[0] df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]}) <lambda>看起来很不协调，把它去掉： get_max = lambda x: x.value_counts(dropna=False).index[0] # python就是灵活啊。 get_max.__name__ = "most frequent" df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]}) 另外，还有一个小问题，那就是如果你希望输出的列按照某个顺序排列，可以使用collections的OrderedDict： get_max = lambda x: x.value_counts(dropna=False).index[0] get_max.__name__ = "most frequent" import collections agg_dict = { 'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]} # 按照列名的长度排序。 OrderedDict的顺序是跟插入顺序一致的 df.agg(collections.OrderedDict(sorted(agg_dict.items(), key = lambda x: len(x[0])))) 原文：http://pbpython.com/pandas-grouper-agg.html 转自：https://segmentfault.com/a/1190000012394176

2018-02-02

手机处理器品牌分析

mobile.pconline.com.cn/337/3379352.html 【PConline 杂谈】如果你向朋友请教买一台怎样的台式机或者笔记本的话，很多时候那朋友会根据你对电脑的使用需求而作一个性能划分，如“你只是需要处理一些简单的文档

2017-12-20

北京云栖大会workshop：《数据处理：数据建模与加工》篇

本手册为云栖大会Workshop《云数据·大计算：快速搭建互联网在线运营分析平台》的《数据处理：数据建模与加工》篇而准备。

2017-12-12

EF架构~扩展一个分页处理大数据的方法

最近总遇到大数据的问题，一次性处理几千万数据不实际，所以，我们需要对大数据进行分块处理，或者叫分页处理，我在EF架构里曾经写过类似的，那是在进行BulkInsert时，对大数据批量插入时候用到的，现在我把它拿出来

2017-12-04

【转】IOS高级教程1:处理1000张图片的内存优化

，重复加载了一张图片1000次，首先加载图片到内存，然后进行压缩操作，释放内存 01 for(inti = 0; i <= 1000; i ++) { 02 03 //1.首先我们获取到需要处理的图片资源的路径

2017-11-22

Android应用程序键盘（Keyboard）消息处理机制分析（19）

receiveDispatchSignal来确认是否是接收到了键盘消息的通知，如果是的话，再调用它的consume函数来把键盘事件读取出来，最后，调用Java层的回调对象InputQueue的DispatchKeyEvent来处理这个键盘事件

2017-11-21

droid应用程序键盘（Keyboard）消息处理机制分析（10）

scheduleTraversals函数中，会通过sendEmptyMessage(DO_TRAVERSAL)发送一个消息到应用程序的消息队列中，这个消息最终由ViewRoot的handleMessage函数处理

2017-11-21

Android应用程序键盘（Keyboard）消息处理机制分析（7）

函数首先根据文件名来打开这个设备文件： fd=open(deviceName,O_RDWR); 系统中所有输入设备文件信息都保存在成员变量mDevicesById中，因此，先在mDevicesById找到一个空位置来保存当前打开的设备文件信息： mDevicesById[devid].seq=(mDevicesById[devid].seq+(1<<SEQ_SHIFT))&SEQ_MASK; if(mDevicesById[devid].seq==0){ mDevicesById[devid].seq=1<<SEQ_SHIFT; } 找到了空闲位置后，就为这个输入设备文件创建相应的device_t信息： mDevicesById[devid].seq=(mDevicesById[devid].seq+(1<<SEQ_SHIFT))&SEQ_MASK; if(mDevicesById[devid].seq==0){ mDevicesById[devid].seq=1<<SEQ_SHIFT; } new_mFDs=(pollfd*)realloc(mFDs,sizeof(mFDs[0])*(mFDCount+1)); new_devices=(device_t**)realloc(mDevices,sizeof(mDevices[0])*(mFDCount+1)); if(new_mFDs==NULL||new_devices==NULL){ LOGE("outofmemory"); return-1; } mFDs=new_mFDs; mDevices=new_devices; ...... device_t*device=newdevice_t(devid|mDevicesById[devid].seq,deviceName,name); if(device==NULL){ LOGE("outofmemory"); return-1; } device->fd=fd; 同时，这个设备文件还会保存在数组mFDs中： mFDs[mFDCount].fd=fd; mFDs[mFDCount].events=POLLIN; mFDs[mFDCount].revents=0; 接下来查看这个设备是不是键盘： //Figureoutthekindsofeventsthedevicereports. uint8_tkey_bitmask[sizeof_bit_array(KEY_MAX+1)]; memset(key_bitmask,0,sizeof(key_bitmask)); LOGV("Gettingkeys..."); if(ioctl(fd,EVIOCGBIT(EV_KEY,sizeof(key_bitmask)),key_bitmask)>=0){ //Seeifthisisakeyboard.Ignoreeverythinginthebuttonrangeexceptfor //gamepadswhicharealsoconsideredkeyboards. if(containsNonZeroByte(key_bitmask,0,sizeof_bit_array(BTN_MISC)) ||containsNonZeroByte(key_bitmask,sizeof_bit_array(BTN_GAMEPAD), sizeof_bit_array(BTN_DIGI)) ||containsNonZeroByte(key_bitmask,sizeof_bit_array(KEY_OK), sizeof_bit_array(KEY_MAX+1))){ device->classes|=INPUT_DEVICE_CLASS_KEYBOARD; device->keyBitmask=newuint8_t[sizeof(key_bitmask)]; if(device->keyBitmask!=NULL){ memcpy(device->keyBitmask,key_bitmask,sizeof(key_bitmask)); }else{ deletedevice; LOGE("outofmemoryallocatingkeybitmask"); return-1; } } } 如果是的话，还要继续进一步初始化前面为这个设备文件所创建的device_t结构体，主要就是把结构体device的classes成员变量的INPUT_DEVICE_CLASS_KEYBOARD位置为1了，以表明这是一个键盘。如果是键盘设备，初始化工作还未完成，还要继续设置键盘的布局等信息： if((device->classes&INPUT_DEVICE_CLASS_KEYBOARD)!=0){ chartmpfn[sizeof(name)]; charkeylayoutFilename[300]; //amoredescriptivename device->name=name; //replaceallthespaceswithunderscores strcpy(tmpfn,name); for(char*p=strchr(tmpfn,'');p&&*p;p=strchr(tmpfn,'')) *p='_'; //findthe.klfileweneedforthisdevice constchar*root=getenv("ANDROID_ROOT"); snprintf(keylayoutFilename,sizeof(keylayoutFilename), "%s/usr/keylayout/%s.kl",root,tmpfn); booldefaultKeymap=false; if(access(keylayoutFilename,R_OK)){ snprintf(keylayoutFilename,sizeof(keylayoutFilename), "%s/usr/keylayout/%s",root,"qwerty.kl"); defaultKeymap=true; } status_tstatus=device->layoutMap->load(keylayoutFilename); if(status){ LOGE("Error%dloadingkeylayout.",status); } //telltheworldaboutthedevname(thedescriptivename) if(!mHaveFirstKeyboard&&!defaultKeymap&&strstr(name,"-keypad")){ //thebuilt-inkeyboardhasawell-knowndeviceIDof0, //thisdevicebetternotgoaway. mHaveFirstKeyboard=true; mFirstKeyboardId=device->id; property_set("hw.keyboards.0.devname",name); }else{ //ensuremFirstKeyboardIdissetto-something-. if(mFirstKeyboardId==0){ mFirstKeyboardId=device->id; } } charpropName[100]; sprintf(propName,"hw.keyboards.%u.devname",device->id); property_set(propName,name); //'Q'keysupport=cheaptestofwhetherthisisanalpha-capablekbd if(hasKeycodeLocked(device,AKEYCODE_Q)){ device->classes|=INPUT_DEVICE_CLASS_ALPHAKEY; } //SeeifthisdevicehasaDPAD. if(hasKeycodeLocked(device,AKEYCODE_DPAD_UP)&& hasKeycodeLocked(device,AKEYCODE_DPAD_DOWN)&& hasKeycodeLocked(device,AKEYCODE_DPAD_LEFT)&& hasKeycodeLocked(device,AKEYCODE_DPAD_RIGHT)&& hasKeycodeLocked(device,AKEYCODE_DPAD_CENTER)){ device->classes|=INPUT_DEVICE_CLASS_DPAD; } //Seeifthisdevicehasagamepad. for(size_ti=0;i<sizeof(GAMEPAD_KEYCODES)/sizeof(GAMEPAD_KEYCODES[0]);i++){ if(hasKeycodeLocked(device,GAMEPAD_KEYCODES[i])){ device->classes|=INPUT_DEVICE_CLASS_GAMEPAD; break; } } LOGI("Newkeyboard:device->id=0x%xdevname='%s'propName='%s'keylayout='%s'\n", device->id,name,propName,keylayoutFilename); } 到这里，系统中的输入设备文件就打开了。本文转自 Luoshengyang 51CTO博客，原文链接：http://blog.51cto.com/shyluo/966619，如需转载请自行联系原作者

2017-11-21

自然语言处理领域的两种创新观念

自然语言处理作为一个研究领域，曾经是一个颇为冷门的方向，但是现在随着互联网搜索概念股的疯狂被投资人追捧，搜索和自然语言处理逐渐成为学术领域的显学。

2017-11-19

spark 数据预处理特征标准化归一化模块

#We will also standardise our data as we have done so far when performing distance-based clustering. from pyspark.mllib.feature import StandardScaler standardizer = StandardScaler(True, True) t0 = time() standardizer_model = standardizer.fit(parsed_data_values) tt = time() - t0 standardized_data_values = standardizer_model.transform(parsed_data_values) print "Data standardized in {} seconds".format(round(tt,3)) Data standardized in 9.54 seconds We can now perform k-means clustering. from pyspark.mllib.clustering import KMeans t0 = time() clusters = KMeans.train(standardized_data_values, 80, maxIterations=10, runs=5, initializationMode="random") tt = time() - t0 print "Data clustered in {} seconds".format(round(tt,3)) Data clustered in 137.496 seconds kmeans demo 摘自：http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.feature pyspark.mllib.feature module Python package for feature in MLlib. classpyspark.mllib.feature.Normalizer( p=2.0) [source] Bases:pyspark.mllib.feature.VectorTransformer Normalizes samples individually to unit Lpnorm For any 1 <=p< float(‘inf’), normalizes samples using sum(abs(vector)p)(1/p)as norm. Forp= float(‘inf’), max(abs(vector)) will be used as norm for normalization. Parameters: p– Normalization in L^p^ space, p = 2 by default. >>> v = Vectors.dense(range(3)) >>> nor = Normalizer(1) >>> nor.transform(v) DenseVector([0.0, 0.3333, 0.6667]) >>> rdd = sc.parallelize([v]) >>> nor.transform(rdd).collect() [DenseVector([0.0, 0.3333, 0.6667])] >>> nor2 = Normalizer(float("inf")) >>> nor2.transform(v) DenseVector([0.0, 0.5, 1.0]) New in version 1.2.0. transform( vector) [source] Applies unit length normalization on a vector. Parameters: vector– vector or RDD of vector to be normalized. Returns: normalized vector. If the norm of the input is zero, it will return the input vector. New in version 1.2.0. classpyspark.mllib.feature.StandardScalerModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents a StandardScaler model that can transform vectors. New in version 1.2.0. mean [source] Return the column mean values. New in version 2.0.0. setWithMean( withMean) [source] Setter of the boolean which decides whether it uses mean or not New in version 1.4.0. setWithStd( withStd) [source] Setter of the boolean which decides whether it uses std or not New in version 1.4.0. std [source] Return the column standard deviation values. New in version 2.0.0. transform( vector) [source] Applies standardization transformation on a vector. Note In Python, transform cannot currently be used within an RDD transformation or action. Call transform directly on the RDD instead. Parameters: vector– Vector or RDD of Vector to be standardized. Returns: Standardized vector. If the variance of a column is zero, it will return default0.0for the column with zero variance. New in version 1.2.0. withMean [source] Returns if the model centers the data before scaling. New in version 2.0.0. withStd [source] Returns if the model scales the data to unit standard deviation. New in version 2.0.0. classpyspark.mllib.feature.StandardScaler( withMean=False, withStd=True) [source] Bases:object Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. Parameters: withMean– False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. withStd– True by default. Scales the data to unit standard deviation. >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True New in version 1.2.0. fit( dataset) [source] Computes the mean and variance and stores as a model to be used for later scaling. Parameters: dataset– The data used to compute the mean and variance to build the transformation model. Returns: a StandardScalarModel New in version 1.2.0. classpyspark.mllib.feature.HashingTF( numFeatures=1048576) [source] Bases:object Maps a sequence of terms to their term frequencies using the hashing trick. Note The terms must be hashable (can not be dict/set/list...). Parameters: numFeatures– number of features (default: 2^20) >>> htf = HashingTF(100) >>> doc = "a a b b c d".split(" ") >>> htf.transform(doc) SparseVector(100, {...}) New in version 1.2.0. indexOf( term) [source] Returns the index of the input term. New in version 1.2.0. setBinary( value) [source] If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False) New in version 2.0.0. transform( document) [source] Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors. New in version 1.2.0. classpyspark.mllib.feature.IDFModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents an IDF model that can transform term frequency vectors. New in version 1.2.0. idf() [source] Returns the current IDF vector. New in version 1.4.0. transform( x) [source] Transforms term frequency (TF) vectors to TF-IDF vectors. IfminDocFreqwas set for the IDF calculation, the terms which occur in fewer thanminDocFreqdocuments will have an entry of 0. Note In Python, transform cannot currently be used within an RDD transformation or action. Call transform directly on the RDD instead. Parameters: x– an RDD of term frequency vectors or a term frequency vector Returns: an RDD of TF-IDF vectors or a TF-IDF vector New in version 1.2.0. classpyspark.mllib.feature.IDF( minDocFreq=0) [source] Bases:object Inverse document frequency (IDF). The standard formulation is used:idf = log((m + 1) / (d(t) + 1)), wheremis the total number of documents andd(t)is the number of documents that contain termt. This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variableminDocFreq). For terms that are not in at leastminDocFreqdocuments, the IDF is found as 0, resulting in TF-IDFs of 0. Parameters: minDocFreq– minimum of documents in which a term should appear for filtering >>> n = 4 >>> freqs = [Vectors.sparse(n, (1, 3), (1.0, 2.0)), ... Vectors.dense([0.0, 1.0, 2.0, 3.0]), ... Vectors.sparse(n, [1], [1.0])] >>> data = sc.parallelize(freqs) >>> idf = IDF() >>> model = idf.fit(data) >>> tfidf = model.transform(data) >>> for r in tfidf.collect(): r SparseVector(4, {1: 0.0, 3: 0.5754}) DenseVector([0.0, 0.0, 1.3863, 0.863]) SparseVector(4, {1: 0.0}) >>> model.transform(Vectors.dense([0.0, 1.0, 2.0, 3.0])) DenseVector([0.0, 0.0, 1.3863, 0.863]) >>> model.transform([0.0, 1.0, 2.0, 3.0]) DenseVector([0.0, 0.0, 1.3863, 0.863]) >>> model.transform(Vectors.sparse(n, (1, 3), (1.0, 2.0))) SparseVector(4, {1: 0.0, 3: 0.5754}) New in version 1.2.0. fit( dataset) [source] Computes the inverse document frequency. Parameters: dataset– an RDD of term frequency vectors New in version 1.2.0. classpyspark.mllib.feature.Word2Vec [source] Bases:object Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation. For original C implementation, seehttps://code.google.com/p/word2vec/For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. >>> sentence = "a b " * 100 + "a c " * 10 >>> localDoc = [sentence, sentence] >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" ")) >>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc) Querying for synonyms of a word will not return that word: >>> syms = model.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] But querying for synonyms of a vector may return the word whose representation is that vector: >>> vec = model.transform("a") >>> syms = model.findSynonyms(vec, 2) >>> [s[0] for s in syms] [u'a', u'b'] >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = Word2VecModel.load(sc, path) >>> model.transform("a") == sameModel.transform("a") True >>> syms = sameModel.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass New in version 1.2.0. fit( data) [source] Computes the vector representation of each word in vocabulary. Parameters: data– training data. RDD of list of string Returns: Word2VecModel instance New in version 1.2.0. setLearningRate( learningRate) [source] Sets initial learning rate (default: 0.025). New in version 1.2.0. setMinCount( minCount) [source] Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5). New in version 1.4.0. setNumIterations( numIterations) [source] Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. New in version 1.2.0. setNumPartitions( numPartitions) [source] Sets number of partitions (default: 1). Use a small number for accuracy. New in version 1.2.0. setSeed( seed) [source] Sets random seed. New in version 1.2.0. setVectorSize( vectorSize) [source] Sets vector size (default: 100). New in version 1.2.0. setWindowSize( windowSize) [source] Sets window size (default: 5). New in version 2.0.0. classpyspark.mllib.feature.Word2VecModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer,pyspark.mllib.util.JavaSaveable,pyspark.mllib.util.JavaLoader class for Word2Vec model New in version 1.2.0. findSynonyms( word, num) [source] Find synonyms of a word Parameters: word– a word or a vector representation of word num– number of synonyms to find Returns: array of (word, cosineSimilarity) Note Local use only New in version 1.2.0. getVectors() [source] Returns a map of words to their vector representations. New in version 1.4.0. classmethodload( sc, path) [source] Load a model from the given path. New in version 1.5.0. transform( word) [source] Transforms a word to its vector representation Note Local use only Parameters: word– a word Returns: vector representation of word(s) New in version 1.2.0. classpyspark.mllib.feature.ChiSqSelector( numTopFeatures=50, selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05) [source] Bases:object Creates a ChiSquared feature selector. The selector supports different selection methods:numTopFeatures,percentile,fpr,fdr,fwe. numTopFeatureschooses a fixed number of top features according to a chi-squared test. percentileis similar but chooses a fraction of all features instead of a fixed number. fprchooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. fdruses theBenjamini-Hochberg procedureto choose all features whose false discovery rate is below a threshold. fwechooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method isnumTopFeatures, with the default number of top features set to 50. >>> data = sc.parallelize([ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [7.0, 9.0, 5.0]), ... LabeledPoint(2.0, [8.0, 7.0, 3.0]) ... ]) >>> model = ChiSqSelector(numTopFeatures=1).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="fpr", fpr=0.2).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="percentile", percentile=0.34).fit(data) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) New in version 1.4.0. fit( data) [source] Returns a ChiSquared feature selector. Parameters: data– anRDD[LabeledPoint]containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function. New in version 1.4.0. setFdr( fdr) [source] set FDR [0.0, 1.0] for feature selection by FDR. Only applicable when selectorType = “fdr”. New in version 2.2.0. setFpr( fpr) [source] set FPR [0.0, 1.0] for feature selection by FPR. Only applicable when selectorType = “fpr”. New in version 2.1.0. setFwe( fwe) [source] set FWE [0.0, 1.0] for feature selection by FWE. Only applicable when selectorType = “fwe”. New in version 2.2.0. setNumTopFeatures( numTopFeatures) [source] set numTopFeature for feature selection by number of top features. Only applicable when selectorType = “numTopFeatures”. New in version 2.1.0. setPercentile( percentile) [source] set percentile [0.0, 1.0] for feature selection by percentile. Only applicable when selectorType = “percentile”. New in version 2.1.0. setSelectorType( selectorType) [source] set the selector type of the ChisqSelector. Supported options: “numTopFeatures” (default), “percentile”, “fpr”, “fdr”, “fwe”. New in version 2.1.0. classpyspark.mllib.feature.ChiSqSelectorModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents a Chi Squared selector model. New in version 1.4.0. transform( vector) [source] Applies transformation on a vector. Parameters: vector– Vector or RDD of Vector to be transformed. Returns: transformed vector. New in version 1.4.0. classpyspark.mllib.feature.ElementwiseProduct( scalingVector) [source] Bases:pyspark.mllib.feature.VectorTransformer Scales each column of the vector, with the supplied weight vector. i.e the elementwise product. >>> weight = Vectors.dense([1.0, 2.0, 3.0]) >>> eprod = ElementwiseProduct(weight) >>> a = Vectors.dense([2.0, 1.0, 3.0]) >>> eprod.transform(a) DenseVector([2.0, 2.0, 9.0]) >>> b = Vectors.dense([9.0, 3.0, 4.0]) >>> rdd = sc.parallelize([a, b]) >>> eprod.transform(rdd).collect() [DenseVector([2.0, 2.0, 9.0]), DenseVector([9.0, 6.0, 12.0])] New in version 1.5.0. transform( vector) [source] Computes the Hadamard product of the vector. New in version 1.5.0. 本文转自张昺华-sky博客园博客，原文链接：http://www.cnblogs.com/bonelee/p/7774142.html，如需转载请自行联系原作者

2017-11-15

Android应用程序键盘（Keyboard）消息处理机制分析（22）

nativeFinished(mFinishedToken); ...... } } ...... } ...... } 这里它调用外部类InputQueue的本地方法nativeFinished来进一步处理

2017-11-15

如何处理SQL Server事务复制中的大事务操作

大事务同步延时处理方法在transactional replication, 经常会遇到数据同步延迟的情况。

2017-11-15

资源下载

更多资源

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长，特征是大鼻子、头戴帽子、身穿背带裤，还留着胡子。与他的双胞胎兄弟路易基一起，长年担任任天堂的招牌角色。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

精选列表

Win8MetroC#数字图像处理--2.1图像灰度化

大型分布式C++框架《二：大包处理过程》

自然语言处理，计算机与人类“谈心”的关键

实体嵌入(向量化)：用深度学习处理结构化数据

大数据||日志文件数据存储、收集、预处理和分析

免费的自然语言处理/机器学习教育资源12例

利用函数计算对oss压缩文件做自动解压处理

如何在E-MapReduce上提交Storm作业处理Kafka数据

python处理数据的风骚操作[pandas 之 groupby&agg]

手机处理器品牌分析

北京云栖大会workshop：《数据处理：数据建模与加工》篇

EF架构~扩展一个分页处理大数据的方法

【转】IOS高级教程1:处理1000张图片的内存优化

Android应用程序键盘（Keyboard）消息处理机制分析（19）

droid应用程序键盘（Keyboard）消息处理机制分析（10）

Android应用程序键盘（Keyboard）消息处理机制分析（7）

自然语言处理领域的两种创新观念

spark 数据预处理特征标准化归一化模块

Android应用程序键盘（Keyboard）消息处理机制分析（22）

如何处理SQL Server事务复制中的大事务操作

资源下载

Mario

Nacos

Spring

Rocky Linux

欢迎您来访！

精选列表

Win8MetroC#数字图像处理--2.1图像灰度化

大型分布式C++框架《二：大包处理过程》

自然语言处理，计算机与人类“谈心”的关键

实体嵌入(向量化)：用深度学习处理结构化数据

大数据||日志文件数据存储、收集、预处理和分析

免费的自然语言处理/机器学习教育资源12例

利用函数计算对oss压缩文件做自动解压处理

如何在E-MapReduce上提交Storm作业处理Kafka数据

python处理数据的风骚操作[pandas 之 groupby&agg]

手机处理器品牌分析

北京云栖大会workshop：《数据处理：数据建模与加工》篇

EF架构~扩展一个分页处理大数据的方法

【转】IOS高级教程1:处理1000张图片的内存优化

Android应用程序键盘（Keyboard）消息处理机制分析（19）

droid应用程序键盘（Keyboard）消息处理机制分析（10）

Android应用程序键盘（Keyboard）消息处理机制分析（7）

自然语言处理领域的两种创新观念

spark 数据预处理 特征标准化 归一化模块

Android应用程序键盘（Keyboard）消息处理机制分析（22）

如何处理SQL Server事务复制中的大事务操作

资源下载

Mario

Nacos

Spring

Rocky Linux

欢迎您来访！

spark 数据预处理特征标准化归一化模块