首页 文章 精选 留言 我的

精选列表

搜索[文档处理],共10000篇文章
优秀的个人博客,低调大师

如何在E-MapReduce上提交Storm作业处理Kafka数据

0. 序言 本文演示如何在E-MapReduce上部署Storm集群和Kafka集群,并运行Storm作业消费Kafka数据。 1. 准备环境 这里我选择在杭州Region进行测试,版本选择EMR-3.8.0,本次测试需要的组件版本有: Kafka:2.11_1.0.0 Storm: 1.0.1 E-MapReduce的集群管理界面地址:https://emr.console.aliyun.com/console#/cn-hangzhou/ 1.1 创建Hadoop集群 由于Zookeeper和Storm组件默认不是必选的,所以在创建集群时需要记得勾选上,如下: 详细创建集群步骤,请参考E-MapReduce-用户指南-集群一节。 1.2 创建Kafka集群 接着创建Kafka集群,集群类型选择Kafka,如下: 注意: 如果使用经典网络,请注意将Hado

优秀的个人博客,低调大师

python处理数据的风骚操作[pandas 之 groupby&agg]

本文重点介绍了pandas中groupby、Grouper和agg函数的使用。这2个函数作用类似,都是对数据集中的一类属性进行聚合操作,比如统计一个用户在每个月内的全部花销,统计某个属性的最大、最小、累和、平均等数值。 其中,agg是pandas 0.20新引入的功能 groupby && Grouper 首先,我们从网上把数据下载下来,后面的操作都是基于这份数据的: import pandas as pd df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True") df["date"] = pd.to_datetime(df['date']) df.head() (图片来自于jupyter notebook,强烈推荐使用它作为python的交互工具) 下面,我们统计'ext price'这个属性在每个月的累和(sum)值,resample只有在index为date类型的时候才能用: df.set_index('date').resample('M')['ext price'].sum() date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 2014-12-31 163867.26 Freq: M, Name: ext price, dtype: float64 进一步的,我们想知道每个用户每个月的sum值,那么就需要一个groupby了: df.set_index('date').groupby('name')['ext price'].resample("M").sum() name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 2014-09-30 12693.74 2014-10-31 9324.37 2014-11-30 6021.11 2014-12-31 7640.60 Frami, Hills and Schmidt 2014-01-31 5112.34 2014-02-28 4124.53 2014-03-31 10397.44 2014-04-30 5036.18 2014-05-31 4097.87 2014-06-30 13192.19 ... Trantow-Barrows 2014-07-31 11987.34 2014-08-31 17251.65 2014-09-30 6992.48 2014-10-31 10064.27 2014-11-30 6550.10 2014-12-31 10124.23 White-Trantow 2014-01-31 13703.77 2014-02-28 11783.98 2014-03-31 8583.05 2014-04-30 19009.20 2014-05-31 5877.29 2014-06-30 14791.32 2014-07-31 10242.62 2014-08-31 12287.21 2014-09-30 5315.16 2014-10-31 19896.85 2014-11-30 9544.61 2014-12-31 4806.93 Will LLC 2014-01-31 20953.87 2014-02-28 13613.06 2014-03-31 9838.93 2014-04-30 6094.94 2014-05-31 11856.95 2014-06-30 2419.52 2014-07-31 11017.54 2014-08-31 1439.82 2014-09-30 4345.99 2014-10-31 7085.33 2014-11-30 3210.44 2014-12-31 12561.21 Name: ext price, Length: 240, dtype: float64 结果肯定是对的,但是不够完美。我们可以使用Grouper写得更加简洁: # df.set_index('date').groupby('name')['ext price'].resample("M").sum() df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum() 结果和上面一样,就不列出来了。 显然,这种写法多敲了很多次键盘,那么它的好处是啥呢? 首先,逻辑上更加直接,当你敲代码完成以上统计的时候,你首先想到的就是groupby操作,而set_index, resample好像不会立马想到。想到了groupby这个'动作'之后,你就会紧接着想按照哪个key来操作,此时 你只需要用字符串,或者Grouper把key定义好就行了。最后使用聚合函数,就得到了结果。所以,从人类的 思考角度看,后者更容易记忆。 另外,Grouper里的freq可以方便的改成其他周期参数(resample也可以),比如: # 按照年度,且截止到12月最后一天统计ext price的sum值 df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum() name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64 agg 从0.20.1开始,pandas引入了agg函数,它提供基于列的聚合操作。而groupby可以看做是基于行,或者说index的聚合操作。 从实现上看,groupby返回的是一个DataFrameGroupBy结构,这个结构必须调用聚合函数(如sum)之后,才会得到结构为Series的数据结果。 而agg是DataFrame的直接方法,返回的也是一个DataFrame。当然,很多功能用sum、mean等等也可以实现。但是agg更加简洁, 而且传给它的函数可以是字符串,也可以自定义,参数是column对应的子DataFrame 举个栗子吧: df[["ext price","quantity","unit price"]].agg(['sum','mean']) 怎么样,是不是比使用 df[["ext price", "quantity"]].sum() df['unit price'].mean() 简洁多了? 上例中,你还可以针对不同的列使用不同的聚合函数: df.agg({'ext price': ['sum','mean'],'quantity': ['sum','mean'],'unit price': ['mean']}) 另外,自定义函数怎么用呢,也是so easy. 比如,我想统计sku中,购买次数最多的产品编号,可以这样做: # 这里的x是sku对应的column get_max = lambda x: x.value_counts(dropna=False).index[0] df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]}) <lambda>看起来很不协调,把它去掉: get_max = lambda x: x.value_counts(dropna=False).index[0] # python就是灵活啊。 get_max.__name__ = "most frequent" df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]}) 另外,还有一个小问题,那就是如果你希望输出的列按照某个顺序排列,可以使用collections的OrderedDict: get_max = lambda x: x.value_counts(dropna=False).index[0] get_max.__name__ = "most frequent" import collections agg_dict = { 'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]} # 按照列名的长度排序。 OrderedDict的顺序是跟插入顺序一致的 df.agg(collections.OrderedDict(sorted(agg_dict.items(), key = lambda x: len(x[0])))) 原文:http://pbpython.com/pandas-grouper-agg.html 转自:https://segmentfault.com/a/1190000012394176

优秀的个人博客,低调大师

Android应用程序键盘(Keyboard)消息处理机制分析(7)

函数首先根据文件名来打开这个设备文件: fd=open(deviceName,O_RDWR); 系统中所有输入设备文件信息都保存在成员变量mDevicesById中,因此,先在mDevicesById找到一个空位置来保存当前打开的设备文件信息: mDevicesById[devid].seq=(mDevicesById[devid].seq+(1<<SEQ_SHIFT))&SEQ_MASK; if(mDevicesById[devid].seq==0){ mDevicesById[devid].seq=1<<SEQ_SHIFT; } 找到了空闲位置后,就为这个输入设备文件创建相应的device_t信息: mDevicesById[devid].seq=(mDevicesById[devid].seq+(1<<SEQ_SHIFT))&SEQ_MASK; if(mDevicesById[devid].seq==0){ mDevicesById[devid].seq=1<<SEQ_SHIFT; } new_mFDs=(pollfd*)realloc(mFDs,sizeof(mFDs[0])*(mFDCount+1)); new_devices=(device_t**)realloc(mDevices,sizeof(mDevices[0])*(mFDCount+1)); if(new_mFDs==NULL||new_devices==NULL){ LOGE("outofmemory"); return-1; } mFDs=new_mFDs; mDevices=new_devices; ...... device_t*device=newdevice_t(devid|mDevicesById[devid].seq,deviceName,name); if(device==NULL){ LOGE("outofmemory"); return-1; } device->fd=fd; 同时,这个设备文件还会保存在数组mFDs中: mFDs[mFDCount].fd=fd; mFDs[mFDCount].events=POLLIN; mFDs[mFDCount].revents=0; 接下来查看这个设备是不是键盘: //Figureoutthekindsofeventsthedevicereports. uint8_tkey_bitmask[sizeof_bit_array(KEY_MAX+1)]; memset(key_bitmask,0,sizeof(key_bitmask)); LOGV("Gettingkeys..."); if(ioctl(fd,EVIOCGBIT(EV_KEY,sizeof(key_bitmask)),key_bitmask)>=0){ //Seeifthisisakeyboard.Ignoreeverythinginthebuttonrangeexceptfor //gamepadswhicharealsoconsideredkeyboards. if(containsNonZeroByte(key_bitmask,0,sizeof_bit_array(BTN_MISC)) ||containsNonZeroByte(key_bitmask,sizeof_bit_array(BTN_GAMEPAD), sizeof_bit_array(BTN_DIGI)) ||containsNonZeroByte(key_bitmask,sizeof_bit_array(KEY_OK), sizeof_bit_array(KEY_MAX+1))){ device->classes|=INPUT_DEVICE_CLASS_KEYBOARD; device->keyBitmask=newuint8_t[sizeof(key_bitmask)]; if(device->keyBitmask!=NULL){ memcpy(device->keyBitmask,key_bitmask,sizeof(key_bitmask)); }else{ deletedevice; LOGE("outofmemoryallocatingkeybitmask"); return-1; } } } 如果是的话,还要继续进一步初始化前面为这个设备文件所创建的device_t结构体,主要就是把结构体device的classes成员变量的INPUT_DEVICE_CLASS_KEYBOARD位置为1了,以表明这是一个键盘。 如果是键盘设备,初始化工作还未完成,还要继续设置键盘的布局等信息: if((device->classes&INPUT_DEVICE_CLASS_KEYBOARD)!=0){ chartmpfn[sizeof(name)]; charkeylayoutFilename[300]; //amoredescriptivename device->name=name; //replaceallthespaceswithunderscores strcpy(tmpfn,name); for(char*p=strchr(tmpfn,'');p&&*p;p=strchr(tmpfn,'')) *p='_'; //findthe.klfileweneedforthisdevice constchar*root=getenv("ANDROID_ROOT"); snprintf(keylayoutFilename,sizeof(keylayoutFilename), "%s/usr/keylayout/%s.kl",root,tmpfn); booldefaultKeymap=false; if(access(keylayoutFilename,R_OK)){ snprintf(keylayoutFilename,sizeof(keylayoutFilename), "%s/usr/keylayout/%s",root,"qwerty.kl"); defaultKeymap=true; } status_tstatus=device->layoutMap->load(keylayoutFilename); if(status){ LOGE("Error%dloadingkeylayout.",status); } //telltheworldaboutthedevname(thedescriptivename) if(!mHaveFirstKeyboard&&!defaultKeymap&&strstr(name,"-keypad")){ //thebuilt-inkeyboardhasawell-knowndeviceIDof0, //thisdevicebetternotgoaway. mHaveFirstKeyboard=true; mFirstKeyboardId=device->id; property_set("hw.keyboards.0.devname",name); }else{ //ensuremFirstKeyboardIdissetto-something-. if(mFirstKeyboardId==0){ mFirstKeyboardId=device->id; } } charpropName[100]; sprintf(propName,"hw.keyboards.%u.devname",device->id); property_set(propName,name); //'Q'keysupport=cheaptestofwhetherthisisanalpha-capablekbd if(hasKeycodeLocked(device,AKEYCODE_Q)){ device->classes|=INPUT_DEVICE_CLASS_ALPHAKEY; } //SeeifthisdevicehasaDPAD. if(hasKeycodeLocked(device,AKEYCODE_DPAD_UP)&& hasKeycodeLocked(device,AKEYCODE_DPAD_DOWN)&& hasKeycodeLocked(device,AKEYCODE_DPAD_LEFT)&& hasKeycodeLocked(device,AKEYCODE_DPAD_RIGHT)&& hasKeycodeLocked(device,AKEYCODE_DPAD_CENTER)){ device->classes|=INPUT_DEVICE_CLASS_DPAD; } //Seeifthisdevicehasagamepad. for(size_ti=0;i<sizeof(GAMEPAD_KEYCODES)/sizeof(GAMEPAD_KEYCODES[0]);i++){ if(hasKeycodeLocked(device,GAMEPAD_KEYCODES[i])){ device->classes|=INPUT_DEVICE_CLASS_GAMEPAD; break; } } LOGI("Newkeyboard:device->id=0x%xdevname='%s'propName='%s'keylayout='%s'\n", device->id,name,propName,keylayoutFilename); } 到这里,系统中的输入设备文件就打开了。 本文转自 Luoshengyang 51CTO博客,原文链接:http://blog.51cto.com/shyluo/966619,如需转载请自行联系原作者

优秀的个人博客,低调大师

spark 数据预处理 特征标准化 归一化模块

#We will also standardise our data as we have done so far when performing distance-based clustering. from pyspark.mllib.feature import StandardScaler standardizer = StandardScaler(True, True) t0 = time() standardizer_model = standardizer.fit(parsed_data_values) tt = time() - t0 standardized_data_values = standardizer_model.transform(parsed_data_values) print "Data standardized in {} seconds".format(round(tt,3)) Data standardized in 9.54 seconds We can now perform k-means clustering. from pyspark.mllib.clustering import KMeans t0 = time() clusters = KMeans.train(standardized_data_values, 80, maxIterations=10, runs=5, initializationMode="random") tt = time() - t0 print "Data clustered in {} seconds".format(round(tt,3)) Data clustered in 137.496 seconds kmeans demo 摘自:http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.feature pyspark.mllib.feature module Python package for feature in MLlib. classpyspark.mllib.feature.Normalizer( p=2.0) [source] Bases:pyspark.mllib.feature.VectorTransformer Normalizes samples individually to unit Lpnorm For any 1 <=p< float(‘inf’), normalizes samples using sum(abs(vector)p)(1/p)as norm. Forp= float(‘inf’), max(abs(vector)) will be used as norm for normalization. Parameters: p– Normalization in L^p^ space, p = 2 by default. >>> v = Vectors.dense(range(3)) >>> nor = Normalizer(1) >>> nor.transform(v) DenseVector([0.0, 0.3333, 0.6667]) >>> rdd = sc.parallelize([v]) >>> nor.transform(rdd).collect() [DenseVector([0.0, 0.3333, 0.6667])] >>> nor2 = Normalizer(float("inf")) >>> nor2.transform(v) DenseVector([0.0, 0.5, 1.0]) New in version 1.2.0. transform( vector) [source] Applies unit length normalization on a vector. Parameters: vector– vector or RDD of vector to be normalized. Returns: normalized vector. If the norm of the input is zero, it will return the input vector. New in version 1.2.0. classpyspark.mllib.feature.StandardScalerModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents a StandardScaler model that can transform vectors. New in version 1.2.0. mean [source] Return the column mean values. New in version 2.0.0. setWithMean( withMean) [source] Setter of the boolean which decides whether it uses mean or not New in version 1.4.0. setWithStd( withStd) [source] Setter of the boolean which decides whether it uses std or not New in version 1.4.0. std [source] Return the column standard deviation values. New in version 2.0.0. transform( vector) [source] Applies standardization transformation on a vector. Note In Python, transform cannot currently be used within an RDD transformation or action. Call transform directly on the RDD instead. Parameters: vector– Vector or RDD of Vector to be standardized. Returns: Standardized vector. If the variance of a column is zero, it will return default0.0for the column with zero variance. New in version 1.2.0. withMean [source] Returns if the model centers the data before scaling. New in version 2.0.0. withStd [source] Returns if the model scales the data to unit standard deviation. New in version 2.0.0. classpyspark.mllib.feature.StandardScaler( withMean=False, withStd=True) [source] Bases:object Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. Parameters: withMean– False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. withStd– True by default. Scales the data to unit standard deviation. >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True New in version 1.2.0. fit( dataset) [source] Computes the mean and variance and stores as a model to be used for later scaling. Parameters: dataset– The data used to compute the mean and variance to build the transformation model. Returns: a StandardScalarModel New in version 1.2.0. classpyspark.mllib.feature.HashingTF( numFeatures=1048576) [source] Bases:object Maps a sequence of terms to their term frequencies using the hashing trick. Note The terms must be hashable (can not be dict/set/list...). Parameters: numFeatures– number of features (default: 2^20) >>> htf = HashingTF(100) >>> doc = "a a b b c d".split(" ") >>> htf.transform(doc) SparseVector(100, {...}) New in version 1.2.0. indexOf( term) [source] Returns the index of the input term. New in version 1.2.0. setBinary( value) [source] If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False) New in version 2.0.0. transform( document) [source] Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors. New in version 1.2.0. classpyspark.mllib.feature.IDFModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents an IDF model that can transform term frequency vectors. New in version 1.2.0. idf() [source] Returns the current IDF vector. New in version 1.4.0. transform( x) [source] Transforms term frequency (TF) vectors to TF-IDF vectors. IfminDocFreqwas set for the IDF calculation, the terms which occur in fewer thanminDocFreqdocuments will have an entry of 0. Note In Python, transform cannot currently be used within an RDD transformation or action. Call transform directly on the RDD instead. Parameters: x– an RDD of term frequency vectors or a term frequency vector Returns: an RDD of TF-IDF vectors or a TF-IDF vector New in version 1.2.0. classpyspark.mllib.feature.IDF( minDocFreq=0) [source] Bases:object Inverse document frequency (IDF). The standard formulation is used:idf = log((m + 1) / (d(t) + 1)), wheremis the total number of documents andd(t)is the number of documents that contain termt. This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variableminDocFreq). For terms that are not in at leastminDocFreqdocuments, the IDF is found as 0, resulting in TF-IDFs of 0. Parameters: minDocFreq– minimum of documents in which a term should appear for filtering >>> n = 4 >>> freqs = [Vectors.sparse(n, (1, 3), (1.0, 2.0)), ... Vectors.dense([0.0, 1.0, 2.0, 3.0]), ... Vectors.sparse(n, [1], [1.0])] >>> data = sc.parallelize(freqs) >>> idf = IDF() >>> model = idf.fit(data) >>> tfidf = model.transform(data) >>> for r in tfidf.collect(): r SparseVector(4, {1: 0.0, 3: 0.5754}) DenseVector([0.0, 0.0, 1.3863, 0.863]) SparseVector(4, {1: 0.0}) >>> model.transform(Vectors.dense([0.0, 1.0, 2.0, 3.0])) DenseVector([0.0, 0.0, 1.3863, 0.863]) >>> model.transform([0.0, 1.0, 2.0, 3.0]) DenseVector([0.0, 0.0, 1.3863, 0.863]) >>> model.transform(Vectors.sparse(n, (1, 3), (1.0, 2.0))) SparseVector(4, {1: 0.0, 3: 0.5754}) New in version 1.2.0. fit( dataset) [source] Computes the inverse document frequency. Parameters: dataset– an RDD of term frequency vectors New in version 1.2.0. classpyspark.mllib.feature.Word2Vec [source] Bases:object Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation. For original C implementation, seehttps://code.google.com/p/word2vec/For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. >>> sentence = "a b " * 100 + "a c " * 10 >>> localDoc = [sentence, sentence] >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" ")) >>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc) Querying for synonyms of a word will not return that word: >>> syms = model.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] But querying for synonyms of a vector may return the word whose representation is that vector: >>> vec = model.transform("a") >>> syms = model.findSynonyms(vec, 2) >>> [s[0] for s in syms] [u'a', u'b'] >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = Word2VecModel.load(sc, path) >>> model.transform("a") == sameModel.transform("a") True >>> syms = sameModel.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass New in version 1.2.0. fit( data) [source] Computes the vector representation of each word in vocabulary. Parameters: data– training data. RDD of list of string Returns: Word2VecModel instance New in version 1.2.0. setLearningRate( learningRate) [source] Sets initial learning rate (default: 0.025). New in version 1.2.0. setMinCount( minCount) [source] Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5). New in version 1.4.0. setNumIterations( numIterations) [source] Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. New in version 1.2.0. setNumPartitions( numPartitions) [source] Sets number of partitions (default: 1). Use a small number for accuracy. New in version 1.2.0. setSeed( seed) [source] Sets random seed. New in version 1.2.0. setVectorSize( vectorSize) [source] Sets vector size (default: 100). New in version 1.2.0. setWindowSize( windowSize) [source] Sets window size (default: 5). New in version 2.0.0. classpyspark.mllib.feature.Word2VecModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer,pyspark.mllib.util.JavaSaveable,pyspark.mllib.util.JavaLoader class for Word2Vec model New in version 1.2.0. findSynonyms( word, num) [source] Find synonyms of a word Parameters: word– a word or a vector representation of word num– number of synonyms to find Returns: array of (word, cosineSimilarity) Note Local use only New in version 1.2.0. getVectors() [source] Returns a map of words to their vector representations. New in version 1.4.0. classmethodload( sc, path) [source] Load a model from the given path. New in version 1.5.0. transform( word) [source] Transforms a word to its vector representation Note Local use only Parameters: word– a word Returns: vector representation of word(s) New in version 1.2.0. classpyspark.mllib.feature.ChiSqSelector( numTopFeatures=50, selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05) [source] Bases:object Creates a ChiSquared feature selector. The selector supports different selection methods:numTopFeatures,percentile,fpr,fdr,fwe. numTopFeatureschooses a fixed number of top features according to a chi-squared test. percentileis similar but chooses a fraction of all features instead of a fixed number. fprchooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. fdruses theBenjamini-Hochberg procedureto choose all features whose false discovery rate is below a threshold. fwechooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method isnumTopFeatures, with the default number of top features set to 50. >>> data = sc.parallelize([ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [7.0, 9.0, 5.0]), ... LabeledPoint(2.0, [8.0, 7.0, 3.0]) ... ]) >>> model = ChiSqSelector(numTopFeatures=1).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="fpr", fpr=0.2).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="percentile", percentile=0.34).fit(data) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) New in version 1.4.0. fit( data) [source] Returns a ChiSquared feature selector. Parameters: data– anRDD[LabeledPoint]containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function. New in version 1.4.0. setFdr( fdr) [source] set FDR [0.0, 1.0] for feature selection by FDR. Only applicable when selectorType = “fdr”. New in version 2.2.0. setFpr( fpr) [source] set FPR [0.0, 1.0] for feature selection by FPR. Only applicable when selectorType = “fpr”. New in version 2.1.0. setFwe( fwe) [source] set FWE [0.0, 1.0] for feature selection by FWE. Only applicable when selectorType = “fwe”. New in version 2.2.0. setNumTopFeatures( numTopFeatures) [source] set numTopFeature for feature selection by number of top features. Only applicable when selectorType = “numTopFeatures”. New in version 2.1.0. setPercentile( percentile) [source] set percentile [0.0, 1.0] for feature selection by percentile. Only applicable when selectorType = “percentile”. New in version 2.1.0. setSelectorType( selectorType) [source] set the selector type of the ChisqSelector. Supported options: “numTopFeatures” (default), “percentile”, “fpr”, “fdr”, “fwe”. New in version 2.1.0. classpyspark.mllib.feature.ChiSqSelectorModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents a Chi Squared selector model. New in version 1.4.0. transform( vector) [source] Applies transformation on a vector. Parameters: vector– Vector or RDD of Vector to be transformed. Returns: transformed vector. New in version 1.4.0. classpyspark.mllib.feature.ElementwiseProduct( scalingVector) [source] Bases:pyspark.mllib.feature.VectorTransformer Scales each column of the vector, with the supplied weight vector. i.e the elementwise product. >>> weight = Vectors.dense([1.0, 2.0, 3.0]) >>> eprod = ElementwiseProduct(weight) >>> a = Vectors.dense([2.0, 1.0, 3.0]) >>> eprod.transform(a) DenseVector([2.0, 2.0, 9.0]) >>> b = Vectors.dense([9.0, 3.0, 4.0]) >>> rdd = sc.parallelize([a, b]) >>> eprod.transform(rdd).collect() [DenseVector([2.0, 2.0, 9.0]), DenseVector([9.0, 6.0, 12.0])] New in version 1.5.0. transform( vector) [source] Computes the Hadamard product of the vector. New in version 1.5.0. 本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7774142.html,如需转载请自行联系原作者

资源下载

更多资源
Mario

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长,特征是大鼻子、头戴帽子、身穿背带裤,还留着胡子。与他的双胞胎兄弟路易基一起,长年担任任天堂的招牌角色。

Nacos

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称,一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集,帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring

Spring框架(Spring Framework)是由Rod Johnson于2002年提出的开源Java企业级应用框架,旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念,提供核心容器、应用上下文、数据访问集成等模块,支持整合Hibernate、Struts等第三方框架,其适用范围不仅限于服务器端开发,绝大多数Java应用均可从中受益。

Rocky Linux

Rocky Linux

Rocky Linux(中文名:洛基)是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版,作为CentOS稳定版停止维护后与RHEL(Red Hat Enterprise Linux)完全兼容的开源替代方案,由社区拥有并管理,支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性,采用模块化包装和SELinux安全架构,默认包含GNOME桌面环境及XFS文件系统,支持十年生命周期更新。

用户登录
用户注册