首页 文章 精选 留言 我的

精选列表

搜索[模块],共10000篇文章
优秀的个人博客,低调大师

spark 数据预处理 特征标准化 归一化模块

#We will also standardise our data as we have done so far when performing distance-based clustering. from pyspark.mllib.feature import StandardScaler standardizer = StandardScaler(True, True) t0 = time() standardizer_model = standardizer.fit(parsed_data_values) tt = time() - t0 standardized_data_values = standardizer_model.transform(parsed_data_values) print "Data standardized in {} seconds".format(round(tt,3)) Data standardized in 9.54 seconds We can now perform k-means clustering. from pyspark.mllib.clustering import KMeans t0 = time() clusters = KMeans.train(standardized_data_values, 80, maxIterations=10, runs=5, initializationMode="random") tt = time() - t0 print "Data clustered in {} seconds".format(round(tt,3)) Data clustered in 137.496 seconds kmeans demo 摘自:http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.feature pyspark.mllib.feature module Python package for feature in MLlib. classpyspark.mllib.feature.Normalizer( p=2.0) [source] Bases:pyspark.mllib.feature.VectorTransformer Normalizes samples individually to unit Lpnorm For any 1 <=p< float(‘inf’), normalizes samples using sum(abs(vector)p)(1/p)as norm. Forp= float(‘inf’), max(abs(vector)) will be used as norm for normalization. Parameters: p– Normalization in L^p^ space, p = 2 by default. >>> v = Vectors.dense(range(3)) >>> nor = Normalizer(1) >>> nor.transform(v) DenseVector([0.0, 0.3333, 0.6667]) >>> rdd = sc.parallelize([v]) >>> nor.transform(rdd).collect() [DenseVector([0.0, 0.3333, 0.6667])] >>> nor2 = Normalizer(float("inf")) >>> nor2.transform(v) DenseVector([0.0, 0.5, 1.0]) New in version 1.2.0. transform( vector) [source] Applies unit length normalization on a vector. Parameters: vector– vector or RDD of vector to be normalized. Returns: normalized vector. If the norm of the input is zero, it will return the input vector. New in version 1.2.0. classpyspark.mllib.feature.StandardScalerModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents a StandardScaler model that can transform vectors. New in version 1.2.0. mean [source] Return the column mean values. New in version 2.0.0. setWithMean( withMean) [source] Setter of the boolean which decides whether it uses mean or not New in version 1.4.0. setWithStd( withStd) [source] Setter of the boolean which decides whether it uses std or not New in version 1.4.0. std [source] Return the column standard deviation values. New in version 2.0.0. transform( vector) [source] Applies standardization transformation on a vector. Note In Python, transform cannot currently be used within an RDD transformation or action. Call transform directly on the RDD instead. Parameters: vector– Vector or RDD of Vector to be standardized. Returns: Standardized vector. If the variance of a column is zero, it will return default0.0for the column with zero variance. New in version 1.2.0. withMean [source] Returns if the model centers the data before scaling. New in version 2.0.0. withStd [source] Returns if the model scales the data to unit standard deviation. New in version 2.0.0. classpyspark.mllib.feature.StandardScaler( withMean=False, withStd=True) [source] Bases:object Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. Parameters: withMean– False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. withStd– True by default. Scales the data to unit standard deviation. >>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True New in version 1.2.0. fit( dataset) [source] Computes the mean and variance and stores as a model to be used for later scaling. Parameters: dataset– The data used to compute the mean and variance to build the transformation model. Returns: a StandardScalarModel New in version 1.2.0. classpyspark.mllib.feature.HashingTF( numFeatures=1048576) [source] Bases:object Maps a sequence of terms to their term frequencies using the hashing trick. Note The terms must be hashable (can not be dict/set/list...). Parameters: numFeatures– number of features (default: 2^20) >>> htf = HashingTF(100) >>> doc = "a a b b c d".split(" ") >>> htf.transform(doc) SparseVector(100, {...}) New in version 1.2.0. indexOf( term) [source] Returns the index of the input term. New in version 1.2.0. setBinary( value) [source] If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False) New in version 2.0.0. transform( document) [source] Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors. New in version 1.2.0. classpyspark.mllib.feature.IDFModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents an IDF model that can transform term frequency vectors. New in version 1.2.0. idf() [source] Returns the current IDF vector. New in version 1.4.0. transform( x) [source] Transforms term frequency (TF) vectors to TF-IDF vectors. IfminDocFreqwas set for the IDF calculation, the terms which occur in fewer thanminDocFreqdocuments will have an entry of 0. Note In Python, transform cannot currently be used within an RDD transformation or action. Call transform directly on the RDD instead. Parameters: x– an RDD of term frequency vectors or a term frequency vector Returns: an RDD of TF-IDF vectors or a TF-IDF vector New in version 1.2.0. classpyspark.mllib.feature.IDF( minDocFreq=0) [source] Bases:object Inverse document frequency (IDF). The standard formulation is used:idf = log((m + 1) / (d(t) + 1)), wheremis the total number of documents andd(t)is the number of documents that contain termt. This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variableminDocFreq). For terms that are not in at leastminDocFreqdocuments, the IDF is found as 0, resulting in TF-IDFs of 0. Parameters: minDocFreq– minimum of documents in which a term should appear for filtering >>> n = 4 >>> freqs = [Vectors.sparse(n, (1, 3), (1.0, 2.0)), ... Vectors.dense([0.0, 1.0, 2.0, 3.0]), ... Vectors.sparse(n, [1], [1.0])] >>> data = sc.parallelize(freqs) >>> idf = IDF() >>> model = idf.fit(data) >>> tfidf = model.transform(data) >>> for r in tfidf.collect(): r SparseVector(4, {1: 0.0, 3: 0.5754}) DenseVector([0.0, 0.0, 1.3863, 0.863]) SparseVector(4, {1: 0.0}) >>> model.transform(Vectors.dense([0.0, 1.0, 2.0, 3.0])) DenseVector([0.0, 0.0, 1.3863, 0.863]) >>> model.transform([0.0, 1.0, 2.0, 3.0]) DenseVector([0.0, 0.0, 1.3863, 0.863]) >>> model.transform(Vectors.sparse(n, (1, 3), (1.0, 2.0))) SparseVector(4, {1: 0.0, 3: 0.5754}) New in version 1.2.0. fit( dataset) [source] Computes the inverse document frequency. Parameters: dataset– an RDD of term frequency vectors New in version 1.2.0. classpyspark.mllib.feature.Word2Vec [source] Bases:object Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms. We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation. For original C implementation, seehttps://code.google.com/p/word2vec/For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality. >>> sentence = "a b " * 100 + "a c " * 10 >>> localDoc = [sentence, sentence] >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" ")) >>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc) Querying for synonyms of a word will not return that word: >>> syms = model.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] But querying for synonyms of a vector may return the word whose representation is that vector: >>> vec = model.transform("a") >>> syms = model.findSynonyms(vec, 2) >>> [s[0] for s in syms] [u'a', u'b'] >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = Word2VecModel.load(sc, path) >>> model.transform("a") == sameModel.transform("a") True >>> syms = sameModel.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass New in version 1.2.0. fit( data) [source] Computes the vector representation of each word in vocabulary. Parameters: data– training data. RDD of list of string Returns: Word2VecModel instance New in version 1.2.0. setLearningRate( learningRate) [source] Sets initial learning rate (default: 0.025). New in version 1.2.0. setMinCount( minCount) [source] Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5). New in version 1.4.0. setNumIterations( numIterations) [source] Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions. New in version 1.2.0. setNumPartitions( numPartitions) [source] Sets number of partitions (default: 1). Use a small number for accuracy. New in version 1.2.0. setSeed( seed) [source] Sets random seed. New in version 1.2.0. setVectorSize( vectorSize) [source] Sets vector size (default: 100). New in version 1.2.0. setWindowSize( windowSize) [source] Sets window size (default: 5). New in version 2.0.0. classpyspark.mllib.feature.Word2VecModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer,pyspark.mllib.util.JavaSaveable,pyspark.mllib.util.JavaLoader class for Word2Vec model New in version 1.2.0. findSynonyms( word, num) [source] Find synonyms of a word Parameters: word– a word or a vector representation of word num– number of synonyms to find Returns: array of (word, cosineSimilarity) Note Local use only New in version 1.2.0. getVectors() [source] Returns a map of words to their vector representations. New in version 1.4.0. classmethodload( sc, path) [source] Load a model from the given path. New in version 1.5.0. transform( word) [source] Transforms a word to its vector representation Note Local use only Parameters: word– a word Returns: vector representation of word(s) New in version 1.2.0. classpyspark.mllib.feature.ChiSqSelector( numTopFeatures=50, selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05) [source] Bases:object Creates a ChiSquared feature selector. The selector supports different selection methods:numTopFeatures,percentile,fpr,fdr,fwe. numTopFeatureschooses a fixed number of top features according to a chi-squared test. percentileis similar but chooses a fraction of all features instead of a fixed number. fprchooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. fdruses theBenjamini-Hochberg procedureto choose all features whose false discovery rate is below a threshold. fwechooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method isnumTopFeatures, with the default number of top features set to 50. >>> data = sc.parallelize([ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [7.0, 9.0, 5.0]), ... LabeledPoint(2.0, [8.0, 7.0, 3.0]) ... ]) >>> model = ChiSqSelector(numTopFeatures=1).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="fpr", fpr=0.2).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="percentile", percentile=0.34).fit(data) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) New in version 1.4.0. fit( data) [source] Returns a ChiSquared feature selector. Parameters: data– anRDD[LabeledPoint]containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function. New in version 1.4.0. setFdr( fdr) [source] set FDR [0.0, 1.0] for feature selection by FDR. Only applicable when selectorType = “fdr”. New in version 2.2.0. setFpr( fpr) [source] set FPR [0.0, 1.0] for feature selection by FPR. Only applicable when selectorType = “fpr”. New in version 2.1.0. setFwe( fwe) [source] set FWE [0.0, 1.0] for feature selection by FWE. Only applicable when selectorType = “fwe”. New in version 2.2.0. setNumTopFeatures( numTopFeatures) [source] set numTopFeature for feature selection by number of top features. Only applicable when selectorType = “numTopFeatures”. New in version 2.1.0. setPercentile( percentile) [source] set percentile [0.0, 1.0] for feature selection by percentile. Only applicable when selectorType = “percentile”. New in version 2.1.0. setSelectorType( selectorType) [source] set the selector type of the ChisqSelector. Supported options: “numTopFeatures” (default), “percentile”, “fpr”, “fdr”, “fwe”. New in version 2.1.0. classpyspark.mllib.feature.ChiSqSelectorModel( java_model) [source] Bases:pyspark.mllib.feature.JavaVectorTransformer Represents a Chi Squared selector model. New in version 1.4.0. transform( vector) [source] Applies transformation on a vector. Parameters: vector– Vector or RDD of Vector to be transformed. Returns: transformed vector. New in version 1.4.0. classpyspark.mllib.feature.ElementwiseProduct( scalingVector) [source] Bases:pyspark.mllib.feature.VectorTransformer Scales each column of the vector, with the supplied weight vector. i.e the elementwise product. >>> weight = Vectors.dense([1.0, 2.0, 3.0]) >>> eprod = ElementwiseProduct(weight) >>> a = Vectors.dense([2.0, 1.0, 3.0]) >>> eprod.transform(a) DenseVector([2.0, 2.0, 9.0]) >>> b = Vectors.dense([9.0, 3.0, 4.0]) >>> rdd = sc.parallelize([a, b]) >>> eprod.transform(rdd).collect() [DenseVector([2.0, 2.0, 9.0]), DenseVector([9.0, 6.0, 12.0])] New in version 1.5.0. transform( vector) [source] Computes the Hadamard product of the vector. New in version 1.5.0. 本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7774142.html,如需转载请自行联系原作者

优秀的个人博客,低调大师

云计算/云存储---Ceph和Openstack的cinder模块对接方法

1.创建存储池 在ceph节点中执行如下语句。 #ceph osd pool create volumes 128 2.配置 OPENSTACK 的 CEPH 客户端 在ceph节点两次执行如下语句,两次{your-openstack-server}分别填控制节点和计算节点IP。 如果显示在控制节点和计算节点中没有ceph文件夹,则在两节点中创建对应文件夹。 #ssh {your-openstack-server} sudo tee /etc/ceph/ceph.conf < /etc/ceph/ceph.conf 3.安装 CEPH 客户端软件包 控制节点上进行librbd的 Python 绑定 #yum install python-rbd 计算节点和控制节点进行安装 Python 绑定和客户端命令行工具 #yum install ceph-common #yum install ceph 4.配置 CEPH 客户端认证 在ceph节点为Cinder创建新用户 #ceph auth get-or-create client.cinder mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=volumes' 在ceph节点把client.cinder的密钥环复制到控制节点,并更改所有权,{your-volume-server}和{your-cinder-volume-server}处填控制节点IP。 #ceph auth get-or-create client.cinder | ssh {your-volume-server} sudo tee /etc/ceph/ceph.client.cinder.keyring #ssh {your-cinder-volume-server} sudo chown cinder:cinder /etc/ceph/ceph.client.cinder.keyring 在ceph节点执行如下语句{your-nova-compute-server}为计算节点IP。 #ceph auth get-or-create client.cinder | ssh {your-nova-compute-server} sudo tee /etc/ceph/ceph.client.cinder.keyring 在ceph节点把client.cinder用户的密钥存进libvirt。libvirt 进程从 Cinder 挂载块设备时要用它访问集群,在运行 nova-compute 的节点上创建一个密钥的临时副本。 {your-compute-node}为计算节点IP。 #ceph auth get-key client.cinder | ssh {your-compute-node} tee /etc/ceph/client.cinder.key 在计算节点上执行如下语句,把密钥加进 libvirt 、然后删除临时副本。 #uuidgen 记录下产生的数字,将下面的UUIDGEN替换为该数字,并在计算节点执行下列语句 cat > secret.xml <<EOF <secret ephemeral='no' private='no'> <uuid>UUIDGEN</uuid> <usage type='ceph'> <name>client.cinder secret</name> </usage> </secret> EOF #sudo virsh secret-define --file secret.xml #sudo virsh secret-set-value --secret 457eb676-33da-42ec-9a8c-9293d545c337 --base64 $(cat client.cinder.key) && rm client.cinder.key secret.xml 执行完后,记录好上面产生的uuidgen,下面还会用到。 5.安装并配置控制节点 5.1先决条件 在控制节点完成下面的步骤以创建数据库: 用数据库连接客户端以 root 用户连接到数据库服务器: #mysql -u root -p 创建cinde数据库 #CREATE DATABASE cinder; 配置 cinder 数据库的访问权限,下列CINDER_DBPASS用合适的密码替换。 #GRANT ALL PRIVILEGES ON cinder.* TO 'cinder'@'localhost' \ IDENTIFIED BY 'CINDER_DBPASS'; #GRANT ALL PRIVILEGES ON cinder.* TO 'cinder'@'%' \ IDENTIFIED BY 'CINDER_DBPASS'; 退出数据库。 获得 admin 凭证来获取只有管理员能执行的命令的访问权限: # . admin-openrc 创建服务证书: 创建一个 cinder 用户: #openstack user create --domain default --password-prompt cinder 添加 admin 角色到 cinder 用户上。 #openstack role add --project service --user cinder admin 创建 cinder 和 cinderv2 服务实体: #openstack service create --name cinder \ --description "OpenStack Block Storage" volume #openstack service create --name cinderv2 \ --description "OpenStack Block Storage" volumev2 创建块设备存储服务的 API 入口点: #openstack endpoint create --region RegionOne \ volume public http://controller:8776/v1/%\(tenant_id\)s #openstack endpoint create --region RegionOne \ volume internal http://controller:8776/v1/%\(tenant_id\)s #openstack endpoint create --region RegionOne \ volume admin http://controller:8776/v1/%\(tenant_id\)s #openstack endpoint create --region RegionOne \ volumev2 public http://controller:8776/v2/%\(tenant_id\)s #openstack endpoint create --region RegionOne \ volumev2 internal http://controller:8776/v2/%\(tenant_id\)s #openstack endpoint create --region RegionOne \ volumev2 admin http://controller:8776/v2/%\(tenant_id\)s 5.2安装并配置组件 安装软件包 # yum install openstack-cinder #yum install openstack-cinder targetcli python-keystone 在控制节点上编辑cinder.conf。 #vi /etc/cinder/cinder.conf 添加如下内容: 注意:1.如果你为 cinder 配置了多后端, [DEFAULT] 节中必须有 glance_api_version = 2 2.[ceph]中的rbd_secret_uuid后面对应填的刚刚记录的uuid。 [DEFAULT] transport_url = rabbit://openstack:RABBIT_PASS@controller auth_strategy = keystone my_ip = 控制节点管理网络的IP enabled_backends = ceph glance_api_servers = http://controller:9292 [database] connection = mysql+pymysql://cinder:CINDER_PASS@controller/cinder [keystone_authtoken] auth_uri = http://controller:5000 auth_url = http://controller:35357 memcached_servers = controller:11211 auth_type = password project_domain_name = default user_domain_name = default project_name = service username = cinder password = CINDER_PASS [oslo_concurrency] lock_path = /var/lib/cinder/tmp [ceph] volume_driver = cinder.volume.drivers.rbd.RBDDriver rbd_pool = volumes rbd_ceph_conf = /etc/ceph/ceph.conf rbd_flatten_volume_from_snapshot = false rbd_max_clone_depth = 5 rbd_store_chunk_size = 4 rados_connect_timeout = -1 glance_api_version = 2 rbd_user = cinder rbd_secret_uuid = a852df2b-55e1-4c1b-9fa2-61e77feaf30f 编辑/etc/nova/nova.conf添加如下内容: [cinder] os_region_name = RegionOne 6.重启 OPENSTACK 在控制节点重启计算API 服务: # systemctl restart openstack-nova-api.service 启动块设备存储服务,并将其配置为开机自启: # systemctl enable openstack-cinder-api.service openstack-cinder-scheduler.service # systemctl start openstack-cinder-api.service openstack-cinder-scheduler.service 启动块存储卷服务及其依赖的服务,并将其配置为随系统启动: # systemctl enable openstack-cinder-volume.service target.service # systemctl start openstack-cinder-volume.service target.service 7.验证 在控制节点获得 admin 凭证来获取只有管理员能执行的命令的访问权限: # . admin-openrc 列出服务组件以验证是否每个进程都成功启动: # cinder service-list 并且登录界面后可以创建卷

资源下载

更多资源
Mario

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长,特征是大鼻子、头戴帽子、身穿背带裤,还留着胡子。与他的双胞胎兄弟路易基一起,长年担任任天堂的招牌角色。

腾讯云软件源

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题,腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构,目前腾讯云软件源站支持公网访问和内网访问。

Rocky Linux

Rocky Linux

Rocky Linux(中文名:洛基)是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版,作为CentOS稳定版停止维护后与RHEL(Red Hat Enterprise Linux)完全兼容的开源替代方案,由社区拥有并管理,支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性,采用模块化包装和SELinux安全架构,默认包含GNOME桌面环境及XFS文件系统,支持十年生命周期更新。

Sublime Text

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能,例如代码缩略图,Python的插件,代码段等。还可自定义键绑定,菜单和工具栏。Sublime Text 的主要功能包括:拼写检查,书签,完整的 Python API , Goto 功能,即时项目切换,多选择,多窗口等等。Sublime Text 是一个跨平台的编辑器,同时支持Windows、Linux、Mac OS X等操作系统。

用户登录
用户注册