搜索[数据脱敏]结果-低调大师优秀个人博客

sqoop 从sqlserver2008 导入数据到hadoop

今天终于开始上手导入数据到hadoop了，哈哈，过程蛮崎岖的，和官方文档的还不太一样。 OK,let's go！

2016-09-08

Openstack 通过 SQLAlchemy-ORM 访问数据库

目录目录 Demo SQLAlchemy 数据库的初始化数据库的操作实现数据库的操作请求全部查询单个查询创建更新删除 Demo Github/JmilkFan/my-code-repertory

2016-09-08

Swift语法专题二——基本数据类型

示例如下： varboolVale:Bool=true 五、元组元组是Swift语言十分重要的一个特点，它允许开发者将任意个不同类型的数据组合成一个数据类型，这也是Swift语言的一个强大之处。

2016-09-01

揭秘315黑客wifi,如何保障APP数据安全

WIFI下的攻击原理攻击者针对WIFI的攻击，一般通过中间人攻击，在中间人攻击过程中，正常APP应用经过WIFI的路径: 被黑客劫持，攻击者与通讯的两端分别创建独立的联系，并交换其所收到的数据

2016-08-23

Moosefs存储空间扩容及元数据恢复

本文主要关于Moosefs存储空间扩容及元数据恢复说明，Moosefs安装配置参考以下链接 http://hnr520.blog.51cto.com/4484939/1837619 一、原有集群，一台master

2016-08-13

Linux 下Shell脚本回滚删除数据

Linux 下Shell脚本回滚删除数据近期公司一直在做一些OA系统的开发，我在里面主要协助帮忙搭建及维护测试环境，由于环境的特殊性，我几乎每天需要对数据进行备份及还原，所以就应用到了一些shell脚本

2016-08-08

数据迁移的预检测及修复方案

数据传输(Data Transmission)是阿里云提供的一种支持以数据库为核心的结构化存储产品之间的数据传输服务，它提供了数据迁移、数据订阅及实时同步等多种数据传输功能。

2016-07-23

如何获取阿里巴巴的大数据能力？

数据开发平台拥有在线查询、ETL加工、定时调度、数据传输等多项功能，满足日常业务数据的生产需要。在最底层，是阿里云强大的数据计算引擎。

2016-06-19

海量数据挖掘之中移动流量运营系统

当然，普通人肯定是得不到运营商的数据啦，因为这些数据都是保密的，那么我要说的是运营商如何得到用户数据呢！

2016-06-15

大数据实战：用户流量分析系统

文末提供源数据采集文件和系统源码。本案例非常适合hadoop初级人员学习以及想入门大数据、云计算、数据分析等领域的朋友进行学习。

2016-06-10

HDFS基本原理及数据存取实战

3、当前的数据块在已经写入的数据节点中被元数据节点赋予新的标示，则错误节点重启后能够察觉其数据块是过时的，会被删除。

2016-06-09

官网介绍 Multi-purpose Notebook The Notebook is the place for all your needs Data Ingestion Data Discovery Data Analytics Data Visualization & Collaboration Multiple language backend Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin.Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell. Adding new language-backend is really simple. Learn how to write a zeppelin interpreter. Apache Spark integration Zeppelin provides built-in Apache Spark integration. You don't need to build a separate module, plugin or library for it. Zeppelin's Spark integration provides Automatic SparkContext and SQLContext injection Runtime jar dependency loading from local filesystem or maven repository. Learn more aboutdependency loader. Canceling job and displaying its progress Data visualization Some basic charts are already included in Zeppelin. Visualizations are not limited to SparkSQL's query, any output from any language backend can be recognized and visualized. Pivot chart With simple drag and drop Zeppelin aggeregates the values and display them in pivot chart. You can easily create chart with multiple aggregated values including sum, count, average, min, max. Learn more about Zeppelin's Display system. ( text, html, table, angular ) Dynamic forms Zeppelin can dynamically create some input forms into your notebook. Learn more about Dynamic Forms. Collaboration Notebook URL can be shared among collaborators. Zeppelin can then broadcast any changes in realtime, just like the collaboration in Google docs. Publish Zeppelin provides an URL to display the result only, that page does not include Zeppelin's menu and buttons.This way, you can easily embed it as an iframe inside of your website. 100% Opensource Apache Zeppelin (incubating) is Apache2 Licensed software. Please check out thesource repository andHow to contribute Zeppelin has a very active development community.Join the Mailing list and report issues on our Issue tracker. Undergoing Incubation Apache Zeppelin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. 安装 From binary package Download latest binary package from Download. Build from source Check instructions in README to build from source. Configure Configuration can be done by both environment variable(conf/zeppelin-env.sh) and java properties(conf/zeppelin-site.xml). If both defined, environment vaiable is used. zepplin-env.sh zepplin-site.xml Default value Description ZEPPELIN_PORT zeppelin.server.port 8080 Zeppelin server port. ZEPPELIN_MEM N/A -Xmx1024m -XX:MaxPermSize=512m JVM mem options ZEPPELIN_INTP_MEM N/A ZEPPELIN_MEM JVM mem options for interpreter process ZEPPELIN_JAVA_OPTS N/A JVM Options ZEPPELIN_ALLOWED_ORIGINS zeppelin.server.allowed.origins * Allows a way to specify a ',' separated list of allowed origins for rest and websockets. i.e. http://localhost:8080 ZEPPELIN_SERVER_CONTEXT_PATH zeppelin.server.context.path / Context Path of the Web Application ZEPPELIN_SSL zeppelin.ssl false ZEPPELIN_SSL_CLIENT_AUTH zeppelin.ssl.client.auth false ZEPPELIN_SSL_KEYSTORE_PATH zeppelin.ssl.keystore.path keystore ZEPPELIN_SSL_KEYSTORE_TYPE zeppelin.ssl.keystore.type JKS ZEPPELIN_SSL_KEYSTORE_PASSWORD zeppelin.ssl.keystore.password ZEPPELIN_SSL_KEY_MANAGER_PASSWORD zeppelin.ssl.key.manager.password ZEPPELIN_SSL_TRUSTSTORE_PATH zeppelin.ssl.truststore.path ZEPPELIN_SSL_TRUSTSTORE_TYPE zeppelin.ssl.truststore.type ZEPPELIN_SSL_TRUSTSTORE_PASSWORD zeppelin.ssl.truststore.password ZEPPELIN_NOTEBOOK_HOMESCREEN zeppelin.notebook.homescreen Id of notebook to be displayed in homescreen ex) 2A94M5J1Z ZEPPELIN_NOTEBOOK_HOMESCREEN_HIDE zeppelin.notebook.homescreen.hide false hide homescreen notebook from list when this value set to "true" ZEPPELIN_WAR_TEMPDIR zeppelin.war.tempdir webapps The location of jetty temporary directory. ZEPPELIN_NOTEBOOK_DIR zeppelin.notebook.dir notebook Where notebook file is saved ZEPPELIN_NOTEBOOK_S3_BUCKET zeppelin.notebook.s3.bucket zeppelin Bucket where notebook saved ZEPPELIN_NOTEBOOK_S3_USER zeppelin.notebook.s3.user user User in bucket where notebook saved. For example bucket/user/notebook/2A94M5J1Z/note.json ZEPPELIN_NOTEBOOK_STORAGE zeppelin.notebook.storage org.apache.zeppelin.notebook.repo.VFSNotebookRepo Comma separated list of notebook storage ZEPPELIN_INTERPRETERS zeppelin.interpreters org.apache.zeppelin.spark.SparkInterpreter, org.apache.zeppelin.spark.PySparkInterpreter, org.apache.zeppelin.spark.SparkSqlInterpreter, org.apache.zeppelin.spark.DepInterpreter, org.apache.zeppelin.markdown.Markdown, org.apache.zeppelin.shell.ShellInterpreter, org.apache.zeppelin.hive.HiveInterpreter ... Comma separated interpreter configurations [Class]. First interpreter become a default ZEPPELIN_INTERPRETER_DIR zeppelin.interpreter.dir interpreter Zeppelin interpreter directory You'll also need to configure individual interpreter. Information can be found in 'Interpreter' section in this documentation. For example Spark. Start/Stop Start Zeppelin bin/zeppelin-daemon.sh start After successful start, visit http://localhost:8080 with your web browser. Stop Zeppelin bin/zeppelin-daemon.sh stop 实践例子： Zeppelin Tutorial We will assume you have Zeppelin installed already. If that's not the case, seeInstall. Zeppelin's current main backend processing engine is Apache Spark. If you're new to the system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin. Tutorial with Local File Data Refine Before you start Zeppelin tutorial, you will need to download bank.zip. First, to transform data from csv format into RDD of Bank objects, run following script. This will also remove header usingfilter function. val bankText = sc.textFile("yourPath/bank/bank-full.csv") case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) // split each line, filter out header (starts with "age"), and map it into Bank case class val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map( s=>Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ) // convert to DataFrame and create temporal table bank.toDF().registerTempTable("bank") Data Retrieval Suppose we want to see age distribution from bank. To do this, run: %sql select age, count(1) from bank where age < 30 group by age order by age You can make input box for setting age condition by replacing 30 with${maxAge=30}. %sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age Now we want to see age distribution with certain marital status and add combo box to select marital status. Run: %sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age Tutorial with Streaming Data Data Refine Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look atTwitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey,apiSecret, accessToken, accessTokenSecret) with your API keys on following script. This will create a RDD of Tweet objects and register these stream data as a table: import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File import org.apache.log4j.Logger import org.apache.log4j.Level import sys.process.stringSeqToProcess /** Configures the Oauth Credentials for accessing Twitter */ def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) { val configs = new HashMap[String, String] ++= Seq( "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret) println("Configuring Twitter OAuth") configs.foreach{ case(key, value) => if (value.trim.isEmpty) { throw new Exception("Error setting authentication - value for " + key + " not set") } val fullKey = "twitter4j.oauth." + key.replace("api", "consumer") System.setProperty(fullKey, value.trim) println("\tProperty " + fullKey + " set as [" + value.trim + "]") } println() } // Configure Twitter credentials val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx" val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret) import org.apache.spark.streaming.twitter._ val ssc = new StreamingContext(sc, Seconds(2)) val tweets = TwitterUtils.createStream(ssc, None) val twt = tweets.window(Seconds(60)) case class Tweet(createdAt:Long, text:String) twt.map(status=> Tweet(status.getCreatedAt().getTime()/1000, status.getText()) ).foreachRDD(rdd=> // Below line works only in spark 1.3.0. // For spark 1.1.x and spark 1.2.x, // use rdd.registerTempTable("tweets") instead. rdd.toDF().registerAsTable("tweets") ) twt.print ssc.start() Data Retrieval For each following script, every time you click run button you will see different result since it is based on real-time data. Let's begin by extracting maximum 10 tweets which contain the word "girl". %sql select * from tweets where text like '%girl%' limit 10 This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run: %sql select createdAt, count(1) from tweets group by createdAt order by createdAt You can make user-defined function and use it in Spark SQL. Let's try it by making function namedsentiment. This function will return one of the three attitudes(positive, negative, neutral) towards the parameter. def sentiment(s:String) : String = { val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that") val negative = Array("hate", "bad", "stupid", "is") var st = 0; val words = s.split(" ") positive.foreach(p => words.foreach(w => if(p==w) st = st+1 ) ) negative.foreach(p=> words.foreach(w=> if(p==w) st = st-1 ) ) if(st>0) "positivie" else if(st<0) "negative" else "neutral" } // Below line works only in spark 1.3.0. // For spark 1.1.x and spark 1.2.x, // use sqlc.registerFunction("sentiment", sentiment _) instead. sqlc.udf.register("sentiment", sentiment _) To check how people think about girls using sentiment function we've made above, run this: %sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)

2016-05-25

开源大数据技术社区召集令

引子 Hadoop生态技术已经俨然成为大数据事实标准，为了给广大同学、朋友提供一些交流学习的环境，沉淀大数据技术相关的资料，特别发起此次关注活动。

2016-05-22

地铁译：Spark for python developers ---Spark的数据戏法

聚焦在 Twitter 上关于Apache Spark的数据, 这些是准备用于机器学习和流式处理应用的数据。

2016-05-20

使用Spark DataFrame针对数据进行SQL处理

简介 DataFrame让Spark具备了处理大规模结构化数据的能力，在比原有的RDD转化方式易用的前提下，计算性能更还快了两倍。

2016-05-12

Hadoop URL中读取数据出错：unknown protocol: hdfs

今天在学习如何从hadoop中读取数据时，写了一个简单的方法，测试时，却报以下错误：以下是读取hadoop中文件并写入本地磁盘的代码： ?

2016-05-08

Spark-数据分析可视化Zeppelin

官网介绍 Apache Zeppelin提供了web版的类似ipython的notebook，用于做数据分析和可视化。

2016-04-25

docker -v挂载数据卷网络异常的问题

docker 删除容器并重新运行容器时报如下异常： docker: Error response from daemon: failed to create endpoint tomcat001 on network bridge: COMMAND_FAILED: '/sbin/iptables -w2 -t nat -A DOCKER -p tcp -d 0/0 --dport 8090 -j DNAT --to-destination 172.17.0.3:8080 ! -i docker0' failed: iptables: No chain/target/match by that name.. 重启docker即可： systemctl restart docker

2016-04-25

HBase 数据库检索性能优化策略

HBase 数据库是一个基于分布式的、面向列的、主要用于非结构化数据存储用途的开源数据库。其设计思路来源于 Google 的非开源数据库”BigTable”。

2016-04-23

E-Mapreduce如何处理RDS的数据

一、引言目前网站的一些业务数据存在数据库中，这些数据往往需要做进一步的分析，如：需要根据一些日志数据关联分析，或者需要进行一些如机器学习的分析。

2016-04-07

精选列表