首页 文章 精选 留言 我的

精选列表

搜索[数据脱敏],共10000篇文章
优秀的个人博客,低调大师

Spark-zeppelin-大数据可视化分析

官网介绍 Multi-purpose Notebook The Notebook is the place for all your needs Data Ingestion Data Discovery Data Analytics Data Visualization & Collaboration Multiple language backend Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin.Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell. Adding new language-backend is really simple. Learn how to write a zeppelin interpreter. Apache Spark integration Zeppelin provides built-in Apache Spark integration. You don't need to build a separate module, plugin or library for it. Zeppelin's Spark integration provides Automatic SparkContext and SQLContext injection Runtime jar dependency loading from local filesystem or maven repository. Learn more aboutdependency loader. Canceling job and displaying its progress Data visualization Some basic charts are already included in Zeppelin. Visualizations are not limited to SparkSQL's query, any output from any language backend can be recognized and visualized. Pivot chart With simple drag and drop Zeppelin aggeregates the values and display them in pivot chart. You can easily create chart with multiple aggregated values including sum, count, average, min, max. Learn more about Zeppelin's Display system. ( text, html, table, angular ) Dynamic forms Zeppelin can dynamically create some input forms into your notebook. Learn more about Dynamic Forms. Collaboration Notebook URL can be shared among collaborators. Zeppelin can then broadcast any changes in realtime, just like the collaboration in Google docs. Publish Zeppelin provides an URL to display the result only, that page does not include Zeppelin's menu and buttons.This way, you can easily embed it as an iframe inside of your website. 100% Opensource Apache Zeppelin (incubating) is Apache2 Licensed software. Please check out thesource repository andHow to contribute Zeppelin has a very active development community.Join the Mailing list and report issues on our Issue tracker. Undergoing Incubation Apache Zeppelin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. 安装 From binary package Download latest binary package from Download. Build from source Check instructions in README to build from source. Configure Configuration can be done by both environment variable(conf/zeppelin-env.sh) and java properties(conf/zeppelin-site.xml). If both defined, environment vaiable is used. zepplin-env.sh zepplin-site.xml Default value Description ZEPPELIN_PORT zeppelin.server.port 8080 Zeppelin server port. ZEPPELIN_MEM N/A -Xmx1024m -XX:MaxPermSize=512m JVM mem options ZEPPELIN_INTP_MEM N/A ZEPPELIN_MEM JVM mem options for interpreter process ZEPPELIN_JAVA_OPTS N/A JVM Options ZEPPELIN_ALLOWED_ORIGINS zeppelin.server.allowed.origins * Allows a way to specify a ',' separated list of allowed origins for rest and websockets. i.e. http://localhost:8080 ZEPPELIN_SERVER_CONTEXT_PATH zeppelin.server.context.path / Context Path of the Web Application ZEPPELIN_SSL zeppelin.ssl false ZEPPELIN_SSL_CLIENT_AUTH zeppelin.ssl.client.auth false ZEPPELIN_SSL_KEYSTORE_PATH zeppelin.ssl.keystore.path keystore ZEPPELIN_SSL_KEYSTORE_TYPE zeppelin.ssl.keystore.type JKS ZEPPELIN_SSL_KEYSTORE_PASSWORD zeppelin.ssl.keystore.password ZEPPELIN_SSL_KEY_MANAGER_PASSWORD zeppelin.ssl.key.manager.password ZEPPELIN_SSL_TRUSTSTORE_PATH zeppelin.ssl.truststore.path ZEPPELIN_SSL_TRUSTSTORE_TYPE zeppelin.ssl.truststore.type ZEPPELIN_SSL_TRUSTSTORE_PASSWORD zeppelin.ssl.truststore.password ZEPPELIN_NOTEBOOK_HOMESCREEN zeppelin.notebook.homescreen Id of notebook to be displayed in homescreen ex) 2A94M5J1Z ZEPPELIN_NOTEBOOK_HOMESCREEN_HIDE zeppelin.notebook.homescreen.hide false hide homescreen notebook from list when this value set to "true" ZEPPELIN_WAR_TEMPDIR zeppelin.war.tempdir webapps The location of jetty temporary directory. ZEPPELIN_NOTEBOOK_DIR zeppelin.notebook.dir notebook Where notebook file is saved ZEPPELIN_NOTEBOOK_S3_BUCKET zeppelin.notebook.s3.bucket zeppelin Bucket where notebook saved ZEPPELIN_NOTEBOOK_S3_USER zeppelin.notebook.s3.user user User in bucket where notebook saved. For example bucket/user/notebook/2A94M5J1Z/note.json ZEPPELIN_NOTEBOOK_STORAGE zeppelin.notebook.storage org.apache.zeppelin.notebook.repo.VFSNotebookRepo Comma separated list of notebook storage ZEPPELIN_INTERPRETERS zeppelin.interpreters org.apache.zeppelin.spark.SparkInterpreter, org.apache.zeppelin.spark.PySparkInterpreter, org.apache.zeppelin.spark.SparkSqlInterpreter, org.apache.zeppelin.spark.DepInterpreter, org.apache.zeppelin.markdown.Markdown, org.apache.zeppelin.shell.ShellInterpreter, org.apache.zeppelin.hive.HiveInterpreter ... Comma separated interpreter configurations [Class]. First interpreter become a default ZEPPELIN_INTERPRETER_DIR zeppelin.interpreter.dir interpreter Zeppelin interpreter directory You'll also need to configure individual interpreter. Information can be found in 'Interpreter' section in this documentation. For example Spark. Start/Stop Start Zeppelin bin/zeppelin-daemon.sh start After successful start, visit http://localhost:8080 with your web browser. Stop Zeppelin bin/zeppelin-daemon.sh stop 实践例子: Zeppelin Tutorial We will assume you have Zeppelin installed already. If that's not the case, seeInstall. Zeppelin's current main backend processing engine is Apache Spark. If you're new to the system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin. Tutorial with Local File Data Refine Before you start Zeppelin tutorial, you will need to download bank.zip. First, to transform data from csv format into RDD of Bank objects, run following script. This will also remove header usingfilter function. val bankText = sc.textFile("yourPath/bank/bank-full.csv") case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) // split each line, filter out header (starts with "age"), and map it into Bank case class val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map( s=>Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ) // convert to DataFrame and create temporal table bank.toDF().registerTempTable("bank") Data Retrieval Suppose we want to see age distribution from bank. To do this, run: %sql select age, count(1) from bank where age < 30 group by age order by age You can make input box for setting age condition by replacing 30 with${maxAge=30}. %sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age Now we want to see age distribution with certain marital status and add combo box to select marital status. Run: %sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age Tutorial with Streaming Data Data Refine Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look atTwitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey,apiSecret, accessToken, accessTokenSecret) with your API keys on following script. This will create a RDD of Tweet objects and register these stream data as a table: import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File import org.apache.log4j.Logger import org.apache.log4j.Level import sys.process.stringSeqToProcess /** Configures the Oauth Credentials for accessing Twitter */ def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) { val configs = new HashMap[String, String] ++= Seq( "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret) println("Configuring Twitter OAuth") configs.foreach{ case(key, value) => if (value.trim.isEmpty) { throw new Exception("Error setting authentication - value for " + key + " not set") } val fullKey = "twitter4j.oauth." + key.replace("api", "consumer") System.setProperty(fullKey, value.trim) println("\tProperty " + fullKey + " set as [" + value.trim + "]") } println() } // Configure Twitter credentials val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx" val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret) import org.apache.spark.streaming.twitter._ val ssc = new StreamingContext(sc, Seconds(2)) val tweets = TwitterUtils.createStream(ssc, None) val twt = tweets.window(Seconds(60)) case class Tweet(createdAt:Long, text:String) twt.map(status=> Tweet(status.getCreatedAt().getTime()/1000, status.getText()) ).foreachRDD(rdd=> // Below line works only in spark 1.3.0. // For spark 1.1.x and spark 1.2.x, // use rdd.registerTempTable("tweets") instead. rdd.toDF().registerAsTable("tweets") ) twt.print ssc.start() Data Retrieval For each following script, every time you click run button you will see different result since it is based on real-time data. Let's begin by extracting maximum 10 tweets which contain the word "girl". %sql select * from tweets where text like '%girl%' limit 10 This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run: %sql select createdAt, count(1) from tweets group by createdAt order by createdAt You can make user-defined function and use it in Spark SQL. Let's try it by making function namedsentiment. This function will return one of the three attitudes(positive, negative, neutral) towards the parameter. def sentiment(s:String) : String = { val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that") val negative = Array("hate", "bad", "stupid", "is") var st = 0; val words = s.split(" ") positive.foreach(p => words.foreach(w => if(p==w) st = st+1 ) ) negative.foreach(p=> words.foreach(w=> if(p==w) st = st-1 ) ) if(st>0) "positivie" else if(st<0) "negative" else "neutral" } // Below line works only in spark 1.3.0. // For spark 1.1.x and spark 1.2.x, // use sqlc.registerFunction("sentiment", sentiment _) instead. sqlc.udf.register("sentiment", sentiment _) To check how people think about girls using sentiment function we've made above, run this: %sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)

资源下载

更多资源
Mario

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长,特征是大鼻子、头戴帽子、身穿背带裤,还留着胡子。与他的双胞胎兄弟路易基一起,长年担任任天堂的招牌角色。

腾讯云软件源

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题,腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构,目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称,一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集,帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Sublime Text

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能,例如代码缩略图,Python的插件,代码段等。还可自定义键绑定,菜单和工具栏。Sublime Text 的主要功能包括:拼写检查,书签,完整的 Python API , Goto 功能,即时项目切换,多选择,多窗口等等。Sublime Text 是一个跨平台的编辑器,同时支持Windows、Linux、Mac OS X等操作系统。

用户登录
用户注册