NLTK基础教程学习笔记(二)
Python基础:字典(dictionary)也是最常用到的一种数据结构。在其他语言中被称为关联数组/存储。字典是一种键值索引型的数据结构,其索引键可以是一种不可变的类型,例如字符串和数字常被用来充当索引键。Python的字典结构是哈希表实现之一。哈希表是一种操作起来非常容易的字典结构,其优势在于通过简短的代码就能建立起非常复杂的数据结构。例子用字典来获取文本中各单词出现的频率: mystring="Monty Python! And the holy Grail !\n" word_frep={} for tok in mystring.split(): if tok in word_frep: word_frep[tok]+=1 else: word_frep[tok]=1 print(word_frep) 结果: {'holy': 1, 'the': 1, 'Python!': 1, '!': 1, 'Grail': 1, 'And': 1, 'Monty': 1} NLTK入门:先介绍了一个简单的爬虫例子,爬取了Python官网主页上的文本信息: import urllib.request response=urllib.request.urlopen('http://python.org/') html=response.read() print(len(html)) 这里和书上的不同对于我用的python3.5,urllib2包已经不能用了,用urllib.request代替。结果; 48907 接下来做一次探索性数据分析(EDA),对于一段文本域而言,EDA可能包含多重含义,这里只会涉及一个简单的例子,即该文档的主体术语类型。文字的主体和出现的频率等。对于之前从Python主页爬的文字域,我们先清除其中的html标签,做法是先用正则表达式选取其中的标记,包括数字和字符,转换为一个列表;版本1: import urllib.request response=urllib.request.urlopen('http://python.org/') html=response.read() #print(len(html)) tokens=[tok for tok in html.split()] print("Total no of tokens:" +str (len(tokens))) print(tokens[0:100]) 结果; Total no of tokens:2932 [b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js', b'ie6', b'lt-ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js', b'ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js', b'ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js"', b'lang="en"', b'dir="ltr">', b'<!--<![endif]-->', b'<head>', b'<meta', b'charset="utf-8">', b'<meta', b'http-equiv="X-UA-Compatible"', b'content="IE=edge">', b'<link', b'rel="prefetch"', b'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', b'<meta', b'name="application-name"', b'content="Python.org">', b'<meta', b'name="msapplication-tooltip"', b'content="The', b'official', b'home', b'of', b'the', b'Python', b'Programming', b'Language">', b'<meta', b'name="apple-mobile-web-app-title"', b'content="Python.org">', b'<meta', b'name="apple-mobile-web-app-capable"', b'content="yes">', b'<meta', b'name="apple-mobile-web-app-status-bar-style"', b'content="black">', b'<meta', b'name="viewport"', b'content="width=device-width,', b'initial-scale=1.0">', b'<meta', b'name="HandheldFriendly"', b'content="True">', b'<meta', b'name="format-detection"', b'content="telephone=no">', b'<meta', b'http-equiv="cleartype"', b'content="on">', b'<meta', b'http-equiv="imagetoolbar"', b'content="false">', b'<script', b'src="/static/js/libs/modernizr.js"></script>', b'<link', b'href="/static/stylesheets/style.css"', b'rel="stylesheet"', b'type="text/css"', b'title="default"', b'/>', b'<link', b'href="/static/stylesheets/mq.css"', b'rel="stylesheet"', b'type="text/css"', b'media="not', b'print,', b'braille,'] 版本2: import urllib.request import re response=urllib.request.urlopen('http://python.org/') html=response.read() html=html.decode('utf-8') tokens=re.split('\W+',html) print(len(tokens)) print(tokens[0:100]) 结果: 6221 ['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'html', 'class', 'no', 'js', 'ie8', 'lt', 'ie9', 'endif', 'if', 'gt', 'IE', '8', 'html', 'class', 'no', 'js', 'lang', 'en', 'dir', 'ltr', 'endif', 'head', 'meta', 'charset', 'utf', '8', 'meta', 'http', 'equiv', 'X', 'UA', 'Compatible', 'content', 'IE', 'edge', 'link', 'rel', 'prefetch', 'href', 'ajax', 'googleapis', 'com', 'ajax', 'libs', 'jquery', '1', '8', '2', 'jquery', 'min', 'js', 'meta', 'name', 'application', 'name', 'content', 'Python', 'org', 'meta', 'name', 'msapplication', 'tooltip', 'content', 'The', 'official'] 注python3要用上 html=html.decode('utf-8') 否则会报错: cannot use a string pattern on a bytes-like object 接下来用NLTK的方式清理这些标签: import nltk import urllib from bs4 import BeautifulSoup response=urllib.request.urlopen('http://python.org/') html=response.read() html=html.decode('utf-8') soup=BeautifulSoup(html,'lxml') clean=soup.get_text() tokens=[tok for tok in clean.split()] print(tokens[:100]) 结果: ['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network'] 下面是用nltk进行词频的统计: import nltk import urllib from bs4 import BeautifulSoup response=urllib.request.urlopen('http://python.org/') html=response.read() html=html.decode('utf-8') soup=BeautifulSoup(html,'lxml') clean=soup.get_text() tokens=[tok for tok in clean.split()] #print(tokens[:100]) Freq_dist_nltk=nltk.FreqDist(tokens) print(Freq_dist_nltk) for k,v in Freq_dist_nltk.items(): print(str(k)+':'+str(v)) 结果: <FreqDist with 614 samples and 1117 outcomes> up:2 ==:1 now:3 document.getElementsByTagName('script')[0];:1 Best:1 "url"::1 -:1 core:1 [];:1 Statements:1 ga.src:1 s.parentNode.insertBefore(ga,:1 tkInter,:1 international:1 Trac,:1 Legal:1 Beginner’s:1 While:1 'Apple'),:1 here.:2 FOSDEM:2 Brochure:2 2018-01-09:1 Windows:3 programmers:1 Stories:3 Essays:2 _gaq.push(['_setAccount',:1 Interpretation:1 Up:1 Chat:1 discussed:1 Logo:2 window.jQuery:1 Search:1 List:2 comprehensions:1 processing:1 ?:1 b,:1 PyCon:2 'Banana'),:1 <:1 Easy:1 Diversity:3 1.:1 Contributing:1 Top:2 Learn:3 go.:1 User:4 even:1 Notice::1 for?:1 Arts:2 sure:1 programs:1 functions:2 it's:1 control:3 document.location.protocol:1 web2py:1 Government:2 Event:2 -,:1 tools:1 /:4 our:2 Industrial:1 A:2 Smaller:1 Scientific:3 compound:1 knows:1 About:2 Tru64,:1 rendering:1 Security:1 classic:1 Sign:3 have:1 Check:1 Solaris,:1 Copyright:1 quickly,:1 Javascript:2 =:14 Launch:1 Runs:1 987:1 Events:11 an:3 Whet:1 Getting:2 hire:1 Unicode):1 join:1 Mailing:2 Foundation:3 will:1 Roundup:1 Web:1 Hi,:1 learn.:1 Developer's:3 Submit:3 Django,:1 built-in:1 (PEPs)::1 GO:1 Types:1 I'm:2 lists.:1 Girls:1 alpha:1 used:2 Not:1 name):1 math:1 Started:3 Latest:1 Proposals:1 Python!"):1 21:1 to:17 55:1 expected;:1 syntax:2 Documentation:3 Latest::1 other:4 6,:1 }:2 programmers.:1 Engineering:2 limited.:1 Website:1 Talks:2 PSF:4 faster:1 can:3 lists:1 Numeric::1 3.6.4,:1 Implementations:2 377:1 Django:1 n::1 list:2 Data:1 pipeline.:1 —:2 Larger:1 relaunched:1 programming:4 PyGObject,:1 appetite:1 (1,:1 website,:1 Intuitive:1 structure:1 Python.:2 Linux,:1 ©2001-2018.:1 argument:1 IPython:1 Forums:2 What:1 a,:2 Kivy,:1 SciPy,:1 Pyramid,:1 for,:1 output:1 For:1 available.:2 own:1 one:1 **:1 "WebSite",:1 //:1 Meetup:1 You’d:1 Skip:1 (with:1 144:1 arithmetic:1 running:1 Conduct:2 turn:1 motion:1 Archive:4 allows:1 ['Banana',:1 is:',:1 Light:1 users:1 community-run:1 Contact:1 Compaq:1 speak:1 indentation:1 Issue:1 fruits:1 0:1 course.:1 operators:1 Upcoming:1 straightforward::1 re-code):1 Tracker:1 3.6.4:5 The:5 General:1 fruit:1 systems:1 Welcome:1 fib(n)::1 8:2 all:1 Become:1 versions!:1 release:1 job:1 you:1 '.google-analytics.com/ga.js';:1 daily.:1 languages:1 610:1 document.write('<script:1 manipulated:1 together:1 Quick:1 new:1 Powered:1 product):1 ≡:1 print(a,:1 of:17 Calculations:1 Tim:1 language,:1 ...:7 2018:2 Defined:1 input('What:1 essential:1 frames:1 2018-02-02:2 end=':1 Fibonacci:1 Flask,:1 use?:1 Lists:4 %s.':1 Please:1 functions.:2 'Lime']:1 development:1 Initiatives:1 standard:1 "potentialAction"::1 types:1 s:1 languages):1 [2,:1 enumerate:1 Donate:1 picture:1 Looking:1 "@type"::2 ():1 fib(1000):1 protect,:1 Enhancement:1 in:8 more:2 Jobs:2 +:1 Register:1 Menu:1 Code:2 Hello,:1 thousands:1 experienced:1 document.createElement('script');:1 2018-01-23:1 Practices:1 (and:1 future:1 grouping.:1 by:3 [(0,:1 News:11 0,:1 ['BANANA',:1 for…:1 true;:1 day.:1 library,:1 Flow:1 with:7 pick:1 number:2 Ansible,:1 4,:1 Software:6 (function():1 as:2 ga:1 testing.:1 OpenStack:1 easy:2 5.666666666666667:1 machines:1 last:1 find:1 Mac:2 environment:1 production:2 Source:2 print("Hello,:1 Special:2 3:8 'http://www'):1 '):1 fourth:2 keyword:1 docs.python.org:1 ▲:3 'LIME']:1 ga.type:1 per:1 System:1 3.:1 Fortenberry:1 Bug:1 Success:3 Awards:2 Input,:1 Development::3 Pandas,:1 3.7.0a4:1 print('Hi,:1 statements:1 arrays:1 name?:1 the:19 "http://schema.org",:1 twists,:1 parentheses:1 Platforms:2 "@context"::1 community:1 This:1 usual:1 growth:1 lets:1 are:5 Books:2 Alternative:2 ~800:1 Audio/Visual:2 effectively.:1 way:1 Back:2 89:1 Development:2 'Apple',:1 ILM:1 })();:1 arguments,:2 is::1 Our:1 some:1 Reset:1 "query-input"::1 *:2 Functions:1 four:1 "target"::1 Chelyabinsk:1 Rackspace:1 sliced:1 2:3 place:1 source:1 list(enumerate(fruits)):1 content:2 Privacy:1 def:1 License:2 key,:1 not:1 Simple:2 FAQ:2 very:1 PyPI:1 34:1 interaction:1 n:1 installers:1 Python:60 Conferences:2 float:1 numbers:1 Wiki:2 Guide:6 'Lime')]:1 Facebook:1 Buildbot,:1 ||:2 extensible:1 PyQt,:1 ::1 >>>:24 experience.:1 assignment:1 Community:7 Pythology:1 numbers::1 #:9 compositing:1 facilitate:1 quickly:1 series:1 IRC:3 advance:1 a+b:1 ga.async:1 arbitrary:1 Administration::1 var:3 2017-12-06:1 _gaq:2 for:11 Salt,:1 or:2 Core:1 PEP:2 Magic:1 Education:2 board:1 and:22 Legon:1 your:4 tens:1 language:2 mission:1 support:1 was:1 >_:1 Python,:1 while:2 Site:1 Applications:2 name?\n'):1 17:2 about:3 2017-12-19:2 that:5 version:1 Thousands:1 available:3 returns:1 jobs.python.org:1 position:1 'APPLE',:1 Interactive:1 _gaq.push(['_trackPageview']);:1 Groups:2 modeling,:1 promote,:1 X,:1 233:1 Merchandise:2 understands.:1 Non-English:2 Network:1 batch:1 3.5.5rc1:1 optional:1 Mentorship:1 product:5 floor:1 Expect:1 on:4 8]:1 Interest:2 %:1 which:1 'text/javascript';:1 print(loud_fruits):1 Socialize:1 0.5:1 Member:1 {:3 capable:1 One-Day:1 candidate:1 Whether:1 full:1 ('https:':1 Group:4 Close:1 "https://www.python.org/search/?q={search_term_string}",:1 In:2 developer,:1 fruits]:1 function:1 data:1 loud_fruits:1 Experienced:1 Downloads:2 Status:1 you're:2 expression:1 flow:2 python-dev:1 online.:1 indexed,:1 3::4 Google+:1 (known:1 is:16 src="/static/js/libs/jquery-1.8.2.min.js"><\/script>'):1 Python's:1 Use:1 its:1 3.4.8rc1:1 trying:1 clean:1 use:1 13:1 2018-02-03:3 Policy:1 3.6.4rc1:1 download:1 "SearchAction",:1 Beginner's:2 (2,:1 Conference::1 work:3 +,:1 planned:1 code:4 s);:1 range:1 this:2 overview.:1 loop:1 RSS:1 be:3 All:3 Bottle,:1 Business:2 tutorials:1 Twitter:1 runs:1 beginners:1 defining:2 GUI:1 Speed:1 OS:3 provide:1 learn:1 division:2 IRIX,:1 'https://ssl':1 [fruit.upper():1 if,:1 ▼:1 Start:1 diverse:1 Python.org:1 Shell:1 1:5 first:1 name:1 X:2 Download:1 releases:3 384:1 More:9 wxPython:1 Compound:1 Other:2 any:1 b:2 pipeline:1 along:1 related:1 integrate:1 PySide,:1 print('The:1 5:2 print():1 "required:1 Docs:6 &:3 Tornado,:1 Python!:1 name=search_term_string":1 "https://www.python.org/",:1 'UA-39055973-1']);:1 Get:1 simple:2 Help:3 guides,:1 Quotes:2 a:10 Index:2 mandatory:1 2.7.14:1 图表: