您现在的位置是:首页 > 文章详情

[雪峰磁针石博客]python爬虫cookbook1爬虫入门

日期:2018-09-09点击:480

第一章 爬虫入门

  • Requests和Beautiful Soup 爬取python.org
  • urllib3和Beautiful Soup 爬取python.org
  • Scrapy 爬取python.org
  • Selenium和PhantomJs爬取Python.org

请确认可以打开:https://www.python.org/events/pythonevents
安装好requests、bs4,然后我们开始实例1:Requests和Beautiful Soup 爬取python.org,

 # pip3 install requests bs4 

Requests和Beautiful Soup 爬取python.org

01_events_with_requests.py

 import requests from bs4 import BeautifulSoup def get_upcoming_events(url): req = requests.get(url) soup = BeautifulSoup(req.text, 'lxml') events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li') for event in events: event_details = dict() event_details['name'] = event.find('h3').find("a").text event_details['location'] = event.find('span', {'class', 'event-location'}).text event_details['time'] = event.find('time').text print(event_details) get_upcoming_events('https://www.python.org/events/python-events/') 

执行结果:

 $ python3 01_events_with_requests.py {'name': 'PyCon US 2018', 'location': 'Cleveland, Ohio, USA', 'time': '09 May – 18 May 2018'} {'name': 'DjangoCon Europe 2018', 'location': 'Heidelberg, Germany', 'time': '23 May – 28 May 2018'} {'name': 'PyCon APAC 2018', 'location': 'NUS School of Computing / COM1, 13 Computing Drive, Singapore 117417, Singapore', 'time': '31 May – 03 June 2018'} {'name': 'PyCon CZ 2018', 'location': 'Prague, Czech Republic', 'time': '01 June – 04 June 2018'} {'name': 'PyConTW 2018', 'location': 'Taipei, Taiwan', 'time': '01 June – 03 June 2018'} {'name': 'PyLondinium', 'location': 'London, UK', 'time': '08 June – 11 June 2018'} 

注意:因为事件的内容未必相同,所以每次的结果也不会一样

课后习题: 用requests爬取https://china-testing.github.io/首页的博客标题,共10条。

参考答案:

01_blog_title.py

 import requests from bs4 import BeautifulSoup def get_upcoming_events(url): req = requests.get(url) soup = BeautifulSoup(req.text, 'lxml') events = soup.findAll('article') for event in events: event_details = {} event_details['name'] = event.find('h1').find("a").text print(event_details) get_upcoming_events('https://china-testing.github.io/') 

执行结果:

 $ python3 01_blog_title.py {'name': '10分钟学会API测试'} {'name': 'python数据分析快速入门教程4-数据汇聚'} {'name': 'python数据分析快速入门教程6-重整'} {'name': 'python数据分析快速入门教程5-处理缺失数据'} {'name': 'python库介绍-pytesseract: OCR光学字符识别'} {'name': '软件自动化测试初学者忠告'} {'name': '使用opencv转换3d图片'} {'name': 'python opencv3实例(对象识别和增强现实)2-边缘检测和应用图像过滤器'} {'name': 'numpy学习指南3rd3:常用函数'} {'name': 'numpy学习指南3rd2:NumPy基础'} 

urllib3和Beautiful Soup 爬取python.org

代码:02_events_with_urlib3.py

 import urllib3 from bs4 import BeautifulSoup def get_upcoming_events(url): req = urllib3.PoolManager() res = req.request('GET', url) soup = BeautifulSoup(res.data, 'html.parser') events = soup.find('ul', {'class': 'list-recent-events'}).findAll('li') for event in events: event_details = dict() event_details['name'] = event.find('h3').find("a").text event_details['location'] = event.find('span', {'class', 'event-location'}).text event_details['time'] = event.find('time').text print(event_details) get_upcoming_events('https://www.python.org/events/python-events/') 

requests对urllib3进行了封装,一般是直接使用requests。

Scrapy 爬取python.org

Scrapy是用于提取数据的非常流行的开源Python抓取框架。 Scrapy提供所有这些功能以及许多其他内置模块和扩展。当涉及到使用Python进行挖掘时,它也是我们的首选工具。
Scrapy提供了许多值得一提的强大功能:

  • 内置的扩展来生成HTTP请求并处理压缩,身份验证,缓存,操作用户代理和HTTP标头
  • 内置的支持选择和提取选择器语言如数据CSS和XPath,以及支持使用正则表达式选择内容和链接。
  • 编码支持来处理语言和非标准编码声明
  • 灵活的API来重用和编写自定义中间件和管道,提供干净而简单的方法来实现自动化等任务。比如下载资产(例如图像或媒体)并将数据存储在存储器中,如文件系统,S3,数据库等

有几种使用Scrapy的方法。一个是程序模式我们在代码中创建抓取工具和蜘蛛。也可以配置Scrapy模板或生成器项目,然后从命令行使用运行。本书将遵循程序模式,因为它的代码在单个文件中。

代码:03_events_with_scrapy.py

 import scrapy from scrapy.crawler import CrawlerProcess class PythonEventsSpider(scrapy.Spider): name = 'pythoneventsspider' start_urls = ['https://www.python.org/events/python-events/',] found_events = [] def parse(self, response): for event in response.xpath('//ul[contains(@class, "list-recent-events")]/li'): event_details = dict() event_details['name'] = event.xpath('h3[@class="event-title"]/a/text()').extract_first() event_details['location'] = event.xpath('p/span[@class="event-location"]/text()').extract_first() event_details['time'] = event.xpath('p/time/text()').extract_first() self.found_events.append(event_details) if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'ERROT630:~/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests/func$ pytest test_api_exceptions.py -v -m "smoke and not get" =========================================== test session starts =========================================== platform linux -- Python 3.5.2, pytest-3.5.1, py-1.5.3, pluggy-0.6.0 -- /usr/bin/python3 cachedir: ../.pytest_cache rootdir: /home/andrew/code/china-testing/python3_libraries/pytest_testing/ch2/tasks_proj/tests, inifile: pytest.ini collected 7 items / 6 deselected test_api_exceptions.py::test_list_raises PASSED [100%] R'}) process.crawl(PythonEventsSpider) spider = next(iter(process.crawlers)).spider process.start() for event in spider.found_events: print(event) 

课后习题: 用scrapy爬取https://china-testing.github.io/首页的博客标题,共10条。

参考答案:

03_blog_with_scrapy.py

 from scrapy.crawler import CrawlerProcess class PythonEventsSpider(scrapy.Spider): name = 'pythoneventsspider' start_urls = ['https://china-testing.github.io/',] found_events = [] def parse(self, response): for event in response.xpath('//article//h1'): event_details = dict() event_details['name'] = event.xpath('a/text()').extract_first() self.found_events.append(event_details) if __name__ == "__main__": process = CrawlerProcess({ 'LOG_LEVEL': 'ERROR'}) process.crawl(PythonEventsSpider) spider = next(iter(process.crawlers)).spider process.start() for event in spider.found_events: print(event) 

Selenium和PhantomJs爬取Python.org

04_events_with_selenium.py

 from selenium import webdriver def get_upcoming_events(url): driver = webdriver.Chrome() driver.get(url) events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li') for event in events: event_details = dict() event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text event_details['time'] = event.find_element_by_xpath('p/time').text print(event_details) driver.close() get_upcoming_events('https://www.python.org/events/python-events/') 

改用driver = webdriver.PhantomJS('phantomjs')可以使用无界面的方式,代码如下:

05_events_with_phantomjs.py

 from selenium import webdriver def get_upcoming_events(url): driver = webdriver.Chrome() driver.get(url) events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li') for event in events: event_details = dict() event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text event_details['time'] = event.find_element_by_xpath('p/time').text print(event_details) driver.close() get_upcoming_events('https://www.python.org/events/python-events/') 

不过selenium的headless模式已经可以更好的代替phantomjs了。

04_events_with_selenium_headless.py

 from selenium import webdriver def get_upcoming_events(url): options = webdriver.ChromeOptions() options.add_argument('headless') driver = webdriver.Chrome(chrome_options=options) driver.get(url) events = driver.find_elements_by_xpath('//ul[contains(@class, "list-recent-events")]/li') for event in events: event_details = dict() event_details['name'] = event.find_element_by_xpath('h3[@class="event-title"]/a').text event_details['location'] = event.find_element_by_xpath('p/span[@class="event-location"]').text event_details['time'] = event.find_element_by_xpath('p/time').text print(event_details) driver.close() get_upcoming_events('https://www.python.org/events/python-events/') 

参考资料

原文链接:https://yq.aliyun.com/articles/637658
关注公众号

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。

持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

转载内容版权归作者及来源网站所有,本站原创内容转载请注明来源。

文章评论

共有0条评论来说两句吧...

文章二维码

扫描即可查看该文章

点击排行

推荐阅读

最新文章