目录
-
11.1. 安装 scrapy 开发环境
-
-
11.1.1. Mac
-
11.1.2. Ubuntu
-
11.1.3. 使用 pip 安装 scrapy
-
11.1.4. 测试 scrapy
-
11.2. scrapy 命令
-
-
11.2.1.
-
11.2.2. 新建 spider
-
11.2.3. 列出可用的 spiders
-
11.2.4. 运行 spider
-
11.3. Scrapy Shell
-
-
11.3.1. response
-
-
11.3.1.1. 当前URL地址
-
11.3.1.2. status HTTP 状态
-
11.3.1.3. text 正文
-
11.3.1.4. css
-
-
11.3.1.4.1. 获取 html 属性
-
11.3.1.5. xpath
-
11.3.1.6. headers
-
11.4. 爬虫项目
-
-
11.4.1. 创建项目
-
11.4.2. Spider
-
-
11.4.2.1. 翻页操作
-
11.4.2.2. 采集内容保存到文件
-
11.4.3. settings.py 爬虫配置文件
-
-
11.4.3.1. 忽略 robots.txt 规则
-
11.4.4. Item
-
11.4.5. Pipeline
-
11.5. 下载图片
-
-
11.5.1. 配置 settings.py
-
11.5.2. 修改 pipelines.py 文件
-
11.5.3. 编辑 items.py
-
11.5.4. Spider 爬虫文件
-
11.6. xpath
-
-
11.6.1. 逻辑运算符
-
-
11.6.1.1. and
-
11.6.1.2. or
-
11.6.2. function
-
-
11.6.2.1. text()
-
11.6.2.2. contains()
https://scrapy.org
neo@MacBook-Pro ~ % brew install python3
neo@MacBook-Pro ~ % pip3 install scrapy
搜索 scrapy 包,scrapy 支持 Python2.7 和 Python3 我们只需要 python3 版本
neo@netkiller ~ % apt-cache search scrapy | grep python3
python3-scrapy - Python web scraping and crawling framework (Python 3)
python3-scrapy-djangoitem - Scrapy extension to write scraped items using Django models (Python3 version)
python3-w3lib - Collection of web-related functions (Python 3)
Ubuntu 17.04 默认 scrapy 版本为 1.3.0-1 如果需要最新的 1.4.0 请使用 pip 命令安装
neo@netkiller ~ % apt search python3-scrapy
Sorting... Done
Full Text Search... Done
python3-scrapy/zesty,zesty 1.3.0-1~exp2 all
Python web scraping and crawling framework (Python 3)
python3-scrapy-djangoitem/zesty,zesty 1.1.1-1 all
Scrapy extension to write scraped items using Django models (Python3 version)
安装 scrapy
neo@netkiller ~ % sudo apt install python3-scrapy
[sudo] password for neo:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly
python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb
python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch
python3-pygments python3-queuelib python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth
python3-webencodings python3-zope.interface
Suggested packages:
python-pexpect-doc python-attr-doc python-cryptography-doc python3-cryptography-vectors python3-genshi python3-lxml-dbg python-lxml-doc default-mysql-server | virtual-mysql-server
python-egenix-mxdatetime python3-mysqldb-dbg python-openssl-doc python3-openssl-dbg python3-pam-dbg python-pil-doc python3-pil-dbg doc-base python-pydispatch-doc ttf-bitstream-vera python-scrapy-doc
python3-wxgtk3.0 | python3-wxgtk python-setuptools-doc python3-tk python3-gtk2 python3-glade2 python3-qt4 python3-wxgtk2.8 python3-twisted-bin-dbg
The following NEW packages will be installed:
ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly
python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb
python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch
python3-pygments python3-queuelib python3-scrapy python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib
python3-wcwidth python3-webencodings python3-zope.interface
0 upgraded, 49 newly installed, 0 to remove and 0 not upgraded.
Need to get 7,152 kB of archives.
After this operation, 40.8 MB of additional disk space will be used.
Do you want to continue? [Y/n]
输入大写 “Y” 然后回车
neo@netkiller ~ % sudo apt install python3-pip
neo@netkiller ~ % pip3 install scrapy
创建测试程序,用于验证 scrapy 安装是否存在问题。
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
for title in response.css('h2.entry-title'):
yield {'title': title.css('a ::text').extract_first()}
for next_page in response.css('div.prev-post > a'):
yield response.follow(next_page, self.parse)
EOF
运行爬虫
$ scrapy runspider myspider.py
原文出处:Netkiller 系列 手札
本文作者:陈景峯
转载请与作者联系,同时请务必标明文章原始出处和作者信息及本声明。