python scrapy框架爬取haozu 数据

2019-06-30 628

1.创建项目
在控制台通过scrapy startproject 创建项目
我们通过scrapy startproject haozu 创建爬虫项目

2.创建爬虫文件
在控制台进入spiders 文件夹下通过scrapy genspider <网站域名>
scrapy genspider haozu_xzl www.haozu.com 创建爬虫文件

3.在爬虫文件中 haozu_xzl.py写代码 python version=3.6.0

-- coding: utf-8 --

import scrapy
import requests
from lxml import html
etree =html.etree
from ..items import HaozuItem
import random

class HaozuXzlSpider(scrapy.Spider):

# scrapy crawl haozu_xzl
name = 'haozu_xzl'
# allowed_domains = ['www.haozu.com/sz/zuxiezilou/']
start_urls = "http://www.haozu.com/sz/zuxiezilou/"
province_list = ['bj', 'sh', 'gz', 'sz', 'cd', 'cq', 'cs','dl','fz','hz','hf','nj','jian','jn','km','nb','sy',
                 'su','sjz','tj','wh','wx','xa','zz']

def start_requests(self):

    user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2'
    headers = {'User-Agent': user_agent}
    for s in self.province_list:
        start_url = "http://www.haozu.com/{}/zuxiezilou/".format(s)
        # 包含yield语句的函数是一个生成器，每次产生一个值，函数被冻结，被唤醒后再次产生一个值
        yield scrapy.Request(url=start_url, headers=headers, method='GET', callback=self.parse, \
                         meta={"headers": headers,"city":s})

def parse(self, response):
    lists = response.body.decode('utf-8')
    selector = etree.HTML(lists)
    elem_list = selector.xpath('/html/body/div[2]/div[2]/div/dl[1]/dd/div[2]/div[1]/a')
    print(elem_list,type(elem_list))
    for elem in elem_list[1:-1]:
        try:
            district = str(elem.xpath("text()"))[1:-1].replace("'",'')
            # district.remove(district[0])
            # district.pop()
            print(district,type(district))
            district_href =str(elem.xpath("@href"))[1:-1].replace("'",'')
            # district_href.remove(district_href[0])
            print(district_href,type(district_href))

            elem_url ="http://www.haozu.com{}".format(district_href)
            print(elem_url)
            yield scrapy.Request(url=elem_url, headers=response.meta["headers"], method='GET', callback=self.detail_url,
                                 meta={"district": district,"url":elem_url,"headers":response.meta["headers"],"city":response.meta["city"]})
        except Exception as e:
            print(e)
            pass
def detail_url(self, response):
    print("===================================================================")
    for i in range(1,50):
        # 组建url
        re_url = "{}o{}/".format(response.meta["url"],i)
        print(re_url)
        try:
            response_elem = requests.get(re_url,headers=response.meta["headers"])
            seles= etree.HTML(response_elem.content)
            sele_list = seles.xpath("/html/body/div[3]/div[1]/ul[1]/li")
            for sele in sele_list:
                href = str(sele.xpath("./div[2]/h1/a/@href"))[1:-1].replace("'",'')
                print(href)
                href_url = "http://www.haozu.com{}".format(href)
                print(href_url)
                yield scrapy.Request(url=href_url, headers=response.meta["headers"], method='GET',
                                     callback=self.final_url,
                                     meta={"district": response.meta["district"],"city":response.meta["city"]})
        except Exception as e:
            print(e)
            pass
def final_url(self,response):
    try:
        body = response.body.decode('utf-8')
        sele_body = etree.HTML(body)
        #获取价格 名称 地址
        item = HaozuItem()
        item["city"]= response.meta["city"]
        item['district']=response.meta["district"]
        item['addr'] = str(sele_body.xpath("/html/body/div[2]/div[2]/div/div/div[2]/span[1]/text()[2]"))[1:-1].replace("'",'')
        item['title'] = str(sele_body.xpath("/html/body/div[2]/div[2]/div/div/div[1]/h1/span/text()"))[1:-1].replace("'",'')
        price = str(sele_body.xpath("/html/body/div[2]/div[3]/div[2]/div[1]/span/text()"))[1:-1].replace("'",'')
        price_danwei=str(sele_body.xpath("/html/body/div[2]/div[3]/div[2]/div[1]/div/div/i/text()"))[1:-1].replace("'",'')
        print(price+price_danwei)
        item['price']=price+price_danwei
        yield item
    except Exception as e:
        print(e)
        pass

4.修改items.py 文件

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class HaozuItem(scrapy.Item):

# define the fields for your item here like:
# name = scrapy.Field()
city = scrapy.Field()
district =scrapy.Field()
title = scrapy.Field()
addr =scrapy.Field()
price = scrapy.Field()

5修改settings.py

打开
ITEM_PIPELINES = {
'haozu.pipelines.HaozuPipeline': 300,
}

6 修改pipelines.py文件这里可以自定义存储文件格式

-- coding: utf-8 --

Define your item pipelines here

Don't forget to add your pipeline to the ITEM_PIPELINES setting

See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv

class HaozuPipeline(object):

def process_item(self, item, spider):
    f = open('./xiezilou2.csv', 'a+',encoding='utf-8',newline='')
    write = csv.writer(f)
    write.writerow((item['city'],item['district'],item['addr'],item['title'],item['price']))
    print(item)
    return item

7.启动框架

在控制台输入 scrapy crawl haozu_xzl 启动程序

微信关注我们

原文链接：https://yq.aliyun.com/articles/707004

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

Elasticsearch的使用场景深入详解

1、场景—：使用Elasticsearch作为主要的后端传统项目中，搜索引擎是部署在成熟的数据存储的顶部，以提供快速且相关的搜索能力。这是因为早期的搜索引擎不能提供耐用的存储或其他经常需要的功能，如统计。 Elasticsearch是提供持久存储、统计等多项功能的现代搜索引擎。如果你开始一个新项目，我们建议您考虑使用Elasticsearch作为唯一的数据存储，以帮助保持你的设计尽可能简单。此种场景不支持包含频繁更新、事务（transaction）的操作。举例如下：新建一个博客系统使用es作为存储。 1）我们可以向ES提交新的博文； 2）使用ES检索、搜索、统计数据。 ES作为存储的优势：如果一台服务器出现故障时会发生什么？你可以通过复制数据到不同的服务器以达到容错的目的。注意：整体架构设计时，需要我们权衡是否有必要增

2019-07-01

643

公众号：爱写bug 给定一个二进制数组，计算其中最大连续1的个数。 Given a binary array, find the maximum number of consecutive 1s in this array. 示例 1: 输入: [1,1,0,1,1,1] 输出: 3 解释: 开头的两位和最后的三位都是连续1，所以最大连续1的个数是 3. 注意：输入的数组只包含 0 和1。输入数组的长度是正整数，且不超过 10,000。 Note: The input array will only contain 0 and 1. The length of input array is a positive integer and will not exceed 10,000 解题思路：记录一个指针向右移动，用一个数记录1的个数，遇1就累加1，遇0就倒置为0。具体见 Java 注释。 Java： class Solution{ public int findMaxConsecutiveOnes(int[] nums) { int temp=0,count=0;//tem...

2019-07-01

599

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称，一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集，帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring框架（Spring Framework）是由Rod Johnson于2002年提出的开源Java企业级应用框架，旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念，提供核心容器、应用上下文、数据访问集成等模块，支持整合Hibernate、Struts等第三方框架，其适用范围不仅限于服务器端开发，绝大多数Java应用均可从中受益。

python scrapy框架爬取haozu 数据

-- coding: utf-8 --

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

https://doc.scrapy.org/en/latest/topics/items.html

-- coding: utf-8 --

Define your item pipelines here

Don't forget to add your pipeline to the ITEM_PIPELINES setting

See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

Elasticsearch的使用场景深入详解

LeetCode 485：连续最大1的个数 Max Consecutive Ones（python java）

相关文章

发表评论

资源下载

优质分享App

腾讯云软件源

Nacos

Spring

欢迎您！