首页 文章 精选 留言 我的

精选列表

搜索[镜像无法拉取],共10013篇文章
优秀的个人博客,低调大师

python scrapy框架爬haozu 数据

1.创建项目 在控制台通过scrapy startproject 创建项目 我们通过scrapy startproject haozu 创建爬虫项目 2.创建爬虫文件 在控制台 进入spiders 文件夹下 通过scrapy genspider <网站域名> scrapy genspider haozu_xzl www.haozu.com 创建爬虫文件 3.在爬虫文件中 haozu_xzl.py写代码 python version=3.6.0 -- coding: utf-8 -- import scrapyimport requestsfrom lxml import htmletree =html.etreefrom ..items import HaozuItemimport random class HaozuXzlSpider(scrapy.Spider): # scrapy crawl haozu_xzl name = 'haozu_xzl' # allowed_domains = ['www.haozu.com/sz/zuxiezilou/'] start_urls = "http://www.haozu.com/sz/zuxiezilou/" province_list = ['bj', 'sh', 'gz', 'sz', 'cd', 'cq', 'cs','dl','fz','hz','hf','nj','jian','jn','km','nb','sy', 'su','sjz','tj','wh','wx','xa','zz'] def start_requests(self): user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2' headers = {'User-Agent': user_agent} for s in self.province_list: start_url = "http://www.haozu.com/{}/zuxiezilou/".format(s) # 包含yield语句的函数是一个生成器,每次产生一个值,函数被冻结,被唤醒后再次产生一个值 yield scrapy.Request(url=start_url, headers=headers, method='GET', callback=self.parse, \ meta={"headers": headers,"city":s}) def parse(self, response): lists = response.body.decode('utf-8') selector = etree.HTML(lists) elem_list = selector.xpath('/html/body/div[2]/div[2]/div/dl[1]/dd/div[2]/div[1]/a') print(elem_list,type(elem_list)) for elem in elem_list[1:-1]: try: district = str(elem.xpath("text()"))[1:-1].replace("'",'') # district.remove(district[0]) # district.pop() print(district,type(district)) district_href =str(elem.xpath("@href"))[1:-1].replace("'",'') # district_href.remove(district_href[0]) print(district_href,type(district_href)) elem_url ="http://www.haozu.com{}".format(district_href) print(elem_url) yield scrapy.Request(url=elem_url, headers=response.meta["headers"], method='GET', callback=self.detail_url, meta={"district": district,"url":elem_url,"headers":response.meta["headers"],"city":response.meta["city"]}) except Exception as e: print(e) pass def detail_url(self, response): print("===================================================================") for i in range(1,50): # 组建url re_url = "{}o{}/".format(response.meta["url"],i) print(re_url) try: response_elem = requests.get(re_url,headers=response.meta["headers"]) seles= etree.HTML(response_elem.content) sele_list = seles.xpath("/html/body/div[3]/div[1]/ul[1]/li") for sele in sele_list: href = str(sele.xpath("./div[2]/h1/a/@href"))[1:-1].replace("'",'') print(href) href_url = "http://www.haozu.com{}".format(href) print(href_url) yield scrapy.Request(url=href_url, headers=response.meta["headers"], method='GET', callback=self.final_url, meta={"district": response.meta["district"],"city":response.meta["city"]}) except Exception as e: print(e) pass def final_url(self,response): try: body = response.body.decode('utf-8') sele_body = etree.HTML(body) #获取价格 名称 地址 item = HaozuItem() item["city"]= response.meta["city"] item['district']=response.meta["district"] item['addr'] = str(sele_body.xpath("/html/body/div[2]/div[2]/div/div/div[2]/span[1]/text()[2]"))[1:-1].replace("'",'') item['title'] = str(sele_body.xpath("/html/body/div[2]/div[2]/div/div/div[1]/h1/span/text()"))[1:-1].replace("'",'') price = str(sele_body.xpath("/html/body/div[2]/div[3]/div[2]/div[1]/span/text()"))[1:-1].replace("'",'') price_danwei=str(sele_body.xpath("/html/body/div[2]/div[3]/div[2]/div[1]/div/div/i/text()"))[1:-1].replace("'",'') print(price+price_danwei) item['price']=price+price_danwei yield item except Exception as e: print(e) pass 4.修改items.py 文件 -- coding: utf-8 -- Define here the models for your scraped items See documentation in: https://doc.scrapy.org/en/latest/topics/items.html import scrapy class HaozuItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() city = scrapy.Field() district =scrapy.Field() title = scrapy.Field() addr =scrapy.Field() price = scrapy.Field() 5修改settings.py 打开ITEM_PIPELINES = { 'haozu.pipelines.HaozuPipeline': 300,} 6 修改pipelines.py文件 这里可以自定义存储文件格式 -- coding: utf-8 -- Define your item pipelines here Don't forget to add your pipeline to the ITEM_PIPELINES setting See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import csv class HaozuPipeline(object): def process_item(self, item, spider): f = open('./xiezilou2.csv', 'a+',encoding='utf-8',newline='') write = csv.writer(f) write.writerow((item['city'],item['district'],item['addr'],item['title'],item['price'])) print(item) return item 7.启动框架 在控制台 输入 scrapy crawl haozu_xzl 启动程序

优秀的个人博客,低调大师

Python 爬斗图啦图片

斗图啦 requests BeautifulSoup4 代码 # -*- coding:utf-8 -*- # pip install requests 框架 import requests # pip install beautifulsoup4 框架 # pip install lxml 解析器 from bs4 import BeautifulSoup import os class doutuSpider(object): headers = { "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" } def get_url(self, url): data = requests.get(url, headers=self.headers) soup = BeautifulSoup(data.content, 'lxml') totals = soup.findAll("a", {"class": "list-group-item"}) for one in totals: sub_url = one.get('href') global path path = 'E:\\img' + '\\' + sub_url.split('/')[-1] os.mkdir(path) try: self.get_img_url(sub_url) except: pass pass pass def get_img_url(self, url): data = requests.get(url, headers = self.headers) soup = BeautifulSoup(data.content, 'lxml') totals = soup.findAll('div', {'class': 'artile_des'}) for one in totals: img = one.find('img') try: sub_url = img.get('src') except Exception as e: raise e finally: urls = sub_url try: self.get_img(urls) except: print urls pass pass pass def get_img(self, url): filename = url.split('/')[-1] global path img_path = path + '\\' + filename img = requests.get(url, headers = self.headers) try: with open(img_path, 'wb') as f: f.write(img.content) except: pass pass def create(self): for count in range(1,10): url = 'https://www.doutula.com/article/list/?page={}'.format(count) print 'download {} page'.format(count) self.get_url(url) pass pass if __name__ == '__main__': doutu = doutuSpider() doutu.create()

优秀的个人博客,低调大师

Python使用BeautifulSoup爬妹子图

最近突然发现之前写的妹子图的爬虫不能用了,估计是网站又加了新的反爬虫机制,本着追求真理的精神我只好又来爬一遍了! 效果 文件夹 妹子图 思路整理 页面地址:http://www.meizitu.com/ 获取首页分类标签地址,传入下一步 image.png 获取每个分类下内容页面地址 image.png 获取内容页面图片地址以及标题,以页面标题作为文件夹名 image.png 最后保存图片就好了 代码 所需包 import os import sys import urllib2 from bs4 import BeautifulSoup import requests import lxml import uuid 获取地址 首先说BeautifulSoup真的是爬虫利器,不过需要注意这里返回的list,还需要通过for循环读取每个地址。贴一段官方解释: Beautiful Soup提供一些简单的、python式的函数来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。 Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。 下面给出的代码是从首页获取每个分类的地址,其他的获取包括图片地址,内容页地址也都是大同小异,然后一直嵌套就可以了。 def get_mei_channel(url): web_data=requests.get(url) web_data.encoding='gb2312' soup=BeautifulSoup(web_data.text,'lxml') channel=soup.select('body span a') return channel ##获取分类地址 保存图片 这里需要注意的是保存图片的时候需要加上header,应该是网站更新了验证,去年爬妹子图直接保存就可以的。 文件命名的话我引入了uuid包来生成唯一guid,避免重名保存失败。 def save_pic(url,path): header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } request = urllib2.Request(url, None, header) response = urllib2.urlopen(request) filename = path +'/'+str(uuid.uuid1())+'.jpg' with open(filename,"wb") as f: f.write(response.read()) ##保存图片,生成唯一guid作为文件名 嵌套 最后按照思路一步步嵌套起来就可以啦,贴完整代码: # -*- coding: utf-8 -*- import os import sys import urllib2 from bs4 import BeautifulSoup import requests import lxml import uuid def judge_folder(path): if os.path.isdir(path): return False else: os.mkdir(path) return True def save_pic(url,path): header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } request = urllib2.Request(url, None, header) response = urllib2.urlopen(request) filename = path +'/'+str(uuid.uuid1())+'.jpg' with open(filename,"wb") as f: f.write(response.read()) def get_mei_channel(url): web_data=requests.get(url) web_data.encoding='gb2312' soup=BeautifulSoup(web_data.text,'lxml') channel=soup.select('body span a') return channel def get_mei_info(url): web_data=requests.get(url) web_data.encoding='gb2312' soup=BeautifulSoup(web_data.text,'lxml') info=soup.select('body div.pic a') return info def get_mei_pic(url): web_data=requests.get(url) web_data.encoding='gb2312' soup=BeautifulSoup(web_data.text,'lxml') pic=soup.select('body p img') titlelist=soup.select('body div h2 a') for list in titlelist: path_folder = format(list.get_text()) path = root_folder + path_folder.encode('utf-8') print '创建文件夹>>>'+ path_folder.encode('utf-8') +'>>>' if judge_folder(path): print '***开始下载啦!!***' else: pic =[] print '***文件夹已存在,即将开始保存下一个页面***' return pic ,path def MeiZiTuSpider(url): channel_list = get_mei_channel(url) for channel in channel_list: channel_url = (channel.get('href')) channel_title = (channel.get('title')) print '***开始查找 '+channel_title.encode('utf-8') +' 分类下的妹子图***' info_list = get_mei_info(channel_url) for info in info_list: info_url = (info.get('href')) pic_list,path = get_mei_pic(info_url) for pic in pic_list: pic_url = (pic.get('src')) save_pic(pic_url,path) root_folder = 'MEIZITU/' url='http://www.meizitu.com/' if __name__ == "__main__": if os.path.isdir(root_folder): pass else: os.mkdir(root_folder) MeiZiTuSpider(url) print '****MeiZiTuSpider@Awesome_Tang****' 其实还有一步可以做,每个分类页面下目前是只取了第一页的内容,再加一个页码的嵌套的话基本上就可以全部download下来了,不过我盖中盖的Mac吃不消了,有兴趣的可以尝试下~另外我把代码打包生成了exe,有兴趣的可以留言或者私信我,我发你^^ peace~

资源下载

更多资源
优质分享App

优质分享App

近一个月的开发和优化,本站点的第一个app全新上线。该app采用极致压缩,本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

Mario

Mario

马里奥是站在游戏界顶峰的超人气多面角色。马里奥靠吃蘑菇成长,特征是大鼻子、头戴帽子、身穿背带裤,还留着胡子。与他的双胞胎兄弟路易基一起,长年担任任天堂的招牌角色。

Nacos

Nacos

Nacos /nɑ:kəʊs/ 是 Dynamic Naming and Configuration Service 的首字母简称,一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。Nacos 致力于帮助您发现、配置和管理微服务及AI智能体应用。Nacos 提供了一组简单易用的特性集,帮助您快速实现动态服务发现、服务配置、服务元数据、流量管理。Nacos 帮助您更敏捷和容易地构建、交付和管理微服务平台。

Spring

Spring

Spring框架(Spring Framework)是由Rod Johnson于2002年提出的开源Java企业级应用框架,旨在通过使用JavaBean替代传统EJB实现方式降低企业级编程开发的复杂性。该框架基于简单性、可测试性和松耦合性设计理念,提供核心容器、应用上下文、数据访问集成等模块,支持整合Hibernate、Struts等第三方框架,其适用范围不仅限于服务器端开发,绝大多数Java应用均可从中受益。

用户登录
用户注册