摘要:本文主要向大家介绍了【云计算】Scrapy爬取当当网特产(Python爬虫),通过具体的内容向大家展现,希望对大家学习云计算有所帮助。
本文主要向大家介绍了【云计算】Scrapy爬取当当网特产(Python爬虫),通过具体的内容向大家展现,希望对大家学习云计算有所帮助。
一、总体思路
1、创建scrapy项目
2、分析当当网特产网址
3、分析出所取部分xpath公式
4、编写item
5、编写爬虫
6、编写pipline文件将取到的数据存入到文件中
二、具体实践
1、创建scrapy项目
scrapy startproject autopjt
2、分析当当网特产网址
第一页 第二页 第三页
可以看出pg后面跟的为第几页
那么 我们可以把第一页改成
发现和第一页一样,那么就可以找出规律
实际中使用的url为
i为第几页
3、分析出所取部分xpath公式
# 标题 链接
# 价格 //span[@class="price_n"]/text()
¥16.80
# 评论数 198条评论
由此可以推断出xpath为
# 价格 //span[@class='price_n']/text() # 标题 //a[@class='pic']/@title # 链接 //a[@class='pic']/@href # 评论数 //a[@name='itemlist-review']/text()
4、项目代码
项目结构
(1)item
# -*- coding: utf-8 -*- import scrapy class AutopjtItem(scrapy.Item): # 定义好name用来存储商品 name = scrapy.Field() # 定义好price用来存储商品价格 price = scrapy.Field() # 定义好link用来存储商品链接 link = scrapy.Field() # 定义好comnum用来存储商品评论数 comnum = scrapy.Field()
(2)AutospdSpider
创建spider文件scrapy genspider -t basic autospd dangdang.com
# -*- coding: utf-8 -*- import scrapy from autopjt.items import AutopjtItem from scrapy.http.request import Request # 价格 //span[@class='price_n']/text() # 标题 //a[@class='pic']/@title # 链接 //a[@class='pic']/@href # 评论数 //a[@name='itemlist-review']/text() class AutospdSpider(scrapy.Spider): name = 'autospd' allowed_domains = ['dangdang.com'] start_urls = [ '//category.dangdang.com/pg1-cid4011029.html' ] def parse(self, response): item = AutopjtItem() # 通过XPath表达式分别提取商品的名称、价格、链接、评论数等信息 item['name'] = response.xpath("//a[@class='pic']/@title").extract() item['price'] = response.xpath("//span[@class='price_n']/text()").extract() item['link'] = response.xpath("//a[@class='pic']/@href").extract() item['comnum'] = response.xpath("//a[@name='itemlist-review']/text()").extract() # 提取完后返回item yield item # 接下来很关键,通过循环自动爬去75页的数据 for i in range(1, 76): url = "//category.dangdang.com/pg" + str(i) + "-cid4011029.html" # 通过yield返回Request,并制定要爬取的网址和回调函数 # 实现自动爬取 yield Request(url, callback=self.parse)
(3)piplines
# -*- coding: utf-8 -*- import json import codecs class AutopjtPipeline(object): def __init__(self): self.file = codecs.open("C:/Users/Administrator/Desktop/dangdangdate.json", "wb", encoding="utf-8") def process_item(self, item, spider): # i = json.dumps(dict(item), ensure_ascii=False) # # 每行数据后加上换行 # line = i + "\n" # # 将数据写入到dangdangdate.json文件中 # self.file.write(line) # return item for j in range(0, len(item['name'])): # 将当前页的第j个商品的名称赋值给变量name name = item["name"][j] price = item["price"][j] link = item["link"][j] comnum = item["comnum"][j] # 将当前页下第j个商品的name、price、link、comnum等信息处理一下 # 重新组合成一个字典 goods = {"name": name, "price": price, "link": link, "comnum": comnum} # 将当前页下第j个产品的数据写入json文件 i = json.dumps(dict(goods), ensure_ascii=False) line = i + "\n" self.file.write(line) def close_spider(self, spider): # 关闭dangdangdate.json文件 self.file.close()
(4)settings.py
# -*- coding: utf-8 -*- # Scrapy settings for autopjt project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'autopjt' SPIDER_MODULES = ['autopjt.spiders'] NEWSPIDER_MODULE = 'autopjt.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'autopjt (+//www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'autopjt.middlewares.AutopjtSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'autopjt.middlewares.AutopjtDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'autopjt.pipelines.AutopjtPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
注意:如果出现robots错误 把这里的True改为Fasle即可解决
三、运行及结果展示
scrapy crawl autospd --nolog
打开dangdangdata.json文件,结果如下
本文由职坐标整理并发布,希望对同学们有所帮助。了解更多详情请关注职坐标大数据云计算大数据安全频道!
您输入的评论内容中包含违禁敏感词
我知道了
请输入正确的手机号码
请输入正确的验证码
您今天的短信下发次数太多了,明天再试试吧!
我们会在第一时间安排职业规划师联系您!
您也可以联系我们的职业规划师咨询:
版权所有 职坐标-一站式IT培训就业服务领导者 沪ICP备13042190号-4
上海海同信息科技有限公司 Copyright ©2015 www.zhizuobiao.com,All Rights Reserved.
沪公网安备 31011502005948号