带你学习目前非常流行的开源爬虫框架Scrapy
Scrapy安装
官网 https://scrapy.org/
安装方式
在任意操作系统下,可以使用pip安装Scrapy,例如:
1$ pip install scrapy
为确认Scrapy已安装成功,首先在Python中测试能否导入Scrapy模块:
1>>> import scrapy
2>>> scrapy.version_info
3(1, 8, 0)
然后,在 shell 中测试能否执行 Scrapy 这条命令:
1(base) λ scrapy
2Scrapy 1.8.0 - no active project
3Usage:
4 scrapy <command> [options] [args]
5
6Available commands:
7 bench Run quick benchmark test
8 fetch Fetch a URL using the Scrapy downloader
9 genspider Generate new spider using pre-defined templates
10 runspider Run a self-contained spider (without creating a project)
11 settings Get settings values
12 shell Interactive scraping console
13 startproject Create new project version
14 Print Scrapy version
15 view Open URL in browser, as seen by Scrapy
16
17 [ more ] More commands available when run from project directory
18
19Use "scrapy <command> -h" to see more info about a command
通过了以上两项检测,说明Scrapy安装成功了。如上所示,我们安装的是当前最新版本1.8.0
注意:
在安装Scrapy的过程中可能会遇到缺少VC++等错误,可以安装缺失模块的离线包
成功安装后,在CMD下运行scrapy出现上图不算真正成功,检测真正是否成功使用 scrapy bench 测试,如果没有提示错误,就代表成功安装
具体Scrapy安装流程参考:http://doc.scrapy.org/en/latest/intro/install.html##intro-install-platform-notes 里面有各个平台的安装方法
全局命令
1$ scrapy
2Scrapy 1.7.3 - no active project
3Usage:
4 scrapy <command> [options] [args]
5
6Available commands:
7 bench Run quick benchmark test
8 ## 测试电脑性能。
9 fetch Fetch a URL using the Scrapy downloader
10 ## 将源代码下载下来并显示出来
11 genspider Generate new spider using pre-defined templates
12 ## 创建一个新的 spider 文件
13 runspider Run a self-contained spider (without creating a project)
14 ## 这个和通过crawl启动爬虫不同,scrapy runspider 爬虫文件名称
15 settings Get settings values
16 ## 获取当前的配置信息
17 shell Interactive scraping console
18 ## 进入 scrapy 的交互模式
19 startproject Create new project
20 ## 创建爬虫项目。
21 version Print Scrapy version
22 view Open URL in browser, as seen by Scrapy
23 ## 将网页document内容下载下来,并且在浏览器显示出来
24
25 [ more ] More commands available when run from project directory
26
27Use "scrapy <command> -h" to see more info about a command
项目命令
scrapy startproject projectname
创建一个项目scrapy genspider spidername domain
创建爬虫。创建好爬虫项目以后,还需要创建爬虫。scrapy crawl spidername
运行爬虫。注意该命令运行时所在的目录。