python案例01-【爬虫系列】
题目
爬取猫眼电影 TOP100 电影的所有信息
网址:
https://www.maoyan.com/board/4?timeStamp=1649210279544&channelId=40011&index=8&signKey=1730d1a56d7142ccf1ffbb0f9b721ff6&sVersion=1&webdriver=false&offset=0
答案
爬取分析
解决猫眼网反爬虫策略的爬虫:https://cloud.tencent.com/developer/article/1813581
1.登录猫眼电影网页,查看数值返回
得到User-Agent与Cookie,测试得到的网页数值
但是这个cookie的使用时间是有限的,当时间超过就会再次跳转到验证中心
import requests
def link(url):
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36 Edg/100.0.1185.29",
"Cookie": "__mta=217298601.XXXXXXX"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.text
return None
def main():
url = 'https://www.maoyan.com/board/4?timeStamp=1649302074514&channelId=40011&index=6&signKey=0b672485eadff613ec0174c46940372a&sVersion=1&webdriver=false&offset=0'
html = link(url)
print(html)
main()
2. 获取数值
import requests as req
import re
from bs4 import BeautifulSoup as bs
import time as ti
def link(url):
header = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
"cookie" : "__mta=151852934.XXXXXX"
}
res = req.get(url,headers = header)
if res.status_code == 200:
return bs(res.text,"lxml")
return None
for i in range(0,100,10):
url = "https://maoyan.com/board/4?offset=" + str(i)
movies = link(url).find_all("dd")
for i in movies:
img = i.find("img",class_ = "board-img").get("data-src")
num = i.find("i").text
name = i.find("a").get("title")
actor = re.findall("主演:(.*)",i.find("p",class_ = "star").text)[0]
when = re.findall("上映时间:(.*)",i.find("p",class_ = "releasetime").text)[0]
score = i.find("i",class_ = "integer").text + i.find("i",class_ = "fraction").text
url1 = "https://maoyan.com" + i.find("p",class_ = "name").a.get("href")
movie = link(url1)
ti.sleep(1)
about = movie.find("span",class_ = "dra").text
word = movie.find("span",class_ = "name").text + ": " + movie.find("div",class_ = "comment-content").text.replace("?","")
boss = movie.find("a",class_= "name").text.replace("\n","").replace(" ","")
a = {
"片名" : name,
"排名" : num,
"评分" : score,
"网址" : url1,
"演员" : actor,
"上映时间" : when,
"图片" : img,
"评论" : word,
"导演" : boss,
"简介" : about
}
3.拓展
Cookie的时效太短了,猜测只有一分钟,而且这种从浏览器复制Cookie的行为是不满足爬虫自动化的要求。
先勾勒一下整个交互过程:
客户端发送请求->服务器响应并传回Cookie->客户端将Cookie添加到请求头中,重新发送请求->服务器响应请求。
借用python的库selenium
由它打开浏览器访问页面,而我们只需要其中的Cookie再重新请求获取数据。
注:在实践过程,当我连续使用selenium依然会被认为是爬虫。
(1)安装geckodriver.exe
2.下载解压后将getckodriver.exe复制到Firefox的安装目录下,如(C:\Program Files\Mozilla Firefox),并在环境变量Path中添加路径。
(2)下载chromdriver
需要用到浏览器驱动,因本机安装的chrom,故下载的chromdriver
参考:https://blog.csdn.net/weixin_44335092/article/details/109054128
(3)代码如下
#借用python的库selenium了。由它打开浏览器访问页面,而我们只需要其中的Cookie再重新请求获取数据,代码如下:
from selenium import webdriver
# browser = webdriver.Firefox()
browser = webdriver.Chrome()
browser.get('https://www.maoyan.com/board/4')
Cookie = browser.get_cookies()
strr = ''
print(Cookie)
for c in Cookie:
strr += c['name']
strr += '='
strr += c['value']
strr += ';'
headers = {'Cookie': strr}
print(headers)
# r2 = requests.get('https://www.maoyan.com/board/4', headers=headers)