vlambda博客
学习文章列表

python案例01-【爬虫系列】

题目

爬取猫眼电影 TOP100 电影的所有信息

网址:

https://www.maoyan.com/board/4?timeStamp=1649210279544&channelId=40011&index=8&signKey=1730d1a56d7142ccf1ffbb0f9b721ff6&sVersion=1&webdriver=false&offset=0


答案

爬取分析

解决猫眼网反爬虫策略的爬虫:https://cloud.tencent.com/developer/article/1813581

1.登录猫眼电影网页,查看数值返回

得到User-Agent与Cookie,测试得到的网页数值

但是这个cookie的使用时间是有限的,当时间超过就会再次跳转到验证中心

import requests

def link(url):
  headers = {
      "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36 Edg/100.0.1185.29",
      "Cookie": "__mta=217298601.XXXXXXX"
  }

  response = requests.get(url, headers=headers)
  if response.status_code == 200:
      return response.text
  return None
def main():
  url = 'https://www.maoyan.com/board/4?timeStamp=1649302074514&channelId=40011&index=6&signKey=0b672485eadff613ec0174c46940372a&sVersion=1&webdriver=false&offset=0'
  html = link(url)
  print(html)

main()

2. 获取数值

import requests as req
import re
from bs4 import BeautifulSoup as bs
import time as ti

def link(url):
  header = {
      "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36",
      "cookie" : "__mta=151852934.XXXXXX"
  }
  res = req.get(url,headers = header)
  if res.status_code == 200:
      return bs(res.text,"lxml")
  return None

for i in range(0,100,10):
  url = "https://maoyan.com/board/4?offset=" + str(i)
  movies = link(url).find_all("dd")
  for i in movies:
      img = i.find("img",class_ = "board-img").get("data-src")
      num = i.find("i").text
      name = i.find("a").get("title")
      actor = re.findall("主演:(.*)",i.find("p",class_ = "star").text)[0]
      when = re.findall("上映时间:(.*)",i.find("p",class_ = "releasetime").text)[0]
      score = i.find("i",class_ = "integer").text + i.find("i",class_ = "fraction").text
      url1 = "https://maoyan.com" + i.find("p",class_ = "name").a.get("href")
      movie = link(url1)
      ti.sleep(1)
      about = movie.find("span",class_ = "dra").text
      word = movie.find("span",class_ = "name").text + ": " + movie.find("div",class_ = "comment-content").text.replace("?","")
      boss = movie.find("a",class_= "name").text.replace("\n","").replace(" ","")

      a = {
          "片名" : name,
          "排名" : num,
          "评分" : score,
          "网址" : url1,
          "演员" : actor,
          "上映时间" : when,
          "图片" : img,
          "评论" : word,
          "导演" : boss,
          "简介" : about
      }

3.拓展

Cookie的时效太短了,猜测只有一分钟,而且这种从浏览器复制Cookie的行为是不满足爬虫自动化的要求。

先勾勒一下整个交互过程:

客户端发送请求->服务器响应并传回Cookie->客户端将Cookie添加到请求头中,重新发送请求->服务器响应请求。

借用python的库selenium

由它打开浏览器访问页面,而我们只需要其中的Cookie再重新请求获取数据。

注:在实践过程,当我连续使用selenium依然会被认为是爬虫。

(1)安装geckodriver.exe

2.下载解压后将getckodriver.exe复制到Firefox的安装目录下,如(C:\Program Files\Mozilla Firefox),并在环境变量Path中添加路径。

(2)下载chromdriver

需要用到浏览器驱动,因本机安装的chrom,故下载的chromdriver

参考:https://blog.csdn.net/weixin_44335092/article/details/109054128

(3)代码如下
#借用python的库selenium了。由它打开浏览器访问页面,而我们只需要其中的Cookie再重新请求获取数据,代码如下:
from selenium import webdriver


# browser = webdriver.Firefox()

browser = webdriver.Chrome()
browser.get('https://www.maoyan.com/board/4')

Cookie = browser.get_cookies()
strr = ''
print(Cookie)
for c in Cookie:
  strr += c['name']
  strr += '='
  strr += c['value']
  strr += ';'

headers = {'Cookie': strr}
print(headers)
# r2 = requests.get('https://www.maoyan.com/board/4', headers=headers)