针对网页源代码中无数据的解决案例(Ajax动态加载、JS))
不知名菜鸟
爱分享,爱生活,爱工作。
Official Account
一、效果展示
二、网页分析
https://pvp.qq.com/web201605/wallpaper.shtml
然后我们可以在高清壁纸下面看到皮肤壁纸信息,打开打开网页源代码,查看是否能够获取到图片下载地址
不出所料,我们没办法直接从网页源代码中获取到图片下载地址,既然这样,那我们只有通过抓包工具进行提取了。 |
到这里我们先不着急直接使用xpath或者bs4、Selenium进行解析,想想还有没有其他方法,比如json格式的返回值或者Ajax加载啥的,像直接返回json格式文件的,只要不是加密太难的,我们可以更加轻易的拿到数据。
打开Network,重新加载网页,看看有没有数据包里包含"sProdImgDown sProdImgL8"之类的信息。 |
跟以往的json格式文件不同的是它前面有"jQuery17104394415749553038_1618321207478"前缀,我们再检查下Headers,分析下Request URL,可以很好的发现里面增加了jQueryxxxx的信息,去掉即可,再仔细分析可以获得每次json文件显示的图片数量以及每一页的信息。得到这些数据分析结果后我们就可以拼接网页信息了。 |
request_url = f"https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={page_num}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=1&iFlowId=267733&iActId=2735&iModuleId=2735&_=1618305808724"
image_url_list = get_html(request_url).json()['List']
for url_dic in image_url_list:
image_url = parse.unquote(url_dic['sProdImgNo_8']).split('/200')[0] + '/0'
url_list.append(image_url)
三、项目源码
from urllib import parse
import requests
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
import os
IMAGE_PATH = os.path.join(os.path.dirname(__file__), '哎哟!这是王者荣耀的皮肤啊!')
def get_html(url):
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
try:
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding # 自动解码
res.close()
if res.status_code == 200:
return res
except ConnectionError:
return
def main(page_num=0):
request_url = f"https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={page_num}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=1&iFlowId=267733&iActId=2735&iModuleId=2735&_=1618305808724"
image_url_list = get_html(request_url).json()['List']
for url_dic in image_url_list:
image_url = parse.unquote(url_dic['sProdImgNo_8']).split('/200')[0] + '/0'
url_list.append(image_url)
def save(url):
try:
content = get_html(url).content
image_name = url.split('/')[5]
image_path = os.path.join(IMAGE_PATH, image_name)
try:
with open(image_path, 'wb') as f:
f.write(content)
print(image_name, '下载成功!')
except:
print(image_name, '下载失败!')
except:
return
if __name__ == '__main__':
while True:
url_list = []
if not os.path.exists(IMAGE_PATH):
os.mkdir(IMAGE_PATH)
num = input("请输入获取的皮肤数量>>>").strip()
if not num.isdigit():
print("请输入数字!")
continue
num = int(num)
page = divmod(num, 20)[0]
count = 0
with ProcessPoolExecutor(4) as p:
with ThreadPoolExecutor(50) as t: # 开启50个线程池
for page_num in range(page+1):
t.submit(main, page_num)
for url in url_list:
try:
p.submit(save, url)
count += 1
if count >= num:
break
except:
continue
***************************************************************
完