针对网页源代码中无数据的解决案例(Ajax动态加载、JS))
不知名菜鸟
爱分享,爱生活,爱工作。
Official Account
一、效果展示
二、网页分析
https://pvp.qq.com/web201605/wallpaper.shtml
然后我们可以在高清壁纸下面看到皮肤壁纸信息,打开打开网页源代码,查看是否能够获取到图片下载地址
| 不出所料,我们没办法直接从网页源代码中获取到图片下载地址,既然这样,那我们只有通过抓包工具进行提取了。 |
到这里我们先不着急直接使用xpath或者bs4、Selenium进行解析,想想还有没有其他方法,比如json格式的返回值或者Ajax加载啥的,像直接返回json格式文件的,只要不是加密太难的,我们可以更加轻易的拿到数据。
| 打开Network,重新加载网页,看看有没有数据包里包含"sProdImgDown sProdImgL8"之类的信息。 |
| 跟以往的json格式文件不同的是它前面有"jQuery17104394415749553038_1618321207478"前缀,我们再检查下Headers,分析下Request URL,可以很好的发现里面增加了jQueryxxxx的信息,去掉即可,再仔细分析可以获得每次json文件显示的图片数量以及每一页的信息。得到这些数据分析结果后我们就可以拼接网页信息了。 |
request_url = f"https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={page_num}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=1&iFlowId=267733&iActId=2735&iModuleId=2735&_=1618305808724"image_url_list = get_html(request_url).json()['List']for url_dic in image_url_list:image_url = parse.unquote(url_dic['sProdImgNo_8']).split('/200')[0] + '/0'url_list.append(image_url)
三、项目源码
from urllib import parseimport requestsfrom concurrent.futures import ThreadPoolExecutorfrom concurrent.futures import ProcessPoolExecutorimport osIMAGE_PATH = os.path.join(os.path.dirname(__file__), '哎哟!这是王者荣耀的皮肤啊!')def get_html(url):header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"}try:res = requests.get(url, headers=header)res.encoding = res.apparent_encoding # 自动解码res.close()if res.status_code == 200:return resexcept ConnectionError:returndef main(page_num=0):request_url = f"https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={page_num}&iOrder=0&iSortNumClose=1&iAMSActivityId=51991&_everyRead=true&iTypeId=1&iFlowId=267733&iActId=2735&iModuleId=2735&_=1618305808724"image_url_list = get_html(request_url).json()['List']for url_dic in image_url_list:image_url = parse.unquote(url_dic['sProdImgNo_8']).split('/200')[0] + '/0'url_list.append(image_url)def save(url):try:content = get_html(url).contentimage_name = url.split('/')[5]image_path = os.path.join(IMAGE_PATH, image_name)try:with open(image_path, 'wb') as f:f.write(content)print(image_name, '下载成功!')except:print(image_name, '下载失败!')except:returnif __name__ == '__main__':while True:url_list = []if not os.path.exists(IMAGE_PATH):os.mkdir(IMAGE_PATH)num = input("请输入获取的皮肤数量>>>").strip()if not num.isdigit():print("请输入数字!")continuenum = int(num)page = divmod(num, 20)[0]count = 0with ProcessPoolExecutor(4) as p:with ThreadPoolExecutor(50) as t: # 开启50个线程池for page_num in range(page+1):t.submit(main, page_num)for url in url_list:try:p.submit(save, url)count += 1if count >= num:breakexcept:continue
***************************************************************
完
