Python爬蟲入門案例教學:批量下載快手高清無水印視頻

blank

Python爬蟲入門案例教學:批量下載快手高清無水印視頻

前言

今天分享的案例是Python爬取快手短視頻平台高清無水印視頻

主要知識點:

  • requests
  • json
  • re
  • pprint

開發環境:

  • 版本:anaconda5.2.0(python3.6.5)
  • 編輯器:pycharm

案例實現步驟:

  1. 找到目標網址
  2. 發送請求get post
  3. 解析數據(視頻地址視頻標題)
  4. 發送請求請求每一個視頻的地址
  5. 保存視頻

開始實現代碼

1. 導入模塊

import requests import requests import pprint import json import re import pprint import requests import pprint import json import re import json import requests import pprint import json import re

2. 請求數據

headers = { # data內容類型# application/json: 傳入json類型數據json 瀏覽器跟快手服務器交流(數據傳輸格式)的方式# 默認格式: application/x-www-form-urlencoded 'content-type': 'application/json', # cookie: 用戶身份標識有沒有登錄'Cookie': 'did=web_53827e0b098c608bc6f42524b1f3211a; didv=1617281516668; kpf=PC_WEB; kpn=KUAISHOU_VISION; clientid=3', # User-Agent: 瀏覽器訊息(用來偽裝成瀏覽器) 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36', } data = { 'operationName': "visionSearchPhoto",'query': "query visionSearchPhoto($keyword: String, $pcursor: String, $searchSessionId: String, $page: String, $webPageArea: String) {n visionSearchPhoto(keyword: $keyword, pcursor: $pcursor, searchSessionId: $searchSessionId, page: $page, webPageArea: $webPageArea) {n resultn llsidn webPageArean feeds {n typen author {n idn namen followingn headerUrln headerUrls {n cdnn urln __typenamen }n __typenamen }n tags {n typen namen __typenamen }n photo {n idn durationn captionn likeCountn realLikeCountn coverUrln photoUrln likedn timestampn expTagn coverUrls {n cdnn urln __typenamen }n photoUrls {n cdnn urln __typenamen }n animatedCoverUrln stereoTypen videoRation __typenamen }n canAddCommentn currentPcursorn llsidn statusn __typenamen }n searchSessionIdn pcursorn aladdinBanner {n imgUrln linkn __typenamen }n __typenamen }n}n", 'variables': { 'keyword': keyword, 'pcursor': str(page), 'page': "search" # 發送請求response = requests.p ost('https://www.kuaishou.com/graphql', headers=headers, data=data)

解析數據

for page in range(0, 11): for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content print(f'-----------------------正在爬取{page+1}頁----------------------') for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content json_data = response.json() for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content data_list = json_data['data']['visionSearchPhoto']['feeds'] for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content for data in data_list: for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content title = data['photo']['caption'] for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content url_1 = data['photo']['photoUrl'] for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content new_title = re.sub(r'[/:*?"<>|n]', '_', title) for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content # print(title, url_1) for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content # content: 獲取到的二進制數據# 文字text for page in range(0, 11): print(f'-----------------------正在爬取{page+1}页----------------------') json_data = response.json() data_list = json_data['data']['visionSearchPhoto']['feeds'] for data in data_list: title = data['photo']['caption'] url_1 = data['photo']['photoUrl'] new_title = re.sub(r'[/:*?"<>|n]', '_', title) # print(title, url_1) # content: 获取到的二进制数据# 文字text # 图片视频音频二进制数据content = requests.get(url_1).content

保存數據

with open('./video/' + new_title + '.mp4', mode='wb') as f: with open('./video/' + new_title + '.mp4', mode='wb') as f: f.write(content) print(new_title, '爬取成功!!!') f.write(content) with open('./video/' + new_title + '.mp4', mode='wb') as f: f.write(content) print(new_title, '爬取成功!!!')
blank

blank

What do you think?

Written by marketer

blank

(深圳站)POINT.小數點數據分析特訓營正式開營

blank

請提防!老外騙錢的套路之(一)