侧边栏壁纸
  • 累计撰写 64 篇文章
  • 累计创建 46 个标签
  • 累计收到 91 条评论

目 录CONTENT

文章目录

基于selenium的携程机票爬取程序(绕过反爬)

草莓牛奶
2022-06-24 / 2 评论 / 1 点赞 / 1,079 阅读 / 2,999 字 / 正在检测是否收录...
温馨提示:
「博客文章out of date 会及时更新,无特殊说明仍然有效,欢迎指正内容中的错误」

1.采用request方法直接请求携程的api接口,不但要考虑IP被限制,更要进行js逆向,成本巨大

2.以下程序基于selenium,获取api请求结果,并利用页面特性解决IP限制问题

3.程序为应对各种可能问题的产生,保证7*24小时运行,进行了大量的嵌套和try,excpet

4.程序执行为保证正确运行加入了sleep,运行效率尚未优化(单IP一天10000条航线不是问题,不建议高强度)

5.如果有优化或者bug,请不吝赐教!

!!!如需转载请注明出处,谢谢!!!

经过处理后的数据结构如下:

image-20220624201931513

2022年上半年上海-北京经济舱最低价变化

image-20220624202325094

一、程序整体思路

1.更换城市方法

在携程页面内通过下方红色部分更改出发地与目的地,经过大量测试,目前是不会被限制IP(频繁刷新页面等操作则会使IP受限)

image-20220624195838221

因此,主要通过selenium更改出发地与目的地即可解决大量请求所导致的IP封禁问题

2.获取数据方法

携程的航班数据主要来自https://flights.ctrip.com/international/search/api/search/batchSearch这一接口

直接采用request方法进行请求存在困难,但是可以通过一个叫seleniumwire的python库获取该请求的数据

image-20220624200732870

二、需要解决的问题

1.疫情期间的出行提醒

出行提醒不影响数据的获取,但是会对后续更换城市造成影响,程序采用JavaScript的方式直接删除

driver.execute_script("$('.notice-box').remove();")

image-20220624195719251

2.IP限制

在进行大量刷新等操作后将会依次出现如下两个验证,一般等待1~2个小时就会自行解除(有复数个IP的可以直接更换IP)

通过driver.find_element(By.CLASS_NAME,"basic-alert.alert-giftinfo"),判断验证码是否存在

image-20220624201218476

image-20220624201229906

三、程序主体

import io
import os
import gzip
import time
import json
import random
import requests
import threading
import pandas as pd
from seleniumwire import webdriver
from datetime import datetime as dt,timedelta
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException,ElementNotInteractableException,ElementClickInterceptedException # 加载异常

1.获取城市的代码

如上海-SHA、北京-BJS等

def getcitycode():
    cityname,code=[],[]
    #采用携程的api接口
    city_url='https://flights.ctrip.com/online/api/poi/get?v='+str(random.random())
    headers={
        'dnt':'1',
        'referer':'https://verify.ctrip.com/',
        'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'
        }
    r=requests.get(city_url,headers=headers)
    citys=json.loads(r.text).get('data')
    for city in citys:
        if city =='热门':
            continue
        for key in city:
            try:
                for k in citys[city][key]:
                    cityname.append(k['display'])
                    code.append(k['data'])
            except:
                continue
    citycode=dict(zip(cityname,code))
    
    return cityname,citycode

2.定义FLIGHT类

class FLIGHT(object):
    def __init__(self):
        self.url = 'https://flights.ctrip.com/online/list/oneway' #携程机票查询页面
        self.chromeDriverPath = 'C:/Program Files/Google/Chrome/Application/chromedriver' #chromedriver位置
        self.options = webdriver.ChromeOptions() # 创建一个配置对象
        #self.options.add_argument('--incognito')  # 隐身模式(无痕模式)
        #self.options.add_argument('User-Agent=%s'%UserAgent().random) # 替换User-Agent
        self.options.add_argument("--disable-blink-features")
        self.options.add_argument("--disable-blink-features=AutomationControlled")
        self.options.add_experimental_option("excludeSwitches", ['enable-automation'])# 不显示正在受自动化软件控制
        self.driver = webdriver.Chrome(executable_path=self.chromeDriverPath,chrome_options=self.options)
        self.driver.maximize_window()
        self.err=0#错误重试次数

(1).获取页面

**主要功能:**生成链接https://flights.ctrip.com/online/list/oneway-SHA-BJS?&depdate=2022-10-01

    def getpage(self): 
        ##############获取地区码
        self.startcode=self.citycode[self.city[0]][-3:]
        self.endcode=self.citycode[self.city[1]][-3:]
        
        ##############生成访问链接
        flights_url=self.url+'-'+self.startcode+'-'+self.endcode+'?&depdate='+self.date    
        print(flights_url)
        ##############设置加载超时阈值
        self.driver.set_page_load_timeout(300)
        try:
            self.driver.get(flights_url)
        except:
            print('页面连接失败')
            self.driver.close()
            self.getpage()
        else:
            try:
                ##############判断是否存在验证码
                self.driver.find_element(By.CLASS_NAME,"basic-alert.alert-giftinfo")
                print('等待2小时后重试')
                time.sleep(7200)
                self.getpage()
            except:
                ##############不存在验证码,执行下一步
                self.remove_btn()

(2).移除防疫提醒

在页面不关闭的情况下,只需要运行一次

    def remove_btn(self):
        try:
            js_remove="$('.notice-box').remove();"
            self.driver.execute_script(js_remove)
        except Exception as e:
            print('防疫移除失败',e)
        else:
            self.changecity()

(3).更换出发与目的地

更换城市不能太快,否则会失败,且不能直接赋值

    def changecity(self):
        try:
        	#获取出发地与目的地元素位置
            its=self.driver.find_elements(By.CLASS_NAME,'form-input-v3')
            
            #若出发地与目标值不符,则更改出发地
            while self.city[0] not in its[0].get_attribute('value'):    
                its[0].click()
                time.sleep(0.5)
                its[0].send_keys(Keys.CONTROL + 'a')
                time.sleep(0.5)
                its[0].send_keys(self.city[0])

            time.sleep(0.5)

            #若目的地与目标值不符,则更改目的地
            while self.city[1] not in its[1].get_attribute('value'):
                its[1].click()
                time.sleep(0.5)
                its[1].send_keys(Keys.CONTROL + 'a')
                time.sleep(0.5)
                its[1].send_keys(self.city[1])
            
            time.sleep(0.5)
            try:
                #通过低价提醒按钮实现enter键换页
                self.driver.implicitly_wait(5) # seconds
                self.driver.find_elements(By.CLASS_NAME,'low-price-remind')[0].click()
            except IndexError as e:
                print('\n更换城市错误 找不到元素',e)
                #以防万一
                its[1].send_keys(Keys.ENTER)
            
            print('\n更换城市成功',self.city[0]+'-'+self.city[1])
        except (ElementNotInteractableException,StaleElementReferenceException,ElementClickInterceptedException,ElementClickInterceptedException) as e:
            print('\n更换城市错误 元素错误',e)
            self.err+=1
            if self.err<=5:
                self.click_btn()
            else:
                self.err=0
                del self.driver.requests
                self.getpage()
    	except Exception as e:
            print('\n更换城市错误',e)
            #删除本次请求
            del self.driver.requests
            #从头开始重新执行程序
            self.getpage()
        else:
            #若无错误,执行下一步
            self.getdata()

(4).获取原始数据

    def getdata(self):
        try:
            #等待响应加载完成
            self.predata = self.driver.wait_for_request('/international/search/api/search/batchSearch?.*', timeout=60)
        
            rb=dict(json.loads(self.predata.body).get('flightSegments')[0])
        
        except TimeoutException as e:
            print('\获取数据错误',e)
            #删除本次请求
            del self.driver.requests
            #从头开始重新执行程序
            self.getpage()
        else:
            #检查数据获取正确性
            if rb['departureCityName'] == self.city[0] and rb['arrivalCityName'] == self.city[1]:
                print('城市获取正确')
                #删除本次请求
                del self.driver.requests
                #若无错误,执行下一步
                self.decode_data()
            else:
                #删除本次请求
                del self.driver.requests
                #重新更换城市
                self.changecity()

(5).对数据进行解码

    def decode_data(self):
        try:
            buf = io.BytesIO(self.predata.response.body)
            gf = gzip.GzipFile(fileobj = buf)
            self.dedata = gf.read().decode('UTF-8')
            self.dedata=json.loads(self.dedata)
        except:
            print('重新获取数据')
            self.getpage()
        else:
            #若无错误,执行下一步
            self.check_data()

(6).检查直航航班情况

倒序遍历删除中转航班

    def check_data(self):
        try:
            self.flightItineraryList=self.dedata['data']['flightItineraryList']
            #倒序遍历,删除转机航班
            for i in range(len(self.flightItineraryList)-1, -1, -1):
                if self.flightItineraryList[i]['flightSegments'][0]['transferCount'] !=0:
                    self.flightItineraryList.pop(i)
            if len(self.flightItineraryList):
                #存在直航航班,执行下一步
                self.muti_process()
            else:
                print('不存在直航航班')
                return 0
        except:
            print('不存在直航航班')
            return 0 

(7).双线程处理航班数据

    def muti_process(self):
        processes = []

        self.flights = pd.DataFrame()
        self.prices = pd.DataFrame()
        #处理航班信息
        processes.append(threading.Thread(target=self.proc_flightSegments))
        #处理票价信息
        processes.append(threading.Thread(target=self.proc_priceList))

        for pro in processes:
            pro.start()
        for pro in processes:
            pro.join()
        
        #若无错误,执行下一步
        self.mergedata()

Ⅰ.处理航班信息

    def proc_flightSegments(self):
        for flightlist in self.flightItineraryList:
            flightlist=flightlist['flightSegments'][0]['flightList']
            flightUnitList=dict(flightlist[0])

            
            departureday=flightUnitList['departureDateTime'].split(' ')[0]
            departuretime=flightUnitList['departureDateTime'].split(' ')[1]
            
            arrivalday=flightUnitList['arrivalDateTime'].split(' ')[0]
            arrivaltime=flightUnitList['arrivalDateTime'].split(' ')[1]            
            
            #删除一些不重要的信息
            dellist=['sequenceNo', 'marketAirlineCode',
             'departureProvinceId','departureCityId','departureCityCode','departureAirportShortName','departureTerminal',
             'arrivalProvinceId','arrivalCityId','arrivalCityCode','arrivalAirportShortName','arrivalTerminal',
             'transferDuration','stopList','leakedVisaTagSwitch','trafficType','highLightPlaneNo','mealType',
             'operateAirlineCode','arrivalDateTime','departureDateTime','operateFlightNo','operateAirlineName']
            for value in dellist:
                try:
                    flightUnitList.pop(value)
                except:
                    continue
            
            #更新日期格式
            flightUnitList.update({'departureday': departureday, 'departuretime': departuretime,
                                   'arrivalday': arrivalday, 'arrivaltime': arrivaltime}) 
            
            self.flights=pd.concat([self.flights,pd.DataFrame(flightUnitList,index=[0])],ignore_index=True)

Ⅱ.处理票价信息

携程返回的原始数据只有航班折扣价和折扣率,航班全票价需要自己计算

这部分可以酌情修改

    def proc_priceList(self):
        for flightlist in self.flightItineraryList:
            flightNo=flightlist['itineraryId'].split('_')[0]
            priceList=flightlist['priceList']
            
            #经济舱,经济舱折扣
            economy,economy_discount=[],[]
            #商务舱,商务舱折扣
            bussiness,bussiness_discount=[],[]
            
            for price in priceList:
                adultPrice=price['adultPrice']
                cabin=price['cabin']
                priceUnitList=dict(price['priceUnitList'][0]['flightSeatList'][0])
                discountRate=priceUnitList['discountRate']
                #经济舱
                if cabin=='Y':
                    economy.append(adultPrice)
                    economy_discount.append(discountRate)
                 #商务舱
                elif cabin=='C':
                    bussiness.append(adultPrice)
                    bussiness_discount.append(discountRate)
            
            if economy !=[]:
                try:
                    economy_origin=economy[economy_discount.index(1)]
                except:
                    economy_origin=int(max(economy)/max(economy_discount))
            
                if min(economy_discount) !=1:
                    economy_low=min(economy)
                    economy_cut=min(economy_discount)
                else:
                    economy_low=''
                    economy_cut=''
                
            else:
                economy_origin=''
                economy_low=''
                economy_cut=''
            

            if bussiness !=[]: 
                try:
                    bussiness_origin=bussiness[bussiness_discount.index(1)]
                except:
                    bussiness_origin=int(max(bussiness)/max(bussiness_discount))
            
                if min(bussiness_discount) !=1:
                    bussiness_low=min(bussiness)
                    bussiness_cut=min(bussiness_discount)
                else:
                    bussiness_low=''
                    bussiness_cut=''
                
            else:
                bussiness_origin=''
                bussiness_low=''
                bussiness_cut=''        
        
            price_info={'flightNo':flightNo,
                    'economy_origin':economy_origin,'economy_low':economy_low,'economy_cut':economy_cut,
                    'bussiness_origin':bussiness_origin,'bussiness_low':bussiness_low,'bussiness_cut':bussiness_cut}

            self.prices=pd.concat([self.prices,pd.DataFrame(price_info,index=[0])],ignore_index=True)

(7).合并保存数据

    def mergedata(self):
        try:
            self.df = self.flights.merge(self.prices,on=['flightNo'])
            
            self.df['数据获取日期']=dt.now().strftime('%Y-%m-%d')
            
            #对pandas的columns进行重命名
            order=['数据获取日期','航班号','航空公司',
                   '出发日期','出发时间','到达日期','到达时间','飞行时长','出发国家','出发城市','出发机场','出发机场三字码',
                   '到达国家','到达城市','到达机场','到达机场三字码','飞机型号','飞机尺寸','飞机型号三字码',
                   '经济舱原价','经济舱最低价','经济舱折扣','商务舱原价','商务舱最低价','商务舱折扣',
                   '到达准点率','停留次数']
            
            origin=['数据获取日期','flightNo','marketAirlineName',
                    'departureday','departuretime','arrivalday','arrivaltime','duration',
                    'departureCountryName','departureCityName','departureAirportName','departureAirportCode',
                    'arrivalCountryName','arrivalCityName','arrivalAirportName','arrivalAirportCode',
                    'aircraftName','aircraftSize','aircraftCode',
                    'economy_origin','economy_low','economy_cut',
                    'bussiness_origin','bussiness_low','bussiness_cut',
                    'arrivalPunctuality','stopCount']
            
            columns=dict(zip(origin,order))

            self.df=self.df.rename(columns=columns)
              
            self.df = self.df[order]
            
            
            if not os.path.exists(self.date):
                os.makedirs(self.date)      

            filename=os.getcwd()+'\\'+self.date+'\\'+self.date+'-'+self.city[0]+'-'+self.city[1]+'.csv'

            self.df.to_csv(filename,encoding='GB18030',index=False)
            
            print('\n数据爬取完成',filename) 
        except Exception as e:
            print('合并数据失败',e)

(8).外部调用函数

    def demain(self,citys,citycode):
        self.citycode=citycode
        #设置出发日期
        self.date=dt.now()+timedelta(days=1)
        self.date=self.date.strftime('%Y-%m-%d')
        
        for city in citys:
            self.city=city
            
            if citys.index(city)==0:
                #第一次运行
                self.getpage()
            else:
                #后续运行只需更换出发与目的地
                self.changecity()
        
        #运行结束退出
        self.driver.quit()

四、主函数

if __name__ == '__main__':
    citys=[]
    cityname,citycode=getcitycode()
    city=['上海','广州','深圳','北京']
    ytic=list(reversed(city))
    for m in city:
        for n in ytic:
            if m==n:
                continue
            else:
                citys.append([m,n])
    
    fly = FLIGHT()
    fly.demain(citys,citycode)
    print('\n程序运行完成!!!!')   
1

评论区