怎樣實現百度指數爬蟲功能

這次給大家帶來怎樣實現百度指數爬蟲功能，實現百度指數爬蟲功能的註意事項有哪些，下面就是實戰案例，壹起來看壹下。

之前看過壹篇腦洞大開的文章，介紹了各個大廠的前端反爬蟲技巧，但也正如此文所說，沒有100%的反爬蟲方法，本文介紹壹種簡單的方法，來繞過所有這些前端反爬蟲手段。

下面的代碼以百度指數為例，代碼已經封裝成壹個百度指數爬蟲node庫： pm來安裝，或者將下載地址改成淘寶的鏡像，然後再安裝：

npm config set PUPPETEER_DOWNLOAD_HOST=https://npm.taobao.org/mirrors

npm install --save puppeteer

妳也可以在安裝時跳過Chromium下載，通過代碼指定本機Chrome路徑來運行：

// npm

npm install --save puppeteer --ignore-scripts

// node

puppeteer.launch({ executablePath: '/path/to/Chrome' });

實現

為版面整潔，下面只列出了主要部分，代碼涉及到selector的部分都用了...代替，完整代碼參看文章頂部的github倉庫。

打開百度指數頁面，模擬登錄

這裏做的就是模擬用戶操作，壹步步點擊和輸入。沒有處理登錄驗證碼的情況，處理驗證碼又是另壹個話題了，如果妳在本機登錄過百度，壹般不需要驗證碼。

// 啟動瀏覽器，

// headless參數如果設置為true，Puppeteer將在後臺操作妳Chromium，換言之妳將看不到瀏覽器的操作過程

// 設為false則相反，會在妳電腦上打開瀏覽器，顯示瀏覽器每壹操作。

const browser = await puppeteer.launch({headless:false});

const page = await browser.newPage();

// 打開百度指數

await page.goto(BAIDU_INDEX_URL);

// 模擬登陸

await page.click('...');

await page.waitForSelecto('...');

// 輸入百度賬號密碼然後登錄

await page.type('...','username');

await page.type('...','password');

await page.click('...');

await page.waitForNavigation();

console.log(':white_check_mark: 登錄成功');

模擬移動鼠標，獲取需要的數據

需要將頁面滾動到趨勢圖的區域，然後移動鼠標到某個日期上，等待請求結束，tooltip顯示數值，再截圖保存圖片。

// 獲取chart第壹天的坐標

const position = await page.evaluate(() => {

const $image = document.querySelector('...');

const $area = document.querySelector('...');

const areaRect = $area.getBoundingClientRect();

const imageRect = $image.getBoundingClientRect();

// 滾動到圖表可視化區域

window.scrollBy(0, areaRect.top);

return { x: imageRect.x, y: 200 }；

});

// 移動鼠標，觸發tooltip

await page.mouse.move(position.x, position.y);

await page.waitForSelector('...');

// 獲取tooltip信息

const tooltipInfo = await page.evaluate(() => {

const $tooltip = document.querySelector('...');

const $title = $tooltip.querySelector('...');

const $value = $tooltip.querySelector('...');

const valueRect = $value.getBoundingClientRect();

const padding = 5;

return {

title: $title.textContent.split(' ')[0],

x: valueRect.x - padding,

y: valueRect.y,

width: valueRect.width + padding * 2,

height: valueRect.height

}

});

截圖

計算數值的坐標，截圖並用jimp對裁剪圖片。

await page.screenshot({ path: imgPath });

// 對圖片進行裁剪，只保留數字部分

const img = await jimp.read(imgPath);

await img.crop(tooltipInfo.x, tooltipInfo.y, tooltipInfo.width, tooltipInfo.height);

// 將圖片放大壹些，識別準確率會有提升

await img.scale(5);

await img.write(imgPath);

圖像識別

這裏我們用Tesseract來做圖像識別，Tesseracts是Google開源的壹款OCR工具，用來識別圖片中的文字，並且可以通過訓練提高準確率。github上已經有壹個簡單的node封裝：

node-tesseract ，需要妳先安裝Tesseract並設置到環境變量。

Tesseract.process(imgPath, (err, val) => {

if (err || val == null) {

console.error(':x: 識別失敗：' + imgPath);

return;

}

console.log(val);

實際上未經訓練的Tesseracts識別起來會有少數幾個錯誤，比如把9開頭的數字識別成`3，這裏需要通過訓練去提升Tesseracts的準確率，如果識別過程出現的問題都是壹樣的，也可以簡單通過正則去修復這些問題。

封裝

實現了以上幾點後，只需組合起來就可以封裝成壹個百度指數爬蟲node庫。當然還有許多優化的方法，比如批量爬取，指定天數爬取等，只要在這個基礎上實現都不難了。

const recognition = require('./src/recognition');

const Spider = require('./src/spider');

module.exports = {

async run (word, options, puppeteerOptions = { headless: true }) {

const spider = new Spider({

imgDir,

...options

}, puppeteerOptions);

// 抓取數據

await spider.run(word);

// 讀取抓取到的截圖，做圖像識別

const wordDir = path.resolve(imgDir, word);

const imgNames = fs.readdirSync(wordDir);

const result = [];

imgNames = imgNames.filter(item => path.extname(item) === '.png');

for (let i = 0; i < imgNames.length; i++) {

const imgPath = path.resolve(wordDir, imgNames[i]);

const val = await recognition.run(imgPath);

result.push(val);

}

return result;

}

反爬蟲

最後，如何抵擋這種爬蟲呢，個人認為通過判斷鼠標移動軌跡可能是壹種方法。當然前端沒有100%的反爬蟲手段，我們能做的只是給爬蟲增加壹點難度。

相信看了本文案例妳已經掌握了方法，更多精彩請關註Gxl網其它相關文章！