Python豆瓣電影《肖申克的救贖》評論爬取

先看效果圖：

地址：（ /subject/1292052/comments?sort=time&status=P）

爬取前1w條評論

存儲成txt文檔

數據預處理

中文分詞

統計top10的高頻詞

可視化展示高頻詞

根據詞頻生成詞雲

審核評論

================================================================

配置準備

中文分詞需要jieba

詞雲繪制需要wordcloud

可視化展示中需要的中文字體

網上公開資源中找壹個中文停用詞表

根據分詞結果自己制作新增詞表

準備壹張詞雲背景圖（附加項，不做要求）

paddlehub配置

#安裝jieba分詞和詞雲

pip?install?jieba

pip?install?wordcloud

#安裝paddle

pip?install?--upgrade?PaddlePaddle

#安裝模型

#hub?install?porn_detection_lstm==1.1.0

pip?install?--upgrade?paddlehub

pip?install?numpy

#安裝Beautifulsoup

pip?install?BeautifulSoup4

Github地址： /mikite/python_sp_shawshank

有可能遇到的問題：

1.UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 1: invalid continuation byte

解決方法：

1.不使用urlLib換做requests

2.去掉請求頭中的 'Accept-Encoding': 'gzip, deflate, br'

3.返回值reponse 轉字符串指定編碼utf-8

# 'Accept-Encoding': 'gzip, deflate, br',

2.關於cookie

解決方法：

1.去豆瓣請求頭中復制cookie設置到請求頭中

'Cookie': 'bid=WD6_t6hVqgM'

3.請求返回418的問題

解決方案模擬設置請求頭，設置user-agent

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',

4.使用beautifulsoup獲取不到評論

解決方法：

第壹步：指定解析參數為'lxml'

soupComment = BeautifulSoup(html, 'lxml')

第二步：

findAll方法指定css文件的class名

print('網頁內容：', soupComment.prettify())

comments = soupComment.findAll(class_='short')

點擊獲取源碼

上一篇:神探狄仁傑1部，神探狄仁傑第1、2、3部，從哪集到哪集都叫什麽案子

下一篇:誰知道周傑倫曾經唱過的另壹種風格的東風破?曾經在中央三臺上播放過的

請問大佬有VC++2008運行庫 32位&64位完整版軟件免費百度雲資源嗎

機械戰警電影

別讓愛妳的人等太久電視劇