SushiWen's Blog: BeautifulSoup Encoding 編碼

最近遇到網頁讀取編碼問題，使用的模組是BeautifulSoup，BeautilfulSoup運作方式是根據一些優先權準則來猜測內容編碼，然後使用這個編碼來解碼輸出unicode，先來看優先權標準:

An encoding you pass in as the fromEncoding argument to the soup constructor.
An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252

順序是先用fromEncoding參數，接著看內容的meta tag，讀取幾bytes內容，使用chardet猜測，然後用utf-8，最後windows-1252


soup = BeautifulSoup(content, fromEncoding='UTF-8')

照這個先後順序看，理論上上面這種寫法就會使用UTF-8，但是實際上要用soup.originalEncoding來檢查最後所採用的編碼，主要原因是因為網頁中可能帶有不合法字元造成fromEncoding編碼失敗，chardet也會因為這些預期之外的字元猜錯網頁編碼。

可以先使用ignore或replace自行解碼再傳給BeatifulSoup來解決


decoded_document = document.decode('UTF-8', 'ignore')

SushiWen's Blog

2012年11月17日

BeautifulSoup Encoding 編碼

沒有留言:

張貼留言