2012年11月17日

BeautifulSoup Encoding 編碼

最近遇到網頁讀取編碼問題,使用的模組是BeautifulSoup,BeautilfulSoup運作方式是根據一些優先權準則來猜測內容編碼,然後使用這個編碼來解碼輸出unicode,先來看優先權標準:

  • An encoding you pass in as the fromEncoding argument to the soup constructor.
  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

順序是先用fromEncoding參數,接著看內容的meta tag,讀取幾bytes內容,使用chardet猜測,然後用utf-8,最後windows-1252

soup = BeautifulSoup(content, fromEncoding='UTF-8') 

照這個先後順序看,理論上上面這種寫法就會使用UTF-8,但是實際上要用soup.originalEncoding來檢查最後所採用的編碼,主要原因是因為網頁中可能帶有不合法字元造成fromEncoding編碼失敗,chardet也會因為這些預期之外的字元猜錯網頁編碼。

可以先使用ignore或replace自行解碼再傳給BeatifulSoup來解決

decoded_document = document.decode('UTF-8', 'ignore')