- An encoding you pass in as the fromEncoding argument to the soup constructor.
- An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
- An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252
順序是先用fromEncoding參數,接著看內容的meta tag,讀取幾bytes內容,使用chardet猜測,然後用utf-8,最後windows-1252
soup = BeautifulSoup(content, fromEncoding='UTF-8')
照這個先後順序看,理論上上面這種寫法就會使用UTF-8,但是實際上要用soup.originalEncoding來檢查最後所採用的編碼,主要原因是因為網頁中可能帶有不合法字元造成fromEncoding編碼失敗,chardet也會因為這些預期之外的字元猜錯網頁編碼。
可以先使用ignore或replace自行解碼再傳給BeatifulSoup來解決
decoded_document = document.decode('UTF-8', 'ignore')
沒有留言:
張貼留言