[python语法语义] httplib抓取UTF8编码的网页,将内容解码时出错

ewong 2011-05-24
# -*- coding: utf-8 -*-
import codecs
import httplib
import sys

#reload(sys)
#sys.setdefaultencoding('utf8')
#print sys.getdefaultencoding()

conn=httplib.HTTPConnection('www.douban.com',80)
conn.request('GET', '/')
resp = conn.getresponse()

f = codecs.open('C:\\tmp\\web.log', 'w', 'utf8')
f.write(resp.read().decode('utf8'))
f.close()
conn.close()


我试了3个网站,www.douban.com,www.google.com.hk,www.iteye.com,都是UTF8编码的
但只有豆瓣能保存成功,后面两个报错如下:

Traceback (most recent call last):
  File "test.py", line 15, in <module>
    f.write(resp.read().decode('utf8'))
  File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 16960-16961: invalid data


请大家帮忙看看是什么原因
suxianbaozi 2011-08-31
不用decode 直接保存就行了
乖睡觉 2012-07-02
.decode('utf8'))  去掉应该就可以了
edisonlz 2012-07-16
使用大字符集合,gb18030
Global site tag (gtag.js) - Google Analytics