ç§èªèº«ã¯ããããŒ/é/éçšã®ãã±ã¢ã³ãžã¥ãŒã¹ãã®ãããªèŠåºãã¯ããŸã奜ãã§ã¯ãããŸããããããã¯äºå®ãšæãããŸã-åºæ¬çãªããšã«ã€ããŠè©±ããŸãã ããªãæ©èœããªãã®ã§ããïŒã ãŸã æããããã³/ãŸãã¯ãŠãã³ãŒããç解ããŠããªãå Žå-ç§ã¯ç«ãæ±ããŸãã
ãªãã§ïŒ
åå¿è ã®äž»ãªè³ªåã§ãããå°è±¡çãªãšã³ã³ãŒãã£ã³ã°ãšãäžèŠè€éãªã¡ã«ããºã ïŒPython 2.xãªã©ïŒã«ééããŸãã çãçãã¯ããããèµ·ãã£ãããã§ã:)
ã³ãŒãã£ã³ã°ã¯ãç¥ããªããã¡ã«ãã³ã³ãã¥ãŒã¿ãŒã®ã¡ã¢ãªãŒïŒãŒãåäœ\æ°åã§èªã¿ãŸãïŒã§æ°åãããããã®ä»ãã¹ãŠã®æåãè¡šãæ¹æ³ãšåŒã°ããŸãã ããšãã°ãã¹ããŒã¹ã¯0b100000ïŒãã€ããªïŒã32ïŒ10é²æ°ïŒããŸãã¯0x20ïŒ16é²æ°ïŒãšããŠè¡šãããŸãã
ãã®ãããã¡ã¢ãªãéåžžã«å°ãªããªãããã¹ãŠã®ã³ã³ãã¥ãŒã¿ãŒã«å¿ èŠãªãã¹ãŠã®æåïŒæ°åãå°æå/倧æåã®ã©ãã³ã¢ã«ãã¡ããããäžé£ã®æåãããããå¶åŸ¡æå-ãã¹ãŠã®127ã®æ°åã誰ãã«äžããããŸããïŒãè¡šãã®ã«ååãª7ãããããããŸããã åœæã®ãšã³ã³ãŒãã£ã³ã°ã¯1ã€ã®ASCIIã§ããã æéãçµã€ã«ã€ããŠã誰ãã幞ãã§ã誰ã幞ãã§ã¯ãããŸããã§ããïŒèªã-ã©ããŸãã¯ãã€ãã£ãã®æåãuããæ¬ ã人ïŒ-æ®ãã®128æåãèªç±è£éã§äœ¿çšããŸããã ãã®ããã ISO-8859-1ãšïŒããªã«æåã®ïŒ cp1251ãšKOI8ãç»å ŽããŸãã ã ãããã«å ããŠãã¿ã€ã0b1 *******ïŒã€ãŸãã128ãã255ã®æå\çªå·ïŒã®ãã€ãã解éããåé¡ãçºçããŸãã-ããšãã°ãcp1251ãšã³ã³ãŒãã£ã³ã°ã®0b11011111ã¯ISOã®ãã€ãã£ããIãã§ããã 8859-1ã¯
ãã®ç¬éãæããé è³ãéãŸããæ°ããæšæºã§ããUnicodeãææ¡ããŸããã ããã¯ãšã³ã³ãŒãã§ã¯ãªãæšæºã§ããUnicodeã ãã§ã¯ãæåãããŒããã©ã€ãã«ä¿åãããæ¹æ³ããããã¯ãŒã¯çµç±ã§éä¿¡ãããæ¹æ³ã¯æ±ºå®ãããŸããã æåãšç¹å®ã®æ°åã®éã®é¢ä¿ã®ã¿ãå®çŸ©ãããããã®æ°åããã€ãã«å€æããã圢åŒã¯Unicodeãšã³ã³ãŒãïŒ UTF-8ãŸãã¯UTF-16ãªã© ïŒã«ãã£ãŠæ±ºå®ãããŸãã çŸåšãUnicodeæšæºã«ã¯10äžæåãå°ãè¶ ããæåããããŸãããUTF-16ã§ã¯100äžãè¶ ããæåïŒUTF-8ãªã©ïŒããµããŒãã§ããŸãã
ãã®ãããã¯ã«ã€ããŠã壮倧ãªJoel Spolskyã絶察ã«æå°ã®ãã¹ãŠã®ãœãããŠã§ã¢éçºè ã§ããããŠãã³ãŒããšæåã»ããã«ã€ããŠçµ¶å¯Ÿã«ååãã«ç¥ã£ãŠããå¿ èŠããããããã¯ãèªãããšããå§ãããŸãã
èŠç¹ãã€ãããïŒ
åœç¶ãPythonã§ãUnicodeããµããŒããããŠããŸãã ããããæ®å¿µãªãããPython 3ã§ã®ã¿ãã¹ãŠã®æååããŠãã³ãŒãã«ãªããåå¿è ã¯æ¬¡ã®ãããªãšã©ãŒã«ã€ããŠèªæ®ºããªããã°ãªããŸããã
>>> with open('1.txt') as fh: s = fh.read() >>> print s >>> parser_result = u'-' # , , - >>> parser_result + s
Traceback (most recent call last): File "<pyshell#43>", line 1, in <module> parser_result + s UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
ãŸãã¯ïŒ
>>> str(parser_result)
Traceback (most recent call last): File "<pyshell#52>", line 1, in <module> str(parser_result) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
ãããç解ããŸãããããé çªã«ã
誰ãããŠãã³ãŒãã䜿çšããŠããã®ã¯ãªãã§ããïŒ
ãæ°ã«å ¥ãã®HTMLããŒãµãŒãUnicodeãè¿ãã®ã¯ãªãã§ããïŒ éåžžã®æååãè¿ãããã«ããŸããããããããã°ããã§ã«ããã§å¯ŸåŠããŸãã ããïŒ ããã§ããªãã Unicodeã«ååšããåæåã¯ïŒããããïŒããã€ãã®ã·ã³ã°ã«ãã€ããšã³ã³ãŒãã£ã³ã°ïŒISO-8859-1ãcp1251ãªã©ã¯ã·ã³ã°ã«ãã€ããšåŒã°ããŸããæåã1ãã€ãã§æ£ç¢ºã«ãšã³ã³ãŒãããããïŒã§ãããæååã«æåãããå Žåã¯ã©ãã§ããããç°ãªããšã³ã³ãŒãã£ã³ã°ããïŒ åæåã«åå¥ã®ãšã³ã³ãŒãã£ã³ã°ãå²ãåœãŠãŸããïŒ ãããããã¡ãããUnicodeã䜿çšããå¿ èŠããããŸãã
ãªãæ°ããã¿ã€ãã®ããŠãã³ãŒãããå¿ èŠãªã®ã§ããïŒ
ã ãããç§ãã¡ã¯æãèå³æ·±ãããšã«ãªããŸããã Python 2.xã®æååãšã¯äœã§ããïŒ ãããã¯åãªããã€ãã§ãã äœã§ãããŸããŸããã å®éã次ã®ãããªãã®ãæžããšãïŒ
>>> x = 'abcd' >>> x 'abcd'
ã€ã³ã¿ãŒããªã¿ãŒã¯ãã©ãã³ã¢ã«ãã¡ãããã®æåã®4æåãå«ãå€æ°ãäœæããŸããããã·ãŒã±ã³ã¹ã®ã¿ãäœæããŸã ('a', 'b', 'c', 'd')
4ãã€ãã§ãã©ãã³æåã¯ãã®ç¹å®ã®ãã€ãå€ã瀺ãããã«ã®ã¿äœ¿çšãããŸãã ã€ãŸããããã§ã®ãaãã¯ã\ x61ããèšè¿°ããããã®åãªãå矩èªã§ãããããå°ãã§ã¯ãããŸããã äŸïŒ
>>> '\x61' 'a' >>> struct.unpack('>4b', x) # 'x' - signed/unsigned char- (97, 98, 99, 100) >>> struct.unpack('>2h', x) # short- (24930, 25444) >>> struct.unpack('>l', x) # long (1633837924,) >>> struct.unpack('>f', x) # float (2.6100787562286154e+20,) >>> struct.unpack('>d', x * 2) # double- (1.2926117739473244e+161,)
ããã ãã§ãïŒ
ãããŠã質åãžã®åç-ããŠãã³ãŒãããå¿ èŠãªçç±ã¯ããæçœã§ã-ãã€ãã§ã¯ãªãæåã§è¡šãããã¿ã€ããå¿ èŠã§ãã
ãŸããç§ã¯ã©ã€ã³ãäœã§ãããç解ããŸããã 次ã«ãPythonã®Unicodeãšã¯äœã§ããïŒ
ãã¿ã€ããŠãã³ãŒããã¯ãäž»ã«ãŠãã³ãŒãã®æŠå¿µïŒãããã«é¢é£ä»ããããæåãšæ°åã®ã»ããïŒãå®è£ ããæœè±¡åã§ãã ããŠãã³ãŒããã¿ã€ãã®ãªããžã§ã¯ãã¯ããã¯ããã€ãã®ã·ãŒã±ã³ã¹ã§ã¯ãªããæåèªäœã®ã·ãŒã±ã³ã¹ã§ããããããã®æåãã³ã³ãã¥ãŒã¿ãŒã®ã¡ã¢ãªã«ã©ã®ããã«å¹æçã«ä¿åãããŠãããã«ã€ããŠã¯ãŸã£ããããããŸããã å¿ èŠã«å¿ããŠãããã¯ãã€ãæååãããé«ãæœè±¡åã¬ãã«ã§ãïŒPython 3ã§ã¯ãPython 2.6ã§äœ¿çšãããéåžžã®æååãšåŒã°ããŸãïŒã
Unicodeã®äœ¿çšæ¹æ³
Python 2.6ã§Unicodeæååãäœæããã«ã¯ã3ã€ã®ïŒå°ãªããšãèªç¶ãªïŒæ¹æ³ããããŸãã
- u ""ãªãã©ã«ïŒ
>>> u'abc' u'abc'
- ãã€ãæååã®ãã³ãŒãã¡ãœããïŒ
>>> 'abc'.decode('ascii') u'abc'
- Unicodeé¢æ°ïŒ
>>> unicode('abc', 'ascii') u'abc'
'\x61' -> ascii -> "a" -> u'\u0061' (unicode-point ) '\xe0' -> c1251 -> "a" -> u'\u0430'
UnicodeæååããéåžžãååŸããæ¹æ³ã¯ïŒ ãšã³ã³ãŒãããïŒ
>>> u'abc'.encode('ascii') 'abc'
ã³ãŒãã£ã³ã°ã¢ã«ãŽãªãºã ã¯ãåœç¶äžèšã®éã§ãã
èŠããŠãããŠãã ãã-æ··åããªãã§ãã ãã-Unicode ==æåãæåå==ãã€ããããã³ãã€ã->æå³ã®ãããã®ïŒæåïŒã¯ãã³ãŒãïŒãã³ãŒãïŒã§ãããæå->ãã€ãã¯ãšã³ã³ãŒãïŒãšã³ã³ãŒãïŒã§ãã
ãšã³ã³ãŒããããŠããªã:(
èšäºã®æåããäŸãèŠãŠã¿ãŸãããã æååãšUnicodeæååã®é£çµã¯ã©ã®ããã«æ©èœããŸããïŒ åçŽãªæååã¯Unicodeæååã«å€æããå¿ èŠããããã€ã³ã¿ãŒããªã¿ãŒã¯ãšã³ã³ãŒããç¥ããªããããããã©ã«ãã®ãšã³ã³ãŒãã§ããasciiã䜿çšããŸãã ãã®ãšã³ã³ãŒããæååã®ãã³ãŒãã«å€±æãããšãweããšã©ãŒãçºçããŸãã ãã®å Žåãæ£ãããšã³ã³ãŒãã£ã³ã°ã䜿çšããŠãæååãUnicodeæååã«ãã£ã¹ãããå¿ èŠããããŸãã
>>> print type(parser_result), parser_result <type 'unicode'> - >>> s = '' >>> parser_result + s
Traceback (most recent call last): File "<pyshell#67>", line 1, in <module> parser_result + s UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 0: ordinal not in range(128)
>>> parser_result + s.decode('cp1251') u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0\u043a\u043e\u0449\u0435\u0439' >>> print parser_result + s.decode('cp1251') - >>> print '&'.join((parser_result, s.decode('cp1251'))) -& # :)
ãUnicodeDecodeErrorãã¯éåžžãæ£ãããšã³ã³ãŒãã£ã³ã°ã䜿çšããŠæååãUnicodeã«ãã³ãŒãããå¿ èŠãããããšã®èšŒæ ã§ãã
çŸåšãstrããã³Unicodeæååã䜿çšããŠããŸãã ãstrãããã³ãŠãã³ãŒãæååã䜿çšããªãã§ãã ãã:)ãstrãã§ã¯ãšã³ã³ãŒãã£ã³ã°ãæå®ããæ¹æ³ããªããããããã©ã«ãã®ãšã³ã³ãŒãã£ã³ã°ãåžžã«äœ¿çšããã128æåãè¶ ãããšãšã©ãŒãçºçããŸãã ããšã³ã³ãŒããã¡ãœããã䜿çšããŸãã
>>> print type(s), s <type 'unicode'> >>> str(s)
Traceback (most recent call last): File "<pyshell#90>", line 1, in <module> str(s) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
>>> s = s.encode('cp1251') >>> print type(s), s <type 'str'>
ãUnicodeEncodeErrorãã¯ãUnicodeæååãéåžžã®æååã«å€æãããšãã«æ£ãããšã³ã³ãŒããæå®ããå¿ èŠãããããšã瀺ããŸãïŒãŸãã¯ãencodeãã¡ãœããã§2çªç®ã®ãã©ã¡ãŒã¿ãŒãignoreã\ãreplaceã\ãxmlcharrefreplaceãã䜿çšããŸãïŒã
ãã£ãšæ¬²ããïŒ
ããŠãäžèšã®äŸã®éŠ¬å Žãããåã³äœ¿çšããŸãã
>>> parser_result = u'-' #1 >>> parser_result u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0' #2 >>> print parser_result áà áà -ÿãà #3 >>> print parser_result.encode('latin1') #4 - >>> print parser_result.encode('latin1').decode('cp1251') #5 - >>> print unicode('-', 'cp1251') #6 -
ãã®äŸã¯å®å šã«åçŽã§ã¯ãããŸãããããã¹ãŠïŒãŸãããŸãã¯ã»ãšãã©ãã¹ãŠïŒããããŸãã ããã§äœãèµ·ãã£ãŠããŸããïŒ
- å
¥ãå£ã«ã¯äœããããŸããïŒ IDLEãã€ã³ã¿ãŒããªã¿ãŒã«æž¡ããã€ãã åºå£ã§äœãå¿
èŠã§ããïŒ Unicodeãã€ãŸãæåã ãã€ããæåã«å€æããããšã¯æ®ã£ãŠããŸããããšã³ã³ãŒããå¿
èŠã§ãããïŒ ã©ã®ãšã³ã³ãŒãã£ã³ã°ã䜿çšãããŸããïŒ ããã«èª¿ã¹ãŸãã
- éèŠãªãã€ã³ãã¯æ¬¡ã®ãšããã§ãã
>>> '-' '\xe1\xe0\xe1\xe0-\xff\xe3\xe0' >>> u'\u00e1\u00e0\u00e1\u00e0-\u00ff\u00e3\u00e0' == u'\xe1\xe0\xe1\xe0-\xff\xe3\xe0' True
>>> ord('') 224 >>> ord(u'') 224
- ããã«ã®ã¿åé¡ããããŸã-cp1251ã®224çªç®ã®æåïŒã€ã³ã¿ãŒããªã¿ãŒã䜿çšãããšã³ã³ãŒãïŒã¯ããŠãã³ãŒãã®224ãšãŸã£ããåãã§ã¯ãããŸããã ãã®ãããUnicodeæååãå°å·ããããšãããškrakozyabraãååŸãããŸãã
- 女æ§ãå©ããæ¹æ³ã¯ïŒ æåã®256åã®Unicodeæåã¯ãããããISO-8859-1 \ latin1ãšã³ã³ãŒããšåãã§ããããšãããããŸããããã䜿çšããŠUnicodeæååããšã³ã³ãŒããããšãå
¥åãããã€ããååŸããŸãïŒæ°ã«ããã®ã¯-Objects / unicodeobject.c ãé¢æ°ãunicode_encode_ucs1ãã®å®çŸ©ãæ¢ããŠããŸãïŒïŒ
>>> parser_result.encode('latin1') '\xe1\xe0\xe1\xe0-\xff\xe3\xe0'
- 女æ§ããŠãã³ãŒãã«ããæ¹æ³ã¯ïŒ 䜿çšãããšã³ã³ãŒããæå®ããå¿
èŠããããŸãã
>>> parser_result.encode('latin1').decode('cp1251') u'\u0431\u0430\u0431\u0430-\u044f\u0433\u0430'
- ãã€ã³ã5ããã®ã¡ãœããã¯ç¢ºãã«ããã»ã©æãã¯ãããŸãããçµã¿èŸŒã¿ã®unicodeã䜿çšããæ¹ãã¯ããã«äŸ¿å©ã§ãã
ããšãã°ãããªã«æåãè¡šãããã«ãuãã䜿çšããæ¹æ³ãããããšã³ã³ãŒããŸãã¯èªã¿åãäžèœãªUnicodeãã€ã³ãïŒã€ãŸãããu '\ u1234'ãïŒãæå®ããŸããã ãã®æ¹æ³ã¯å®å šã«äŸ¿å©ã§ã¯ãããŸããããèå³æ·±ãã®ã¯ãŠãã³ãŒããšã³ãã£ãã£ã³ãŒãã䜿çšããããšã§ãã
>>> s = u'\N{CYRILLIC SMALL LETTER KA}\N{CYRILLIC SMALL LETTER O}\N{CYRILLIC SMALL LETTER SHCHA}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER SHORT I}' >>> print s
ãŸãããã¹ãŠãããã§ãã äž»ãªãã³ãã¯ãããšã³ã³ãŒãããšããã³ãŒãããæ··åããªãã§ããã€ããšæåã®éããç解ããããšã§ãã
Python 3
çµéšããªããããã³ãŒãã¯ãããŸããã ç®æè ã¯ããã¹ãŠãããã§ããç°¡åã§ãã楜ãããšäž»åŒµããŸãã ããïŒPython 2.xïŒãšããïŒPython 3.xïŒã®éã-å°æ¬ãšå°æ¬ã®éããå®èšŒããããã«ã誰ãç«ãåŒãåããŸããã
圹ã«ç«ã€
ãšã³ã³ãŒãã£ã³ã°ã«ã€ããŠè©±ããŠããã®ã§ãæã krakozyabraãå æããã®ã«åœ¹ç«ã€ãªãœãŒã¹-http://2cyr.com/decode/?lang=enããå§ãããŸã
ç¹°ãè¿ãã«ãªããŸãããSpolskyã®èšäºãžã®ãªã³ã¯- ãã¹ãŠã®ãœãããŠã§ã¢éçºè ã絶察çãã€ç©æ¥µçã«Unicodeããã³æåã»ããã«ã€ããŠç¥ã£ãŠããå¿ èŠã®ãã絶察æå°å€ã§ãã
Unicode HOWTOã¯ãPython 2.xã®Unicodeã®å Žæãæ¹æ³ãçç±ã«é¢ããå ¬åŒããã¥ã¡ã³ãã§ãã
ãæž èŽããããšãããããŸããã ãã©ã€ããŒãã§ã®ã³ã¡ã³ãã«æè¬ããŸãã
PSã¯ãSpolsky- Absolute Minimumã®ç¿»èš³ãžã®ãªã³ã¯ãæããŸãããããã¯ããã¹ãŠã®ãœãããŠã§ã¢éçºè ãUnicodeãšæåã»ããã«ã€ããŠç¥ã£ãŠããå¿ èŠããããŸãã