Merged revisions 68133-68134,68141-68142,68145-68146,68148-68149,68159-68162,68166,68171-68174,68179,68195-68196,68210,68214-68215,68217-68222 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r68133 | antoine.pitrou | 2009-01-01 16:38:03 +0100 (Thu, 01 Jan 2009) | 1 line
fill in actual issue number in tests
........
r68134 | hirokazu.yamamoto | 2009-01-01 16:45:39 +0100 (Thu, 01 Jan 2009) | 2 lines
Issue #4797: IOError.filename was not set when _fileio.FileIO failed to open
file with `str' filename on Windows.
........
r68141 | benjamin.peterson | 2009-01-01 17:43:12 +0100 (Thu, 01 Jan 2009) | 1 line
fix highlighting
........
r68142 | benjamin.peterson | 2009-01-01 18:29:49 +0100 (Thu, 01 Jan 2009) | 2 lines
welcome to 2009, Python!
........
r68145 | amaury.forgeotdarc | 2009-01-02 01:03:54 +0100 (Fri, 02 Jan 2009) | 5 lines
#4801 _collections module fails to build on cygwin.
_PyObject_GC_TRACK is the macro version of PyObject_GC_Track,
and according to documentation it should not be used for extension modules.
........
r68146 | ronald.oussoren | 2009-01-02 11:44:46 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4472: "configure --enable-shared doesn't work on OSX"
........
r68148 | ronald.oussoren | 2009-01-02 11:48:31 +0100 (Fri, 02 Jan 2009) | 2 lines
Forgot to add a NEWS item in my previous checkin
........
r68149 | ronald.oussoren | 2009-01-02 11:50:48 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue4780
........
r68159 | ronald.oussoren | 2009-01-02 15:48:17 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue 1627952
........
r68160 | ronald.oussoren | 2009-01-02 15:52:09 +0100 (Fri, 02 Jan 2009) | 2 lines
Fix for issue r1737832
........
r68161 | ronald.oussoren | 2009-01-02 16:00:05 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 1149804
........
r68162 | ronald.oussoren | 2009-01-02 16:06:00 +0100 (Fri, 02 Jan 2009) | 3 lines
Fix for issue 4472 is incompatible with Cygwin, this patch
should fix that.
........
r68166 | benjamin.peterson | 2009-01-02 19:26:23 +0100 (Fri, 02 Jan 2009) | 1 line
document PyMemberDef
........
r68171 | georg.brandl | 2009-01-02 21:25:14 +0100 (Fri, 02 Jan 2009) | 3 lines
#4811: fix markup glitches (mostly remains of the conversion),
found by Gabriel Genellina.
........
r68172 | martin.v.loewis | 2009-01-02 21:32:55 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4075: Use OutputDebugStringW in Py_FatalError.
........
r68173 | martin.v.loewis | 2009-01-02 21:40:14 +0100 (Fri, 02 Jan 2009) | 2 lines
Issue #4051: Prevent conflict of UNICODE macros in cPickle.
........
r68174 | benjamin.peterson | 2009-01-02 21:47:27 +0100 (Fri, 02 Jan 2009) | 1 line
fix compilation on non-Windows platforms
........
r68179 | raymond.hettinger | 2009-01-02 22:26:45 +0100 (Fri, 02 Jan 2009) | 1 line
Issue #4615. Document how to use itertools for de-duping.
........
r68195 | georg.brandl | 2009-01-03 14:45:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove useless string literal.
........
r68196 | georg.brandl | 2009-01-03 15:29:53 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix indentation.
........
r68210 | georg.brandl | 2009-01-03 20:10:12 +0100 (Sat, 03 Jan 2009) | 2 lines
Set eol-style correctly for mp_distributing.py.
........
r68214 | georg.brandl | 2009-01-03 20:44:48 +0100 (Sat, 03 Jan 2009) | 2 lines
Make indentation consistent.
........
r68215 | georg.brandl | 2009-01-03 21:15:14 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix role name.
........
r68217 | georg.brandl | 2009-01-03 21:30:15 +0100 (Sat, 03 Jan 2009) | 2 lines
Add rstlint, a little tool to find subtle markup problems and inconsistencies in the Doc sources.
........
r68218 | georg.brandl | 2009-01-03 21:38:59 +0100 (Sat, 03 Jan 2009) | 2 lines
Recognize usage of the default role.
........
r68219 | georg.brandl | 2009-01-03 21:47:01 +0100 (Sat, 03 Jan 2009) | 2 lines
Fix uses of the default role.
........
r68220 | georg.brandl | 2009-01-03 21:55:06 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove trailing whitespace.
........
r68221 | georg.brandl | 2009-01-03 22:04:55 +0100 (Sat, 03 Jan 2009) | 2 lines
Remove tabs from the documentation.
........
r68222 | georg.brandl | 2009-01-03 22:11:58 +0100 (Sat, 03 Jan 2009) | 2 lines
Disable the line length checker by default.
........
diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst
index d5dec63..c09a72d 100644
--- a/Doc/howto/unicode.rst
+++ b/Doc/howto/unicode.rst
@@ -30,8 +30,8 @@
looking at Apple ][ BASIC programs, published in French-language publications in
the mid-1980s, that had lines like these::
- PRINT "FICHIER EST COMPLETE."
- PRINT "CARACTERE NON ACCEPTE."
+ PRINT "FICHIER EST COMPLETE."
+ PRINT "CARACTERE NON ACCEPTE."
Those messages should contain accents, and they just look wrong to someone who
can read French.
@@ -89,11 +89,11 @@
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
of tables listing characters and their corresponding code points::
- 0061 'a'; LATIN SMALL LETTER A
- 0062 'b'; LATIN SMALL LETTER B
- 0063 'c'; LATIN SMALL LETTER C
- ...
- 007B '{'; LEFT CURLY BRACKET
+ 0061 'a'; LATIN SMALL LETTER A
+ 0062 'b'; LATIN SMALL LETTER B
+ 0063 'c'; LATIN SMALL LETTER C
+ ...
+ 007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it's meaningless to say 'this is
character U+12ca'. U+12ca is a code point, which represents some particular
@@ -122,8 +122,8 @@
representation, the string "Python" would look like this::
P y t h o n
- 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
+ 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
+ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This representation is straightforward but using it presents a number of
problems.
@@ -181,7 +181,7 @@
between 128 and 255.
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
byte of the sequence is between 128 and 255.
-
+
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
@@ -252,7 +252,7 @@
>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
- UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
+ UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)
The ``errors`` argument specifies the response when the input string can't be
@@ -264,7 +264,7 @@
>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
- UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
+ UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
@@ -350,7 +350,7 @@
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
-
+
The low-level routines for registering and accessing the available encodings are
found in the :mod:`codecs` module. However, the encoding and decoding functions
returned by this module are usually more low-level than is comfortable, so I'm
@@ -362,8 +362,8 @@
The most commonly used part of the :mod:`codecs` module is the
:func:`codecs.open` function which will be discussed in the section on input and
output.
-
-
+
+
Unicode Literals in Python Source Code
--------------------------------------
@@ -381,10 +381,10 @@
>>> s = u"a\xac\u1234\u20ac\U00008000"
^^^^ two-digit hex escape
- ^^^^^^ four-digit Unicode escape
+ ^^^^^^ four-digit Unicode escape
^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
- ...
+ ...
97 172 4660 8364 32768
Using escape sequences for code points greater than 127 is fine in small doses,
@@ -404,10 +404,10 @@
#!/usr/bin/env python
# -*- coding: latin-1 -*-
-
+
u = u'abcdé'
print ord(u[-1])
-
+
The syntax is inspired by Emacs's notation for specifying variables local to a
file. Emacs supports many different variables, but Python only supports
'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
@@ -427,10 +427,10 @@
When you run it with Python 2.4, it will output the following warning::
amk:~$ python p263.py
- sys:1: DeprecationWarning: Non-ASCII character '\xe9'
- in file p263.py on line 2, but no encoding declared;
+ sys:1: DeprecationWarning: Non-ASCII character '\xe9'
+ in file p263.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
-
+
Unicode Properties
------------------
@@ -446,13 +446,13 @@
prints the numeric value of one particular character::
import unicodedata
-
+
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
-
+
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
-
+
# Get numeric value of second character
print unicodedata.numeric(u[1])
@@ -597,25 +597,25 @@
path will return the 8-bit versions of the filenames. For example, assuming the
default filesystem encoding is UTF-8, running the following program::
- fn = u'filename\u4500abc'
- f = open(fn, 'w')
- f.close()
+ fn = u'filename\u4500abc'
+ f = open(fn, 'w')
+ f.close()
- import os
- print os.listdir('.')
- print os.listdir(u'.')
+ import os
+ print os.listdir('.')
+ print os.listdir(u'.')
will produce the following output::
- amk:~$ python t.py
- ['.svn', 'filename\xe4\x94\x80abc', ...]
- [u'.svn', u'filename\u4500abc', ...]
+ amk:~$ python t.py
+ ['.svn', 'filename\xe4\x94\x80abc', ...]
+ [u'.svn', u'filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains
the Unicode versions.
-
+
Tips for Writing Unicode-aware Programs
---------------------------------------
@@ -661,7 +661,7 @@
unicode_name = filename.decode(encoding)
f = open(unicode_name, 'r')
# ... return contents of file ...
-
+
However, if an attacker could specify the ``'base64'`` encoding, they could pass
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
@@ -697,32 +697,32 @@
.. comment Describe obscure -U switch somewhere?
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
-.. comment
+.. comment
Original outline:
- [ ] Unicode introduction
- [ ] ASCII
- [ ] Terms
- - [ ] Character
- - [ ] Code point
- - [ ] Encodings
- - [ ] Common encodings: ASCII, Latin-1, UTF-8
+ - [ ] Character
+ - [ ] Code point
+ - [ ] Encodings
+ - [ ] Common encodings: ASCII, Latin-1, UTF-8
- [ ] Unicode Python type
- - [ ] Writing unicode literals
- - [ ] Obscurity: -U switch
- - [ ] Built-ins
- - [ ] unichr()
- - [ ] ord()
- - [ ] unicode() constructor
- - [ ] Unicode type
- - [ ] encode(), decode() methods
+ - [ ] Writing unicode literals
+ - [ ] Obscurity: -U switch
+ - [ ] Built-ins
+ - [ ] unichr()
+ - [ ] ord()
+ - [ ] unicode() constructor
+ - [ ] Unicode type
+ - [ ] encode(), decode() methods
- [ ] Unicodedata module for character properties
- [ ] I/O
- - [ ] Reading/writing Unicode data into files
- - [ ] Byte-order marks
- - [ ] Unicode filenames
+ - [ ] Reading/writing Unicode data into files
+ - [ ] Byte-order marks
+ - [ ] Unicode filenames
- [ ] Writing Unicode programs
- - [ ] Do everything in Unicode
- - [ ] Declaring source code encodings (PEP 263)
+ - [ ] Do everything in Unicode
+ - [ ] Declaring source code encodings (PEP 263)
- [ ] Other issues
- - [ ] Building Python (UCS2, UCS4)
+ - [ ] Building Python (UCS2, UCS4)