| ============================================================================= |
| Python Unicode Integration Proposal Version: 1.7 |
| ----------------------------------------------------------------------------- |
| |
| |
| Introduction: |
| ------------- |
| |
| The idea of this proposal is to add native Unicode 3.0 support to |
| Python in a way that makes use of Unicode strings as simple as |
| possible without introducing too many pitfalls along the way. |
| |
| Since this goal is not easy to achieve -- strings being one of the |
| most fundamental objects in Python --, we expect this proposal to |
| undergo some significant refinements. |
| |
| Note that the current version of this proposal is still a bit unsorted |
| due to the many different aspects of the Unicode-Python integration. |
| |
| The latest version of this document is always available at: |
| |
| http://starship.python.net/~lemburg/unicode-proposal.txt |
| |
| Older versions are available as: |
| |
| http://starship.python.net/~lemburg/unicode-proposal-X.X.txt |
| |
| |
| Conventions: |
| ------------ |
| |
| · In examples we use u = Unicode object and s = Python string |
| |
| · 'XXX' markings indicate points of discussion (PODs) |
| |
| |
| General Remarks: |
| ---------------- |
| |
| · Unicode encoding names should be lower case on output and |
| case-insensitive on input (they will be converted to lower case |
| by all APIs taking an encoding name as input). |
| |
| · Encoding names should follow the name conventions as used by the |
| Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is |
| written as 'utf-16'. |
| |
| · Codec modules should use the same names, but with hyphens converted |
| to underscores, e.g. utf_8, utf_16, iso_8859_1. |
| |
| |
| Unicode Default Encoding: |
| ------------------------- |
| |
| The Unicode implementation has to make some assumption about the |
| encoding of 8-bit strings passed to it for coercion and about the |
| encoding to as default for conversion of Unicode to strings when no |
| specific encoding is given. This encoding is called <default encoding> |
| throughout this text. |
| |
| For this, the implementation maintains a global which can be set in |
| the site.py Python startup script. Subsequent changes are not |
| possible. The <default encoding> can be set and queried using the |
| two sys module APIs: |
| |
| sys.setdefaultencoding(encoding) |
| --> Sets the <default encoding> used by the Unicode implementation. |
| encoding has to be an encoding which is supported by the Python |
| installation, otherwise, a LookupError is raised. |
| |
| Note: This API is only available in site.py ! It is removed |
| from the sys module by site.py after usage. |
| |
| sys.getdefaultencoding() |
| --> Returns the current <default encoding>. |
| |
| If not otherwise defined or set, the <default encoding> defaults to |
| 'ascii'. This encoding is also the startup default of Python (and in |
| effect before site.py is executed). |
| |
| Note that the default site.py startup module contains disabled |
| optional code which can set the <default encoding> according to the |
| encoding defined by the current locale. The locale module is used to |
| extract the encoding from the locale default settings defined by the |
| OS environment (see locale.py). If the encoding cannot be determined, |
| is unkown or unsupported, the code defaults to setting the <default |
| encoding> to 'ascii'. To enable this code, edit the site.py file or |
| place the appropriate code into the sitecustomize.py module of your |
| Python installation. |
| |
| |
| Unicode Constructors: |
| --------------------- |
| |
| Python should provide a built-in constructor for Unicode strings which |
| is available through __builtins__: |
| |
| u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"]) |
| |
| u = u'<unicode-escape encoded Python string>' |
| |
| u = ur'<raw-unicode-escape encoded Python string>' |
| |
| With the 'unicode-escape' encoding being defined as: |
| |
| · all non-escape characters represent themselves as Unicode ordinal |
| (e.g. 'a' -> U+0061). |
| |
| · all existing defined Python escape sequences are interpreted as |
| Unicode ordinals; note that \xXXXX can represent all Unicode |
| ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. |
| |
| · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax |
| error to have fewer than 4 digits after \u. |
| |
| For an explanation of possible values for errors see the Codec section |
| below. |
| |
| Examples: |
| |
| u'abc' -> U+0061 U+0062 U+0063 |
| u'\u1234' -> U+1234 |
| u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c |
| |
| The 'raw-unicode-escape' encoding is defined as follows: |
| |
| · \uXXXX sequence represent the U+XXXX Unicode character if and |
| only if the number of leading backslashes is odd |
| |
| · all other characters represent themselves as Unicode ordinal |
| (e.g. 'b' -> U+0062) |
| |
| |
| Note that you should provide some hint to the encoding you used to |
| write your programs as pragma line in one the first few comment lines |
| of the source file (e.g. '# source file encoding: latin-1'). If you |
| only use 7-bit ASCII then everything is fine and no such notice is |
| needed, but if you include Latin-1 characters not defined in ASCII, it |
| may well be worthwhile including a hint since people in other |
| countries will want to be able to read your source strings too. |
| |
| |
| Unicode Type Object: |
| -------------------- |
| |
| Unicode objects should have the type UnicodeType with type name |
| 'unicode', made available through the standard types module. |
| |
| |
| Unicode Output: |
| --------------- |
| |
| Unicode objects have a method .encode([encoding=<default encoding>]) |
| which returns a Python string encoding the Unicode string using the |
| given scheme (see Codecs). |
| |
| print u := print u.encode() # using the <default encoding> |
| |
| str(u) := u.encode() # using the <default encoding> |
| |
| repr(u) := "u%s" % repr(u.encode('unicode-escape')) |
| |
| Also see Internal Argument Parsing and Buffer Interface for details on |
| how other APIs written in C will treat Unicode objects. |
| |
| |
| Unicode Ordinals: |
| ----------------- |
| |
| Since Unicode 3.0 has a 32-bit ordinal character set, the implementation |
| should provide 32-bit aware ordinal conversion APIs: |
| |
| ord(u[:1]) (this is the standard ord() extended to work with Unicode |
| objects) |
| --> Unicode ordinal number (32-bit) |
| |
| unichr(i) |
| --> Unicode object for character i (provided it is 32-bit); |
| ValueError otherwise |
| |
| Both APIs should go into __builtins__ just like their string |
| counterparts ord() and chr(). |
| |
| Note that Unicode provides space for private encodings. Usage of these |
| can cause different output representations on different machines. This |
| problem is not a Python or Unicode problem, but a machine setup and |
| maintenance one. |
| |
| |
| Comparison & Hash Value: |
| ------------------------ |
| |
| Unicode objects should compare equal to other objects after these |
| other objects have been coerced to Unicode. For strings this means |
| that they are interpreted as Unicode string using the <default |
| encoding>. |
| |
| Unicode objects should return the same hash value as their ASCII |
| equivalent strings. Unicode strings holding non-ASCII values are not |
| guaranteed to return the same hash values as the default encoded |
| equivalent string representation. |
| |
| When compared using cmp() (or PyObject_Compare()) the implementation |
| should mask TypeErrors raised during the conversion to remain in synch |
| with the string behavior. All other errors such as ValueErrors raised |
| during coercion of strings to Unicode should not be masked and passed |
| through to the user. |
| |
| In containment tests ('a' in u'abc' and u'a' in 'abc') both sides |
| should be coerced to Unicode before applying the test. Errors occurring |
| during coercion (e.g. None in u'abc') should not be masked. |
| |
| |
| Coercion: |
| --------- |
| |
| Using Python strings and Unicode objects to form new objects should |
| always coerce to the more precise format, i.e. Unicode objects. |
| |
| u + s := u + unicode(s) |
| |
| s + u := unicode(s) + u |
| |
| All string methods should delegate the call to an equivalent Unicode |
| object method call by converting all involved strings to Unicode and |
| then applying the arguments to the Unicode method of the same name, |
| e.g. |
| |
| string.join((s,u),sep) := (s + sep) + u |
| |
| sep.join((s,u)) := (s + sep) + u |
| |
| For a discussion of %-formatting w/r to Unicode objects, see |
| Formatting Markers. |
| |
| |
| Exceptions: |
| ----------- |
| |
| UnicodeError is defined in the exceptions module as a subclass of |
| ValueError. It is available at the C level via PyExc_UnicodeError. |
| All exceptions related to Unicode encoding/decoding should be |
| subclasses of UnicodeError. |
| |
| |
| Codecs (Coder/Decoders) Lookup: |
| ------------------------------- |
| |
| A Codec (see Codec Interface Definition) search registry should be |
| implemented by a module "codecs": |
| |
| codecs.register(search_function) |
| |
| Search functions are expected to take one argument, the encoding name |
| in all lower case letters and with hyphens and spaces converted to |
| underscores, and return a tuple of functions (encoder, decoder, |
| stream_reader, stream_writer) taking the following arguments: |
| |
| encoder and decoder: |
| These must be functions or methods which have the same |
| interface as the .encode/.decode methods of Codec instances |
| (see Codec Interface). The functions/methods are expected to |
| work in a stateless mode. |
| |
| stream_reader and stream_writer: |
| These need to be factory functions with the following |
| interface: |
| |
| factory(stream,errors='strict') |
| |
| The factory functions must return objects providing |
| the interfaces defined by StreamWriter/StreamReader resp. |
| (see Codec Interface). Stream codecs can maintain state. |
| |
| Possible values for errors are defined in the Codec |
| section below. |
| |
| In case a search function cannot find a given encoding, it should |
| return None. |
| |
| Aliasing support for encodings is left to the search functions |
| to implement. |
| |
| The codecs module will maintain an encoding cache for performance |
| reasons. Encodings are first looked up in the cache. If not found, the |
| list of registered search functions is scanned. If no codecs tuple is |
| found, a LookupError is raised. Otherwise, the codecs tuple is stored |
| in the cache and returned to the caller. |
| |
| To query the Codec instance the following API should be used: |
| |
| codecs.lookup(encoding) |
| |
| This will either return the found codecs tuple or raise a LookupError. |
| |
| |
| Standard Codecs: |
| ---------------- |
| |
| Standard codecs should live inside an encodings/ package directory in the |
| Standard Python Code Library. The __init__.py file of that directory should |
| include a Codec Lookup compatible search function implementing a lazy module |
| based codec lookup. |
| |
| Python should provide a few standard codecs for the most relevant |
| encodings, e.g. |
| |
| 'utf-8': 8-bit variable length encoding |
| 'utf-16': 16-bit variable length encoding (little/big endian) |
| 'utf-16-le': utf-16 but explicitly little endian |
| 'utf-16-be': utf-16 but explicitly big endian |
| 'ascii': 7-bit ASCII codepage |
| 'iso-8859-1': ISO 8859-1 (Latin 1) codepage |
| 'unicode-escape': See Unicode Constructors for a definition |
| 'raw-unicode-escape': See Unicode Constructors for a definition |
| 'native': Dump of the Internal Format used by Python |
| |
| Common aliases should also be provided per default, e.g. 'latin-1' |
| for 'iso-8859-1'. |
| |
| Note: 'utf-16' should be implemented by using and requiring byte order |
| marks (BOM) for file input/output. |
| |
| All other encodings such as the CJK ones to support Asian scripts |
| should be implemented in separate packages which do not get included |
| in the core Python distribution and are not a part of this proposal. |
| |
| |
| Codecs Interface Definition: |
| ---------------------------- |
| |
| The following base class should be defined in the module |
| "codecs". They provide not only templates for use by encoding module |
| implementors, but also define the interface which is expected by the |
| Unicode implementation. |
| |
| Note that the Codec Interface defined here is well suitable for a |
| larger range of applications. The Unicode implementation expects |
| Unicode objects on input for .encode() and .write() and character |
| buffer compatible objects on input for .decode(). Output of .encode() |
| and .read() should be a Python string and .decode() must return an |
| Unicode object. |
| |
| First, we have the stateless encoders/decoders. These do not work in |
| chunks as the stream codecs (see below) do, because all components are |
| expected to be available in memory. |
| |
| class Codec: |
| |
| """ Defines the interface for stateless encoders/decoders. |
| |
| The .encode()/.decode() methods may implement different error |
| handling schemes by providing the errors argument. These |
| string values are defined: |
| |
| 'strict' - raise an error (or a subclass) |
| 'ignore' - ignore the character and continue with the next |
| 'replace' - replace with a suitable replacement character; |
| Python will use the official U+FFFD REPLACEMENT |
| CHARACTER for the builtin Unicode codecs. |
| |
| """ |
| def encode(self,input,errors='strict'): |
| |
| """ Encodes the object input and returns a tuple (output |
| object, length consumed). |
| |
| errors defines the error handling to apply. It defaults to |
| 'strict' handling. |
| |
| The method may not store state in the Codec instance. Use |
| StreamCodec for codecs which have to keep state in order to |
| make encoding/decoding efficient. |
| |
| """ |
| ... |
| |
| def decode(self,input,errors='strict'): |
| |
| """ Decodes the object input and returns a tuple (output |
| object, length consumed). |
| |
| input must be an object which provides the bf_getreadbuf |
| buffer slot. Python strings, buffer objects and memory |
| mapped files are examples of objects providing this slot. |
| |
| errors defines the error handling to apply. It defaults to |
| 'strict' handling. |
| |
| The method may not store state in the Codec instance. Use |
| StreamCodec for codecs which have to keep state in order to |
| make encoding/decoding efficient. |
| |
| """ |
| ... |
| |
| StreamWriter and StreamReader define the interface for stateful |
| encoders/decoders which work on streams. These allow processing of the |
| data in chunks to efficiently use memory. If you have large strings in |
| memory, you may want to wrap them with cStringIO objects and then use |
| these codecs on them to be able to do chunk processing as well, |
| e.g. to provide progress information to the user. |
| |
| class StreamWriter(Codec): |
| |
| def __init__(self,stream,errors='strict'): |
| |
| """ Creates a StreamWriter instance. |
| |
| stream must be a file-like object open for writing |
| (binary) data. |
| |
| The StreamWriter may implement different error handling |
| schemes by providing the errors keyword argument. These |
| parameters are defined: |
| |
| 'strict' - raise a ValueError (or a subclass) |
| 'ignore' - ignore the character and continue with the next |
| 'replace'- replace with a suitable replacement character |
| |
| """ |
| self.stream = stream |
| self.errors = errors |
| |
| def write(self,object): |
| |
| """ Writes the object's contents encoded to self.stream. |
| """ |
| data, consumed = self.encode(object,self.errors) |
| self.stream.write(data) |
| |
| def writelines(self, list): |
| |
| """ Writes the concatenated list of strings to the stream |
| using .write(). |
| """ |
| self.write(''.join(list)) |
| |
| def reset(self): |
| |
| """ Flushes and resets the codec buffers used for keeping state. |
| |
| Calling this method should ensure that the data on the |
| output is put into a clean state, that allows appending |
| of new fresh data without having to rescan the whole |
| stream to recover state. |
| |
| """ |
| pass |
| |
| def __getattr__(self,name, |
| |
| getattr=getattr): |
| |
| """ Inherit all other methods from the underlying stream. |
| """ |
| return getattr(self.stream,name) |
| |
| class StreamReader(Codec): |
| |
| def __init__(self,stream,errors='strict'): |
| |
| """ Creates a StreamReader instance. |
| |
| stream must be a file-like object open for reading |
| (binary) data. |
| |
| The StreamReader may implement different error handling |
| schemes by providing the errors keyword argument. These |
| parameters are defined: |
| |
| 'strict' - raise a ValueError (or a subclass) |
| 'ignore' - ignore the character and continue with the next |
| 'replace'- replace with a suitable replacement character; |
| |
| """ |
| self.stream = stream |
| self.errors = errors |
| |
| def read(self,size=-1): |
| |
| """ Decodes data from the stream self.stream and returns the |
| resulting object. |
| |
| size indicates the approximate maximum number of bytes to |
| read from the stream for decoding purposes. The decoder |
| can modify this setting as appropriate. The default value |
| -1 indicates to read and decode as much as possible. size |
| is intended to prevent having to decode huge files in one |
| step. |
| |
| The method should use a greedy read strategy meaning that |
| it should read as much data as is allowed within the |
| definition of the encoding and the given size, e.g. if |
| optional encoding endings or state markers are available |
| on the stream, these should be read too. |
| |
| """ |
| # Unsliced reading: |
| if size < 0: |
| return self.decode(self.stream.read())[0] |
| |
| # Sliced reading: |
| read = self.stream.read |
| decode = self.decode |
| data = read(size) |
| i = 0 |
| while 1: |
| try: |
| object, decodedbytes = decode(data) |
| except ValueError,why: |
| # This method is slow but should work under pretty much |
| # all conditions; at most 10 tries are made |
| i = i + 1 |
| newdata = read(1) |
| if not newdata or i > 10: |
| raise |
| data = data + newdata |
| else: |
| return object |
| |
| def readline(self, size=None): |
| |
| """ Read one line from the input stream and return the |
| decoded data. |
| |
| Note: Unlike the .readlines() method, this method inherits |
| the line breaking knowledge from the underlying stream's |
| .readline() method -- there is currently no support for |
| line breaking using the codec decoder due to lack of line |
| buffering. Subclasses should however, if possible, try to |
| implement this method using their own knowledge of line |
| breaking. |
| |
| size, if given, is passed as size argument to the stream's |
| .readline() method. |
| |
| """ |
| if size is None: |
| line = self.stream.readline() |
| else: |
| line = self.stream.readline(size) |
| return self.decode(line)[0] |
| |
| def readlines(self, sizehint=0): |
| |
| """ Read all lines available on the input stream |
| and return them as list of lines. |
| |
| Line breaks are implemented using the codec's decoder |
| method and are included in the list entries. |
| |
| sizehint, if given, is passed as size argument to the |
| stream's .read() method. |
| |
| """ |
| if sizehint is None: |
| data = self.stream.read() |
| else: |
| data = self.stream.read(sizehint) |
| return self.decode(data)[0].splitlines(1) |
| |
| def reset(self): |
| |
| """ Resets the codec buffers used for keeping state. |
| |
| Note that no stream repositioning should take place. |
| This method is primarily intended to be able to recover |
| from decoding errors. |
| |
| """ |
| pass |
| |
| def __getattr__(self,name, |
| |
| getattr=getattr): |
| |
| """ Inherit all other methods from the underlying stream. |
| """ |
| return getattr(self.stream,name) |
| |
| |
| Stream codec implementors are free to combine the StreamWriter and |
| StreamReader interfaces into one class. Even combining all these with |
| the Codec class should be possible. |
| |
| Implementors are free to add additional methods to enhance the codec |
| functionality or provide extra state information needed for them to |
| work. The internal codec implementation will only use the above |
| interfaces, though. |
| |
| It is not required by the Unicode implementation to use these base |
| classes, only the interfaces must match; this allows writing Codecs as |
| extension types. |
| |
| As guideline, large mapping tables should be implemented using static |
| C data in separate (shared) extension modules. That way multiple |
| processes can share the same data. |
| |
| A tool to auto-convert Unicode mapping files to mapping modules should be |
| provided to simplify support for additional mappings (see References). |
| |
| |
| Whitespace: |
| ----------- |
| |
| The .split() method will have to know about what is considered |
| whitespace in Unicode. |
| |
| |
| Case Conversion: |
| ---------------- |
| |
| Case conversion is rather complicated with Unicode data, since there |
| are many different conditions to respect. See |
| |
| http://www.unicode.org/unicode/reports/tr13/ |
| |
| for some guidelines on implementing case conversion. |
| |
| For Python, we should only implement the 1-1 conversions included in |
| Unicode. Locale dependent and other special case conversions (see the |
| Unicode standard file SpecialCasing.txt) should be left to user land |
| routines and not go into the core interpreter. |
| |
| The methods .capitalize() and .iscapitalized() should follow the case |
| mapping algorithm defined in the above technical report as closely as |
| possible. |
| |
| |
| Line Breaks: |
| ------------ |
| |
| Line breaking should be done for all Unicode characters having the B |
| property as well as the combinations CRLF, CR, LF (interpreted in that |
| order) and other special line separators defined by the standard. |
| |
| The Unicode type should provide a .splitlines() method which returns a |
| list of lines according to the above specification. See Unicode |
| Methods. |
| |
| |
| Unicode Character Properties: |
| ----------------------------- |
| |
| A separate module "unicodedata" should provide a compact interface to |
| all Unicode character properties defined in the standard's |
| UnicodeData.txt file. |
| |
| Among other things, these properties provide ways to recognize |
| numbers, digits, spaces, whitespace, etc. |
| |
| Since this module will have to provide access to all Unicode |
| characters, it will eventually have to contain the data from |
| UnicodeData.txt which takes up around 600kB. For this reason, the data |
| should be stored in static C data. This enables compilation as shared |
| module which the underlying OS can shared between processes (unlike |
| normal Python code modules). |
| |
| There should be a standard Python interface for accessing this information |
| so that other implementors can plug in their own possibly enhanced versions, |
| e.g. ones that do decompressing of the data on-the-fly. |
| |
| |
| Private Code Point Areas: |
| ------------------------- |
| |
| Support for these is left to user land Codecs and not explicitly |
| integrated into the core. Note that due to the Internal Format being |
| implemented, only the area between \uE000 and \uF8FF is usable for |
| private encodings. |
| |
| |
| Internal Format: |
| ---------------- |
| |
| The internal format for Unicode objects should use a Python specific |
| fixed format <PythonUnicode> implemented as 'unsigned short' (or |
| another unsigned numeric type having 16 bits). Byte order is platform |
| dependent. |
| |
| This format will hold UTF-16 encodings of the corresponding Unicode |
| ordinals. The Python Unicode implementation will address these values |
| as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all |
| currently defined Unicode character points. UTF-16 without surrogates |
| provides access to about 64k characters and covers all characters in |
| the Basic Multilingual Plane (BMP) of Unicode. |
| |
| It is the Codec's responsibility to ensure that the data they pass to |
| the Unicode object constructor respects this assumption. The |
| constructor does not check the data for Unicode compliance or use of |
| surrogates. |
| |
| Future implementations can extend the 32 bit restriction to the full |
| set of all UTF-16 addressable characters (around 1M characters). |
| |
| The Unicode API should provide interface routines from <PythonUnicode> |
| to the compiler's wchar_t which can be 16 or 32 bit depending on the |
| compiler/libc/platform being used. |
| |
| Unicode objects should have a pointer to a cached Python string object |
| <defenc> holding the object's value using the <default encoding>. |
| This is needed for performance and internal parsing (see Internal |
| Argument Parsing) reasons. The buffer is filled when the first |
| conversion request to the <default encoding> is issued on the object. |
| |
| Interning is not needed (for now), since Python identifiers are |
| defined as being ASCII only. |
| |
| codecs.BOM should return the byte order mark (BOM) for the format |
| used internally. The codecs module should provide the following |
| additional constants for convenience and reference (codecs.BOM will |
| either be BOM_BE or BOM_LE depending on the platform): |
| |
| BOM_BE: '\376\377' |
| (corresponds to Unicode U+0000FEFF in UTF-16 on big endian |
| platforms == ZERO WIDTH NO-BREAK SPACE) |
| |
| BOM_LE: '\377\376' |
| (corresponds to Unicode U+0000FFFE in UTF-16 on little endian |
| platforms == defined as being an illegal Unicode character) |
| |
| BOM4_BE: '\000\000\376\377' |
| (corresponds to Unicode U+0000FEFF in UCS-4) |
| |
| BOM4_LE: '\377\376\000\000' |
| (corresponds to Unicode U+0000FFFE in UCS-4) |
| |
| Note that Unicode sees big endian byte order as being "correct". The |
| swapped order is taken to be an indicator for a "wrong" format, hence |
| the illegal character definition. |
| |
| The configure script should provide aid in deciding whether Python can |
| use the native wchar_t type or not (it has to be a 16-bit unsigned |
| type). |
| |
| |
| Buffer Interface: |
| ----------------- |
| |
| Implement the buffer interface using the <defenc> Python string object |
| as basis for bf_getcharbuf and the internal buffer for |
| bf_getreadbuf. If bf_getcharbuf is requested and the <defenc> object |
| does not yet exist, it is created first. |
| |
| Note that as special case, the parser marker "s#" will not return raw |
| Unicode UTF-16 data (which the bf_getreadbuf returns), but instead |
| tries to encode the Unicode object using the default encoding and then |
| returns a pointer to the resulting string object (or raises an |
| exception in case the conversion fails). This was done in order to |
| prevent accidentely writing binary data to an output stream which the |
| other end might not recognize. |
| |
| This has the advantage of being able to write to output streams (which |
| typically use this interface) without additional specification of the |
| encoding to use. |
| |
| If you need to access the read buffer interface of Unicode objects, |
| use the PyObject_AsReadBuffer() interface. |
| |
| The internal format can also be accessed using the 'unicode-internal' |
| codec, e.g. via u.encode('unicode-internal'). |
| |
| |
| Pickle/Marshalling: |
| ------------------- |
| |
| Should have native Unicode object support. The objects should be |
| encoded using platform independent encodings. |
| |
| Marshal should use UTF-8 and Pickle should either choose |
| Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as |
| encoding. Using UTF-8 instead of UTF-16 has the advantage of |
| eliminating the need to store a BOM mark. |
| |
| |
| Regular Expressions: |
| -------------------- |
| |
| Secret Labs AB is working on a Unicode-aware regular expression |
| machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4 |
| internal character buffers. |
| |
| Also see |
| |
| http://www.unicode.org/unicode/reports/tr18/ |
| |
| for some remarks on how to treat Unicode REs. |
| |
| |
| Formatting Markers: |
| ------------------- |
| |
| Format markers are used in Python format strings. If Python strings |
| are used as format strings, the following interpretations should be in |
| effect: |
| |
| '%s': For Unicode objects this will cause coercion of the |
| whole format string to Unicode. Note that |
| you should use a Unicode format string to start |
| with for performance reasons. |
| |
| In case the format string is an Unicode object, all parameters are coerced |
| to Unicode first and then put together and formatted according to the format |
| string. Numbers are first converted to strings and then to Unicode. |
| |
| '%s': Python strings are interpreted as Unicode |
| string using the <default encoding>. Unicode |
| objects are taken as is. |
| |
| All other string formatters should work accordingly. |
| |
| Example: |
| |
| u"%s %s" % (u"abc", "abc") == u"abc abc" |
| |
| |
| Internal Argument Parsing: |
| -------------------------- |
| |
| These markers are used by the PyArg_ParseTuple() APIs: |
| |
| "U": Check for Unicode object and return a pointer to it |
| |
| "s": For Unicode objects: return a pointer to the object's |
| <defenc> buffer (which uses the <default encoding>). |
| |
| "s#": Access to the default encoded version of the Unicode object |
| (see Buffer Interface); note that the length relates to the length |
| of the default encoded string rather than the Unicode object length. |
| |
| "t#": Same as "s#". |
| |
| "es": |
| Takes two parameters: encoding (const char *) and |
| buffer (char **). |
| |
| The input object is first coerced to Unicode in the usual way |
| and then encoded into a string using the given encoding. |
| |
| On output, a buffer of the needed size is allocated and |
| returned through *buffer as NULL-terminated string. |
| The encoded may not contain embedded NULL characters. |
| The caller is responsible for calling PyMem_Free() |
| to free the allocated *buffer after usage. |
| |
| "es#": |
| Takes three parameters: encoding (const char *), |
| buffer (char **) and buffer_len (int *). |
| |
| The input object is first coerced to Unicode in the usual way |
| and then encoded into a string using the given encoding. |
| |
| If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer) |
| on input. Output is then copied to *buffer. |
| |
| If *buffer is NULL, a buffer of the needed size is |
| allocated and output copied into it. *buffer is then |
| updated to point to the allocated memory area. |
| The caller is responsible for calling PyMem_Free() |
| to free the allocated *buffer after usage. |
| |
| In both cases *buffer_len is updated to the number of |
| characters written (excluding the trailing NULL-byte). |
| The output buffer is assured to be NULL-terminated. |
| |
| Examples: |
| |
| Using "es#" with auto-allocation: |
| |
| static PyObject * |
| test_parser(PyObject *self, |
| PyObject *args) |
| { |
| PyObject *str; |
| const char *encoding = "latin-1"; |
| char *buffer = NULL; |
| int buffer_len = 0; |
| |
| if (!PyArg_ParseTuple(args, "es#:test_parser", |
| encoding, &buffer, &buffer_len)) |
| return NULL; |
| if (!buffer) { |
| PyErr_SetString(PyExc_SystemError, |
| "buffer is NULL"); |
| return NULL; |
| } |
| str = PyString_FromStringAndSize(buffer, buffer_len); |
| PyMem_Free(buffer); |
| return str; |
| } |
| |
| Using "es" with auto-allocation returning a NULL-terminated string: |
| |
| static PyObject * |
| test_parser(PyObject *self, |
| PyObject *args) |
| { |
| PyObject *str; |
| const char *encoding = "latin-1"; |
| char *buffer = NULL; |
| |
| if (!PyArg_ParseTuple(args, "es:test_parser", |
| encoding, &buffer)) |
| return NULL; |
| if (!buffer) { |
| PyErr_SetString(PyExc_SystemError, |
| "buffer is NULL"); |
| return NULL; |
| } |
| str = PyString_FromString(buffer); |
| PyMem_Free(buffer); |
| return str; |
| } |
| |
| Using "es#" with a pre-allocated buffer: |
| |
| static PyObject * |
| test_parser(PyObject *self, |
| PyObject *args) |
| { |
| PyObject *str; |
| const char *encoding = "latin-1"; |
| char _buffer[10]; |
| char *buffer = _buffer; |
| int buffer_len = sizeof(_buffer); |
| |
| if (!PyArg_ParseTuple(args, "es#:test_parser", |
| encoding, &buffer, &buffer_len)) |
| return NULL; |
| if (!buffer) { |
| PyErr_SetString(PyExc_SystemError, |
| "buffer is NULL"); |
| return NULL; |
| } |
| str = PyString_FromStringAndSize(buffer, buffer_len); |
| return str; |
| } |
| |
| |
| File/Stream Output: |
| ------------------- |
| |
| Since file.write(object) and most other stream writers use the "s#" or |
| "t#" argument parsing marker for querying the data to write, the |
| default encoded string version of the Unicode object will be written |
| to the streams (see Buffer Interface). |
| |
| For explicit handling of files using Unicode, the standard stream |
| codecs as available through the codecs module should be used. |
| |
| The codecs module should provide a short-cut open(filename,mode,encoding) |
| available which also assures that mode contains the 'b' character when |
| needed. |
| |
| |
| File/Stream Input: |
| ------------------ |
| |
| Only the user knows what encoding the input data uses, so no special |
| magic is applied. The user will have to explicitly convert the string |
| data to Unicode objects as needed or use the file wrappers defined in |
| the codecs module (see File/Stream Output). |
| |
| |
| Unicode Methods & Attributes: |
| ----------------------------- |
| |
| All Python string methods, plus: |
| |
| .encode([encoding=<default encoding>][,errors="strict"]) |
| --> see Unicode Output |
| |
| .splitlines([include_breaks=0]) |
| --> breaks the Unicode string into a list of (Unicode) lines; |
| returns the lines with line breaks included, if include_breaks |
| is true. See Line Breaks for a specification of how line breaking |
| is done. |
| |
| |
| Code Base: |
| ---------- |
| |
| We should use Fredrik Lundh's Unicode object implementation as basis. |
| It already implements most of the string methods needed and provides a |
| well written code base which we can build upon. |
| |
| The object sharing implemented in Fredrik's implementation should |
| be dropped. |
| |
| |
| Test Cases: |
| ----------- |
| |
| Test cases should follow those in Lib/test/test_string.py and include |
| additional checks for the Codec Registry and the Standard Codecs. |
| |
| |
| References: |
| ----------- |
| |
| Unicode Consortium: |
| http://www.unicode.org/ |
| |
| Unicode FAQ: |
| http://www.unicode.org/unicode/faq/ |
| |
| Unicode 3.0: |
| http://www.unicode.org/unicode/standard/versions/Unicode3.0.html |
| |
| Unicode-TechReports: |
| http://www.unicode.org/unicode/reports/techreports.html |
| |
| Unicode-Mappings: |
| ftp://ftp.unicode.org/Public/MAPPINGS/ |
| |
| Introduction to Unicode (a little outdated by still nice to read): |
| http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html |
| |
| For comparison: |
| Introducing Unicode to ECMAScript (aka JavaScript) -- |
| http://www-4.ibm.com/software/developer/library/internationalization-support.html |
| |
| IANA Character Set Names: |
| ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets |
| |
| Discussion of UTF-8 and Unicode support for POSIX and Linux: |
| http://www.cl.cam.ac.uk/~mgk25/unicode.html |
| |
| Encodings: |
| |
| Overview: |
| http://czyborra.com/utf/ |
| |
| UTC-2: |
| http://www.uazone.com/multiling/unicode/ucs2.html |
| |
| UTF-7: |
| Defined in RFC2152, e.g. |
| http://www.uazone.com/multiling/ml-docs/rfc2152.txt |
| |
| UTF-8: |
| Defined in RFC2279, e.g. |
| http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt |
| |
| UTF-16: |
| http://www.uazone.com/multiling/unicode/wg2n1035.html |
| |
| |
| History of this Proposal: |
| ------------------------- |
| 1.7: Added note about the changed behaviour of "s#". |
| 1.6: Changed <defencstr> to <defenc> since this is the name used in the |
| implementation. Added notes about the usage of <defenc> in the |
| buffer protocol implementation. |
| 1.5: Added notes about setting the <default encoding>. Fixed some |
| typos (thanks to Andrew Kuchling). Changed <defencstr> to <utf8str>. |
| 1.4: Added note about mixed type comparisons and contains tests. |
| Changed treating of Unicode objects in format strings (if used |
| with '%s' % u they will now cause the format string to be |
| coerced to Unicode, thus producing a Unicode object on return). |
| Added link to IANA charset names (thanks to Lars Marius Garshol). |
| Added new codec methods .readline(), .readlines() and .writelines(). |
| 1.3: Added new "es" and "es#" parser markers |
| 1.2: Removed POD about codecs.open() |
| 1.1: Added note about comparisons and hash values. Added note about |
| case mapping algorithms. Changed stream codecs .read() and |
| .write() method to match the standard file-like object methods |
| (bytes consumed information is no longer returned by the methods) |
| 1.0: changed encode Codec method to be symmetric to the decode method |
| (they both return (object, data consumed) now and thus become |
| interchangeable); removed __init__ method of Codec class (the |
| methods are stateless) and moved the errors argument down to the |
| methods; made the Codec design more generic w/r to type of input |
| and output objects; changed StreamWriter.flush to StreamWriter.reset |
| in order to avoid overriding the stream's .flush() method; |
| renamed .breaklines() to .splitlines(); renamed the module unicodec |
| to codecs; modified the File I/O section to refer to the stream codecs. |
| 0.9: changed errors keyword argument definition; added 'replace' error |
| handling; changed the codec APIs to accept buffer like objects on |
| input; some minor typo fixes; added Whitespace section and |
| included references for Unicode characters that have the whitespace |
| and the line break characteristic; added note that search functions |
| can expect lower-case encoding names; dropped slicing and offsets |
| in the codec APIs |
| 0.8: added encodings package and raw unicode escape encoding; untabified |
| the proposal; added notes on Unicode format strings; added |
| .breaklines() method |
| 0.7: added a whole new set of codec APIs; added a different encoder |
| lookup scheme; fixed some names |
| 0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding |
| a real Python string object; changed Buffer Interface to delegate |
| requests to <defencstr>'s buffer interface; removed the explicit |
| reference to the unicodec.codecs dictionary (the module can implement |
| this in way fit for the purpose); removed the settable default |
| encoding; move UnicodeError from unicodec to exceptions; "s#" |
| not returns the internal data; passed the UCS-2/UTF-16 checking |
| from the Unicode constructor to the Codecs |
| 0.5: moved sys.bom to unicodec.BOM; added sections on case mapping, |
| private use encodings and Unicode character properties |
| 0.4: added Codec interface, notes on %-formatting, changed some encoding |
| details, added comments on stream wrappers, fixed some discussion |
| points (most important: Internal Format), clarified the |
| 'unicode-escape' encoding, added encoding references |
| 0.3: added references, comments on codec modules, the internal format, |
| bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding |
| proposed by Tim Peters and fixed repr(u) accordingly |
| 0.2: integrated Guido's suggestions, added stream codecs and file |
| wrapping |
| 0.1: first version |
| |
| |
| ----------------------------------------------------------------------------- |
| Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com |
| ----------------------------------------------------------------------------- |