Replace the text with a link to the PEP-ified version.

commit: 3f8c2e1616f1d67730ffef40f3077247dfe46e26 [log] [tgz]
author: Barry Warsaw <barry@python.org> Wed Jun 27 17:14:04 2001 +0000
committer: Barry Warsaw <barry@python.org> Wed Jun 27 17:14:04 2001 +0000
tree: 2750b459c0c27bfac8f323a0074f4b3a9af82f01
parent: 69940973db6c14dab89880ab97cbbd9ade4b526b [diff]
diff --git a/Misc/unicode.txt b/Misc/unicode.txt
index b71e4ca..a252ebe 100644
--- a/Misc/unicode.txt
+++ b/Misc/unicode.txt

@@ -1,1115 +1,5 @@
-=============================================================================
- Python Unicode Integration                            Proposal Version: 1.7
------------------------------------------------------------------------------
+This document has been PEP-ified.  Please see PEP 100 at:
 
+    http://www.python.org/peps/pep-0100.html
 
-Introduction:
--------------
-
-The idea of this proposal is to add native Unicode 3.0 support to
-Python in a way that makes use of Unicode strings as simple as
-possible without introducing too many pitfalls along the way.
-
-Since this goal is not easy to achieve -- strings being one of the
-most fundamental objects in Python --, we expect this proposal to
-undergo some significant refinements.
-
-Note that the current version of this proposal is still a bit unsorted
-due to the many different aspects of the Unicode-Python integration.
-
-The latest version of this document is always available at:
-
-        http://starship.python.net/~lemburg/unicode-proposal.txt
-
-Older versions are available as:
-
-        http://starship.python.net/~lemburg/unicode-proposal-X.X.txt
-
-
-Conventions:
-------------
-
-· In examples we use u = Unicode object and s = Python string
-
-· 'XXX' markings indicate points of discussion (PODs)
-
-
-General Remarks:
-----------------
-
-· Unicode encoding names should be lower case on output and
-  case-insensitive on input (they will be converted to lower case
-  by all APIs taking an encoding name as input).
-
-· Encoding names should follow the name conventions as used by the
-  Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
-  written as 'utf-16'.
-
-· Codec modules should use the same names, but with hyphens converted
-  to underscores, e.g. utf_8, utf_16, iso_8859_1.
-
-
-Unicode Default Encoding:
--------------------------
-
-The Unicode implementation has to make some assumption about the
-encoding of 8-bit strings passed to it for coercion and about the
-encoding to as default for conversion of Unicode to strings when no
-specific encoding is given. This encoding is called <default encoding>
-throughout this text.
-
-For this, the implementation maintains a global which can be set in
-the site.py Python startup script. Subsequent changes are not
-possible. The <default encoding> can be set and queried using the
-two sys module APIs:
-
-  sys.setdefaultencoding(encoding)
-     --> Sets the <default encoding> used by the Unicode implementation.
-	 encoding has to be an encoding which is supported by the Python
-	 installation, otherwise, a LookupError is raised.
-
-	 Note: This API is only available in site.py ! It is removed
-	 from the sys module by site.py after usage.
-
-  sys.getdefaultencoding()
-     --> Returns the current <default encoding>.
-
-If not otherwise defined or set, the <default encoding> defaults to
-'ascii'. This encoding is also the startup default of Python (and in
-effect before site.py is executed).
-
-Note that the default site.py startup module contains disabled
-optional code which can set the <default encoding> according to the
-encoding defined by the current locale. The locale module is used to
-extract the encoding from the locale default settings defined by the
-OS environment (see locale.py). If the encoding cannot be determined,
-is unkown or unsupported, the code defaults to setting the <default
-encoding> to 'ascii'. To enable this code, edit the site.py file or
-place the appropriate code into the sitecustomize.py module of your
-Python installation.
-
-
-Unicode Constructors:
----------------------
-
-Python should provide a built-in constructor for Unicode strings which
-is available through __builtins__:
-
-  u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
-
-  u = u'<unicode-escape encoded Python string>'
-
-  u = ur'<raw-unicode-escape encoded Python string>'
-
-With the 'unicode-escape' encoding being defined as:
-
-· all non-escape characters represent themselves as Unicode ordinal
-  (e.g. 'a' -> U+0061).
-
-· all existing defined Python escape sequences are interpreted as
-  Unicode ordinals; note that \xXXXX can represent all Unicode
-  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
-
-· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
-  error to have fewer than 4 digits after \u.
-
-For an explanation of possible values for errors see the Codec section
-below.
-
-Examples:
-
-u'abc'          -> U+0061 U+0062 U+0063
-u'\u1234'       -> U+1234
-u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+005c
-
-The 'raw-unicode-escape' encoding is defined as follows:
-
-· \uXXXX sequence represent the U+XXXX Unicode character if and
-  only if the number of leading backslashes is odd
-
-· all other characters represent themselves as Unicode ordinal
-  (e.g. 'b' -> U+0062)
-
-
-Note that you should provide some hint to the encoding you used to
-write your programs as pragma line in one the first few comment lines
-of the source file (e.g. '# source file encoding: latin-1'). If you
-only use 7-bit ASCII then everything is fine and no such notice is
-needed, but if you include Latin-1 characters not defined in ASCII, it
-may well be worthwhile including a hint since people in other
-countries will want to be able to read your source strings too.
-
-
-Unicode Type Object:
---------------------
-
-Unicode objects should have the type UnicodeType with type name
-'unicode', made available through the standard types module.
-
-
-Unicode Output:
----------------
-
-Unicode objects have a method .encode([encoding=<default encoding>])
-which returns a Python string encoding the Unicode string using the
-given scheme (see Codecs).
-
-  print u := print u.encode()   # using the <default encoding>
- 
-  str(u)  := u.encode()         # using the <default encoding>
-
-  repr(u) := "u%s" % repr(u.encode('unicode-escape'))
-
-Also see Internal Argument Parsing and Buffer Interface for details on
-how other APIs written in C will treat Unicode objects.
-
-
-Unicode Ordinals:
------------------
-
-Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
-should provide 32-bit aware ordinal conversion APIs:
-
-  ord(u[:1]) (this is the standard ord() extended to work with Unicode
-              objects)
-        --> Unicode ordinal number (32-bit)
-
-  unichr(i) 
-        --> Unicode object for character i (provided it is 32-bit);
-            ValueError otherwise
-
-Both APIs should go into __builtins__ just like their string
-counterparts ord() and chr().
-
-Note that Unicode provides space for private encodings. Usage of these
-can cause different output representations on different machines. This
-problem is not a Python or Unicode problem, but a machine setup and
-maintenance one.
-
-
-Comparison & Hash Value:
-------------------------
-
-Unicode objects should compare equal to other objects after these
-other objects have been coerced to Unicode. For strings this means
-that they are interpreted as Unicode string using the <default
-encoding>.
-
-Unicode objects should return the same hash value as their ASCII
-equivalent strings. Unicode strings holding non-ASCII values are not
-guaranteed to return the same hash values as the default encoded
-equivalent string representation.
-
-When compared using cmp() (or PyObject_Compare()) the implementation
-should mask TypeErrors raised during the conversion to remain in synch
-with the string behavior. All other errors such as ValueErrors raised
-during coercion of strings to Unicode should not be masked and passed
-through to the user.
-
-In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
-should be coerced to Unicode before applying the test. Errors occurring
-during coercion (e.g. None in u'abc') should not be masked.
-
-
-Coercion:
----------
-
-Using Python strings and Unicode objects to form new objects should
-always coerce to the more precise format, i.e. Unicode objects.
-
-  u + s := u + unicode(s)
-
-  s + u := unicode(s) + u
-
-All string methods should delegate the call to an equivalent Unicode
-object method call by converting all involved strings to Unicode and
-then applying the arguments to the Unicode method of the same name,
-e.g.
-
-  string.join((s,u),sep) := (s + sep) + u
-
-  sep.join((s,u)) := (s + sep) + u
-
-For a discussion of %-formatting w/r to Unicode objects, see
-Formatting Markers.
-
-
-Exceptions:
------------
-
-UnicodeError is defined in the exceptions module as a subclass of
-ValueError. It is available at the C level via PyExc_UnicodeError.
-All exceptions related to Unicode encoding/decoding should be
-subclasses of UnicodeError.
-
-
-Codecs (Coder/Decoders) Lookup:
--------------------------------
-
-A Codec (see Codec Interface Definition) search registry should be
-implemented by a module "codecs":
-
-  codecs.register(search_function)
-
-Search functions are expected to take one argument, the encoding name
-in all lower case letters and with hyphens and spaces converted to
-underscores, and return a tuple of functions (encoder, decoder,
-stream_reader, stream_writer) taking the following arguments:
-
-  encoder and decoder:
-	These must be functions or methods which have the same
-	interface as the .encode/.decode methods of Codec instances
-	(see Codec Interface). The functions/methods are expected to
-	work in a stateless mode.
-
-  stream_reader and stream_writer:
-	These need to be factory functions with the following
-	interface:
-
-	        factory(stream,errors='strict')
-
-        The factory functions must return objects providing
-        the interfaces defined by StreamWriter/StreamReader resp.
-        (see Codec Interface). Stream codecs can maintain state.
-
-	Possible values for errors are defined in the Codec
-	section below.
-
-In case a search function cannot find a given encoding, it should
-return None.
-
-Aliasing support for encodings is left to the search functions
-to implement.
-
-The codecs module will maintain an encoding cache for performance
-reasons. Encodings are first looked up in the cache. If not found, the
-list of registered search functions is scanned. If no codecs tuple is
-found, a LookupError is raised. Otherwise, the codecs tuple is stored
-in the cache and returned to the caller.
-
-To query the Codec instance the following API should be used:
-
-  codecs.lookup(encoding)
-
-This will either return the found codecs tuple or raise a LookupError.
-
-
-Standard Codecs:
-----------------
-
-Standard codecs should live inside an encodings/ package directory in the
-Standard Python Code Library. The __init__.py file of that directory should
-include a Codec Lookup compatible search function implementing a lazy module
-based codec lookup.
-
-Python should provide a few standard codecs for the most relevant
-encodings, e.g. 
-
-  'utf-8':              8-bit variable length encoding
-  'utf-16':             16-bit variable length encoding (little/big endian)
-  'utf-16-le':          utf-16 but explicitly little endian
-  'utf-16-be':          utf-16 but explicitly big endian
-  'ascii':              7-bit ASCII codepage
-  'iso-8859-1':         ISO 8859-1 (Latin 1) codepage
-  'unicode-escape':     See Unicode Constructors for a definition
-  'raw-unicode-escape': See Unicode Constructors for a definition
-  'native':             Dump of the Internal Format used by Python
-
-Common aliases should also be provided per default, e.g.  'latin-1'
-for 'iso-8859-1'.
-
-Note: 'utf-16' should be implemented by using and requiring byte order
-marks (BOM) for file input/output.
-
-All other encodings such as the CJK ones to support Asian scripts
-should be implemented in separate packages which do not get included
-in the core Python distribution and are not a part of this proposal.
-
-
-Codecs Interface Definition:
-----------------------------
-
-The following base class should be defined in the module
-"codecs". They provide not only templates for use by encoding module
-implementors, but also define the interface which is expected by the
-Unicode implementation.
-
-Note that the Codec Interface defined here is well suitable for a
-larger range of applications. The Unicode implementation expects
-Unicode objects on input for .encode() and .write() and character
-buffer compatible objects on input for .decode(). Output of .encode()
-and .read() should be a Python string and .decode() must return an
-Unicode object.
-
-First, we have the stateless encoders/decoders. These do not work in
-chunks as the stream codecs (see below) do, because all components are
-expected to be available in memory.
-
-class Codec:
-
-    """ Defines the interface for stateless encoders/decoders.
-
-        The .encode()/.decode() methods may implement different error
-        handling schemes by providing the errors argument. These
-        string values are defined:
-
-         'strict' - raise an error (or a subclass)
-         'ignore' - ignore the character and continue with the next
-         'replace' - replace with a suitable replacement character;
-                    Python will use the official U+FFFD REPLACEMENT
-                    CHARACTER for the builtin Unicode codecs.
-
-    """
-    def encode(self,input,errors='strict'):
-        
-        """ Encodes the object input and returns a tuple (output
-            object, length consumed).
-
-            errors defines the error handling to apply. It defaults to
-            'strict' handling.
-
-            The method may not store state in the Codec instance. Use
-            StreamCodec for codecs which have to keep state in order to
-            make encoding/decoding efficient.
-
-        """
-	...
-
-    def decode(self,input,errors='strict'):
-
-        """ Decodes the object input and returns a tuple (output
-            object, length consumed).
-
-            input must be an object which provides the bf_getreadbuf
-            buffer slot. Python strings, buffer objects and memory
-            mapped files are examples of objects providing this slot.
-        
-            errors defines the error handling to apply. It defaults to
-            'strict' handling.
-
-            The method may not store state in the Codec instance. Use
-            StreamCodec for codecs which have to keep state in order to
-            make encoding/decoding efficient.
-
-        """ 
-        ...
-
-StreamWriter and StreamReader define the interface for stateful
-encoders/decoders which work on streams. These allow processing of the
-data in chunks to efficiently use memory. If you have large strings in
-memory, you may want to wrap them with cStringIO objects and then use
-these codecs on them to be able to do chunk processing as well,
-e.g. to provide progress information to the user.
-
-class StreamWriter(Codec):
-
-    def __init__(self,stream,errors='strict'):
-
-        """ Creates a StreamWriter instance.
-
-            stream must be a file-like object open for writing
-            (binary) data.
-
-            The StreamWriter may implement different error handling
-            schemes by providing the errors keyword argument. These
-            parameters are defined:
-
-             'strict' - raise a ValueError (or a subclass)
-             'ignore' - ignore the character and continue with the next
-             'replace'- replace with a suitable replacement character
-
-        """
-        self.stream = stream
-        self.errors = errors
-
-    def write(self,object):
-
-        """ Writes the object's contents encoded to self.stream.
-        """
-        data, consumed = self.encode(object,self.errors)
-        self.stream.write(data)
-        
-    def writelines(self, list):
-
-        """ Writes the concatenated list of strings to the stream
-            using .write().
-        """
-        self.write(''.join(list))
-        
-    def reset(self):
-
-        """ Flushes and resets the codec buffers used for keeping state.
-
-            Calling this method should ensure that the data on the
-            output is put into a clean state, that allows appending
-            of new fresh data without having to rescan the whole
-            stream to recover state.
-
-        """
-        pass
-
-    def __getattr__(self,name,
-
-                    getattr=getattr):
-
-        """ Inherit all other methods from the underlying stream.
-        """
-        return getattr(self.stream,name)
-
-class StreamReader(Codec):
-
-    def __init__(self,stream,errors='strict'):
-
-        """ Creates a StreamReader instance.
-
-            stream must be a file-like object open for reading
-            (binary) data.
-
-            The StreamReader may implement different error handling
-            schemes by providing the errors keyword argument. These
-            parameters are defined:
-
-             'strict' - raise a ValueError (or a subclass)
-             'ignore' - ignore the character and continue with the next
-             'replace'- replace with a suitable replacement character;
-
-        """
-        self.stream = stream
-        self.errors = errors
-
-    def read(self,size=-1):
-
-        """ Decodes data from the stream self.stream and returns the
-            resulting object.
-
-            size indicates the approximate maximum number of bytes to
-            read from the stream for decoding purposes. The decoder
-            can modify this setting as appropriate. The default value
-            -1 indicates to read and decode as much as possible.  size
-            is intended to prevent having to decode huge files in one
-            step.
-
-            The method should use a greedy read strategy meaning that
-            it should read as much data as is allowed within the
-            definition of the encoding and the given size, e.g.  if
-            optional encoding endings or state markers are available
-            on the stream, these should be read too.
-
-        """
-        # Unsliced reading:
-        if size < 0:
-            return self.decode(self.stream.read())[0]
-        
-        # Sliced reading:
-        read = self.stream.read
-        decode = self.decode
-        data = read(size)
-        i = 0
-        while 1:
-            try:
-                object, decodedbytes = decode(data)
-            except ValueError,why:
-                # This method is slow but should work under pretty much
-                # all conditions; at most 10 tries are made
-                i = i + 1
-                newdata = read(1)
-                if not newdata or i > 10:
-                    raise
-                data = data + newdata
-            else:
-                return object
-
-    def readline(self, size=None):
-
-        """ Read one line from the input stream and return the
-            decoded data.
-
-            Note: Unlike the .readlines() method, this method inherits
-            the line breaking knowledge from the underlying stream's
-            .readline() method -- there is currently no support for
-            line breaking using the codec decoder due to lack of line
-            buffering. Subclasses should however, if possible, try to
-            implement this method using their own knowledge of line
-            breaking.
-
-            size, if given, is passed as size argument to the stream's
-            .readline() method.
-            
-        """
-        if size is None:
-            line = self.stream.readline()
-        else:
-            line = self.stream.readline(size)
-        return self.decode(line)[0]
-
-    def readlines(self, sizehint=0):
-
-        """ Read all lines available on the input stream
-            and return them as list of lines.
-
-            Line breaks are implemented using the codec's decoder
-            method and are included in the list entries.
-            
-            sizehint, if given, is passed as size argument to the
-            stream's .read() method.
-
-        """
-        if sizehint is None:
-            data = self.stream.read()
-        else:
-            data = self.stream.read(sizehint)
-        return self.decode(data)[0].splitlines(1)
-
-    def reset(self):
-
-        """ Resets the codec buffers used for keeping state.
-
-            Note that no stream repositioning should take place.
-            This method is primarily intended to be able to recover
-            from decoding errors.
-
-        """
-        pass
-
-    def __getattr__(self,name,
-
-                    getattr=getattr):
-
-        """ Inherit all other methods from the underlying stream.
-        """
-        return getattr(self.stream,name)
-
-
-Stream codec implementors are free to combine the StreamWriter and
-StreamReader interfaces into one class. Even combining all these with
-the Codec class should be possible.
-
-Implementors are free to add additional methods to enhance the codec
-functionality or provide extra state information needed for them to
-work. The internal codec implementation will only use the above
-interfaces, though.
-
-It is not required by the Unicode implementation to use these base
-classes, only the interfaces must match; this allows writing Codecs as
-extension types.
-
-As guideline, large mapping tables should be implemented using static
-C data in separate (shared) extension modules. That way multiple
-processes can share the same data.
-
-A tool to auto-convert Unicode mapping files to mapping modules should be
-provided to simplify support for additional mappings (see References).
-
-
-Whitespace:
------------
-
-The .split() method will have to know about what is considered
-whitespace in Unicode.
-
-
-Case Conversion:
-----------------
-
-Case conversion is rather complicated with Unicode data, since there
-are many different conditions to respect. See
-
-  http://www.unicode.org/unicode/reports/tr13/ 
-
-for some guidelines on implementing case conversion.
-
-For Python, we should only implement the 1-1 conversions included in
-Unicode. Locale dependent and other special case conversions (see the
-Unicode standard file SpecialCasing.txt) should be left to user land
-routines and not go into the core interpreter.
-
-The methods .capitalize() and .iscapitalized() should follow the case
-mapping algorithm defined in the above technical report as closely as
-possible.
-
-
-Line Breaks:
-------------
-
-Line breaking should be done for all Unicode characters having the B
-property as well as the combinations CRLF, CR, LF (interpreted in that
-order) and other special line separators defined by the standard.
-
-The Unicode type should provide a .splitlines() method which returns a
-list of lines according to the above specification. See Unicode
-Methods.
-
-
-Unicode Character Properties:
------------------------------
-
-A separate module "unicodedata" should provide a compact interface to
-all Unicode character properties defined in the standard's
-UnicodeData.txt file.
-
-Among other things, these properties provide ways to recognize
-numbers, digits, spaces, whitespace, etc.
-
-Since this module will have to provide access to all Unicode
-characters, it will eventually have to contain the data from
-UnicodeData.txt which takes up around 600kB. For this reason, the data
-should be stored in static C data. This enables compilation as shared
-module which the underlying OS can shared between processes (unlike
-normal Python code modules).
-
-There should be a standard Python interface for accessing this information
-so that other implementors can plug in their own possibly enhanced versions,
-e.g. ones that do decompressing of the data on-the-fly.
-
-
-Private Code Point Areas:
--------------------------
-
-Support for these is left to user land Codecs and not explicitly
-integrated into the core. Note that due to the Internal Format being
-implemented, only the area between \uE000 and \uF8FF is usable for
-private encodings.
-
-
-Internal Format:
-----------------
-
-The internal format for Unicode objects should use a Python specific
-fixed format <PythonUnicode> implemented as 'unsigned short' (or
-another unsigned numeric type having 16 bits). Byte order is platform
-dependent.
-
-This format will hold UTF-16 encodings of the corresponding Unicode
-ordinals. The Python Unicode implementation will address these values
-as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
-currently defined Unicode character points. UTF-16 without surrogates
-provides access to about 64k characters and covers all characters in
-the Basic Multilingual Plane (BMP) of Unicode.
-
-It is the Codec's responsibility to ensure that the data they pass to
-the Unicode object constructor respects this assumption. The
-constructor does not check the data for Unicode compliance or use of
-surrogates.
-
-Future implementations can extend the 32 bit restriction to the full
-set of all UTF-16 addressable characters (around 1M characters).
-
-The Unicode API should provide interface routines from <PythonUnicode>
-to the compiler's wchar_t which can be 16 or 32 bit depending on the
-compiler/libc/platform being used.
-
-Unicode objects should have a pointer to a cached Python string object
-<defenc> holding the object's value using the <default encoding>.
-This is needed for performance and internal parsing (see Internal
-Argument Parsing) reasons. The buffer is filled when the first
-conversion request to the <default encoding> is issued on the object.
-
-Interning is not needed (for now), since Python identifiers are
-defined as being ASCII only.
-
-codecs.BOM should return the byte order mark (BOM) for the format
-used internally. The codecs module should provide the following
-additional constants for convenience and reference (codecs.BOM will
-either be BOM_BE or BOM_LE depending on the platform):
-
-  BOM_BE: '\376\377' 
-    (corresponds to Unicode U+0000FEFF in UTF-16 on big endian
-     platforms == ZERO WIDTH NO-BREAK SPACE)
-
-  BOM_LE: '\377\376' 
-    (corresponds to Unicode U+0000FFFE in UTF-16 on little endian
-     platforms == defined as being an illegal Unicode character)
-
-  BOM4_BE: '\000\000\376\377'
-    (corresponds to Unicode U+0000FEFF in UCS-4)
-
-  BOM4_LE: '\377\376\000\000'
-    (corresponds to Unicode U+0000FFFE in UCS-4)
-
-Note that Unicode sees big endian byte order as being "correct". The
-swapped order is taken to be an indicator for a "wrong" format, hence
-the illegal character definition.
-
-The configure script should provide aid in deciding whether Python can
-use the native wchar_t type or not (it has to be a 16-bit unsigned
-type).
-
-
-Buffer Interface:
------------------
-
-Implement the buffer interface using the <defenc> Python string object
-as basis for bf_getcharbuf and the internal buffer for
-bf_getreadbuf. If bf_getcharbuf is requested and the <defenc> object
-does not yet exist, it is created first.
-
-Note that as special case, the parser marker "s#" will not return raw
-Unicode UTF-16 data (which the bf_getreadbuf returns), but instead
-tries to encode the Unicode object using the default encoding and then
-returns a pointer to the resulting string object (or raises an
-exception in case the conversion fails). This was done in order to
-prevent accidentely writing binary data to an output stream which the
-other end might not recognize.
-
-This has the advantage of being able to write to output streams (which
-typically use this interface) without additional specification of the
-encoding to use.
-
-If you need to access the read buffer interface of Unicode objects,
-use the PyObject_AsReadBuffer() interface.
-
-The internal format can also be accessed using the 'unicode-internal'
-codec, e.g. via u.encode('unicode-internal').
-
-
-Pickle/Marshalling:
--------------------
-
-Should have native Unicode object support. The objects should be
-encoded using platform independent encodings.
-
-Marshal should use UTF-8 and Pickle should either choose
-Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
-encoding. Using UTF-8 instead of UTF-16 has the advantage of
-eliminating the need to store a BOM mark.
-
-
-Regular Expressions:
---------------------
-
-Secret Labs AB is working on a Unicode-aware regular expression
-machinery.  It works on plain 8-bit, UCS-2, and (optionally) UCS-4
-internal character buffers.
-
-Also see
-
-        http://www.unicode.org/unicode/reports/tr18/
-
-for some remarks on how to treat Unicode REs.
-
-
-Formatting Markers:
--------------------
-
-Format markers are used in Python format strings. If Python strings
-are used as format strings, the following interpretations should be in
-effect:
-
-  '%s':                 For Unicode objects this will cause coercion of the
-			whole format string to Unicode. Note that
-			you should use a Unicode format string to start
-			with for performance reasons.
-
-In case the format string is an Unicode object, all parameters are coerced
-to Unicode first and then put together and formatted according to the format
-string. Numbers are first converted to strings and then to Unicode.
-
-  '%s':			Python strings are interpreted as Unicode
-			string using the <default encoding>. Unicode
-			objects are taken as is.
-
-All other string formatters should work accordingly.
-
-Example:
-
-u"%s %s" % (u"abc", "abc")  ==  u"abc abc"
-
-
-Internal Argument Parsing:
---------------------------
-
-These markers are used by the PyArg_ParseTuple() APIs:
-
-  "U":  Check for Unicode object and return a pointer to it
-
-  "s":  For Unicode objects: return a pointer to the object's
-	<defenc> buffer (which uses the <default encoding>).
-
-  "s#": Access to the default encoded version of the Unicode object
-        (see Buffer Interface); note that the length relates to the length
-	of the default encoded string rather than the Unicode object length.
-
-  "t#": Same as "s#".
-
-  "es": 
-	Takes two parameters: encoding (const char *) and
-	buffer (char **). 
-
-	The input object is first coerced to Unicode in the usual way
-	and then encoded into a string using the given encoding.
-
-	On output, a buffer of the needed size is allocated and
-	returned through *buffer as NULL-terminated string.
-	The encoded may not contain embedded NULL characters.
-	The caller is responsible for calling PyMem_Free()
-	to free the allocated *buffer after usage.
-
-  "es#":
-	Takes three parameters: encoding (const char *),
-	buffer (char **) and buffer_len (int *).
-	
-	The input object is first coerced to Unicode in the usual way
-	and then encoded into a string using the given encoding.
-
-	If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer)
-	on input. Output is then copied to *buffer.
-
-	If *buffer is NULL, a buffer of the needed size is
-	allocated and output copied into it. *buffer is then
-	updated to point to the allocated memory area.
-	The caller is responsible for calling PyMem_Free()
-	to free the allocated *buffer after usage.
-
-	In both cases *buffer_len is updated to the number of
-	characters written (excluding the trailing NULL-byte).
-	The output buffer is assured to be NULL-terminated.
-
-Examples:
-
-Using "es#" with auto-allocation:
-
-    static PyObject *
-    test_parser(PyObject *self,
-		PyObject *args)
-    {
-	PyObject *str;
-	const char *encoding = "latin-1";
-	char *buffer = NULL;
-	int buffer_len = 0;
-
-	if (!PyArg_ParseTuple(args, "es#:test_parser",
-			      encoding, &buffer, &buffer_len))
-	    return NULL;
-	if (!buffer) {
-	    PyErr_SetString(PyExc_SystemError,
-			    "buffer is NULL");
-	    return NULL;
-	}
-	str = PyString_FromStringAndSize(buffer, buffer_len);
-	PyMem_Free(buffer);
-	return str;
-    }
-
-Using "es" with auto-allocation returning a NULL-terminated string:    
-    
-    static PyObject *
-    test_parser(PyObject *self,
-		PyObject *args)
-    {
-	PyObject *str;
-	const char *encoding = "latin-1";
-	char *buffer = NULL;
-
-	if (!PyArg_ParseTuple(args, "es:test_parser",
-			      encoding, &buffer))
-	    return NULL;
-	if (!buffer) {
-	    PyErr_SetString(PyExc_SystemError,
-			    "buffer is NULL");
-	    return NULL;
-	}
-	str = PyString_FromString(buffer);
-	PyMem_Free(buffer);
-	return str;
-    }
-
-Using "es#" with a pre-allocated buffer:
-    
-    static PyObject *
-    test_parser(PyObject *self,
-		PyObject *args)
-    {
-	PyObject *str;
-	const char *encoding = "latin-1";
-	char _buffer[10];
-	char *buffer = _buffer;
-	int buffer_len = sizeof(_buffer);
-
-	if (!PyArg_ParseTuple(args, "es#:test_parser",
-			      encoding, &buffer, &buffer_len))
-	    return NULL;
-	if (!buffer) {
-	    PyErr_SetString(PyExc_SystemError,
-			    "buffer is NULL");
-	    return NULL;
-	}
-	str = PyString_FromStringAndSize(buffer, buffer_len);
-	return str;
-    }
-
-
-File/Stream Output:
--------------------
-
-Since file.write(object) and most other stream writers use the "s#" or
-"t#" argument parsing marker for querying the data to write, the
-default encoded string version of the Unicode object will be written
-to the streams (see Buffer Interface).
-
-For explicit handling of files using Unicode, the standard stream
-codecs as available through the codecs module should be used.
-
-The codecs module should provide a short-cut open(filename,mode,encoding)
-available which also assures that mode contains the 'b' character when
-needed.
-
-
-File/Stream Input:
-------------------
-
-Only the user knows what encoding the input data uses, so no special
-magic is applied. The user will have to explicitly convert the string
-data to Unicode objects as needed or use the file wrappers defined in
-the codecs module (see File/Stream Output).
-
-
-Unicode Methods & Attributes:
------------------------------
-
-All Python string methods, plus:
-
-  .encode([encoding=<default encoding>][,errors="strict"]) 
-     --> see Unicode Output
-
-  .splitlines([include_breaks=0])
-     --> breaks the Unicode string into a list of (Unicode) lines;
-         returns the lines with line breaks included, if include_breaks
-         is true. See Line Breaks for a specification of how line breaking
-         is done.
-
-
-Code Base:
-----------
-
-We should use Fredrik Lundh's Unicode object implementation as basis.
-It already implements most of the string methods needed and provides a
-well written code base which we can build upon.
-
-The object sharing implemented in Fredrik's implementation should
-be dropped.
-
-
-Test Cases:
------------
-
-Test cases should follow those in Lib/test/test_string.py and include
-additional checks for the Codec Registry and the Standard Codecs.
-
-
-References:
------------
-
-Unicode Consortium:
-        http://www.unicode.org/
-
-Unicode FAQ:
-        http://www.unicode.org/unicode/faq/
-
-Unicode 3.0:
-        http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
-
-Unicode-TechReports:
-        http://www.unicode.org/unicode/reports/techreports.html
-
-Unicode-Mappings:
-        ftp://ftp.unicode.org/Public/MAPPINGS/
-
-Introduction to Unicode (a little outdated by still nice to read):
-        http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
-
-For comparison:
-	Introducing Unicode to ECMAScript (aka JavaScript) --
-	http://www-4.ibm.com/software/developer/library/internationalization-support.html
-
-IANA Character Set Names:
-	ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
-
-Discussion of UTF-8 and Unicode support for POSIX and Linux:
-	http://www.cl.cam.ac.uk/~mgk25/unicode.html
-
-Encodings:
-
-    Overview:
-            http://czyborra.com/utf/
-
-    UTC-2:
-            http://www.uazone.com/multiling/unicode/ucs2.html
-
-    UTF-7:
-            Defined in RFC2152, e.g.
-            http://www.uazone.com/multiling/ml-docs/rfc2152.txt
-
-    UTF-8:
-            Defined in RFC2279, e.g.
-            http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
-
-    UTF-16:
-            http://www.uazone.com/multiling/unicode/wg2n1035.html
-
-
-History of this Proposal:
--------------------------
-1.7: Added note about the changed behaviour of "s#".
-1.6: Changed <defencstr> to <defenc> since this is the name used in the
-     implementation. Added notes about the usage of <defenc> in the
-     buffer protocol implementation.
-1.5: Added notes about setting the <default encoding>. Fixed some
-     typos (thanks to Andrew Kuchling). Changed <defencstr> to <utf8str>.
-1.4: Added note about mixed type comparisons and contains tests.
-     Changed treating of Unicode objects in format strings (if used
-     with '%s' % u they will now cause the format string to be
-     coerced to Unicode, thus producing a Unicode object on return).
-     Added link to IANA charset names (thanks to Lars Marius Garshol).
-     Added new codec methods .readline(), .readlines() and .writelines().
-1.3: Added new "es" and "es#" parser markers
-1.2: Removed POD about codecs.open()
-1.1: Added note about comparisons and hash values. Added note about
-     case mapping algorithms. Changed stream codecs .read() and
-     .write() method to match the standard file-like object methods
-     (bytes consumed information is no longer returned by the methods)
-1.0: changed encode Codec method to be symmetric to the decode method
-     (they both return (object, data consumed) now and thus become
-     interchangeable); removed __init__ method of Codec class (the
-     methods are stateless) and moved the errors argument down to the
-     methods; made the Codec design more generic w/r to type of input
-     and output objects; changed StreamWriter.flush to StreamWriter.reset
-     in order to avoid overriding the stream's .flush() method;
-     renamed .breaklines() to .splitlines(); renamed the module unicodec
-     to codecs; modified the File I/O section to refer to the stream codecs.
-0.9: changed errors keyword argument definition; added 'replace' error
-     handling; changed the codec APIs to accept buffer like objects on
-     input; some minor typo fixes; added Whitespace section and
-     included references for Unicode characters that have the whitespace
-     and the line break characteristic; added note that search functions
-     can expect lower-case encoding names; dropped slicing and offsets
-     in the codec APIs
-0.8: added encodings package and raw unicode escape encoding; untabified
-     the proposal; added notes on Unicode format strings; added
-     .breaklines() method
-0.7: added a whole new set of codec APIs; added a different encoder
-     lookup scheme; fixed some names
-0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
-     a real Python string object; changed Buffer Interface to delegate
-     requests to <defencstr>'s buffer interface; removed the explicit
-     reference to the unicodec.codecs dictionary (the module can implement
-     this in way fit for the purpose); removed the settable default
-     encoding; move UnicodeError from unicodec to exceptions; "s#"
-     not returns the internal data; passed the UCS-2/UTF-16 checking
-     from the Unicode constructor to the Codecs
-0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
-     private use encodings and Unicode character properties
-0.4: added Codec interface, notes on %-formatting, changed some encoding
-     details, added comments on stream wrappers, fixed some discussion
-     points (most important: Internal Format), clarified the 
-     'unicode-escape' encoding, added encoding references
-0.3: added references, comments on codec modules, the internal format,
-     bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
-     proposed by Tim Peters and fixed repr(u) accordingly
-0.2: integrated Guido's suggestions, added stream codecs and file
-     wrapping
-0.1: first version
-
-
------------------------------------------------------------------------------
-Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
------------------------------------------------------------------------------
+-Barry
commit	3f8c2e1616f1d67730ffef40f3077247dfe46e26	[log] [tgz]
author	Barry Warsaw <barry@python.org>	Wed Jun 27 17:14:04 2001 +0000
committer	Barry Warsaw <barry@python.org>	Wed Jun 27 17:14:04 2001 +0000
tree	2750b459c0c27bfac8f323a0074f4b3a9af82f01
parent	69940973db6c14dab89880ab97cbbd9ade4b526b [diff]