| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 1 | ============================================================================= | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 2 |  Python Unicode Integration                            Proposal Version: 1.6 | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 3 | ----------------------------------------------------------------------------- | 
 | 4 |  | 
 | 5 |  | 
 | 6 | Introduction: | 
 | 7 | ------------- | 
 | 8 |  | 
 | 9 | The idea of this proposal is to add native Unicode 3.0 support to | 
 | 10 | Python in a way that makes use of Unicode strings as simple as | 
 | 11 | possible without introducing too many pitfalls along the way. | 
 | 12 |  | 
 | 13 | Since this goal is not easy to achieve -- strings being one of the | 
 | 14 | most fundamental objects in Python --, we expect this proposal to | 
 | 15 | undergo some significant refinements. | 
 | 16 |  | 
 | 17 | Note that the current version of this proposal is still a bit unsorted | 
 | 18 | due to the many different aspects of the Unicode-Python integration. | 
 | 19 |  | 
 | 20 | The latest version of this document is always available at: | 
 | 21 |  | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 22 |         http://starship.python.net/~lemburg/unicode-proposal.txt | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 23 |  | 
 | 24 | Older versions are available as: | 
 | 25 |  | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 26 |         http://starship.python.net/~lemburg/unicode-proposal-X.X.txt | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 27 |  | 
 | 28 |  | 
 | 29 | Conventions: | 
 | 30 | ------------ | 
 | 31 |  | 
 | 32 | · In examples we use u = Unicode object and s = Python string | 
 | 33 |  | 
 | 34 | · 'XXX' markings indicate points of discussion (PODs) | 
 | 35 |  | 
 | 36 |  | 
 | 37 | General Remarks: | 
 | 38 | ---------------- | 
 | 39 |  | 
 | 40 | · Unicode encoding names should be lower case on output and | 
 | 41 |   case-insensitive on input (they will be converted to lower case | 
 | 42 |   by all APIs taking an encoding name as input). | 
 | 43 |  | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 44 | · Encoding names should follow the name conventions as used by the | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 45 |   Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is | 
 | 46 |   written as 'utf-16'. | 
 | 47 |  | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 48 | · Codec modules should use the same names, but with hyphens converted | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 49 |   to underscores, e.g. utf_8, utf_16, iso_8859_1. | 
 | 50 |  | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 51 |  | 
 | 52 | Unicode Default Encoding: | 
 | 53 | ------------------------- | 
 | 54 |  | 
 | 55 | The Unicode implementation has to make some assumption about the | 
 | 56 | encoding of 8-bit strings passed to it for coercion and about the | 
 | 57 | encoding to as default for conversion of Unicode to strings when no | 
 | 58 | specific encoding is given. This encoding is called <default encoding> | 
 | 59 | throughout this text. | 
 | 60 |  | 
 | 61 | For this, the implementation maintains a global which can be set in | 
 | 62 | the site.py Python startup script. Subsequent changes are not | 
 | 63 | possible. The <default encoding> can be set and queried using the | 
 | 64 | two sys module APIs: | 
 | 65 |  | 
 | 66 |   sys.setdefaultencoding(encoding) | 
 | 67 |      --> Sets the <default encoding> used by the Unicode implementation. | 
 | 68 | 	 encoding has to be an encoding which is supported by the Python | 
 | 69 | 	 installation, otherwise, a LookupError is raised. | 
 | 70 |  | 
 | 71 | 	 Note: This API is only available in site.py ! It is removed | 
 | 72 | 	 from the sys module by site.py after usage. | 
 | 73 |  | 
 | 74 |   sys.getdefaultencoding() | 
 | 75 |      --> Returns the current <default encoding>. | 
 | 76 |  | 
 | 77 | If not otherwise defined or set, the <default encoding> defaults to | 
 | 78 | 'ascii'. This encoding is also the startup default of Python (and in | 
 | 79 | effect before site.py is executed). | 
 | 80 |  | 
 | 81 | Note that the default site.py startup module contains disabled | 
 | 82 | optional code which can set the <default encoding> according to the | 
 | 83 | encoding defined by the current locale. The locale module is used to | 
 | 84 | extract the encoding from the locale default settings defined by the | 
 | 85 | OS environment (see locale.py). If the encoding cannot be determined, | 
 | 86 | is unkown or unsupported, the code defaults to setting the <default | 
 | 87 | encoding> to 'ascii'. To enable this code, edit the site.py file or | 
 | 88 | place the appropriate code into the sitecustomize.py module of your | 
 | 89 | Python installation. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 90 |  | 
 | 91 |  | 
 | 92 | Unicode Constructors: | 
 | 93 | --------------------- | 
 | 94 |  | 
 | 95 | Python should provide a built-in constructor for Unicode strings which | 
 | 96 | is available through __builtins__: | 
 | 97 |  | 
 | 98 |   u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"]) | 
 | 99 |  | 
 | 100 |   u = u'<unicode-escape encoded Python string>' | 
 | 101 |  | 
 | 102 |   u = ur'<raw-unicode-escape encoded Python string>' | 
 | 103 |  | 
 | 104 | With the 'unicode-escape' encoding being defined as: | 
 | 105 |  | 
 | 106 | · all non-escape characters represent themselves as Unicode ordinal | 
 | 107 |   (e.g. 'a' -> U+0061). | 
 | 108 |  | 
 | 109 | · all existing defined Python escape sequences are interpreted as | 
 | 110 |   Unicode ordinals; note that \xXXXX can represent all Unicode | 
 | 111 |   ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. | 
 | 112 |  | 
 | 113 | · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax | 
 | 114 |   error to have fewer than 4 digits after \u. | 
 | 115 |  | 
 | 116 | For an explanation of possible values for errors see the Codec section | 
 | 117 | below. | 
 | 118 |  | 
 | 119 | Examples: | 
 | 120 |  | 
 | 121 | u'abc'          -> U+0061 U+0062 U+0063 | 
 | 122 | u'\u1234'       -> U+1234 | 
 | 123 | u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+005c | 
 | 124 |  | 
 | 125 | The 'raw-unicode-escape' encoding is defined as follows: | 
 | 126 |  | 
 | 127 | · \uXXXX sequence represent the U+XXXX Unicode character if and | 
 | 128 |   only if the number of leading backslashes is odd | 
 | 129 |  | 
 | 130 | · all other characters represent themselves as Unicode ordinal | 
 | 131 |   (e.g. 'b' -> U+0062) | 
 | 132 |  | 
 | 133 |  | 
 | 134 | Note that you should provide some hint to the encoding you used to | 
 | 135 | write your programs as pragma line in one the first few comment lines | 
 | 136 | of the source file (e.g. '# source file encoding: latin-1'). If you | 
 | 137 | only use 7-bit ASCII then everything is fine and no such notice is | 
 | 138 | needed, but if you include Latin-1 characters not defined in ASCII, it | 
 | 139 | may well be worthwhile including a hint since people in other | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 140 | countries will want to be able to read your source strings too. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 141 |  | 
 | 142 |  | 
 | 143 | Unicode Type Object: | 
 | 144 | -------------------- | 
 | 145 |  | 
 | 146 | Unicode objects should have the type UnicodeType with type name | 
 | 147 | 'unicode', made available through the standard types module. | 
 | 148 |  | 
 | 149 |  | 
 | 150 | Unicode Output: | 
 | 151 | --------------- | 
 | 152 |  | 
 | 153 | Unicode objects have a method .encode([encoding=<default encoding>]) | 
 | 154 | which returns a Python string encoding the Unicode string using the | 
 | 155 | given scheme (see Codecs). | 
 | 156 |  | 
 | 157 |   print u := print u.encode()   # using the <default encoding> | 
 | 158 |   | 
 | 159 |   str(u)  := u.encode()         # using the <default encoding> | 
 | 160 |  | 
 | 161 |   repr(u) := "u%s" % repr(u.encode('unicode-escape')) | 
 | 162 |  | 
 | 163 | Also see Internal Argument Parsing and Buffer Interface for details on | 
 | 164 | how other APIs written in C will treat Unicode objects. | 
 | 165 |  | 
 | 166 |  | 
 | 167 | Unicode Ordinals: | 
 | 168 | ----------------- | 
 | 169 |  | 
 | 170 | Since Unicode 3.0 has a 32-bit ordinal character set, the implementation | 
 | 171 | should provide 32-bit aware ordinal conversion APIs: | 
 | 172 |  | 
 | 173 |   ord(u[:1]) (this is the standard ord() extended to work with Unicode | 
 | 174 |               objects) | 
 | 175 |         --> Unicode ordinal number (32-bit) | 
 | 176 |  | 
 | 177 |   unichr(i)  | 
 | 178 |         --> Unicode object for character i (provided it is 32-bit); | 
 | 179 |             ValueError otherwise | 
 | 180 |  | 
 | 181 | Both APIs should go into __builtins__ just like their string | 
 | 182 | counterparts ord() and chr(). | 
 | 183 |  | 
 | 184 | Note that Unicode provides space for private encodings. Usage of these | 
 | 185 | can cause different output representations on different machines. This | 
 | 186 | problem is not a Python or Unicode problem, but a machine setup and | 
 | 187 | maintenance one. | 
 | 188 |  | 
 | 189 |  | 
 | 190 | Comparison & Hash Value: | 
 | 191 | ------------------------ | 
 | 192 |  | 
 | 193 | Unicode objects should compare equal to other objects after these | 
 | 194 | other objects have been coerced to Unicode. For strings this means | 
 | 195 | that they are interpreted as Unicode string using the <default | 
 | 196 | encoding>. | 
 | 197 |  | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 198 | Unicode objects should return the same hash value as their ASCII | 
 | 199 | equivalent strings. Unicode strings holding non-ASCII values are not | 
 | 200 | guaranteed to return the same hash values as the default encoded | 
 | 201 | equivalent string representation. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 202 |  | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 203 | When compared using cmp() (or PyObject_Compare()) the implementation | 
 | 204 | should mask TypeErrors raised during the conversion to remain in synch | 
 | 205 | with the string behavior. All other errors such as ValueErrors raised | 
 | 206 | during coercion of strings to Unicode should not be masked and passed | 
 | 207 | through to the user. | 
 | 208 |  | 
 | 209 | In containment tests ('a' in u'abc' and u'a' in 'abc') both sides | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 210 | should be coerced to Unicode before applying the test. Errors occurring | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 211 | during coercion (e.g. None in u'abc') should not be masked. | 
 | 212 |  | 
 | 213 |  | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 214 | Coercion: | 
 | 215 | --------- | 
 | 216 |  | 
 | 217 | Using Python strings and Unicode objects to form new objects should | 
 | 218 | always coerce to the more precise format, i.e. Unicode objects. | 
 | 219 |  | 
 | 220 |   u + s := u + unicode(s) | 
 | 221 |  | 
 | 222 |   s + u := unicode(s) + u | 
 | 223 |  | 
 | 224 | All string methods should delegate the call to an equivalent Unicode | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 225 | object method call by converting all involved strings to Unicode and | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 226 | then applying the arguments to the Unicode method of the same name, | 
 | 227 | e.g. | 
 | 228 |  | 
 | 229 |   string.join((s,u),sep) := (s + sep) + u | 
 | 230 |  | 
 | 231 |   sep.join((s,u)) := (s + sep) + u | 
 | 232 |  | 
 | 233 | For a discussion of %-formatting w/r to Unicode objects, see | 
 | 234 | Formatting Markers. | 
 | 235 |  | 
 | 236 |  | 
 | 237 | Exceptions: | 
 | 238 | ----------- | 
 | 239 |  | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 240 | UnicodeError is defined in the exceptions module as a subclass of | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 241 | ValueError. It is available at the C level via PyExc_UnicodeError. | 
 | 242 | All exceptions related to Unicode encoding/decoding should be | 
 | 243 | subclasses of UnicodeError. | 
 | 244 |  | 
 | 245 |  | 
 | 246 | Codecs (Coder/Decoders) Lookup: | 
 | 247 | ------------------------------- | 
 | 248 |  | 
 | 249 | A Codec (see Codec Interface Definition) search registry should be | 
 | 250 | implemented by a module "codecs": | 
 | 251 |  | 
 | 252 |   codecs.register(search_function) | 
 | 253 |  | 
 | 254 | Search functions are expected to take one argument, the encoding name | 
| Guido van Rossum | 2581764 | 2000-04-10 19:45:09 +0000 | [diff] [blame] | 255 | in all lower case letters and with hyphens and spaces converted to | 
 | 256 | underscores, and return a tuple of functions (encoder, decoder, | 
 | 257 | stream_reader, stream_writer) taking the following arguments: | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 258 |  | 
 | 259 |   encoder and decoder: | 
 | 260 | 	These must be functions or methods which have the same | 
 | 261 | 	interface as the .encode/.decode methods of Codec instances | 
 | 262 | 	(see Codec Interface). The functions/methods are expected to | 
 | 263 | 	work in a stateless mode. | 
 | 264 |  | 
 | 265 |   stream_reader and stream_writer: | 
 | 266 | 	These need to be factory functions with the following | 
 | 267 | 	interface: | 
 | 268 |  | 
 | 269 | 	        factory(stream,errors='strict') | 
 | 270 |  | 
 | 271 |         The factory functions must return objects providing | 
 | 272 |         the interfaces defined by StreamWriter/StreamReader resp. | 
 | 273 |         (see Codec Interface). Stream codecs can maintain state. | 
 | 274 |  | 
 | 275 | 	Possible values for errors are defined in the Codec | 
 | 276 | 	section below. | 
 | 277 |  | 
 | 278 | In case a search function cannot find a given encoding, it should | 
 | 279 | return None. | 
 | 280 |  | 
 | 281 | Aliasing support for encodings is left to the search functions | 
 | 282 | to implement. | 
 | 283 |  | 
 | 284 | The codecs module will maintain an encoding cache for performance | 
 | 285 | reasons. Encodings are first looked up in the cache. If not found, the | 
 | 286 | list of registered search functions is scanned. If no codecs tuple is | 
 | 287 | found, a LookupError is raised. Otherwise, the codecs tuple is stored | 
 | 288 | in the cache and returned to the caller. | 
 | 289 |  | 
 | 290 | To query the Codec instance the following API should be used: | 
 | 291 |  | 
 | 292 |   codecs.lookup(encoding) | 
 | 293 |  | 
 | 294 | This will either return the found codecs tuple or raise a LookupError. | 
 | 295 |  | 
 | 296 |  | 
 | 297 | Standard Codecs: | 
 | 298 | ---------------- | 
 | 299 |  | 
 | 300 | Standard codecs should live inside an encodings/ package directory in the | 
 | 301 | Standard Python Code Library. The __init__.py file of that directory should | 
 | 302 | include a Codec Lookup compatible search function implementing a lazy module | 
 | 303 | based codec lookup. | 
 | 304 |  | 
 | 305 | Python should provide a few standard codecs for the most relevant | 
 | 306 | encodings, e.g.  | 
 | 307 |  | 
 | 308 |   'utf-8':              8-bit variable length encoding | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 309 |   'utf-16':             16-bit variable length encoding (little/big endian) | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 310 |   'utf-16-le':          utf-16 but explicitly little endian | 
 | 311 |   'utf-16-be':          utf-16 but explicitly big endian | 
 | 312 |   'ascii':              7-bit ASCII codepage | 
 | 313 |   'iso-8859-1':         ISO 8859-1 (Latin 1) codepage | 
 | 314 |   'unicode-escape':     See Unicode Constructors for a definition | 
 | 315 |   'raw-unicode-escape': See Unicode Constructors for a definition | 
 | 316 |   'native':             Dump of the Internal Format used by Python | 
 | 317 |  | 
 | 318 | Common aliases should also be provided per default, e.g.  'latin-1' | 
 | 319 | for 'iso-8859-1'. | 
 | 320 |  | 
 | 321 | Note: 'utf-16' should be implemented by using and requiring byte order | 
 | 322 | marks (BOM) for file input/output. | 
 | 323 |  | 
 | 324 | All other encodings such as the CJK ones to support Asian scripts | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 325 | should be implemented in separate packages which do not get included | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 326 | in the core Python distribution and are not a part of this proposal. | 
 | 327 |  | 
 | 328 |  | 
 | 329 | Codecs Interface Definition: | 
 | 330 | ---------------------------- | 
 | 331 |  | 
 | 332 | The following base class should be defined in the module | 
 | 333 | "codecs". They provide not only templates for use by encoding module | 
 | 334 | implementors, but also define the interface which is expected by the | 
 | 335 | Unicode implementation. | 
 | 336 |  | 
 | 337 | Note that the Codec Interface defined here is well suitable for a | 
 | 338 | larger range of applications. The Unicode implementation expects | 
 | 339 | Unicode objects on input for .encode() and .write() and character | 
 | 340 | buffer compatible objects on input for .decode(). Output of .encode() | 
 | 341 | and .read() should be a Python string and .decode() must return an | 
 | 342 | Unicode object. | 
 | 343 |  | 
 | 344 | First, we have the stateless encoders/decoders. These do not work in | 
 | 345 | chunks as the stream codecs (see below) do, because all components are | 
 | 346 | expected to be available in memory. | 
 | 347 |  | 
 | 348 | class Codec: | 
 | 349 |  | 
 | 350 |     """ Defines the interface for stateless encoders/decoders. | 
 | 351 |  | 
 | 352 |         The .encode()/.decode() methods may implement different error | 
 | 353 |         handling schemes by providing the errors argument. These | 
 | 354 |         string values are defined: | 
 | 355 |  | 
 | 356 |          'strict' - raise an error (or a subclass) | 
 | 357 |          'ignore' - ignore the character and continue with the next | 
 | 358 |          'replace' - replace with a suitable replacement character; | 
 | 359 |                     Python will use the official U+FFFD REPLACEMENT | 
 | 360 |                     CHARACTER for the builtin Unicode codecs. | 
 | 361 |  | 
 | 362 |     """ | 
 | 363 |     def encode(self,input,errors='strict'): | 
 | 364 |          | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 365 |         """ Encodes the object input and returns a tuple (output | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 366 |             object, length consumed). | 
 | 367 |  | 
 | 368 |             errors defines the error handling to apply. It defaults to | 
 | 369 |             'strict' handling. | 
 | 370 |  | 
 | 371 |             The method may not store state in the Codec instance. Use | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 372 |             StreamCodec for codecs which have to keep state in order to | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 373 |             make encoding/decoding efficient. | 
 | 374 |  | 
 | 375 |         """ | 
 | 376 | 	... | 
 | 377 |  | 
 | 378 |     def decode(self,input,errors='strict'): | 
 | 379 |  | 
 | 380 |         """ Decodes the object input and returns a tuple (output | 
 | 381 |             object, length consumed). | 
 | 382 |  | 
 | 383 |             input must be an object which provides the bf_getreadbuf | 
 | 384 |             buffer slot. Python strings, buffer objects and memory | 
 | 385 |             mapped files are examples of objects providing this slot. | 
 | 386 |          | 
 | 387 |             errors defines the error handling to apply. It defaults to | 
 | 388 |             'strict' handling. | 
 | 389 |  | 
 | 390 |             The method may not store state in the Codec instance. Use | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 391 |             StreamCodec for codecs which have to keep state in order to | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 392 |             make encoding/decoding efficient. | 
 | 393 |  | 
 | 394 |         """  | 
 | 395 |         ... | 
 | 396 |  | 
 | 397 | StreamWriter and StreamReader define the interface for stateful | 
 | 398 | encoders/decoders which work on streams. These allow processing of the | 
 | 399 | data in chunks to efficiently use memory. If you have large strings in | 
 | 400 | memory, you may want to wrap them with cStringIO objects and then use | 
 | 401 | these codecs on them to be able to do chunk processing as well, | 
 | 402 | e.g. to provide progress information to the user. | 
 | 403 |  | 
 | 404 | class StreamWriter(Codec): | 
 | 405 |  | 
 | 406 |     def __init__(self,stream,errors='strict'): | 
 | 407 |  | 
 | 408 |         """ Creates a StreamWriter instance. | 
 | 409 |  | 
 | 410 |             stream must be a file-like object open for writing | 
 | 411 |             (binary) data. | 
 | 412 |  | 
 | 413 |             The StreamWriter may implement different error handling | 
 | 414 |             schemes by providing the errors keyword argument. These | 
 | 415 |             parameters are defined: | 
 | 416 |  | 
 | 417 |              'strict' - raise a ValueError (or a subclass) | 
 | 418 |              'ignore' - ignore the character and continue with the next | 
 | 419 |              'replace'- replace with a suitable replacement character | 
 | 420 |  | 
 | 421 |         """ | 
 | 422 |         self.stream = stream | 
 | 423 |         self.errors = errors | 
 | 424 |  | 
 | 425 |     def write(self,object): | 
 | 426 |  | 
 | 427 |         """ Writes the object's contents encoded to self.stream. | 
 | 428 |         """ | 
 | 429 |         data, consumed = self.encode(object,self.errors) | 
 | 430 |         self.stream.write(data) | 
 | 431 |          | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 432 |     def writelines(self, list): | 
 | 433 |  | 
 | 434 |         """ Writes the concatenated list of strings to the stream | 
 | 435 |             using .write(). | 
 | 436 |         """ | 
 | 437 |         self.write(''.join(list)) | 
 | 438 |          | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 439 |     def reset(self): | 
 | 440 |  | 
 | 441 |         """ Flushes and resets the codec buffers used for keeping state. | 
 | 442 |  | 
 | 443 |             Calling this method should ensure that the data on the | 
 | 444 |             output is put into a clean state, that allows appending | 
 | 445 |             of new fresh data without having to rescan the whole | 
 | 446 |             stream to recover state. | 
 | 447 |  | 
 | 448 |         """ | 
 | 449 |         pass | 
 | 450 |  | 
 | 451 |     def __getattr__(self,name, | 
 | 452 |  | 
 | 453 |                     getattr=getattr): | 
 | 454 |  | 
 | 455 |         """ Inherit all other methods from the underlying stream. | 
 | 456 |         """ | 
 | 457 |         return getattr(self.stream,name) | 
 | 458 |  | 
 | 459 | class StreamReader(Codec): | 
 | 460 |  | 
 | 461 |     def __init__(self,stream,errors='strict'): | 
 | 462 |  | 
 | 463 |         """ Creates a StreamReader instance. | 
 | 464 |  | 
 | 465 |             stream must be a file-like object open for reading | 
 | 466 |             (binary) data. | 
 | 467 |  | 
 | 468 |             The StreamReader may implement different error handling | 
 | 469 |             schemes by providing the errors keyword argument. These | 
 | 470 |             parameters are defined: | 
 | 471 |  | 
 | 472 |              'strict' - raise a ValueError (or a subclass) | 
 | 473 |              'ignore' - ignore the character and continue with the next | 
 | 474 |              'replace'- replace with a suitable replacement character; | 
 | 475 |  | 
 | 476 |         """ | 
 | 477 |         self.stream = stream | 
 | 478 |         self.errors = errors | 
 | 479 |  | 
 | 480 |     def read(self,size=-1): | 
 | 481 |  | 
 | 482 |         """ Decodes data from the stream self.stream and returns the | 
 | 483 |             resulting object. | 
 | 484 |  | 
 | 485 |             size indicates the approximate maximum number of bytes to | 
 | 486 |             read from the stream for decoding purposes. The decoder | 
 | 487 |             can modify this setting as appropriate. The default value | 
 | 488 |             -1 indicates to read and decode as much as possible.  size | 
 | 489 |             is intended to prevent having to decode huge files in one | 
 | 490 |             step. | 
 | 491 |  | 
 | 492 |             The method should use a greedy read strategy meaning that | 
 | 493 |             it should read as much data as is allowed within the | 
 | 494 |             definition of the encoding and the given size, e.g.  if | 
 | 495 |             optional encoding endings or state markers are available | 
 | 496 |             on the stream, these should be read too. | 
 | 497 |  | 
 | 498 |         """ | 
 | 499 |         # Unsliced reading: | 
 | 500 |         if size < 0: | 
 | 501 |             return self.decode(self.stream.read())[0] | 
 | 502 |          | 
 | 503 |         # Sliced reading: | 
 | 504 |         read = self.stream.read | 
 | 505 |         decode = self.decode | 
 | 506 |         data = read(size) | 
 | 507 |         i = 0 | 
 | 508 |         while 1: | 
 | 509 |             try: | 
 | 510 |                 object, decodedbytes = decode(data) | 
 | 511 |             except ValueError,why: | 
 | 512 |                 # This method is slow but should work under pretty much | 
 | 513 |                 # all conditions; at most 10 tries are made | 
 | 514 |                 i = i + 1 | 
 | 515 |                 newdata = read(1) | 
 | 516 |                 if not newdata or i > 10: | 
 | 517 |                     raise | 
 | 518 |                 data = data + newdata | 
 | 519 |             else: | 
 | 520 |                 return object | 
 | 521 |  | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 522 |     def readline(self, size=None): | 
 | 523 |  | 
 | 524 |         """ Read one line from the input stream and return the | 
 | 525 |             decoded data. | 
 | 526 |  | 
 | 527 |             Note: Unlike the .readlines() method, this method inherits | 
 | 528 |             the line breaking knowledge from the underlying stream's | 
 | 529 |             .readline() method -- there is currently no support for | 
 | 530 |             line breaking using the codec decoder due to lack of line | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 531 |             buffering. Subclasses should however, if possible, try to | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 532 |             implement this method using their own knowledge of line | 
 | 533 |             breaking. | 
 | 534 |  | 
 | 535 |             size, if given, is passed as size argument to the stream's | 
 | 536 |             .readline() method. | 
 | 537 |              | 
 | 538 |         """ | 
 | 539 |         if size is None: | 
 | 540 |             line = self.stream.readline() | 
 | 541 |         else: | 
 | 542 |             line = self.stream.readline(size) | 
 | 543 |         return self.decode(line)[0] | 
 | 544 |  | 
 | 545 |     def readlines(self, sizehint=0): | 
 | 546 |  | 
 | 547 |         """ Read all lines available on the input stream | 
 | 548 |             and return them as list of lines. | 
 | 549 |  | 
 | 550 |             Line breaks are implemented using the codec's decoder | 
 | 551 |             method and are included in the list entries. | 
 | 552 |              | 
 | 553 |             sizehint, if given, is passed as size argument to the | 
 | 554 |             stream's .read() method. | 
 | 555 |  | 
 | 556 |         """ | 
 | 557 |         if sizehint is None: | 
 | 558 |             data = self.stream.read() | 
 | 559 |         else: | 
 | 560 |             data = self.stream.read(sizehint) | 
 | 561 |         return self.decode(data)[0].splitlines(1) | 
 | 562 |  | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 563 |     def reset(self): | 
 | 564 |  | 
 | 565 |         """ Resets the codec buffers used for keeping state. | 
 | 566 |  | 
 | 567 |             Note that no stream repositioning should take place. | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 568 |             This method is primarily intended to be able to recover | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 569 |             from decoding errors. | 
 | 570 |  | 
 | 571 |         """ | 
 | 572 |         pass | 
 | 573 |  | 
 | 574 |     def __getattr__(self,name, | 
 | 575 |  | 
 | 576 |                     getattr=getattr): | 
 | 577 |  | 
 | 578 |         """ Inherit all other methods from the underlying stream. | 
 | 579 |         """ | 
 | 580 |         return getattr(self.stream,name) | 
 | 581 |  | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 582 |  | 
 | 583 | Stream codec implementors are free to combine the StreamWriter and | 
 | 584 | StreamReader interfaces into one class. Even combining all these with | 
 | 585 | the Codec class should be possible. | 
 | 586 |  | 
 | 587 | Implementors are free to add additional methods to enhance the codec | 
 | 588 | functionality or provide extra state information needed for them to | 
 | 589 | work. The internal codec implementation will only use the above | 
 | 590 | interfaces, though. | 
 | 591 |  | 
 | 592 | It is not required by the Unicode implementation to use these base | 
 | 593 | classes, only the interfaces must match; this allows writing Codecs as | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 594 | extension types. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 595 |  | 
 | 596 | As guideline, large mapping tables should be implemented using static | 
 | 597 | C data in separate (shared) extension modules. That way multiple | 
 | 598 | processes can share the same data. | 
 | 599 |  | 
 | 600 | A tool to auto-convert Unicode mapping files to mapping modules should be | 
 | 601 | provided to simplify support for additional mappings (see References). | 
 | 602 |  | 
 | 603 |  | 
 | 604 | Whitespace: | 
 | 605 | ----------- | 
 | 606 |  | 
 | 607 | The .split() method will have to know about what is considered | 
 | 608 | whitespace in Unicode. | 
 | 609 |  | 
 | 610 |  | 
 | 611 | Case Conversion: | 
 | 612 | ---------------- | 
 | 613 |  | 
 | 614 | Case conversion is rather complicated with Unicode data, since there | 
 | 615 | are many different conditions to respect. See | 
 | 616 |  | 
 | 617 |   http://www.unicode.org/unicode/reports/tr13/  | 
 | 618 |  | 
 | 619 | for some guidelines on implementing case conversion. | 
 | 620 |  | 
 | 621 | For Python, we should only implement the 1-1 conversions included in | 
 | 622 | Unicode. Locale dependent and other special case conversions (see the | 
 | 623 | Unicode standard file SpecialCasing.txt) should be left to user land | 
 | 624 | routines and not go into the core interpreter. | 
 | 625 |  | 
 | 626 | The methods .capitalize() and .iscapitalized() should follow the case | 
 | 627 | mapping algorithm defined in the above technical report as closely as | 
 | 628 | possible. | 
 | 629 |  | 
 | 630 |  | 
 | 631 | Line Breaks: | 
 | 632 | ------------ | 
 | 633 |  | 
 | 634 | Line breaking should be done for all Unicode characters having the B | 
 | 635 | property as well as the combinations CRLF, CR, LF (interpreted in that | 
 | 636 | order) and other special line separators defined by the standard. | 
 | 637 |  | 
 | 638 | The Unicode type should provide a .splitlines() method which returns a | 
 | 639 | list of lines according to the above specification. See Unicode | 
 | 640 | Methods. | 
 | 641 |  | 
 | 642 |  | 
 | 643 | Unicode Character Properties: | 
 | 644 | ----------------------------- | 
 | 645 |  | 
 | 646 | A separate module "unicodedata" should provide a compact interface to | 
 | 647 | all Unicode character properties defined in the standard's | 
 | 648 | UnicodeData.txt file. | 
 | 649 |  | 
 | 650 | Among other things, these properties provide ways to recognize | 
 | 651 | numbers, digits, spaces, whitespace, etc. | 
 | 652 |  | 
 | 653 | Since this module will have to provide access to all Unicode | 
 | 654 | characters, it will eventually have to contain the data from | 
 | 655 | UnicodeData.txt which takes up around 600kB. For this reason, the data | 
 | 656 | should be stored in static C data. This enables compilation as shared | 
 | 657 | module which the underlying OS can shared between processes (unlike | 
 | 658 | normal Python code modules). | 
 | 659 |  | 
 | 660 | There should be a standard Python interface for accessing this information | 
 | 661 | so that other implementors can plug in their own possibly enhanced versions, | 
 | 662 | e.g. ones that do decompressing of the data on-the-fly. | 
 | 663 |  | 
 | 664 |  | 
 | 665 | Private Code Point Areas: | 
 | 666 | ------------------------- | 
 | 667 |  | 
 | 668 | Support for these is left to user land Codecs and not explicitly | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 669 | integrated into the core. Note that due to the Internal Format being | 
 | 670 | implemented, only the area between \uE000 and \uF8FF is usable for | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 671 | private encodings. | 
 | 672 |  | 
 | 673 |  | 
 | 674 | Internal Format: | 
 | 675 | ---------------- | 
 | 676 |  | 
 | 677 | The internal format for Unicode objects should use a Python specific | 
 | 678 | fixed format <PythonUnicode> implemented as 'unsigned short' (or | 
 | 679 | another unsigned numeric type having 16 bits). Byte order is platform | 
 | 680 | dependent. | 
 | 681 |  | 
 | 682 | This format will hold UTF-16 encodings of the corresponding Unicode | 
 | 683 | ordinals. The Python Unicode implementation will address these values | 
 | 684 | as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all | 
 | 685 | currently defined Unicode character points. UTF-16 without surrogates | 
 | 686 | provides access to about 64k characters and covers all characters in | 
 | 687 | the Basic Multilingual Plane (BMP) of Unicode. | 
 | 688 |  | 
 | 689 | It is the Codec's responsibility to ensure that the data they pass to | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 690 | the Unicode object constructor respects this assumption. The | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 691 | constructor does not check the data for Unicode compliance or use of | 
 | 692 | surrogates. | 
 | 693 |  | 
 | 694 | Future implementations can extend the 32 bit restriction to the full | 
 | 695 | set of all UTF-16 addressable characters (around 1M characters). | 
 | 696 |  | 
| Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 697 | The Unicode API should provide interface routines from <PythonUnicode> | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 698 | to the compiler's wchar_t which can be 16 or 32 bit depending on the | 
 | 699 | compiler/libc/platform being used. | 
 | 700 |  | 
 | 701 | Unicode objects should have a pointer to a cached Python string object | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 702 | <defenc> holding the object's value using the <default encoding>. | 
 | 703 | This is needed for performance and internal parsing (see Internal | 
 | 704 | Argument Parsing) reasons. The buffer is filled when the first | 
 | 705 | conversion request to the <default encoding> is issued on the object. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 706 |  | 
 | 707 | Interning is not needed (for now), since Python identifiers are | 
 | 708 | defined as being ASCII only. | 
 | 709 |  | 
 | 710 | codecs.BOM should return the byte order mark (BOM) for the format | 
 | 711 | used internally. The codecs module should provide the following | 
 | 712 | additional constants for convenience and reference (codecs.BOM will | 
 | 713 | either be BOM_BE or BOM_LE depending on the platform): | 
 | 714 |  | 
 | 715 |   BOM_BE: '\376\377'  | 
 | 716 |     (corresponds to Unicode U+0000FEFF in UTF-16 on big endian | 
 | 717 |      platforms == ZERO WIDTH NO-BREAK SPACE) | 
 | 718 |  | 
 | 719 |   BOM_LE: '\377\376'  | 
 | 720 |     (corresponds to Unicode U+0000FFFE in UTF-16 on little endian | 
 | 721 |      platforms == defined as being an illegal Unicode character) | 
 | 722 |  | 
 | 723 |   BOM4_BE: '\000\000\376\377' | 
 | 724 |     (corresponds to Unicode U+0000FEFF in UCS-4) | 
 | 725 |  | 
 | 726 |   BOM4_LE: '\377\376\000\000' | 
 | 727 |     (corresponds to Unicode U+0000FFFE in UCS-4) | 
 | 728 |  | 
 | 729 | Note that Unicode sees big endian byte order as being "correct". The | 
 | 730 | swapped order is taken to be an indicator for a "wrong" format, hence | 
 | 731 | the illegal character definition. | 
 | 732 |  | 
 | 733 | The configure script should provide aid in deciding whether Python can | 
 | 734 | use the native wchar_t type or not (it has to be a 16-bit unsigned | 
 | 735 | type). | 
 | 736 |  | 
 | 737 |  | 
 | 738 | Buffer Interface: | 
 | 739 | ----------------- | 
 | 740 |  | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 741 | Implement the buffer interface using the <defenc> Python string | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 742 | object as basis for bf_getcharbuf (corresponds to the "t#" argument | 
 | 743 | parsing marker) and the internal buffer for bf_getreadbuf (corresponds | 
 | 744 | to the "s#" argument parsing marker). If bf_getcharbuf is requested | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 745 | and the <defenc> object does not yet exist, it is created first. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 746 |  | 
 | 747 | This has the advantage of being able to write to output streams (which | 
 | 748 | typically use this interface) without additional specification of the | 
 | 749 | encoding to use. | 
 | 750 |  | 
 | 751 | The internal format can also be accessed using the 'unicode-internal' | 
 | 752 | codec, e.g. via u.encode('unicode-internal'). | 
 | 753 |  | 
 | 754 |  | 
 | 755 | Pickle/Marshalling: | 
 | 756 | ------------------- | 
 | 757 |  | 
 | 758 | Should have native Unicode object support. The objects should be | 
 | 759 | encoded using platform independent encodings. | 
 | 760 |  | 
 | 761 | Marshal should use UTF-8 and Pickle should either choose | 
 | 762 | Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as | 
 | 763 | encoding. Using UTF-8 instead of UTF-16 has the advantage of | 
 | 764 | eliminating the need to store a BOM mark. | 
 | 765 |  | 
 | 766 |  | 
 | 767 | Regular Expressions: | 
 | 768 | -------------------- | 
 | 769 |  | 
 | 770 | Secret Labs AB is working on a Unicode-aware regular expression | 
 | 771 | machinery.  It works on plain 8-bit, UCS-2, and (optionally) UCS-4 | 
 | 772 | internal character buffers. | 
 | 773 |  | 
 | 774 | Also see | 
 | 775 |  | 
 | 776 |         http://www.unicode.org/unicode/reports/tr18/ | 
 | 777 |  | 
 | 778 | for some remarks on how to treat Unicode REs. | 
 | 779 |  | 
 | 780 |  | 
 | 781 | Formatting Markers: | 
 | 782 | ------------------- | 
 | 783 |  | 
 | 784 | Format markers are used in Python format strings. If Python strings | 
 | 785 | are used as format strings, the following interpretations should be in | 
 | 786 | effect: | 
 | 787 |  | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 788 |   '%s':                 For Unicode objects this will cause coercion of the | 
 | 789 | 			whole format string to Unicode. Note that | 
 | 790 | 			you should use a Unicode format string to start | 
 | 791 | 			with for performance reasons. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 792 |  | 
 | 793 | In case the format string is an Unicode object, all parameters are coerced | 
 | 794 | to Unicode first and then put together and formatted according to the format | 
 | 795 | string. Numbers are first converted to strings and then to Unicode. | 
 | 796 |  | 
 | 797 |   '%s':			Python strings are interpreted as Unicode | 
 | 798 | 			string using the <default encoding>. Unicode | 
 | 799 | 			objects are taken as is. | 
 | 800 |  | 
 | 801 | All other string formatters should work accordingly. | 
 | 802 |  | 
 | 803 | Example: | 
 | 804 |  | 
 | 805 | u"%s %s" % (u"abc", "abc")  ==  u"abc abc" | 
 | 806 |  | 
 | 807 |  | 
 | 808 | Internal Argument Parsing: | 
 | 809 | -------------------------- | 
 | 810 |  | 
 | 811 | These markers are used by the PyArg_ParseTuple() APIs: | 
 | 812 |  | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 813 |   "U":  Check for Unicode object and return a pointer to it | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 814 |  | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 815 |   "s":  For Unicode objects: return a pointer to the object's | 
 | 816 | 	<defenc> buffer (which uses the <default encoding>). | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 817 |  | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 818 |   "s#": Access to the Unicode object via the bf_getreadbuf buffer interface  | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 819 |         (see Buffer Interface); note that the length relates to the buffer | 
 | 820 |         length, not the Unicode string length (this may be different | 
 | 821 |         depending on the Internal Format). | 
 | 822 |  | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 823 |   "t#": Access to the Unicode object via the bf_getcharbuf buffer interface | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 824 |         (see Buffer Interface); note that the length relates to the buffer | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 825 |         length, not necessarily to the Unicode string length. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 826 |  | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 827 |   "es":  | 
 | 828 | 	Takes two parameters: encoding (const char *) and | 
 | 829 | 	buffer (char **).  | 
 | 830 |  | 
 | 831 | 	The input object is first coerced to Unicode in the usual way | 
 | 832 | 	and then encoded into a string using the given encoding. | 
 | 833 |  | 
 | 834 | 	On output, a buffer of the needed size is allocated and | 
 | 835 | 	returned through *buffer as NULL-terminated string. | 
 | 836 | 	The encoded may not contain embedded NULL characters. | 
| Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 837 | 	The caller is responsible for calling PyMem_Free() | 
 | 838 | 	to free the allocated *buffer after usage. | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 839 |  | 
 | 840 |   "es#": | 
 | 841 | 	Takes three parameters: encoding (const char *), | 
 | 842 | 	buffer (char **) and buffer_len (int *). | 
 | 843 | 	 | 
 | 844 | 	The input object is first coerced to Unicode in the usual way | 
 | 845 | 	and then encoded into a string using the given encoding. | 
 | 846 |  | 
 | 847 | 	If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer) | 
 | 848 | 	on input. Output is then copied to *buffer. | 
 | 849 |  | 
 | 850 | 	If *buffer is NULL, a buffer of the needed size is | 
 | 851 | 	allocated and output copied into it. *buffer is then | 
| Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 852 | 	updated to point to the allocated memory area. | 
 | 853 | 	The caller is responsible for calling PyMem_Free() | 
 | 854 | 	to free the allocated *buffer after usage. | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 855 |  | 
 | 856 | 	In both cases *buffer_len is updated to the number of | 
 | 857 | 	characters written (excluding the trailing NULL-byte). | 
 | 858 | 	The output buffer is assured to be NULL-terminated. | 
 | 859 |  | 
 | 860 | Examples: | 
 | 861 |  | 
 | 862 | Using "es#" with auto-allocation: | 
 | 863 |  | 
 | 864 |     static PyObject * | 
 | 865 |     test_parser(PyObject *self, | 
 | 866 | 		PyObject *args) | 
 | 867 |     { | 
 | 868 | 	PyObject *str; | 
 | 869 | 	const char *encoding = "latin-1"; | 
 | 870 | 	char *buffer = NULL; | 
 | 871 | 	int buffer_len = 0; | 
 | 872 |  | 
 | 873 | 	if (!PyArg_ParseTuple(args, "es#:test_parser", | 
 | 874 | 			      encoding, &buffer, &buffer_len)) | 
 | 875 | 	    return NULL; | 
 | 876 | 	if (!buffer) { | 
 | 877 | 	    PyErr_SetString(PyExc_SystemError, | 
 | 878 | 			    "buffer is NULL"); | 
 | 879 | 	    return NULL; | 
 | 880 | 	} | 
 | 881 | 	str = PyString_FromStringAndSize(buffer, buffer_len); | 
| Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 882 | 	PyMem_Free(buffer); | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 883 | 	return str; | 
 | 884 |     } | 
 | 885 |  | 
 | 886 | Using "es" with auto-allocation returning a NULL-terminated string:     | 
 | 887 |      | 
 | 888 |     static PyObject * | 
 | 889 |     test_parser(PyObject *self, | 
 | 890 | 		PyObject *args) | 
 | 891 |     { | 
 | 892 | 	PyObject *str; | 
 | 893 | 	const char *encoding = "latin-1"; | 
 | 894 | 	char *buffer = NULL; | 
 | 895 |  | 
 | 896 | 	if (!PyArg_ParseTuple(args, "es:test_parser", | 
 | 897 | 			      encoding, &buffer)) | 
 | 898 | 	    return NULL; | 
 | 899 | 	if (!buffer) { | 
 | 900 | 	    PyErr_SetString(PyExc_SystemError, | 
 | 901 | 			    "buffer is NULL"); | 
 | 902 | 	    return NULL; | 
 | 903 | 	} | 
 | 904 | 	str = PyString_FromString(buffer); | 
| Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 905 | 	PyMem_Free(buffer); | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 906 | 	return str; | 
 | 907 |     } | 
 | 908 |  | 
 | 909 | Using "es#" with a pre-allocated buffer: | 
 | 910 |      | 
 | 911 |     static PyObject * | 
 | 912 |     test_parser(PyObject *self, | 
 | 913 | 		PyObject *args) | 
 | 914 |     { | 
 | 915 | 	PyObject *str; | 
 | 916 | 	const char *encoding = "latin-1"; | 
 | 917 | 	char _buffer[10]; | 
 | 918 | 	char *buffer = _buffer; | 
 | 919 | 	int buffer_len = sizeof(_buffer); | 
 | 920 |  | 
 | 921 | 	if (!PyArg_ParseTuple(args, "es#:test_parser", | 
 | 922 | 			      encoding, &buffer, &buffer_len)) | 
 | 923 | 	    return NULL; | 
 | 924 | 	if (!buffer) { | 
 | 925 | 	    PyErr_SetString(PyExc_SystemError, | 
 | 926 | 			    "buffer is NULL"); | 
 | 927 | 	    return NULL; | 
 | 928 | 	} | 
 | 929 | 	str = PyString_FromStringAndSize(buffer, buffer_len); | 
 | 930 | 	return str; | 
 | 931 |     } | 
 | 932 |  | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 933 |  | 
 | 934 | File/Stream Output: | 
 | 935 | ------------------- | 
 | 936 |  | 
 | 937 | Since file.write(object) and most other stream writers use the "s#" | 
 | 938 | argument parsing marker for binary files and "t#" for text files, the | 
 | 939 | buffer interface implementation determines the encoding to use (see | 
 | 940 | Buffer Interface). | 
 | 941 |  | 
 | 942 | For explicit handling of files using Unicode, the standard | 
 | 943 | stream codecs as available through the codecs module should  | 
 | 944 | be used. | 
 | 945 |  | 
| Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 946 | The codecs module should provide a short-cut open(filename,mode,encoding) | 
 | 947 | available which also assures that mode contains the 'b' character when | 
 | 948 | needed. | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 949 |  | 
 | 950 |  | 
 | 951 | File/Stream Input: | 
 | 952 | ------------------ | 
 | 953 |  | 
 | 954 | Only the user knows what encoding the input data uses, so no special | 
 | 955 | magic is applied. The user will have to explicitly convert the string | 
 | 956 | data to Unicode objects as needed or use the file wrappers defined in | 
 | 957 | the codecs module (see File/Stream Output). | 
 | 958 |  | 
 | 959 |  | 
 | 960 | Unicode Methods & Attributes: | 
 | 961 | ----------------------------- | 
 | 962 |  | 
 | 963 | All Python string methods, plus: | 
 | 964 |  | 
 | 965 |   .encode([encoding=<default encoding>][,errors="strict"])  | 
 | 966 |      --> see Unicode Output | 
 | 967 |  | 
 | 968 |   .splitlines([include_breaks=0]) | 
 | 969 |      --> breaks the Unicode string into a list of (Unicode) lines; | 
 | 970 |          returns the lines with line breaks included, if include_breaks | 
 | 971 |          is true. See Line Breaks for a specification of how line breaking | 
 | 972 |          is done. | 
 | 973 |  | 
 | 974 |  | 
 | 975 | Code Base: | 
 | 976 | ---------- | 
 | 977 |  | 
 | 978 | We should use Fredrik Lundh's Unicode object implementation as basis. | 
 | 979 | It already implements most of the string methods needed and provides a | 
 | 980 | well written code base which we can build upon. | 
 | 981 |  | 
 | 982 | The object sharing implemented in Fredrik's implementation should | 
 | 983 | be dropped. | 
 | 984 |  | 
 | 985 |  | 
 | 986 | Test Cases: | 
 | 987 | ----------- | 
 | 988 |  | 
 | 989 | Test cases should follow those in Lib/test/test_string.py and include | 
 | 990 | additional checks for the Codec Registry and the Standard Codecs. | 
 | 991 |  | 
 | 992 |  | 
 | 993 | References: | 
 | 994 | ----------- | 
 | 995 |  | 
 | 996 | Unicode Consortium: | 
 | 997 |         http://www.unicode.org/ | 
 | 998 |  | 
 | 999 | Unicode FAQ: | 
 | 1000 |         http://www.unicode.org/unicode/faq/ | 
 | 1001 |  | 
 | 1002 | Unicode 3.0: | 
 | 1003 |         http://www.unicode.org/unicode/standard/versions/Unicode3.0.html | 
 | 1004 |  | 
 | 1005 | Unicode-TechReports: | 
 | 1006 |         http://www.unicode.org/unicode/reports/techreports.html | 
 | 1007 |  | 
 | 1008 | Unicode-Mappings: | 
 | 1009 |         ftp://ftp.unicode.org/Public/MAPPINGS/ | 
 | 1010 |  | 
 | 1011 | Introduction to Unicode (a little outdated by still nice to read): | 
 | 1012 |         http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html | 
 | 1013 |  | 
| Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 1014 | For comparison: | 
| Fred Drake | a69ef82 | 2000-05-09 19:58:19 +0000 | [diff] [blame] | 1015 | 	Introducing Unicode to ECMAScript (aka JavaScript) -- | 
| Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 1016 | 	http://www-4.ibm.com/software/developer/library/internationalization-support.html | 
 | 1017 |  | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 1018 | IANA Character Set Names: | 
 | 1019 | 	ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets | 
 | 1020 |  | 
| Fred Drake | a69ef82 | 2000-05-09 19:58:19 +0000 | [diff] [blame] | 1021 | Discussion of UTF-8 and Unicode support for POSIX and Linux: | 
 | 1022 | 	http://www.cl.cam.ac.uk/~mgk25/unicode.html | 
 | 1023 |  | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 1024 | Encodings: | 
 | 1025 |  | 
 | 1026 |     Overview: | 
 | 1027 |             http://czyborra.com/utf/ | 
 | 1028 |  | 
 | 1029 |     UTC-2: | 
 | 1030 |             http://www.uazone.com/multiling/unicode/ucs2.html | 
 | 1031 |  | 
 | 1032 |     UTF-7: | 
 | 1033 |             Defined in RFC2152, e.g. | 
 | 1034 |             http://www.uazone.com/multiling/ml-docs/rfc2152.txt | 
 | 1035 |  | 
 | 1036 |     UTF-8: | 
 | 1037 |             Defined in RFC2279, e.g. | 
 | 1038 |             http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt | 
 | 1039 |  | 
 | 1040 |     UTF-16: | 
 | 1041 |             http://www.uazone.com/multiling/unicode/wg2n1035.html | 
 | 1042 |  | 
 | 1043 |  | 
 | 1044 | History of this Proposal: | 
 | 1045 | ------------------------- | 
| Marc-André Lemburg | bff879c | 2000-08-03 18:46:08 +0000 | [diff] [blame] | 1046 | 1.6: Changed <defencstr> to <defenc> since this is the name used in the | 
 | 1047 |      implementation. Added notes about the usage of <defenc> in the | 
 | 1048 |      buffer protocol implementation. | 
 | 1049 | 1.5: Added notes about setting the <default encoding>. Fixed some | 
 | 1050 |      typos (thanks to Andrew Kuchling). Changed <defencstr> to <utf8str>. | 
| Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 1051 | 1.4: Added note about mixed type comparisons and contains tests. | 
 | 1052 |      Changed treating of Unicode objects in format strings (if used | 
 | 1053 |      with '%s' % u they will now cause the format string to be | 
 | 1054 |      coerced to Unicode, thus producing a Unicode object on return). | 
 | 1055 |      Added link to IANA charset names (thanks to Lars Marius Garshol). | 
 | 1056 |      Added new codec methods .readline(), .readlines() and .writelines(). | 
| Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 1057 | 1.3: Added new "es" and "es#" parser markers | 
| Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 1058 | 1.2: Removed POD about codecs.open() | 
| Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 1059 | 1.1: Added note about comparisons and hash values. Added note about | 
 | 1060 |      case mapping algorithms. Changed stream codecs .read() and | 
 | 1061 |      .write() method to match the standard file-like object methods | 
 | 1062 |      (bytes consumed information is no longer returned by the methods) | 
 | 1063 | 1.0: changed encode Codec method to be symmetric to the decode method | 
 | 1064 |      (they both return (object, data consumed) now and thus become | 
 | 1065 |      interchangeable); removed __init__ method of Codec class (the | 
 | 1066 |      methods are stateless) and moved the errors argument down to the | 
 | 1067 |      methods; made the Codec design more generic w/r to type of input | 
 | 1068 |      and output objects; changed StreamWriter.flush to StreamWriter.reset | 
 | 1069 |      in order to avoid overriding the stream's .flush() method; | 
 | 1070 |      renamed .breaklines() to .splitlines(); renamed the module unicodec | 
 | 1071 |      to codecs; modified the File I/O section to refer to the stream codecs. | 
 | 1072 | 0.9: changed errors keyword argument definition; added 'replace' error | 
 | 1073 |      handling; changed the codec APIs to accept buffer like objects on | 
 | 1074 |      input; some minor typo fixes; added Whitespace section and | 
 | 1075 |      included references for Unicode characters that have the whitespace | 
 | 1076 |      and the line break characteristic; added note that search functions | 
 | 1077 |      can expect lower-case encoding names; dropped slicing and offsets | 
 | 1078 |      in the codec APIs | 
 | 1079 | 0.8: added encodings package and raw unicode escape encoding; untabified | 
 | 1080 |      the proposal; added notes on Unicode format strings; added | 
 | 1081 |      .breaklines() method | 
 | 1082 | 0.7: added a whole new set of codec APIs; added a different encoder | 
 | 1083 |      lookup scheme; fixed some names | 
 | 1084 | 0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding | 
 | 1085 |      a real Python string object; changed Buffer Interface to delegate | 
 | 1086 |      requests to <defencstr>'s buffer interface; removed the explicit | 
 | 1087 |      reference to the unicodec.codecs dictionary (the module can implement | 
 | 1088 |      this in way fit for the purpose); removed the settable default | 
 | 1089 |      encoding; move UnicodeError from unicodec to exceptions; "s#" | 
 | 1090 |      not returns the internal data; passed the UCS-2/UTF-16 checking | 
 | 1091 |      from the Unicode constructor to the Codecs | 
 | 1092 | 0.5: moved sys.bom to unicodec.BOM; added sections on case mapping, | 
 | 1093 |      private use encodings and Unicode character properties | 
 | 1094 | 0.4: added Codec interface, notes on %-formatting, changed some encoding | 
 | 1095 |      details, added comments on stream wrappers, fixed some discussion | 
 | 1096 |      points (most important: Internal Format), clarified the  | 
 | 1097 |      'unicode-escape' encoding, added encoding references | 
 | 1098 | 0.3: added references, comments on codec modules, the internal format, | 
 | 1099 |      bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding | 
 | 1100 |      proposed by Tim Peters and fixed repr(u) accordingly | 
 | 1101 | 0.2: integrated Guido's suggestions, added stream codecs and file | 
 | 1102 |      wrapping | 
 | 1103 | 0.1: first version | 
 | 1104 |  | 
 | 1105 |  | 
 | 1106 | ----------------------------------------------------------------------------- | 
 | 1107 | Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com | 
 | 1108 | ----------------------------------------------------------------------------- |