Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 1 | ============================================================================= |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 2 | Python Unicode Integration Proposal Version: 1.4 |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 3 | ----------------------------------------------------------------------------- |
| 4 | |
| 5 | |
| 6 | Introduction: |
| 7 | ------------- |
| 8 | |
| 9 | The idea of this proposal is to add native Unicode 3.0 support to |
| 10 | Python in a way that makes use of Unicode strings as simple as |
| 11 | possible without introducing too many pitfalls along the way. |
| 12 | |
| 13 | Since this goal is not easy to achieve -- strings being one of the |
| 14 | most fundamental objects in Python --, we expect this proposal to |
| 15 | undergo some significant refinements. |
| 16 | |
| 17 | Note that the current version of this proposal is still a bit unsorted |
| 18 | due to the many different aspects of the Unicode-Python integration. |
| 19 | |
| 20 | The latest version of this document is always available at: |
| 21 | |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 22 | http://starship.python.net/~lemburg/unicode-proposal.txt |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 23 | |
| 24 | Older versions are available as: |
| 25 | |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 26 | http://starship.python.net/~lemburg/unicode-proposal-X.X.txt |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 27 | |
| 28 | |
| 29 | Conventions: |
| 30 | ------------ |
| 31 | |
| 32 | · In examples we use u = Unicode object and s = Python string |
| 33 | |
| 34 | · 'XXX' markings indicate points of discussion (PODs) |
| 35 | |
| 36 | |
| 37 | General Remarks: |
| 38 | ---------------- |
| 39 | |
| 40 | · Unicode encoding names should be lower case on output and |
| 41 | case-insensitive on input (they will be converted to lower case |
| 42 | by all APIs taking an encoding name as input). |
| 43 | |
| 44 | Encoding names should follow the name conventions as used by the |
| 45 | Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is |
| 46 | written as 'utf-16'. |
| 47 | |
| 48 | Codec modules should use the same names, but with hyphens converted |
| 49 | to underscores, e.g. utf_8, utf_16, iso_8859_1. |
| 50 | |
| 51 | · The <default encoding> should be the widely used 'utf-8' format. This |
| 52 | is very close to the standard 7-bit ASCII format and thus resembles the |
| 53 | standard used programming nowadays in most aspects. |
| 54 | |
| 55 | |
| 56 | Unicode Constructors: |
| 57 | --------------------- |
| 58 | |
| 59 | Python should provide a built-in constructor for Unicode strings which |
| 60 | is available through __builtins__: |
| 61 | |
| 62 | u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"]) |
| 63 | |
| 64 | u = u'<unicode-escape encoded Python string>' |
| 65 | |
| 66 | u = ur'<raw-unicode-escape encoded Python string>' |
| 67 | |
| 68 | With the 'unicode-escape' encoding being defined as: |
| 69 | |
| 70 | · all non-escape characters represent themselves as Unicode ordinal |
| 71 | (e.g. 'a' -> U+0061). |
| 72 | |
| 73 | · all existing defined Python escape sequences are interpreted as |
| 74 | Unicode ordinals; note that \xXXXX can represent all Unicode |
| 75 | ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF. |
| 76 | |
| 77 | · a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax |
| 78 | error to have fewer than 4 digits after \u. |
| 79 | |
| 80 | For an explanation of possible values for errors see the Codec section |
| 81 | below. |
| 82 | |
| 83 | Examples: |
| 84 | |
| 85 | u'abc' -> U+0061 U+0062 U+0063 |
| 86 | u'\u1234' -> U+1234 |
| 87 | u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c |
| 88 | |
| 89 | The 'raw-unicode-escape' encoding is defined as follows: |
| 90 | |
| 91 | · \uXXXX sequence represent the U+XXXX Unicode character if and |
| 92 | only if the number of leading backslashes is odd |
| 93 | |
| 94 | · all other characters represent themselves as Unicode ordinal |
| 95 | (e.g. 'b' -> U+0062) |
| 96 | |
| 97 | |
| 98 | Note that you should provide some hint to the encoding you used to |
| 99 | write your programs as pragma line in one the first few comment lines |
| 100 | of the source file (e.g. '# source file encoding: latin-1'). If you |
| 101 | only use 7-bit ASCII then everything is fine and no such notice is |
| 102 | needed, but if you include Latin-1 characters not defined in ASCII, it |
| 103 | may well be worthwhile including a hint since people in other |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 104 | countries will want to be able to read your source strings too. |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 105 | |
| 106 | |
| 107 | Unicode Type Object: |
| 108 | -------------------- |
| 109 | |
| 110 | Unicode objects should have the type UnicodeType with type name |
| 111 | 'unicode', made available through the standard types module. |
| 112 | |
| 113 | |
| 114 | Unicode Output: |
| 115 | --------------- |
| 116 | |
| 117 | Unicode objects have a method .encode([encoding=<default encoding>]) |
| 118 | which returns a Python string encoding the Unicode string using the |
| 119 | given scheme (see Codecs). |
| 120 | |
| 121 | print u := print u.encode() # using the <default encoding> |
| 122 | |
| 123 | str(u) := u.encode() # using the <default encoding> |
| 124 | |
| 125 | repr(u) := "u%s" % repr(u.encode('unicode-escape')) |
| 126 | |
| 127 | Also see Internal Argument Parsing and Buffer Interface for details on |
| 128 | how other APIs written in C will treat Unicode objects. |
| 129 | |
| 130 | |
| 131 | Unicode Ordinals: |
| 132 | ----------------- |
| 133 | |
| 134 | Since Unicode 3.0 has a 32-bit ordinal character set, the implementation |
| 135 | should provide 32-bit aware ordinal conversion APIs: |
| 136 | |
| 137 | ord(u[:1]) (this is the standard ord() extended to work with Unicode |
| 138 | objects) |
| 139 | --> Unicode ordinal number (32-bit) |
| 140 | |
| 141 | unichr(i) |
| 142 | --> Unicode object for character i (provided it is 32-bit); |
| 143 | ValueError otherwise |
| 144 | |
| 145 | Both APIs should go into __builtins__ just like their string |
| 146 | counterparts ord() and chr(). |
| 147 | |
| 148 | Note that Unicode provides space for private encodings. Usage of these |
| 149 | can cause different output representations on different machines. This |
| 150 | problem is not a Python or Unicode problem, but a machine setup and |
| 151 | maintenance one. |
| 152 | |
| 153 | |
| 154 | Comparison & Hash Value: |
| 155 | ------------------------ |
| 156 | |
| 157 | Unicode objects should compare equal to other objects after these |
| 158 | other objects have been coerced to Unicode. For strings this means |
| 159 | that they are interpreted as Unicode string using the <default |
| 160 | encoding>. |
| 161 | |
| 162 | For the same reason, Unicode objects should return the same hash value |
| 163 | as their UTF-8 equivalent strings. |
| 164 | |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 165 | When compared using cmp() (or PyObject_Compare()) the implementation |
| 166 | should mask TypeErrors raised during the conversion to remain in synch |
| 167 | with the string behavior. All other errors such as ValueErrors raised |
| 168 | during coercion of strings to Unicode should not be masked and passed |
| 169 | through to the user. |
| 170 | |
| 171 | In containment tests ('a' in u'abc' and u'a' in 'abc') both sides |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 172 | should be coerced to Unicode before applying the test. Errors occurring |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 173 | during coercion (e.g. None in u'abc') should not be masked. |
| 174 | |
| 175 | |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 176 | Coercion: |
| 177 | --------- |
| 178 | |
| 179 | Using Python strings and Unicode objects to form new objects should |
| 180 | always coerce to the more precise format, i.e. Unicode objects. |
| 181 | |
| 182 | u + s := u + unicode(s) |
| 183 | |
| 184 | s + u := unicode(s) + u |
| 185 | |
| 186 | All string methods should delegate the call to an equivalent Unicode |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 187 | object method call by converting all involved strings to Unicode and |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 188 | then applying the arguments to the Unicode method of the same name, |
| 189 | e.g. |
| 190 | |
| 191 | string.join((s,u),sep) := (s + sep) + u |
| 192 | |
| 193 | sep.join((s,u)) := (s + sep) + u |
| 194 | |
| 195 | For a discussion of %-formatting w/r to Unicode objects, see |
| 196 | Formatting Markers. |
| 197 | |
| 198 | |
| 199 | Exceptions: |
| 200 | ----------- |
| 201 | |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 202 | UnicodeError is defined in the exceptions module as a subclass of |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 203 | ValueError. It is available at the C level via PyExc_UnicodeError. |
| 204 | All exceptions related to Unicode encoding/decoding should be |
| 205 | subclasses of UnicodeError. |
| 206 | |
| 207 | |
| 208 | Codecs (Coder/Decoders) Lookup: |
| 209 | ------------------------------- |
| 210 | |
| 211 | A Codec (see Codec Interface Definition) search registry should be |
| 212 | implemented by a module "codecs": |
| 213 | |
| 214 | codecs.register(search_function) |
| 215 | |
| 216 | Search functions are expected to take one argument, the encoding name |
Guido van Rossum | 2581764 | 2000-04-10 19:45:09 +0000 | [diff] [blame] | 217 | in all lower case letters and with hyphens and spaces converted to |
| 218 | underscores, and return a tuple of functions (encoder, decoder, |
| 219 | stream_reader, stream_writer) taking the following arguments: |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 220 | |
| 221 | encoder and decoder: |
| 222 | These must be functions or methods which have the same |
| 223 | interface as the .encode/.decode methods of Codec instances |
| 224 | (see Codec Interface). The functions/methods are expected to |
| 225 | work in a stateless mode. |
| 226 | |
| 227 | stream_reader and stream_writer: |
| 228 | These need to be factory functions with the following |
| 229 | interface: |
| 230 | |
| 231 | factory(stream,errors='strict') |
| 232 | |
| 233 | The factory functions must return objects providing |
| 234 | the interfaces defined by StreamWriter/StreamReader resp. |
| 235 | (see Codec Interface). Stream codecs can maintain state. |
| 236 | |
| 237 | Possible values for errors are defined in the Codec |
| 238 | section below. |
| 239 | |
| 240 | In case a search function cannot find a given encoding, it should |
| 241 | return None. |
| 242 | |
| 243 | Aliasing support for encodings is left to the search functions |
| 244 | to implement. |
| 245 | |
| 246 | The codecs module will maintain an encoding cache for performance |
| 247 | reasons. Encodings are first looked up in the cache. If not found, the |
| 248 | list of registered search functions is scanned. If no codecs tuple is |
| 249 | found, a LookupError is raised. Otherwise, the codecs tuple is stored |
| 250 | in the cache and returned to the caller. |
| 251 | |
| 252 | To query the Codec instance the following API should be used: |
| 253 | |
| 254 | codecs.lookup(encoding) |
| 255 | |
| 256 | This will either return the found codecs tuple or raise a LookupError. |
| 257 | |
| 258 | |
| 259 | Standard Codecs: |
| 260 | ---------------- |
| 261 | |
| 262 | Standard codecs should live inside an encodings/ package directory in the |
| 263 | Standard Python Code Library. The __init__.py file of that directory should |
| 264 | include a Codec Lookup compatible search function implementing a lazy module |
| 265 | based codec lookup. |
| 266 | |
| 267 | Python should provide a few standard codecs for the most relevant |
| 268 | encodings, e.g. |
| 269 | |
| 270 | 'utf-8': 8-bit variable length encoding |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 271 | 'utf-16': 16-bit variable length encoding (little/big endian) |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 272 | 'utf-16-le': utf-16 but explicitly little endian |
| 273 | 'utf-16-be': utf-16 but explicitly big endian |
| 274 | 'ascii': 7-bit ASCII codepage |
| 275 | 'iso-8859-1': ISO 8859-1 (Latin 1) codepage |
| 276 | 'unicode-escape': See Unicode Constructors for a definition |
| 277 | 'raw-unicode-escape': See Unicode Constructors for a definition |
| 278 | 'native': Dump of the Internal Format used by Python |
| 279 | |
| 280 | Common aliases should also be provided per default, e.g. 'latin-1' |
| 281 | for 'iso-8859-1'. |
| 282 | |
| 283 | Note: 'utf-16' should be implemented by using and requiring byte order |
| 284 | marks (BOM) for file input/output. |
| 285 | |
| 286 | All other encodings such as the CJK ones to support Asian scripts |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 287 | should be implemented in separate packages which do not get included |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 288 | in the core Python distribution and are not a part of this proposal. |
| 289 | |
| 290 | |
| 291 | Codecs Interface Definition: |
| 292 | ---------------------------- |
| 293 | |
| 294 | The following base class should be defined in the module |
| 295 | "codecs". They provide not only templates for use by encoding module |
| 296 | implementors, but also define the interface which is expected by the |
| 297 | Unicode implementation. |
| 298 | |
| 299 | Note that the Codec Interface defined here is well suitable for a |
| 300 | larger range of applications. The Unicode implementation expects |
| 301 | Unicode objects on input for .encode() and .write() and character |
| 302 | buffer compatible objects on input for .decode(). Output of .encode() |
| 303 | and .read() should be a Python string and .decode() must return an |
| 304 | Unicode object. |
| 305 | |
| 306 | First, we have the stateless encoders/decoders. These do not work in |
| 307 | chunks as the stream codecs (see below) do, because all components are |
| 308 | expected to be available in memory. |
| 309 | |
| 310 | class Codec: |
| 311 | |
| 312 | """ Defines the interface for stateless encoders/decoders. |
| 313 | |
| 314 | The .encode()/.decode() methods may implement different error |
| 315 | handling schemes by providing the errors argument. These |
| 316 | string values are defined: |
| 317 | |
| 318 | 'strict' - raise an error (or a subclass) |
| 319 | 'ignore' - ignore the character and continue with the next |
| 320 | 'replace' - replace with a suitable replacement character; |
| 321 | Python will use the official U+FFFD REPLACEMENT |
| 322 | CHARACTER for the builtin Unicode codecs. |
| 323 | |
| 324 | """ |
| 325 | def encode(self,input,errors='strict'): |
| 326 | |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 327 | """ Encodes the object input and returns a tuple (output |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 328 | object, length consumed). |
| 329 | |
| 330 | errors defines the error handling to apply. It defaults to |
| 331 | 'strict' handling. |
| 332 | |
| 333 | The method may not store state in the Codec instance. Use |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 334 | StreamCodec for codecs which have to keep state in order to |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 335 | make encoding/decoding efficient. |
| 336 | |
| 337 | """ |
| 338 | ... |
| 339 | |
| 340 | def decode(self,input,errors='strict'): |
| 341 | |
| 342 | """ Decodes the object input and returns a tuple (output |
| 343 | object, length consumed). |
| 344 | |
| 345 | input must be an object which provides the bf_getreadbuf |
| 346 | buffer slot. Python strings, buffer objects and memory |
| 347 | mapped files are examples of objects providing this slot. |
| 348 | |
| 349 | errors defines the error handling to apply. It defaults to |
| 350 | 'strict' handling. |
| 351 | |
| 352 | The method may not store state in the Codec instance. Use |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 353 | StreamCodec for codecs which have to keep state in order to |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 354 | make encoding/decoding efficient. |
| 355 | |
| 356 | """ |
| 357 | ... |
| 358 | |
| 359 | StreamWriter and StreamReader define the interface for stateful |
| 360 | encoders/decoders which work on streams. These allow processing of the |
| 361 | data in chunks to efficiently use memory. If you have large strings in |
| 362 | memory, you may want to wrap them with cStringIO objects and then use |
| 363 | these codecs on them to be able to do chunk processing as well, |
| 364 | e.g. to provide progress information to the user. |
| 365 | |
| 366 | class StreamWriter(Codec): |
| 367 | |
| 368 | def __init__(self,stream,errors='strict'): |
| 369 | |
| 370 | """ Creates a StreamWriter instance. |
| 371 | |
| 372 | stream must be a file-like object open for writing |
| 373 | (binary) data. |
| 374 | |
| 375 | The StreamWriter may implement different error handling |
| 376 | schemes by providing the errors keyword argument. These |
| 377 | parameters are defined: |
| 378 | |
| 379 | 'strict' - raise a ValueError (or a subclass) |
| 380 | 'ignore' - ignore the character and continue with the next |
| 381 | 'replace'- replace with a suitable replacement character |
| 382 | |
| 383 | """ |
| 384 | self.stream = stream |
| 385 | self.errors = errors |
| 386 | |
| 387 | def write(self,object): |
| 388 | |
| 389 | """ Writes the object's contents encoded to self.stream. |
| 390 | """ |
| 391 | data, consumed = self.encode(object,self.errors) |
| 392 | self.stream.write(data) |
| 393 | |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 394 | def writelines(self, list): |
| 395 | |
| 396 | """ Writes the concatenated list of strings to the stream |
| 397 | using .write(). |
| 398 | """ |
| 399 | self.write(''.join(list)) |
| 400 | |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 401 | def reset(self): |
| 402 | |
| 403 | """ Flushes and resets the codec buffers used for keeping state. |
| 404 | |
| 405 | Calling this method should ensure that the data on the |
| 406 | output is put into a clean state, that allows appending |
| 407 | of new fresh data without having to rescan the whole |
| 408 | stream to recover state. |
| 409 | |
| 410 | """ |
| 411 | pass |
| 412 | |
| 413 | def __getattr__(self,name, |
| 414 | |
| 415 | getattr=getattr): |
| 416 | |
| 417 | """ Inherit all other methods from the underlying stream. |
| 418 | """ |
| 419 | return getattr(self.stream,name) |
| 420 | |
| 421 | class StreamReader(Codec): |
| 422 | |
| 423 | def __init__(self,stream,errors='strict'): |
| 424 | |
| 425 | """ Creates a StreamReader instance. |
| 426 | |
| 427 | stream must be a file-like object open for reading |
| 428 | (binary) data. |
| 429 | |
| 430 | The StreamReader may implement different error handling |
| 431 | schemes by providing the errors keyword argument. These |
| 432 | parameters are defined: |
| 433 | |
| 434 | 'strict' - raise a ValueError (or a subclass) |
| 435 | 'ignore' - ignore the character and continue with the next |
| 436 | 'replace'- replace with a suitable replacement character; |
| 437 | |
| 438 | """ |
| 439 | self.stream = stream |
| 440 | self.errors = errors |
| 441 | |
| 442 | def read(self,size=-1): |
| 443 | |
| 444 | """ Decodes data from the stream self.stream and returns the |
| 445 | resulting object. |
| 446 | |
| 447 | size indicates the approximate maximum number of bytes to |
| 448 | read from the stream for decoding purposes. The decoder |
| 449 | can modify this setting as appropriate. The default value |
| 450 | -1 indicates to read and decode as much as possible. size |
| 451 | is intended to prevent having to decode huge files in one |
| 452 | step. |
| 453 | |
| 454 | The method should use a greedy read strategy meaning that |
| 455 | it should read as much data as is allowed within the |
| 456 | definition of the encoding and the given size, e.g. if |
| 457 | optional encoding endings or state markers are available |
| 458 | on the stream, these should be read too. |
| 459 | |
| 460 | """ |
| 461 | # Unsliced reading: |
| 462 | if size < 0: |
| 463 | return self.decode(self.stream.read())[0] |
| 464 | |
| 465 | # Sliced reading: |
| 466 | read = self.stream.read |
| 467 | decode = self.decode |
| 468 | data = read(size) |
| 469 | i = 0 |
| 470 | while 1: |
| 471 | try: |
| 472 | object, decodedbytes = decode(data) |
| 473 | except ValueError,why: |
| 474 | # This method is slow but should work under pretty much |
| 475 | # all conditions; at most 10 tries are made |
| 476 | i = i + 1 |
| 477 | newdata = read(1) |
| 478 | if not newdata or i > 10: |
| 479 | raise |
| 480 | data = data + newdata |
| 481 | else: |
| 482 | return object |
| 483 | |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 484 | def readline(self, size=None): |
| 485 | |
| 486 | """ Read one line from the input stream and return the |
| 487 | decoded data. |
| 488 | |
| 489 | Note: Unlike the .readlines() method, this method inherits |
| 490 | the line breaking knowledge from the underlying stream's |
| 491 | .readline() method -- there is currently no support for |
| 492 | line breaking using the codec decoder due to lack of line |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 493 | buffering. Subclasses should however, if possible, try to |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 494 | implement this method using their own knowledge of line |
| 495 | breaking. |
| 496 | |
| 497 | size, if given, is passed as size argument to the stream's |
| 498 | .readline() method. |
| 499 | |
| 500 | """ |
| 501 | if size is None: |
| 502 | line = self.stream.readline() |
| 503 | else: |
| 504 | line = self.stream.readline(size) |
| 505 | return self.decode(line)[0] |
| 506 | |
| 507 | def readlines(self, sizehint=0): |
| 508 | |
| 509 | """ Read all lines available on the input stream |
| 510 | and return them as list of lines. |
| 511 | |
| 512 | Line breaks are implemented using the codec's decoder |
| 513 | method and are included in the list entries. |
| 514 | |
| 515 | sizehint, if given, is passed as size argument to the |
| 516 | stream's .read() method. |
| 517 | |
| 518 | """ |
| 519 | if sizehint is None: |
| 520 | data = self.stream.read() |
| 521 | else: |
| 522 | data = self.stream.read(sizehint) |
| 523 | return self.decode(data)[0].splitlines(1) |
| 524 | |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 525 | def reset(self): |
| 526 | |
| 527 | """ Resets the codec buffers used for keeping state. |
| 528 | |
| 529 | Note that no stream repositioning should take place. |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 530 | This method is primarily intended to be able to recover |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 531 | from decoding errors. |
| 532 | |
| 533 | """ |
| 534 | pass |
| 535 | |
| 536 | def __getattr__(self,name, |
| 537 | |
| 538 | getattr=getattr): |
| 539 | |
| 540 | """ Inherit all other methods from the underlying stream. |
| 541 | """ |
| 542 | return getattr(self.stream,name) |
| 543 | |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 544 | |
| 545 | Stream codec implementors are free to combine the StreamWriter and |
| 546 | StreamReader interfaces into one class. Even combining all these with |
| 547 | the Codec class should be possible. |
| 548 | |
| 549 | Implementors are free to add additional methods to enhance the codec |
| 550 | functionality or provide extra state information needed for them to |
| 551 | work. The internal codec implementation will only use the above |
| 552 | interfaces, though. |
| 553 | |
| 554 | It is not required by the Unicode implementation to use these base |
| 555 | classes, only the interfaces must match; this allows writing Codecs as |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 556 | extension types. |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 557 | |
| 558 | As guideline, large mapping tables should be implemented using static |
| 559 | C data in separate (shared) extension modules. That way multiple |
| 560 | processes can share the same data. |
| 561 | |
| 562 | A tool to auto-convert Unicode mapping files to mapping modules should be |
| 563 | provided to simplify support for additional mappings (see References). |
| 564 | |
| 565 | |
| 566 | Whitespace: |
| 567 | ----------- |
| 568 | |
| 569 | The .split() method will have to know about what is considered |
| 570 | whitespace in Unicode. |
| 571 | |
| 572 | |
| 573 | Case Conversion: |
| 574 | ---------------- |
| 575 | |
| 576 | Case conversion is rather complicated with Unicode data, since there |
| 577 | are many different conditions to respect. See |
| 578 | |
| 579 | http://www.unicode.org/unicode/reports/tr13/ |
| 580 | |
| 581 | for some guidelines on implementing case conversion. |
| 582 | |
| 583 | For Python, we should only implement the 1-1 conversions included in |
| 584 | Unicode. Locale dependent and other special case conversions (see the |
| 585 | Unicode standard file SpecialCasing.txt) should be left to user land |
| 586 | routines and not go into the core interpreter. |
| 587 | |
| 588 | The methods .capitalize() and .iscapitalized() should follow the case |
| 589 | mapping algorithm defined in the above technical report as closely as |
| 590 | possible. |
| 591 | |
| 592 | |
| 593 | Line Breaks: |
| 594 | ------------ |
| 595 | |
| 596 | Line breaking should be done for all Unicode characters having the B |
| 597 | property as well as the combinations CRLF, CR, LF (interpreted in that |
| 598 | order) and other special line separators defined by the standard. |
| 599 | |
| 600 | The Unicode type should provide a .splitlines() method which returns a |
| 601 | list of lines according to the above specification. See Unicode |
| 602 | Methods. |
| 603 | |
| 604 | |
| 605 | Unicode Character Properties: |
| 606 | ----------------------------- |
| 607 | |
| 608 | A separate module "unicodedata" should provide a compact interface to |
| 609 | all Unicode character properties defined in the standard's |
| 610 | UnicodeData.txt file. |
| 611 | |
| 612 | Among other things, these properties provide ways to recognize |
| 613 | numbers, digits, spaces, whitespace, etc. |
| 614 | |
| 615 | Since this module will have to provide access to all Unicode |
| 616 | characters, it will eventually have to contain the data from |
| 617 | UnicodeData.txt which takes up around 600kB. For this reason, the data |
| 618 | should be stored in static C data. This enables compilation as shared |
| 619 | module which the underlying OS can shared between processes (unlike |
| 620 | normal Python code modules). |
| 621 | |
| 622 | There should be a standard Python interface for accessing this information |
| 623 | so that other implementors can plug in their own possibly enhanced versions, |
| 624 | e.g. ones that do decompressing of the data on-the-fly. |
| 625 | |
| 626 | |
| 627 | Private Code Point Areas: |
| 628 | ------------------------- |
| 629 | |
| 630 | Support for these is left to user land Codecs and not explicitly |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 631 | integrated into the core. Note that due to the Internal Format being |
| 632 | implemented, only the area between \uE000 and \uF8FF is usable for |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 633 | private encodings. |
| 634 | |
| 635 | |
| 636 | Internal Format: |
| 637 | ---------------- |
| 638 | |
| 639 | The internal format for Unicode objects should use a Python specific |
| 640 | fixed format <PythonUnicode> implemented as 'unsigned short' (or |
| 641 | another unsigned numeric type having 16 bits). Byte order is platform |
| 642 | dependent. |
| 643 | |
| 644 | This format will hold UTF-16 encodings of the corresponding Unicode |
| 645 | ordinals. The Python Unicode implementation will address these values |
| 646 | as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all |
| 647 | currently defined Unicode character points. UTF-16 without surrogates |
| 648 | provides access to about 64k characters and covers all characters in |
| 649 | the Basic Multilingual Plane (BMP) of Unicode. |
| 650 | |
| 651 | It is the Codec's responsibility to ensure that the data they pass to |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 652 | the Unicode object constructor respects this assumption. The |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 653 | constructor does not check the data for Unicode compliance or use of |
| 654 | surrogates. |
| 655 | |
| 656 | Future implementations can extend the 32 bit restriction to the full |
| 657 | set of all UTF-16 addressable characters (around 1M characters). |
| 658 | |
Marc-André Lemburg | bfa36f5 | 2000-06-08 17:51:33 +0000 | [diff] [blame] | 659 | The Unicode API should provide interface routines from <PythonUnicode> |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 660 | to the compiler's wchar_t which can be 16 or 32 bit depending on the |
| 661 | compiler/libc/platform being used. |
| 662 | |
| 663 | Unicode objects should have a pointer to a cached Python string object |
| 664 | <defencstr> holding the object's value using the current <default |
| 665 | encoding>. This is needed for performance and internal parsing (see |
| 666 | Internal Argument Parsing) reasons. The buffer is filled when the |
| 667 | first conversion request to the <default encoding> is issued on the |
| 668 | object. |
| 669 | |
| 670 | Interning is not needed (for now), since Python identifiers are |
| 671 | defined as being ASCII only. |
| 672 | |
| 673 | codecs.BOM should return the byte order mark (BOM) for the format |
| 674 | used internally. The codecs module should provide the following |
| 675 | additional constants for convenience and reference (codecs.BOM will |
| 676 | either be BOM_BE or BOM_LE depending on the platform): |
| 677 | |
| 678 | BOM_BE: '\376\377' |
| 679 | (corresponds to Unicode U+0000FEFF in UTF-16 on big endian |
| 680 | platforms == ZERO WIDTH NO-BREAK SPACE) |
| 681 | |
| 682 | BOM_LE: '\377\376' |
| 683 | (corresponds to Unicode U+0000FFFE in UTF-16 on little endian |
| 684 | platforms == defined as being an illegal Unicode character) |
| 685 | |
| 686 | BOM4_BE: '\000\000\376\377' |
| 687 | (corresponds to Unicode U+0000FEFF in UCS-4) |
| 688 | |
| 689 | BOM4_LE: '\377\376\000\000' |
| 690 | (corresponds to Unicode U+0000FFFE in UCS-4) |
| 691 | |
| 692 | Note that Unicode sees big endian byte order as being "correct". The |
| 693 | swapped order is taken to be an indicator for a "wrong" format, hence |
| 694 | the illegal character definition. |
| 695 | |
| 696 | The configure script should provide aid in deciding whether Python can |
| 697 | use the native wchar_t type or not (it has to be a 16-bit unsigned |
| 698 | type). |
| 699 | |
| 700 | |
| 701 | Buffer Interface: |
| 702 | ----------------- |
| 703 | |
| 704 | Implement the buffer interface using the <defencstr> Python string |
| 705 | object as basis for bf_getcharbuf (corresponds to the "t#" argument |
| 706 | parsing marker) and the internal buffer for bf_getreadbuf (corresponds |
| 707 | to the "s#" argument parsing marker). If bf_getcharbuf is requested |
| 708 | and the <defencstr> object does not yet exist, it is created first. |
| 709 | |
| 710 | This has the advantage of being able to write to output streams (which |
| 711 | typically use this interface) without additional specification of the |
| 712 | encoding to use. |
| 713 | |
| 714 | The internal format can also be accessed using the 'unicode-internal' |
| 715 | codec, e.g. via u.encode('unicode-internal'). |
| 716 | |
| 717 | |
| 718 | Pickle/Marshalling: |
| 719 | ------------------- |
| 720 | |
| 721 | Should have native Unicode object support. The objects should be |
| 722 | encoded using platform independent encodings. |
| 723 | |
| 724 | Marshal should use UTF-8 and Pickle should either choose |
| 725 | Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as |
| 726 | encoding. Using UTF-8 instead of UTF-16 has the advantage of |
| 727 | eliminating the need to store a BOM mark. |
| 728 | |
| 729 | |
| 730 | Regular Expressions: |
| 731 | -------------------- |
| 732 | |
| 733 | Secret Labs AB is working on a Unicode-aware regular expression |
| 734 | machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4 |
| 735 | internal character buffers. |
| 736 | |
| 737 | Also see |
| 738 | |
| 739 | http://www.unicode.org/unicode/reports/tr18/ |
| 740 | |
| 741 | for some remarks on how to treat Unicode REs. |
| 742 | |
| 743 | |
| 744 | Formatting Markers: |
| 745 | ------------------- |
| 746 | |
| 747 | Format markers are used in Python format strings. If Python strings |
| 748 | are used as format strings, the following interpretations should be in |
| 749 | effect: |
| 750 | |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 751 | '%s': For Unicode objects this will cause coercion of the |
| 752 | whole format string to Unicode. Note that |
| 753 | you should use a Unicode format string to start |
| 754 | with for performance reasons. |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 755 | |
| 756 | In case the format string is an Unicode object, all parameters are coerced |
| 757 | to Unicode first and then put together and formatted according to the format |
| 758 | string. Numbers are first converted to strings and then to Unicode. |
| 759 | |
| 760 | '%s': Python strings are interpreted as Unicode |
| 761 | string using the <default encoding>. Unicode |
| 762 | objects are taken as is. |
| 763 | |
| 764 | All other string formatters should work accordingly. |
| 765 | |
| 766 | Example: |
| 767 | |
| 768 | u"%s %s" % (u"abc", "abc") == u"abc abc" |
| 769 | |
| 770 | |
| 771 | Internal Argument Parsing: |
| 772 | -------------------------- |
| 773 | |
| 774 | These markers are used by the PyArg_ParseTuple() APIs: |
| 775 | |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 776 | "U": Check for Unicode object and return a pointer to it |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 777 | |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 778 | "s": For Unicode objects: auto convert them to the <default encoding> |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 779 | and return a pointer to the object's <defencstr> buffer. |
| 780 | |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 781 | "s#": Access to the Unicode object via the bf_getreadbuf buffer interface |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 782 | (see Buffer Interface); note that the length relates to the buffer |
| 783 | length, not the Unicode string length (this may be different |
| 784 | depending on the Internal Format). |
| 785 | |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 786 | "t#": Access to the Unicode object via the bf_getcharbuf buffer interface |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 787 | (see Buffer Interface); note that the length relates to the buffer |
| 788 | length, not necessarily to the Unicode string length (this may |
| 789 | be different depending on the <default encoding>). |
| 790 | |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 791 | "es": |
| 792 | Takes two parameters: encoding (const char *) and |
| 793 | buffer (char **). |
| 794 | |
| 795 | The input object is first coerced to Unicode in the usual way |
| 796 | and then encoded into a string using the given encoding. |
| 797 | |
| 798 | On output, a buffer of the needed size is allocated and |
| 799 | returned through *buffer as NULL-terminated string. |
| 800 | The encoded may not contain embedded NULL characters. |
Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 801 | The caller is responsible for calling PyMem_Free() |
| 802 | to free the allocated *buffer after usage. |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 803 | |
| 804 | "es#": |
| 805 | Takes three parameters: encoding (const char *), |
| 806 | buffer (char **) and buffer_len (int *). |
| 807 | |
| 808 | The input object is first coerced to Unicode in the usual way |
| 809 | and then encoded into a string using the given encoding. |
| 810 | |
| 811 | If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer) |
| 812 | on input. Output is then copied to *buffer. |
| 813 | |
| 814 | If *buffer is NULL, a buffer of the needed size is |
| 815 | allocated and output copied into it. *buffer is then |
Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 816 | updated to point to the allocated memory area. |
| 817 | The caller is responsible for calling PyMem_Free() |
| 818 | to free the allocated *buffer after usage. |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 819 | |
| 820 | In both cases *buffer_len is updated to the number of |
| 821 | characters written (excluding the trailing NULL-byte). |
| 822 | The output buffer is assured to be NULL-terminated. |
| 823 | |
| 824 | Examples: |
| 825 | |
| 826 | Using "es#" with auto-allocation: |
| 827 | |
| 828 | static PyObject * |
| 829 | test_parser(PyObject *self, |
| 830 | PyObject *args) |
| 831 | { |
| 832 | PyObject *str; |
| 833 | const char *encoding = "latin-1"; |
| 834 | char *buffer = NULL; |
| 835 | int buffer_len = 0; |
| 836 | |
| 837 | if (!PyArg_ParseTuple(args, "es#:test_parser", |
| 838 | encoding, &buffer, &buffer_len)) |
| 839 | return NULL; |
| 840 | if (!buffer) { |
| 841 | PyErr_SetString(PyExc_SystemError, |
| 842 | "buffer is NULL"); |
| 843 | return NULL; |
| 844 | } |
| 845 | str = PyString_FromStringAndSize(buffer, buffer_len); |
Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 846 | PyMem_Free(buffer); |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 847 | return str; |
| 848 | } |
| 849 | |
| 850 | Using "es" with auto-allocation returning a NULL-terminated string: |
| 851 | |
| 852 | static PyObject * |
| 853 | test_parser(PyObject *self, |
| 854 | PyObject *args) |
| 855 | { |
| 856 | PyObject *str; |
| 857 | const char *encoding = "latin-1"; |
| 858 | char *buffer = NULL; |
| 859 | |
| 860 | if (!PyArg_ParseTuple(args, "es:test_parser", |
| 861 | encoding, &buffer)) |
| 862 | return NULL; |
| 863 | if (!buffer) { |
| 864 | PyErr_SetString(PyExc_SystemError, |
| 865 | "buffer is NULL"); |
| 866 | return NULL; |
| 867 | } |
| 868 | str = PyString_FromString(buffer); |
Guido van Rossum | 24bdb04 | 2000-03-28 20:29:59 +0000 | [diff] [blame] | 869 | PyMem_Free(buffer); |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 870 | return str; |
| 871 | } |
| 872 | |
| 873 | Using "es#" with a pre-allocated buffer: |
| 874 | |
| 875 | static PyObject * |
| 876 | test_parser(PyObject *self, |
| 877 | PyObject *args) |
| 878 | { |
| 879 | PyObject *str; |
| 880 | const char *encoding = "latin-1"; |
| 881 | char _buffer[10]; |
| 882 | char *buffer = _buffer; |
| 883 | int buffer_len = sizeof(_buffer); |
| 884 | |
| 885 | if (!PyArg_ParseTuple(args, "es#:test_parser", |
| 886 | encoding, &buffer, &buffer_len)) |
| 887 | return NULL; |
| 888 | if (!buffer) { |
| 889 | PyErr_SetString(PyExc_SystemError, |
| 890 | "buffer is NULL"); |
| 891 | return NULL; |
| 892 | } |
| 893 | str = PyString_FromStringAndSize(buffer, buffer_len); |
| 894 | return str; |
| 895 | } |
| 896 | |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 897 | |
| 898 | File/Stream Output: |
| 899 | ------------------- |
| 900 | |
| 901 | Since file.write(object) and most other stream writers use the "s#" |
| 902 | argument parsing marker for binary files and "t#" for text files, the |
| 903 | buffer interface implementation determines the encoding to use (see |
| 904 | Buffer Interface). |
| 905 | |
| 906 | For explicit handling of files using Unicode, the standard |
| 907 | stream codecs as available through the codecs module should |
| 908 | be used. |
| 909 | |
Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 910 | The codecs module should provide a short-cut open(filename,mode,encoding) |
| 911 | available which also assures that mode contains the 'b' character when |
| 912 | needed. |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 913 | |
| 914 | |
| 915 | File/Stream Input: |
| 916 | ------------------ |
| 917 | |
| 918 | Only the user knows what encoding the input data uses, so no special |
| 919 | magic is applied. The user will have to explicitly convert the string |
| 920 | data to Unicode objects as needed or use the file wrappers defined in |
| 921 | the codecs module (see File/Stream Output). |
| 922 | |
| 923 | |
| 924 | Unicode Methods & Attributes: |
| 925 | ----------------------------- |
| 926 | |
| 927 | All Python string methods, plus: |
| 928 | |
| 929 | .encode([encoding=<default encoding>][,errors="strict"]) |
| 930 | --> see Unicode Output |
| 931 | |
| 932 | .splitlines([include_breaks=0]) |
| 933 | --> breaks the Unicode string into a list of (Unicode) lines; |
| 934 | returns the lines with line breaks included, if include_breaks |
| 935 | is true. See Line Breaks for a specification of how line breaking |
| 936 | is done. |
| 937 | |
| 938 | |
| 939 | Code Base: |
| 940 | ---------- |
| 941 | |
| 942 | We should use Fredrik Lundh's Unicode object implementation as basis. |
| 943 | It already implements most of the string methods needed and provides a |
| 944 | well written code base which we can build upon. |
| 945 | |
| 946 | The object sharing implemented in Fredrik's implementation should |
| 947 | be dropped. |
| 948 | |
| 949 | |
| 950 | Test Cases: |
| 951 | ----------- |
| 952 | |
| 953 | Test cases should follow those in Lib/test/test_string.py and include |
| 954 | additional checks for the Codec Registry and the Standard Codecs. |
| 955 | |
| 956 | |
| 957 | References: |
| 958 | ----------- |
| 959 | |
| 960 | Unicode Consortium: |
| 961 | http://www.unicode.org/ |
| 962 | |
| 963 | Unicode FAQ: |
| 964 | http://www.unicode.org/unicode/faq/ |
| 965 | |
| 966 | Unicode 3.0: |
| 967 | http://www.unicode.org/unicode/standard/versions/Unicode3.0.html |
| 968 | |
| 969 | Unicode-TechReports: |
| 970 | http://www.unicode.org/unicode/reports/techreports.html |
| 971 | |
| 972 | Unicode-Mappings: |
| 973 | ftp://ftp.unicode.org/Public/MAPPINGS/ |
| 974 | |
| 975 | Introduction to Unicode (a little outdated by still nice to read): |
| 976 | http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html |
| 977 | |
Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 978 | For comparison: |
Fred Drake | a69ef82 | 2000-05-09 19:58:19 +0000 | [diff] [blame] | 979 | Introducing Unicode to ECMAScript (aka JavaScript) -- |
Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 980 | http://www-4.ibm.com/software/developer/library/internationalization-support.html |
| 981 | |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 982 | IANA Character Set Names: |
| 983 | ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets |
| 984 | |
Fred Drake | a69ef82 | 2000-05-09 19:58:19 +0000 | [diff] [blame] | 985 | Discussion of UTF-8 and Unicode support for POSIX and Linux: |
| 986 | http://www.cl.cam.ac.uk/~mgk25/unicode.html |
| 987 | |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 988 | Encodings: |
| 989 | |
| 990 | Overview: |
| 991 | http://czyborra.com/utf/ |
| 992 | |
| 993 | UTC-2: |
| 994 | http://www.uazone.com/multiling/unicode/ucs2.html |
| 995 | |
| 996 | UTF-7: |
| 997 | Defined in RFC2152, e.g. |
| 998 | http://www.uazone.com/multiling/ml-docs/rfc2152.txt |
| 999 | |
| 1000 | UTF-8: |
| 1001 | Defined in RFC2279, e.g. |
| 1002 | http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt |
| 1003 | |
| 1004 | UTF-16: |
| 1005 | http://www.uazone.com/multiling/unicode/wg2n1035.html |
| 1006 | |
| 1007 | |
| 1008 | History of this Proposal: |
| 1009 | ------------------------- |
Fred Drake | 10dfd4c | 2000-04-13 14:12:38 +0000 | [diff] [blame] | 1010 | 1.4: Added note about mixed type comparisons and contains tests. |
| 1011 | Changed treating of Unicode objects in format strings (if used |
| 1012 | with '%s' % u they will now cause the format string to be |
| 1013 | coerced to Unicode, thus producing a Unicode object on return). |
| 1014 | Added link to IANA charset names (thanks to Lars Marius Garshol). |
| 1015 | Added new codec methods .readline(), .readlines() and .writelines(). |
Guido van Rossum | d8855fd | 2000-03-24 22:14:19 +0000 | [diff] [blame] | 1016 | 1.3: Added new "es" and "es#" parser markers |
Barry Warsaw | 51ac580 | 2000-03-20 16:36:48 +0000 | [diff] [blame] | 1017 | 1.2: Removed POD about codecs.open() |
Guido van Rossum | 9ed0d1e | 2000-03-10 23:14:11 +0000 | [diff] [blame] | 1018 | 1.1: Added note about comparisons and hash values. Added note about |
| 1019 | case mapping algorithms. Changed stream codecs .read() and |
| 1020 | .write() method to match the standard file-like object methods |
| 1021 | (bytes consumed information is no longer returned by the methods) |
| 1022 | 1.0: changed encode Codec method to be symmetric to the decode method |
| 1023 | (they both return (object, data consumed) now and thus become |
| 1024 | interchangeable); removed __init__ method of Codec class (the |
| 1025 | methods are stateless) and moved the errors argument down to the |
| 1026 | methods; made the Codec design more generic w/r to type of input |
| 1027 | and output objects; changed StreamWriter.flush to StreamWriter.reset |
| 1028 | in order to avoid overriding the stream's .flush() method; |
| 1029 | renamed .breaklines() to .splitlines(); renamed the module unicodec |
| 1030 | to codecs; modified the File I/O section to refer to the stream codecs. |
| 1031 | 0.9: changed errors keyword argument definition; added 'replace' error |
| 1032 | handling; changed the codec APIs to accept buffer like objects on |
| 1033 | input; some minor typo fixes; added Whitespace section and |
| 1034 | included references for Unicode characters that have the whitespace |
| 1035 | and the line break characteristic; added note that search functions |
| 1036 | can expect lower-case encoding names; dropped slicing and offsets |
| 1037 | in the codec APIs |
| 1038 | 0.8: added encodings package and raw unicode escape encoding; untabified |
| 1039 | the proposal; added notes on Unicode format strings; added |
| 1040 | .breaklines() method |
| 1041 | 0.7: added a whole new set of codec APIs; added a different encoder |
| 1042 | lookup scheme; fixed some names |
| 1043 | 0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding |
| 1044 | a real Python string object; changed Buffer Interface to delegate |
| 1045 | requests to <defencstr>'s buffer interface; removed the explicit |
| 1046 | reference to the unicodec.codecs dictionary (the module can implement |
| 1047 | this in way fit for the purpose); removed the settable default |
| 1048 | encoding; move UnicodeError from unicodec to exceptions; "s#" |
| 1049 | not returns the internal data; passed the UCS-2/UTF-16 checking |
| 1050 | from the Unicode constructor to the Codecs |
| 1051 | 0.5: moved sys.bom to unicodec.BOM; added sections on case mapping, |
| 1052 | private use encodings and Unicode character properties |
| 1053 | 0.4: added Codec interface, notes on %-formatting, changed some encoding |
| 1054 | details, added comments on stream wrappers, fixed some discussion |
| 1055 | points (most important: Internal Format), clarified the |
| 1056 | 'unicode-escape' encoding, added encoding references |
| 1057 | 0.3: added references, comments on codec modules, the internal format, |
| 1058 | bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding |
| 1059 | proposed by Tim Peters and fixed repr(u) accordingly |
| 1060 | 0.2: integrated Guido's suggestions, added stream codecs and file |
| 1061 | wrapping |
| 1062 | 0.1: first version |
| 1063 | |
| 1064 | |
| 1065 | ----------------------------------------------------------------------------- |
| 1066 | Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com |
| 1067 | ----------------------------------------------------------------------------- |