blob: d4b41a9d7cf22bb614da1330741358ce51f5d6a2 [file] [log] [blame]
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +00001=============================================================================
Fred Drake10dfd4c2000-04-13 14:12:38 +00002 Python Unicode Integration Proposal Version: 1.4
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +00003-----------------------------------------------------------------------------
4
5
6Introduction:
7-------------
8
9The idea of this proposal is to add native Unicode 3.0 support to
10Python in a way that makes use of Unicode strings as simple as
11possible without introducing too many pitfalls along the way.
12
13Since this goal is not easy to achieve -- strings being one of the
14most fundamental objects in Python --, we expect this proposal to
15undergo some significant refinements.
16
17Note that the current version of this proposal is still a bit unsorted
18due to the many different aspects of the Unicode-Python integration.
19
20The latest version of this document is always available at:
21
22 http://starship.skyport.net/~lemburg/unicode-proposal.txt
23
24Older versions are available as:
25
26 http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt
27
28
29Conventions:
30------------
31
32· In examples we use u = Unicode object and s = Python string
33
34· 'XXX' markings indicate points of discussion (PODs)
35
36
37General Remarks:
38----------------
39
40· Unicode encoding names should be lower case on output and
41 case-insensitive on input (they will be converted to lower case
42 by all APIs taking an encoding name as input).
43
44 Encoding names should follow the name conventions as used by the
45 Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
46 written as 'utf-16'.
47
48 Codec modules should use the same names, but with hyphens converted
49 to underscores, e.g. utf_8, utf_16, iso_8859_1.
50
51· The <default encoding> should be the widely used 'utf-8' format. This
52 is very close to the standard 7-bit ASCII format and thus resembles the
53 standard used programming nowadays in most aspects.
54
55
56Unicode Constructors:
57---------------------
58
59Python should provide a built-in constructor for Unicode strings which
60is available through __builtins__:
61
62 u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
63
64 u = u'<unicode-escape encoded Python string>'
65
66 u = ur'<raw-unicode-escape encoded Python string>'
67
68With the 'unicode-escape' encoding being defined as:
69
70· all non-escape characters represent themselves as Unicode ordinal
71 (e.g. 'a' -> U+0061).
72
73· all existing defined Python escape sequences are interpreted as
74 Unicode ordinals; note that \xXXXX can represent all Unicode
75 ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
76
77· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
78 error to have fewer than 4 digits after \u.
79
80For an explanation of possible values for errors see the Codec section
81below.
82
83Examples:
84
85u'abc' -> U+0061 U+0062 U+0063
86u'\u1234' -> U+1234
87u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
88
89The 'raw-unicode-escape' encoding is defined as follows:
90
91· \uXXXX sequence represent the U+XXXX Unicode character if and
92 only if the number of leading backslashes is odd
93
94· all other characters represent themselves as Unicode ordinal
95 (e.g. 'b' -> U+0062)
96
97
98Note that you should provide some hint to the encoding you used to
99write your programs as pragma line in one the first few comment lines
100of the source file (e.g. '# source file encoding: latin-1'). If you
101only use 7-bit ASCII then everything is fine and no such notice is
102needed, but if you include Latin-1 characters not defined in ASCII, it
103may well be worthwhile including a hint since people in other
104countries will want to be able to read you source strings too.
105
106
107Unicode Type Object:
108--------------------
109
110Unicode objects should have the type UnicodeType with type name
111'unicode', made available through the standard types module.
112
113
114Unicode Output:
115---------------
116
117Unicode objects have a method .encode([encoding=<default encoding>])
118which returns a Python string encoding the Unicode string using the
119given scheme (see Codecs).
120
121 print u := print u.encode() # using the <default encoding>
122
123 str(u) := u.encode() # using the <default encoding>
124
125 repr(u) := "u%s" % repr(u.encode('unicode-escape'))
126
127Also see Internal Argument Parsing and Buffer Interface for details on
128how other APIs written in C will treat Unicode objects.
129
130
131Unicode Ordinals:
132-----------------
133
134Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
135should provide 32-bit aware ordinal conversion APIs:
136
137 ord(u[:1]) (this is the standard ord() extended to work with Unicode
138 objects)
139 --> Unicode ordinal number (32-bit)
140
141 unichr(i)
142 --> Unicode object for character i (provided it is 32-bit);
143 ValueError otherwise
144
145Both APIs should go into __builtins__ just like their string
146counterparts ord() and chr().
147
148Note that Unicode provides space for private encodings. Usage of these
149can cause different output representations on different machines. This
150problem is not a Python or Unicode problem, but a machine setup and
151maintenance one.
152
153
154Comparison & Hash Value:
155------------------------
156
157Unicode objects should compare equal to other objects after these
158other objects have been coerced to Unicode. For strings this means
159that they are interpreted as Unicode string using the <default
160encoding>.
161
162For the same reason, Unicode objects should return the same hash value
163as their UTF-8 equivalent strings.
164
Fred Drake10dfd4c2000-04-13 14:12:38 +0000165When compared using cmp() (or PyObject_Compare()) the implementation
166should mask TypeErrors raised during the conversion to remain in synch
167with the string behavior. All other errors such as ValueErrors raised
168during coercion of strings to Unicode should not be masked and passed
169through to the user.
170
171In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
172should be coerced to Unicode before applying the test. Errors occuring
173during coercion (e.g. None in u'abc') should not be masked.
174
175
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000176Coercion:
177---------
178
179Using Python strings and Unicode objects to form new objects should
180always coerce to the more precise format, i.e. Unicode objects.
181
182 u + s := u + unicode(s)
183
184 s + u := unicode(s) + u
185
186All string methods should delegate the call to an equivalent Unicode
187object method call by converting all envolved strings to Unicode and
188then applying the arguments to the Unicode method of the same name,
189e.g.
190
191 string.join((s,u),sep) := (s + sep) + u
192
193 sep.join((s,u)) := (s + sep) + u
194
195For a discussion of %-formatting w/r to Unicode objects, see
196Formatting Markers.
197
198
199Exceptions:
200-----------
201
202UnicodeError is defined in the exceptions module as subclass of
203ValueError. It is available at the C level via PyExc_UnicodeError.
204All exceptions related to Unicode encoding/decoding should be
205subclasses of UnicodeError.
206
207
208Codecs (Coder/Decoders) Lookup:
209-------------------------------
210
211A Codec (see Codec Interface Definition) search registry should be
212implemented by a module "codecs":
213
214 codecs.register(search_function)
215
216Search functions are expected to take one argument, the encoding name
Guido van Rossum25817642000-04-10 19:45:09 +0000217in all lower case letters and with hyphens and spaces converted to
218underscores, and return a tuple of functions (encoder, decoder,
219stream_reader, stream_writer) taking the following arguments:
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000220
221 encoder and decoder:
222 These must be functions or methods which have the same
223 interface as the .encode/.decode methods of Codec instances
224 (see Codec Interface). The functions/methods are expected to
225 work in a stateless mode.
226
227 stream_reader and stream_writer:
228 These need to be factory functions with the following
229 interface:
230
231 factory(stream,errors='strict')
232
233 The factory functions must return objects providing
234 the interfaces defined by StreamWriter/StreamReader resp.
235 (see Codec Interface). Stream codecs can maintain state.
236
237 Possible values for errors are defined in the Codec
238 section below.
239
240In case a search function cannot find a given encoding, it should
241return None.
242
243Aliasing support for encodings is left to the search functions
244to implement.
245
246The codecs module will maintain an encoding cache for performance
247reasons. Encodings are first looked up in the cache. If not found, the
248list of registered search functions is scanned. If no codecs tuple is
249found, a LookupError is raised. Otherwise, the codecs tuple is stored
250in the cache and returned to the caller.
251
252To query the Codec instance the following API should be used:
253
254 codecs.lookup(encoding)
255
256This will either return the found codecs tuple or raise a LookupError.
257
258
259Standard Codecs:
260----------------
261
262Standard codecs should live inside an encodings/ package directory in the
263Standard Python Code Library. The __init__.py file of that directory should
264include a Codec Lookup compatible search function implementing a lazy module
265based codec lookup.
266
267Python should provide a few standard codecs for the most relevant
268encodings, e.g.
269
270 'utf-8': 8-bit variable length encoding
271 'utf-16': 16-bit variable length encoding (litte/big endian)
272 'utf-16-le': utf-16 but explicitly little endian
273 'utf-16-be': utf-16 but explicitly big endian
274 'ascii': 7-bit ASCII codepage
275 'iso-8859-1': ISO 8859-1 (Latin 1) codepage
276 'unicode-escape': See Unicode Constructors for a definition
277 'raw-unicode-escape': See Unicode Constructors for a definition
278 'native': Dump of the Internal Format used by Python
279
280Common aliases should also be provided per default, e.g. 'latin-1'
281for 'iso-8859-1'.
282
283Note: 'utf-16' should be implemented by using and requiring byte order
284marks (BOM) for file input/output.
285
286All other encodings such as the CJK ones to support Asian scripts
287should be implemented in seperate packages which do not get included
288in the core Python distribution and are not a part of this proposal.
289
290
291Codecs Interface Definition:
292----------------------------
293
294The following base class should be defined in the module
295"codecs". They provide not only templates for use by encoding module
296implementors, but also define the interface which is expected by the
297Unicode implementation.
298
299Note that the Codec Interface defined here is well suitable for a
300larger range of applications. The Unicode implementation expects
301Unicode objects on input for .encode() and .write() and character
302buffer compatible objects on input for .decode(). Output of .encode()
303and .read() should be a Python string and .decode() must return an
304Unicode object.
305
306First, we have the stateless encoders/decoders. These do not work in
307chunks as the stream codecs (see below) do, because all components are
308expected to be available in memory.
309
310class Codec:
311
312 """ Defines the interface for stateless encoders/decoders.
313
314 The .encode()/.decode() methods may implement different error
315 handling schemes by providing the errors argument. These
316 string values are defined:
317
318 'strict' - raise an error (or a subclass)
319 'ignore' - ignore the character and continue with the next
320 'replace' - replace with a suitable replacement character;
321 Python will use the official U+FFFD REPLACEMENT
322 CHARACTER for the builtin Unicode codecs.
323
324 """
325 def encode(self,input,errors='strict'):
326
327 """ Encodes the object intput and returns a tuple (output
328 object, length consumed).
329
330 errors defines the error handling to apply. It defaults to
331 'strict' handling.
332
333 The method may not store state in the Codec instance. Use
334 SteamCodec for codecs which have to keep state in order to
335 make encoding/decoding efficient.
336
337 """
338 ...
339
340 def decode(self,input,errors='strict'):
341
342 """ Decodes the object input and returns a tuple (output
343 object, length consumed).
344
345 input must be an object which provides the bf_getreadbuf
346 buffer slot. Python strings, buffer objects and memory
347 mapped files are examples of objects providing this slot.
348
349 errors defines the error handling to apply. It defaults to
350 'strict' handling.
351
352 The method may not store state in the Codec instance. Use
353 SteamCodec for codecs which have to keep state in order to
354 make encoding/decoding efficient.
355
356 """
357 ...
358
359StreamWriter and StreamReader define the interface for stateful
360encoders/decoders which work on streams. These allow processing of the
361data in chunks to efficiently use memory. If you have large strings in
362memory, you may want to wrap them with cStringIO objects and then use
363these codecs on them to be able to do chunk processing as well,
364e.g. to provide progress information to the user.
365
366class StreamWriter(Codec):
367
368 def __init__(self,stream,errors='strict'):
369
370 """ Creates a StreamWriter instance.
371
372 stream must be a file-like object open for writing
373 (binary) data.
374
375 The StreamWriter may implement different error handling
376 schemes by providing the errors keyword argument. These
377 parameters are defined:
378
379 'strict' - raise a ValueError (or a subclass)
380 'ignore' - ignore the character and continue with the next
381 'replace'- replace with a suitable replacement character
382
383 """
384 self.stream = stream
385 self.errors = errors
386
387 def write(self,object):
388
389 """ Writes the object's contents encoded to self.stream.
390 """
391 data, consumed = self.encode(object,self.errors)
392 self.stream.write(data)
393
Fred Drake10dfd4c2000-04-13 14:12:38 +0000394 def writelines(self, list):
395
396 """ Writes the concatenated list of strings to the stream
397 using .write().
398 """
399 self.write(''.join(list))
400
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000401 def reset(self):
402
403 """ Flushes and resets the codec buffers used for keeping state.
404
405 Calling this method should ensure that the data on the
406 output is put into a clean state, that allows appending
407 of new fresh data without having to rescan the whole
408 stream to recover state.
409
410 """
411 pass
412
413 def __getattr__(self,name,
414
415 getattr=getattr):
416
417 """ Inherit all other methods from the underlying stream.
418 """
419 return getattr(self.stream,name)
420
421class StreamReader(Codec):
422
423 def __init__(self,stream,errors='strict'):
424
425 """ Creates a StreamReader instance.
426
427 stream must be a file-like object open for reading
428 (binary) data.
429
430 The StreamReader may implement different error handling
431 schemes by providing the errors keyword argument. These
432 parameters are defined:
433
434 'strict' - raise a ValueError (or a subclass)
435 'ignore' - ignore the character and continue with the next
436 'replace'- replace with a suitable replacement character;
437
438 """
439 self.stream = stream
440 self.errors = errors
441
442 def read(self,size=-1):
443
444 """ Decodes data from the stream self.stream and returns the
445 resulting object.
446
447 size indicates the approximate maximum number of bytes to
448 read from the stream for decoding purposes. The decoder
449 can modify this setting as appropriate. The default value
450 -1 indicates to read and decode as much as possible. size
451 is intended to prevent having to decode huge files in one
452 step.
453
454 The method should use a greedy read strategy meaning that
455 it should read as much data as is allowed within the
456 definition of the encoding and the given size, e.g. if
457 optional encoding endings or state markers are available
458 on the stream, these should be read too.
459
460 """
461 # Unsliced reading:
462 if size < 0:
463 return self.decode(self.stream.read())[0]
464
465 # Sliced reading:
466 read = self.stream.read
467 decode = self.decode
468 data = read(size)
469 i = 0
470 while 1:
471 try:
472 object, decodedbytes = decode(data)
473 except ValueError,why:
474 # This method is slow but should work under pretty much
475 # all conditions; at most 10 tries are made
476 i = i + 1
477 newdata = read(1)
478 if not newdata or i > 10:
479 raise
480 data = data + newdata
481 else:
482 return object
483
Fred Drake10dfd4c2000-04-13 14:12:38 +0000484 def readline(self, size=None):
485
486 """ Read one line from the input stream and return the
487 decoded data.
488
489 Note: Unlike the .readlines() method, this method inherits
490 the line breaking knowledge from the underlying stream's
491 .readline() method -- there is currently no support for
492 line breaking using the codec decoder due to lack of line
493 buffering. Sublcasses should however, if possible, try to
494 implement this method using their own knowledge of line
495 breaking.
496
497 size, if given, is passed as size argument to the stream's
498 .readline() method.
499
500 """
501 if size is None:
502 line = self.stream.readline()
503 else:
504 line = self.stream.readline(size)
505 return self.decode(line)[0]
506
507 def readlines(self, sizehint=0):
508
509 """ Read all lines available on the input stream
510 and return them as list of lines.
511
512 Line breaks are implemented using the codec's decoder
513 method and are included in the list entries.
514
515 sizehint, if given, is passed as size argument to the
516 stream's .read() method.
517
518 """
519 if sizehint is None:
520 data = self.stream.read()
521 else:
522 data = self.stream.read(sizehint)
523 return self.decode(data)[0].splitlines(1)
524
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000525 def reset(self):
526
527 """ Resets the codec buffers used for keeping state.
528
529 Note that no stream repositioning should take place.
530 This method is primarely intended to be able to recover
531 from decoding errors.
532
533 """
534 pass
535
536 def __getattr__(self,name,
537
538 getattr=getattr):
539
540 """ Inherit all other methods from the underlying stream.
541 """
542 return getattr(self.stream,name)
543
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000544
545Stream codec implementors are free to combine the StreamWriter and
546StreamReader interfaces into one class. Even combining all these with
547the Codec class should be possible.
548
549Implementors are free to add additional methods to enhance the codec
550functionality or provide extra state information needed for them to
551work. The internal codec implementation will only use the above
552interfaces, though.
553
554It is not required by the Unicode implementation to use these base
555classes, only the interfaces must match; this allows writing Codecs as
556extensions types.
557
558As guideline, large mapping tables should be implemented using static
559C data in separate (shared) extension modules. That way multiple
560processes can share the same data.
561
562A tool to auto-convert Unicode mapping files to mapping modules should be
563provided to simplify support for additional mappings (see References).
564
565
566Whitespace:
567-----------
568
569The .split() method will have to know about what is considered
570whitespace in Unicode.
571
572
573Case Conversion:
574----------------
575
576Case conversion is rather complicated with Unicode data, since there
577are many different conditions to respect. See
578
579 http://www.unicode.org/unicode/reports/tr13/
580
581for some guidelines on implementing case conversion.
582
583For Python, we should only implement the 1-1 conversions included in
584Unicode. Locale dependent and other special case conversions (see the
585Unicode standard file SpecialCasing.txt) should be left to user land
586routines and not go into the core interpreter.
587
588The methods .capitalize() and .iscapitalized() should follow the case
589mapping algorithm defined in the above technical report as closely as
590possible.
591
592
593Line Breaks:
594------------
595
596Line breaking should be done for all Unicode characters having the B
597property as well as the combinations CRLF, CR, LF (interpreted in that
598order) and other special line separators defined by the standard.
599
600The Unicode type should provide a .splitlines() method which returns a
601list of lines according to the above specification. See Unicode
602Methods.
603
604
605Unicode Character Properties:
606-----------------------------
607
608A separate module "unicodedata" should provide a compact interface to
609all Unicode character properties defined in the standard's
610UnicodeData.txt file.
611
612Among other things, these properties provide ways to recognize
613numbers, digits, spaces, whitespace, etc.
614
615Since this module will have to provide access to all Unicode
616characters, it will eventually have to contain the data from
617UnicodeData.txt which takes up around 600kB. For this reason, the data
618should be stored in static C data. This enables compilation as shared
619module which the underlying OS can shared between processes (unlike
620normal Python code modules).
621
622There should be a standard Python interface for accessing this information
623so that other implementors can plug in their own possibly enhanced versions,
624e.g. ones that do decompressing of the data on-the-fly.
625
626
627Private Code Point Areas:
628-------------------------
629
630Support for these is left to user land Codecs and not explicitly
631intergrated into the core. Note that due to the Internal Format being
632implemented, only the area between \uE000 and \uF8FF is useable for
633private encodings.
634
635
636Internal Format:
637----------------
638
639The internal format for Unicode objects should use a Python specific
640fixed format <PythonUnicode> implemented as 'unsigned short' (or
641another unsigned numeric type having 16 bits). Byte order is platform
642dependent.
643
644This format will hold UTF-16 encodings of the corresponding Unicode
645ordinals. The Python Unicode implementation will address these values
646as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
647currently defined Unicode character points. UTF-16 without surrogates
648provides access to about 64k characters and covers all characters in
649the Basic Multilingual Plane (BMP) of Unicode.
650
651It is the Codec's responsibility to ensure that the data they pass to
652the Unicode object constructor repects this assumption. The
653constructor does not check the data for Unicode compliance or use of
654surrogates.
655
656Future implementations can extend the 32 bit restriction to the full
657set of all UTF-16 addressable characters (around 1M characters).
658
659The Unicode API should provide inteface routines from <PythonUnicode>
660to the compiler's wchar_t which can be 16 or 32 bit depending on the
661compiler/libc/platform being used.
662
663Unicode objects should have a pointer to a cached Python string object
664<defencstr> holding the object's value using the current <default
665encoding>. This is needed for performance and internal parsing (see
666Internal Argument Parsing) reasons. The buffer is filled when the
667first conversion request to the <default encoding> is issued on the
668object.
669
670Interning is not needed (for now), since Python identifiers are
671defined as being ASCII only.
672
673codecs.BOM should return the byte order mark (BOM) for the format
674used internally. The codecs module should provide the following
675additional constants for convenience and reference (codecs.BOM will
676either be BOM_BE or BOM_LE depending on the platform):
677
678 BOM_BE: '\376\377'
679 (corresponds to Unicode U+0000FEFF in UTF-16 on big endian
680 platforms == ZERO WIDTH NO-BREAK SPACE)
681
682 BOM_LE: '\377\376'
683 (corresponds to Unicode U+0000FFFE in UTF-16 on little endian
684 platforms == defined as being an illegal Unicode character)
685
686 BOM4_BE: '\000\000\376\377'
687 (corresponds to Unicode U+0000FEFF in UCS-4)
688
689 BOM4_LE: '\377\376\000\000'
690 (corresponds to Unicode U+0000FFFE in UCS-4)
691
692Note that Unicode sees big endian byte order as being "correct". The
693swapped order is taken to be an indicator for a "wrong" format, hence
694the illegal character definition.
695
696The configure script should provide aid in deciding whether Python can
697use the native wchar_t type or not (it has to be a 16-bit unsigned
698type).
699
700
701Buffer Interface:
702-----------------
703
704Implement the buffer interface using the <defencstr> Python string
705object as basis for bf_getcharbuf (corresponds to the "t#" argument
706parsing marker) and the internal buffer for bf_getreadbuf (corresponds
707to the "s#" argument parsing marker). If bf_getcharbuf is requested
708and the <defencstr> object does not yet exist, it is created first.
709
710This has the advantage of being able to write to output streams (which
711typically use this interface) without additional specification of the
712encoding to use.
713
714The internal format can also be accessed using the 'unicode-internal'
715codec, e.g. via u.encode('unicode-internal').
716
717
718Pickle/Marshalling:
719-------------------
720
721Should have native Unicode object support. The objects should be
722encoded using platform independent encodings.
723
724Marshal should use UTF-8 and Pickle should either choose
725Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
726encoding. Using UTF-8 instead of UTF-16 has the advantage of
727eliminating the need to store a BOM mark.
728
729
730Regular Expressions:
731--------------------
732
733Secret Labs AB is working on a Unicode-aware regular expression
734machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
735internal character buffers.
736
737Also see
738
739 http://www.unicode.org/unicode/reports/tr18/
740
741for some remarks on how to treat Unicode REs.
742
743
744Formatting Markers:
745-------------------
746
747Format markers are used in Python format strings. If Python strings
748are used as format strings, the following interpretations should be in
749effect:
750
Fred Drake10dfd4c2000-04-13 14:12:38 +0000751 '%s': For Unicode objects this will cause coercion of the
752 whole format string to Unicode. Note that
753 you should use a Unicode format string to start
754 with for performance reasons.
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000755
756In case the format string is an Unicode object, all parameters are coerced
757to Unicode first and then put together and formatted according to the format
758string. Numbers are first converted to strings and then to Unicode.
759
760 '%s': Python strings are interpreted as Unicode
761 string using the <default encoding>. Unicode
762 objects are taken as is.
763
764All other string formatters should work accordingly.
765
766Example:
767
768u"%s %s" % (u"abc", "abc") == u"abc abc"
769
770
771Internal Argument Parsing:
772--------------------------
773
774These markers are used by the PyArg_ParseTuple() APIs:
775
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000776 "U": Check for Unicode object and return a pointer to it
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000777
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000778 "s": For Unicode objects: auto convert them to the <default encoding>
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000779 and return a pointer to the object's <defencstr> buffer.
780
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000781 "s#": Access to the Unicode object via the bf_getreadbuf buffer interface
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000782 (see Buffer Interface); note that the length relates to the buffer
783 length, not the Unicode string length (this may be different
784 depending on the Internal Format).
785
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000786 "t#": Access to the Unicode object via the bf_getcharbuf buffer interface
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000787 (see Buffer Interface); note that the length relates to the buffer
788 length, not necessarily to the Unicode string length (this may
789 be different depending on the <default encoding>).
790
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000791 "es":
792 Takes two parameters: encoding (const char *) and
793 buffer (char **).
794
795 The input object is first coerced to Unicode in the usual way
796 and then encoded into a string using the given encoding.
797
798 On output, a buffer of the needed size is allocated and
799 returned through *buffer as NULL-terminated string.
800 The encoded may not contain embedded NULL characters.
Guido van Rossum24bdb042000-03-28 20:29:59 +0000801 The caller is responsible for calling PyMem_Free()
802 to free the allocated *buffer after usage.
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000803
804 "es#":
805 Takes three parameters: encoding (const char *),
806 buffer (char **) and buffer_len (int *).
807
808 The input object is first coerced to Unicode in the usual way
809 and then encoded into a string using the given encoding.
810
811 If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer)
812 on input. Output is then copied to *buffer.
813
814 If *buffer is NULL, a buffer of the needed size is
815 allocated and output copied into it. *buffer is then
Guido van Rossum24bdb042000-03-28 20:29:59 +0000816 updated to point to the allocated memory area.
817 The caller is responsible for calling PyMem_Free()
818 to free the allocated *buffer after usage.
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000819
820 In both cases *buffer_len is updated to the number of
821 characters written (excluding the trailing NULL-byte).
822 The output buffer is assured to be NULL-terminated.
823
824Examples:
825
826Using "es#" with auto-allocation:
827
828 static PyObject *
829 test_parser(PyObject *self,
830 PyObject *args)
831 {
832 PyObject *str;
833 const char *encoding = "latin-1";
834 char *buffer = NULL;
835 int buffer_len = 0;
836
837 if (!PyArg_ParseTuple(args, "es#:test_parser",
838 encoding, &buffer, &buffer_len))
839 return NULL;
840 if (!buffer) {
841 PyErr_SetString(PyExc_SystemError,
842 "buffer is NULL");
843 return NULL;
844 }
845 str = PyString_FromStringAndSize(buffer, buffer_len);
Guido van Rossum24bdb042000-03-28 20:29:59 +0000846 PyMem_Free(buffer);
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000847 return str;
848 }
849
850Using "es" with auto-allocation returning a NULL-terminated string:
851
852 static PyObject *
853 test_parser(PyObject *self,
854 PyObject *args)
855 {
856 PyObject *str;
857 const char *encoding = "latin-1";
858 char *buffer = NULL;
859
860 if (!PyArg_ParseTuple(args, "es:test_parser",
861 encoding, &buffer))
862 return NULL;
863 if (!buffer) {
864 PyErr_SetString(PyExc_SystemError,
865 "buffer is NULL");
866 return NULL;
867 }
868 str = PyString_FromString(buffer);
Guido van Rossum24bdb042000-03-28 20:29:59 +0000869 PyMem_Free(buffer);
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000870 return str;
871 }
872
873Using "es#" with a pre-allocated buffer:
874
875 static PyObject *
876 test_parser(PyObject *self,
877 PyObject *args)
878 {
879 PyObject *str;
880 const char *encoding = "latin-1";
881 char _buffer[10];
882 char *buffer = _buffer;
883 int buffer_len = sizeof(_buffer);
884
885 if (!PyArg_ParseTuple(args, "es#:test_parser",
886 encoding, &buffer, &buffer_len))
887 return NULL;
888 if (!buffer) {
889 PyErr_SetString(PyExc_SystemError,
890 "buffer is NULL");
891 return NULL;
892 }
893 str = PyString_FromStringAndSize(buffer, buffer_len);
894 return str;
895 }
896
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000897
898File/Stream Output:
899-------------------
900
901Since file.write(object) and most other stream writers use the "s#"
902argument parsing marker for binary files and "t#" for text files, the
903buffer interface implementation determines the encoding to use (see
904Buffer Interface).
905
906For explicit handling of files using Unicode, the standard
907stream codecs as available through the codecs module should
908be used.
909
Barry Warsaw51ac5802000-03-20 16:36:48 +0000910The codecs module should provide a short-cut open(filename,mode,encoding)
911available which also assures that mode contains the 'b' character when
912needed.
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000913
914
915File/Stream Input:
916------------------
917
918Only the user knows what encoding the input data uses, so no special
919magic is applied. The user will have to explicitly convert the string
920data to Unicode objects as needed or use the file wrappers defined in
921the codecs module (see File/Stream Output).
922
923
924Unicode Methods & Attributes:
925-----------------------------
926
927All Python string methods, plus:
928
929 .encode([encoding=<default encoding>][,errors="strict"])
930 --> see Unicode Output
931
932 .splitlines([include_breaks=0])
933 --> breaks the Unicode string into a list of (Unicode) lines;
934 returns the lines with line breaks included, if include_breaks
935 is true. See Line Breaks for a specification of how line breaking
936 is done.
937
938
939Code Base:
940----------
941
942We should use Fredrik Lundh's Unicode object implementation as basis.
943It already implements most of the string methods needed and provides a
944well written code base which we can build upon.
945
946The object sharing implemented in Fredrik's implementation should
947be dropped.
948
949
950Test Cases:
951-----------
952
953Test cases should follow those in Lib/test/test_string.py and include
954additional checks for the Codec Registry and the Standard Codecs.
955
956
957References:
958-----------
959
960Unicode Consortium:
961 http://www.unicode.org/
962
963Unicode FAQ:
964 http://www.unicode.org/unicode/faq/
965
966Unicode 3.0:
967 http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
968
969Unicode-TechReports:
970 http://www.unicode.org/unicode/reports/techreports.html
971
972Unicode-Mappings:
973 ftp://ftp.unicode.org/Public/MAPPINGS/
974
975Introduction to Unicode (a little outdated by still nice to read):
976 http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
977
Barry Warsaw51ac5802000-03-20 16:36:48 +0000978For comparison:
979 Introducing Unicode to ECMAScript --
980 http://www-4.ibm.com/software/developer/library/internationalization-support.html
981
Fred Drake10dfd4c2000-04-13 14:12:38 +0000982IANA Character Set Names:
983 ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
984
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000985Encodings:
986
987 Overview:
988 http://czyborra.com/utf/
989
990 UTC-2:
991 http://www.uazone.com/multiling/unicode/ucs2.html
992
993 UTF-7:
994 Defined in RFC2152, e.g.
995 http://www.uazone.com/multiling/ml-docs/rfc2152.txt
996
997 UTF-8:
998 Defined in RFC2279, e.g.
999 http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
1000
1001 UTF-16:
1002 http://www.uazone.com/multiling/unicode/wg2n1035.html
1003
1004
1005History of this Proposal:
1006-------------------------
Fred Drake10dfd4c2000-04-13 14:12:38 +000010071.4: Added note about mixed type comparisons and contains tests.
1008 Changed treating of Unicode objects in format strings (if used
1009 with '%s' % u they will now cause the format string to be
1010 coerced to Unicode, thus producing a Unicode object on return).
1011 Added link to IANA charset names (thanks to Lars Marius Garshol).
1012 Added new codec methods .readline(), .readlines() and .writelines().
Guido van Rossumd8855fd2000-03-24 22:14:19 +000010131.3: Added new "es" and "es#" parser markers
Barry Warsaw51ac5802000-03-20 16:36:48 +000010141.2: Removed POD about codecs.open()
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +000010151.1: Added note about comparisons and hash values. Added note about
1016 case mapping algorithms. Changed stream codecs .read() and
1017 .write() method to match the standard file-like object methods
1018 (bytes consumed information is no longer returned by the methods)
10191.0: changed encode Codec method to be symmetric to the decode method
1020 (they both return (object, data consumed) now and thus become
1021 interchangeable); removed __init__ method of Codec class (the
1022 methods are stateless) and moved the errors argument down to the
1023 methods; made the Codec design more generic w/r to type of input
1024 and output objects; changed StreamWriter.flush to StreamWriter.reset
1025 in order to avoid overriding the stream's .flush() method;
1026 renamed .breaklines() to .splitlines(); renamed the module unicodec
1027 to codecs; modified the File I/O section to refer to the stream codecs.
10280.9: changed errors keyword argument definition; added 'replace' error
1029 handling; changed the codec APIs to accept buffer like objects on
1030 input; some minor typo fixes; added Whitespace section and
1031 included references for Unicode characters that have the whitespace
1032 and the line break characteristic; added note that search functions
1033 can expect lower-case encoding names; dropped slicing and offsets
1034 in the codec APIs
10350.8: added encodings package and raw unicode escape encoding; untabified
1036 the proposal; added notes on Unicode format strings; added
1037 .breaklines() method
10380.7: added a whole new set of codec APIs; added a different encoder
1039 lookup scheme; fixed some names
10400.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
1041 a real Python string object; changed Buffer Interface to delegate
1042 requests to <defencstr>'s buffer interface; removed the explicit
1043 reference to the unicodec.codecs dictionary (the module can implement
1044 this in way fit for the purpose); removed the settable default
1045 encoding; move UnicodeError from unicodec to exceptions; "s#"
1046 not returns the internal data; passed the UCS-2/UTF-16 checking
1047 from the Unicode constructor to the Codecs
10480.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
1049 private use encodings and Unicode character properties
10500.4: added Codec interface, notes on %-formatting, changed some encoding
1051 details, added comments on stream wrappers, fixed some discussion
1052 points (most important: Internal Format), clarified the
1053 'unicode-escape' encoding, added encoding references
10540.3: added references, comments on codec modules, the internal format,
1055 bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
1056 proposed by Tim Peters and fixed repr(u) accordingly
10570.2: integrated Guido's suggestions, added stream codecs and file
1058 wrapping
10590.1: first version
1060
1061
1062-----------------------------------------------------------------------------
1063Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
1064-----------------------------------------------------------------------------