blob: ce74c05bd190087ac6c94f014d9918ea0aa08d8b [file] [log] [blame]
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +00001=============================================================================
2 Python Unicode Integration Proposal Version: 1.2
3-----------------------------------------------------------------------------
4
5
6Introduction:
7-------------
8
9The idea of this proposal is to add native Unicode 3.0 support to
10Python in a way that makes use of Unicode strings as simple as
11possible without introducing too many pitfalls along the way.
12
13Since this goal is not easy to achieve -- strings being one of the
14most fundamental objects in Python --, we expect this proposal to
15undergo some significant refinements.
16
17Note that the current version of this proposal is still a bit unsorted
18due to the many different aspects of the Unicode-Python integration.
19
20The latest version of this document is always available at:
21
22 http://starship.skyport.net/~lemburg/unicode-proposal.txt
23
24Older versions are available as:
25
26 http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt
27
28
29Conventions:
30------------
31
32· In examples we use u = Unicode object and s = Python string
33
34· 'XXX' markings indicate points of discussion (PODs)
35
36
37General Remarks:
38----------------
39
40· Unicode encoding names should be lower case on output and
41 case-insensitive on input (they will be converted to lower case
42 by all APIs taking an encoding name as input).
43
44 Encoding names should follow the name conventions as used by the
45 Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
46 written as 'utf-16'.
47
48 Codec modules should use the same names, but with hyphens converted
49 to underscores, e.g. utf_8, utf_16, iso_8859_1.
50
51· The <default encoding> should be the widely used 'utf-8' format. This
52 is very close to the standard 7-bit ASCII format and thus resembles the
53 standard used programming nowadays in most aspects.
54
55
56Unicode Constructors:
57---------------------
58
59Python should provide a built-in constructor for Unicode strings which
60is available through __builtins__:
61
62 u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
63
64 u = u'<unicode-escape encoded Python string>'
65
66 u = ur'<raw-unicode-escape encoded Python string>'
67
68With the 'unicode-escape' encoding being defined as:
69
70· all non-escape characters represent themselves as Unicode ordinal
71 (e.g. 'a' -> U+0061).
72
73· all existing defined Python escape sequences are interpreted as
74 Unicode ordinals; note that \xXXXX can represent all Unicode
75 ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
76
77· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
78 error to have fewer than 4 digits after \u.
79
80For an explanation of possible values for errors see the Codec section
81below.
82
83Examples:
84
85u'abc' -> U+0061 U+0062 U+0063
86u'\u1234' -> U+1234
87u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
88
89The 'raw-unicode-escape' encoding is defined as follows:
90
91· \uXXXX sequence represent the U+XXXX Unicode character if and
92 only if the number of leading backslashes is odd
93
94· all other characters represent themselves as Unicode ordinal
95 (e.g. 'b' -> U+0062)
96
97
98Note that you should provide some hint to the encoding you used to
99write your programs as pragma line in one the first few comment lines
100of the source file (e.g. '# source file encoding: latin-1'). If you
101only use 7-bit ASCII then everything is fine and no such notice is
102needed, but if you include Latin-1 characters not defined in ASCII, it
103may well be worthwhile including a hint since people in other
104countries will want to be able to read you source strings too.
105
106
107Unicode Type Object:
108--------------------
109
110Unicode objects should have the type UnicodeType with type name
111'unicode', made available through the standard types module.
112
113
114Unicode Output:
115---------------
116
117Unicode objects have a method .encode([encoding=<default encoding>])
118which returns a Python string encoding the Unicode string using the
119given scheme (see Codecs).
120
121 print u := print u.encode() # using the <default encoding>
122
123 str(u) := u.encode() # using the <default encoding>
124
125 repr(u) := "u%s" % repr(u.encode('unicode-escape'))
126
127Also see Internal Argument Parsing and Buffer Interface for details on
128how other APIs written in C will treat Unicode objects.
129
130
131Unicode Ordinals:
132-----------------
133
134Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
135should provide 32-bit aware ordinal conversion APIs:
136
137 ord(u[:1]) (this is the standard ord() extended to work with Unicode
138 objects)
139 --> Unicode ordinal number (32-bit)
140
141 unichr(i)
142 --> Unicode object for character i (provided it is 32-bit);
143 ValueError otherwise
144
145Both APIs should go into __builtins__ just like their string
146counterparts ord() and chr().
147
148Note that Unicode provides space for private encodings. Usage of these
149can cause different output representations on different machines. This
150problem is not a Python or Unicode problem, but a machine setup and
151maintenance one.
152
153
154Comparison & Hash Value:
155------------------------
156
157Unicode objects should compare equal to other objects after these
158other objects have been coerced to Unicode. For strings this means
159that they are interpreted as Unicode string using the <default
160encoding>.
161
162For the same reason, Unicode objects should return the same hash value
163as their UTF-8 equivalent strings.
164
165Coercion:
166---------
167
168Using Python strings and Unicode objects to form new objects should
169always coerce to the more precise format, i.e. Unicode objects.
170
171 u + s := u + unicode(s)
172
173 s + u := unicode(s) + u
174
175All string methods should delegate the call to an equivalent Unicode
176object method call by converting all envolved strings to Unicode and
177then applying the arguments to the Unicode method of the same name,
178e.g.
179
180 string.join((s,u),sep) := (s + sep) + u
181
182 sep.join((s,u)) := (s + sep) + u
183
184For a discussion of %-formatting w/r to Unicode objects, see
185Formatting Markers.
186
187
188Exceptions:
189-----------
190
191UnicodeError is defined in the exceptions module as subclass of
192ValueError. It is available at the C level via PyExc_UnicodeError.
193All exceptions related to Unicode encoding/decoding should be
194subclasses of UnicodeError.
195
196
197Codecs (Coder/Decoders) Lookup:
198-------------------------------
199
200A Codec (see Codec Interface Definition) search registry should be
201implemented by a module "codecs":
202
203 codecs.register(search_function)
204
205Search functions are expected to take one argument, the encoding name
206in all lower case letters, and return a tuple of functions (encoder,
207decoder, stream_reader, stream_writer) taking the following arguments:
208
209 encoder and decoder:
210 These must be functions or methods which have the same
211 interface as the .encode/.decode methods of Codec instances
212 (see Codec Interface). The functions/methods are expected to
213 work in a stateless mode.
214
215 stream_reader and stream_writer:
216 These need to be factory functions with the following
217 interface:
218
219 factory(stream,errors='strict')
220
221 The factory functions must return objects providing
222 the interfaces defined by StreamWriter/StreamReader resp.
223 (see Codec Interface). Stream codecs can maintain state.
224
225 Possible values for errors are defined in the Codec
226 section below.
227
228In case a search function cannot find a given encoding, it should
229return None.
230
231Aliasing support for encodings is left to the search functions
232to implement.
233
234The codecs module will maintain an encoding cache for performance
235reasons. Encodings are first looked up in the cache. If not found, the
236list of registered search functions is scanned. If no codecs tuple is
237found, a LookupError is raised. Otherwise, the codecs tuple is stored
238in the cache and returned to the caller.
239
240To query the Codec instance the following API should be used:
241
242 codecs.lookup(encoding)
243
244This will either return the found codecs tuple or raise a LookupError.
245
246
247Standard Codecs:
248----------------
249
250Standard codecs should live inside an encodings/ package directory in the
251Standard Python Code Library. The __init__.py file of that directory should
252include a Codec Lookup compatible search function implementing a lazy module
253based codec lookup.
254
255Python should provide a few standard codecs for the most relevant
256encodings, e.g.
257
258 'utf-8': 8-bit variable length encoding
259 'utf-16': 16-bit variable length encoding (litte/big endian)
260 'utf-16-le': utf-16 but explicitly little endian
261 'utf-16-be': utf-16 but explicitly big endian
262 'ascii': 7-bit ASCII codepage
263 'iso-8859-1': ISO 8859-1 (Latin 1) codepage
264 'unicode-escape': See Unicode Constructors for a definition
265 'raw-unicode-escape': See Unicode Constructors for a definition
266 'native': Dump of the Internal Format used by Python
267
268Common aliases should also be provided per default, e.g. 'latin-1'
269for 'iso-8859-1'.
270
271Note: 'utf-16' should be implemented by using and requiring byte order
272marks (BOM) for file input/output.
273
274All other encodings such as the CJK ones to support Asian scripts
275should be implemented in seperate packages which do not get included
276in the core Python distribution and are not a part of this proposal.
277
278
279Codecs Interface Definition:
280----------------------------
281
282The following base class should be defined in the module
283"codecs". They provide not only templates for use by encoding module
284implementors, but also define the interface which is expected by the
285Unicode implementation.
286
287Note that the Codec Interface defined here is well suitable for a
288larger range of applications. The Unicode implementation expects
289Unicode objects on input for .encode() and .write() and character
290buffer compatible objects on input for .decode(). Output of .encode()
291and .read() should be a Python string and .decode() must return an
292Unicode object.
293
294First, we have the stateless encoders/decoders. These do not work in
295chunks as the stream codecs (see below) do, because all components are
296expected to be available in memory.
297
298class Codec:
299
300 """ Defines the interface for stateless encoders/decoders.
301
302 The .encode()/.decode() methods may implement different error
303 handling schemes by providing the errors argument. These
304 string values are defined:
305
306 'strict' - raise an error (or a subclass)
307 'ignore' - ignore the character and continue with the next
308 'replace' - replace with a suitable replacement character;
309 Python will use the official U+FFFD REPLACEMENT
310 CHARACTER for the builtin Unicode codecs.
311
312 """
313 def encode(self,input,errors='strict'):
314
315 """ Encodes the object intput and returns a tuple (output
316 object, length consumed).
317
318 errors defines the error handling to apply. It defaults to
319 'strict' handling.
320
321 The method may not store state in the Codec instance. Use
322 SteamCodec for codecs which have to keep state in order to
323 make encoding/decoding efficient.
324
325 """
326 ...
327
328 def decode(self,input,errors='strict'):
329
330 """ Decodes the object input and returns a tuple (output
331 object, length consumed).
332
333 input must be an object which provides the bf_getreadbuf
334 buffer slot. Python strings, buffer objects and memory
335 mapped files are examples of objects providing this slot.
336
337 errors defines the error handling to apply. It defaults to
338 'strict' handling.
339
340 The method may not store state in the Codec instance. Use
341 SteamCodec for codecs which have to keep state in order to
342 make encoding/decoding efficient.
343
344 """
345 ...
346
347StreamWriter and StreamReader define the interface for stateful
348encoders/decoders which work on streams. These allow processing of the
349data in chunks to efficiently use memory. If you have large strings in
350memory, you may want to wrap them with cStringIO objects and then use
351these codecs on them to be able to do chunk processing as well,
352e.g. to provide progress information to the user.
353
354class StreamWriter(Codec):
355
356 def __init__(self,stream,errors='strict'):
357
358 """ Creates a StreamWriter instance.
359
360 stream must be a file-like object open for writing
361 (binary) data.
362
363 The StreamWriter may implement different error handling
364 schemes by providing the errors keyword argument. These
365 parameters are defined:
366
367 'strict' - raise a ValueError (or a subclass)
368 'ignore' - ignore the character and continue with the next
369 'replace'- replace with a suitable replacement character
370
371 """
372 self.stream = stream
373 self.errors = errors
374
375 def write(self,object):
376
377 """ Writes the object's contents encoded to self.stream.
378 """
379 data, consumed = self.encode(object,self.errors)
380 self.stream.write(data)
381
382 def reset(self):
383
384 """ Flushes and resets the codec buffers used for keeping state.
385
386 Calling this method should ensure that the data on the
387 output is put into a clean state, that allows appending
388 of new fresh data without having to rescan the whole
389 stream to recover state.
390
391 """
392 pass
393
394 def __getattr__(self,name,
395
396 getattr=getattr):
397
398 """ Inherit all other methods from the underlying stream.
399 """
400 return getattr(self.stream,name)
401
402class StreamReader(Codec):
403
404 def __init__(self,stream,errors='strict'):
405
406 """ Creates a StreamReader instance.
407
408 stream must be a file-like object open for reading
409 (binary) data.
410
411 The StreamReader may implement different error handling
412 schemes by providing the errors keyword argument. These
413 parameters are defined:
414
415 'strict' - raise a ValueError (or a subclass)
416 'ignore' - ignore the character and continue with the next
417 'replace'- replace with a suitable replacement character;
418
419 """
420 self.stream = stream
421 self.errors = errors
422
423 def read(self,size=-1):
424
425 """ Decodes data from the stream self.stream and returns the
426 resulting object.
427
428 size indicates the approximate maximum number of bytes to
429 read from the stream for decoding purposes. The decoder
430 can modify this setting as appropriate. The default value
431 -1 indicates to read and decode as much as possible. size
432 is intended to prevent having to decode huge files in one
433 step.
434
435 The method should use a greedy read strategy meaning that
436 it should read as much data as is allowed within the
437 definition of the encoding and the given size, e.g. if
438 optional encoding endings or state markers are available
439 on the stream, these should be read too.
440
441 """
442 # Unsliced reading:
443 if size < 0:
444 return self.decode(self.stream.read())[0]
445
446 # Sliced reading:
447 read = self.stream.read
448 decode = self.decode
449 data = read(size)
450 i = 0
451 while 1:
452 try:
453 object, decodedbytes = decode(data)
454 except ValueError,why:
455 # This method is slow but should work under pretty much
456 # all conditions; at most 10 tries are made
457 i = i + 1
458 newdata = read(1)
459 if not newdata or i > 10:
460 raise
461 data = data + newdata
462 else:
463 return object
464
465 def reset(self):
466
467 """ Resets the codec buffers used for keeping state.
468
469 Note that no stream repositioning should take place.
470 This method is primarely intended to be able to recover
471 from decoding errors.
472
473 """
474 pass
475
476 def __getattr__(self,name,
477
478 getattr=getattr):
479
480 """ Inherit all other methods from the underlying stream.
481 """
482 return getattr(self.stream,name)
483
484XXX What about .readline(), .readlines() ? These could be implemented
485 using .read() as generic functions instead of requiring their
486 implementation by all codecs. Also see Line Breaks.
487
488Stream codec implementors are free to combine the StreamWriter and
489StreamReader interfaces into one class. Even combining all these with
490the Codec class should be possible.
491
492Implementors are free to add additional methods to enhance the codec
493functionality or provide extra state information needed for them to
494work. The internal codec implementation will only use the above
495interfaces, though.
496
497It is not required by the Unicode implementation to use these base
498classes, only the interfaces must match; this allows writing Codecs as
499extensions types.
500
501As guideline, large mapping tables should be implemented using static
502C data in separate (shared) extension modules. That way multiple
503processes can share the same data.
504
505A tool to auto-convert Unicode mapping files to mapping modules should be
506provided to simplify support for additional mappings (see References).
507
508
509Whitespace:
510-----------
511
512The .split() method will have to know about what is considered
513whitespace in Unicode.
514
515
516Case Conversion:
517----------------
518
519Case conversion is rather complicated with Unicode data, since there
520are many different conditions to respect. See
521
522 http://www.unicode.org/unicode/reports/tr13/
523
524for some guidelines on implementing case conversion.
525
526For Python, we should only implement the 1-1 conversions included in
527Unicode. Locale dependent and other special case conversions (see the
528Unicode standard file SpecialCasing.txt) should be left to user land
529routines and not go into the core interpreter.
530
531The methods .capitalize() and .iscapitalized() should follow the case
532mapping algorithm defined in the above technical report as closely as
533possible.
534
535
536Line Breaks:
537------------
538
539Line breaking should be done for all Unicode characters having the B
540property as well as the combinations CRLF, CR, LF (interpreted in that
541order) and other special line separators defined by the standard.
542
543The Unicode type should provide a .splitlines() method which returns a
544list of lines according to the above specification. See Unicode
545Methods.
546
547
548Unicode Character Properties:
549-----------------------------
550
551A separate module "unicodedata" should provide a compact interface to
552all Unicode character properties defined in the standard's
553UnicodeData.txt file.
554
555Among other things, these properties provide ways to recognize
556numbers, digits, spaces, whitespace, etc.
557
558Since this module will have to provide access to all Unicode
559characters, it will eventually have to contain the data from
560UnicodeData.txt which takes up around 600kB. For this reason, the data
561should be stored in static C data. This enables compilation as shared
562module which the underlying OS can shared between processes (unlike
563normal Python code modules).
564
565There should be a standard Python interface for accessing this information
566so that other implementors can plug in their own possibly enhanced versions,
567e.g. ones that do decompressing of the data on-the-fly.
568
569
570Private Code Point Areas:
571-------------------------
572
573Support for these is left to user land Codecs and not explicitly
574intergrated into the core. Note that due to the Internal Format being
575implemented, only the area between \uE000 and \uF8FF is useable for
576private encodings.
577
578
579Internal Format:
580----------------
581
582The internal format for Unicode objects should use a Python specific
583fixed format <PythonUnicode> implemented as 'unsigned short' (or
584another unsigned numeric type having 16 bits). Byte order is platform
585dependent.
586
587This format will hold UTF-16 encodings of the corresponding Unicode
588ordinals. The Python Unicode implementation will address these values
589as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
590currently defined Unicode character points. UTF-16 without surrogates
591provides access to about 64k characters and covers all characters in
592the Basic Multilingual Plane (BMP) of Unicode.
593
594It is the Codec's responsibility to ensure that the data they pass to
595the Unicode object constructor repects this assumption. The
596constructor does not check the data for Unicode compliance or use of
597surrogates.
598
599Future implementations can extend the 32 bit restriction to the full
600set of all UTF-16 addressable characters (around 1M characters).
601
602The Unicode API should provide inteface routines from <PythonUnicode>
603to the compiler's wchar_t which can be 16 or 32 bit depending on the
604compiler/libc/platform being used.
605
606Unicode objects should have a pointer to a cached Python string object
607<defencstr> holding the object's value using the current <default
608encoding>. This is needed for performance and internal parsing (see
609Internal Argument Parsing) reasons. The buffer is filled when the
610first conversion request to the <default encoding> is issued on the
611object.
612
613Interning is not needed (for now), since Python identifiers are
614defined as being ASCII only.
615
616codecs.BOM should return the byte order mark (BOM) for the format
617used internally. The codecs module should provide the following
618additional constants for convenience and reference (codecs.BOM will
619either be BOM_BE or BOM_LE depending on the platform):
620
621 BOM_BE: '\376\377'
622 (corresponds to Unicode U+0000FEFF in UTF-16 on big endian
623 platforms == ZERO WIDTH NO-BREAK SPACE)
624
625 BOM_LE: '\377\376'
626 (corresponds to Unicode U+0000FFFE in UTF-16 on little endian
627 platforms == defined as being an illegal Unicode character)
628
629 BOM4_BE: '\000\000\376\377'
630 (corresponds to Unicode U+0000FEFF in UCS-4)
631
632 BOM4_LE: '\377\376\000\000'
633 (corresponds to Unicode U+0000FFFE in UCS-4)
634
635Note that Unicode sees big endian byte order as being "correct". The
636swapped order is taken to be an indicator for a "wrong" format, hence
637the illegal character definition.
638
639The configure script should provide aid in deciding whether Python can
640use the native wchar_t type or not (it has to be a 16-bit unsigned
641type).
642
643
644Buffer Interface:
645-----------------
646
647Implement the buffer interface using the <defencstr> Python string
648object as basis for bf_getcharbuf (corresponds to the "t#" argument
649parsing marker) and the internal buffer for bf_getreadbuf (corresponds
650to the "s#" argument parsing marker). If bf_getcharbuf is requested
651and the <defencstr> object does not yet exist, it is created first.
652
653This has the advantage of being able to write to output streams (which
654typically use this interface) without additional specification of the
655encoding to use.
656
657The internal format can also be accessed using the 'unicode-internal'
658codec, e.g. via u.encode('unicode-internal').
659
660
661Pickle/Marshalling:
662-------------------
663
664Should have native Unicode object support. The objects should be
665encoded using platform independent encodings.
666
667Marshal should use UTF-8 and Pickle should either choose
668Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
669encoding. Using UTF-8 instead of UTF-16 has the advantage of
670eliminating the need to store a BOM mark.
671
672
673Regular Expressions:
674--------------------
675
676Secret Labs AB is working on a Unicode-aware regular expression
677machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
678internal character buffers.
679
680Also see
681
682 http://www.unicode.org/unicode/reports/tr18/
683
684for some remarks on how to treat Unicode REs.
685
686
687Formatting Markers:
688-------------------
689
690Format markers are used in Python format strings. If Python strings
691are used as format strings, the following interpretations should be in
692effect:
693
694 '%s': '%s' does str(u) for Unicode objects embedded
695 in Python strings, so the output will be
696 u.encode(<default encoding>)
697
698In case the format string is an Unicode object, all parameters are coerced
699to Unicode first and then put together and formatted according to the format
700string. Numbers are first converted to strings and then to Unicode.
701
702 '%s': Python strings are interpreted as Unicode
703 string using the <default encoding>. Unicode
704 objects are taken as is.
705
706All other string formatters should work accordingly.
707
708Example:
709
710u"%s %s" % (u"abc", "abc") == u"abc abc"
711
712
713Internal Argument Parsing:
714--------------------------
715
716These markers are used by the PyArg_ParseTuple() APIs:
717
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000718 "U": Check for Unicode object and return a pointer to it
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000719
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000720 "s": For Unicode objects: auto convert them to the <default encoding>
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000721 and return a pointer to the object's <defencstr> buffer.
722
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000723 "s#": Access to the Unicode object via the bf_getreadbuf buffer interface
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000724 (see Buffer Interface); note that the length relates to the buffer
725 length, not the Unicode string length (this may be different
726 depending on the Internal Format).
727
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000728 "t#": Access to the Unicode object via the bf_getcharbuf buffer interface
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000729 (see Buffer Interface); note that the length relates to the buffer
730 length, not necessarily to the Unicode string length (this may
731 be different depending on the <default encoding>).
732
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000733 "es":
734 Takes two parameters: encoding (const char *) and
735 buffer (char **).
736
737 The input object is first coerced to Unicode in the usual way
738 and then encoded into a string using the given encoding.
739
740 On output, a buffer of the needed size is allocated and
741 returned through *buffer as NULL-terminated string.
742 The encoded may not contain embedded NULL characters.
Guido van Rossum24bdb042000-03-28 20:29:59 +0000743 The caller is responsible for calling PyMem_Free()
744 to free the allocated *buffer after usage.
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000745
746 "es#":
747 Takes three parameters: encoding (const char *),
748 buffer (char **) and buffer_len (int *).
749
750 The input object is first coerced to Unicode in the usual way
751 and then encoded into a string using the given encoding.
752
753 If *buffer is non-NULL, *buffer_len must be set to sizeof(buffer)
754 on input. Output is then copied to *buffer.
755
756 If *buffer is NULL, a buffer of the needed size is
757 allocated and output copied into it. *buffer is then
Guido van Rossum24bdb042000-03-28 20:29:59 +0000758 updated to point to the allocated memory area.
759 The caller is responsible for calling PyMem_Free()
760 to free the allocated *buffer after usage.
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000761
762 In both cases *buffer_len is updated to the number of
763 characters written (excluding the trailing NULL-byte).
764 The output buffer is assured to be NULL-terminated.
765
766Examples:
767
768Using "es#" with auto-allocation:
769
770 static PyObject *
771 test_parser(PyObject *self,
772 PyObject *args)
773 {
774 PyObject *str;
775 const char *encoding = "latin-1";
776 char *buffer = NULL;
777 int buffer_len = 0;
778
779 if (!PyArg_ParseTuple(args, "es#:test_parser",
780 encoding, &buffer, &buffer_len))
781 return NULL;
782 if (!buffer) {
783 PyErr_SetString(PyExc_SystemError,
784 "buffer is NULL");
785 return NULL;
786 }
787 str = PyString_FromStringAndSize(buffer, buffer_len);
Guido van Rossum24bdb042000-03-28 20:29:59 +0000788 PyMem_Free(buffer);
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000789 return str;
790 }
791
792Using "es" with auto-allocation returning a NULL-terminated string:
793
794 static PyObject *
795 test_parser(PyObject *self,
796 PyObject *args)
797 {
798 PyObject *str;
799 const char *encoding = "latin-1";
800 char *buffer = NULL;
801
802 if (!PyArg_ParseTuple(args, "es:test_parser",
803 encoding, &buffer))
804 return NULL;
805 if (!buffer) {
806 PyErr_SetString(PyExc_SystemError,
807 "buffer is NULL");
808 return NULL;
809 }
810 str = PyString_FromString(buffer);
Guido van Rossum24bdb042000-03-28 20:29:59 +0000811 PyMem_Free(buffer);
Guido van Rossumd8855fd2000-03-24 22:14:19 +0000812 return str;
813 }
814
815Using "es#" with a pre-allocated buffer:
816
817 static PyObject *
818 test_parser(PyObject *self,
819 PyObject *args)
820 {
821 PyObject *str;
822 const char *encoding = "latin-1";
823 char _buffer[10];
824 char *buffer = _buffer;
825 int buffer_len = sizeof(_buffer);
826
827 if (!PyArg_ParseTuple(args, "es#:test_parser",
828 encoding, &buffer, &buffer_len))
829 return NULL;
830 if (!buffer) {
831 PyErr_SetString(PyExc_SystemError,
832 "buffer is NULL");
833 return NULL;
834 }
835 str = PyString_FromStringAndSize(buffer, buffer_len);
836 return str;
837 }
838
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000839
840File/Stream Output:
841-------------------
842
843Since file.write(object) and most other stream writers use the "s#"
844argument parsing marker for binary files and "t#" for text files, the
845buffer interface implementation determines the encoding to use (see
846Buffer Interface).
847
848For explicit handling of files using Unicode, the standard
849stream codecs as available through the codecs module should
850be used.
851
Barry Warsaw51ac5802000-03-20 16:36:48 +0000852The codecs module should provide a short-cut open(filename,mode,encoding)
853available which also assures that mode contains the 'b' character when
854needed.
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000855
856
857File/Stream Input:
858------------------
859
860Only the user knows what encoding the input data uses, so no special
861magic is applied. The user will have to explicitly convert the string
862data to Unicode objects as needed or use the file wrappers defined in
863the codecs module (see File/Stream Output).
864
865
866Unicode Methods & Attributes:
867-----------------------------
868
869All Python string methods, plus:
870
871 .encode([encoding=<default encoding>][,errors="strict"])
872 --> see Unicode Output
873
874 .splitlines([include_breaks=0])
875 --> breaks the Unicode string into a list of (Unicode) lines;
876 returns the lines with line breaks included, if include_breaks
877 is true. See Line Breaks for a specification of how line breaking
878 is done.
879
880
881Code Base:
882----------
883
884We should use Fredrik Lundh's Unicode object implementation as basis.
885It already implements most of the string methods needed and provides a
886well written code base which we can build upon.
887
888The object sharing implemented in Fredrik's implementation should
889be dropped.
890
891
892Test Cases:
893-----------
894
895Test cases should follow those in Lib/test/test_string.py and include
896additional checks for the Codec Registry and the Standard Codecs.
897
898
899References:
900-----------
901
902Unicode Consortium:
903 http://www.unicode.org/
904
905Unicode FAQ:
906 http://www.unicode.org/unicode/faq/
907
908Unicode 3.0:
909 http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
910
911Unicode-TechReports:
912 http://www.unicode.org/unicode/reports/techreports.html
913
914Unicode-Mappings:
915 ftp://ftp.unicode.org/Public/MAPPINGS/
916
917Introduction to Unicode (a little outdated by still nice to read):
918 http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
919
Barry Warsaw51ac5802000-03-20 16:36:48 +0000920For comparison:
921 Introducing Unicode to ECMAScript --
922 http://www-4.ibm.com/software/developer/library/internationalization-support.html
923
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +0000924Encodings:
925
926 Overview:
927 http://czyborra.com/utf/
928
929 UTC-2:
930 http://www.uazone.com/multiling/unicode/ucs2.html
931
932 UTF-7:
933 Defined in RFC2152, e.g.
934 http://www.uazone.com/multiling/ml-docs/rfc2152.txt
935
936 UTF-8:
937 Defined in RFC2279, e.g.
938 http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
939
940 UTF-16:
941 http://www.uazone.com/multiling/unicode/wg2n1035.html
942
943
944History of this Proposal:
945-------------------------
Guido van Rossumd8855fd2000-03-24 22:14:19 +00009461.3: Added new "es" and "es#" parser markers
Barry Warsaw51ac5802000-03-20 16:36:48 +00009471.2: Removed POD about codecs.open()
Guido van Rossum9ed0d1e2000-03-10 23:14:11 +00009481.1: Added note about comparisons and hash values. Added note about
949 case mapping algorithms. Changed stream codecs .read() and
950 .write() method to match the standard file-like object methods
951 (bytes consumed information is no longer returned by the methods)
9521.0: changed encode Codec method to be symmetric to the decode method
953 (they both return (object, data consumed) now and thus become
954 interchangeable); removed __init__ method of Codec class (the
955 methods are stateless) and moved the errors argument down to the
956 methods; made the Codec design more generic w/r to type of input
957 and output objects; changed StreamWriter.flush to StreamWriter.reset
958 in order to avoid overriding the stream's .flush() method;
959 renamed .breaklines() to .splitlines(); renamed the module unicodec
960 to codecs; modified the File I/O section to refer to the stream codecs.
9610.9: changed errors keyword argument definition; added 'replace' error
962 handling; changed the codec APIs to accept buffer like objects on
963 input; some minor typo fixes; added Whitespace section and
964 included references for Unicode characters that have the whitespace
965 and the line break characteristic; added note that search functions
966 can expect lower-case encoding names; dropped slicing and offsets
967 in the codec APIs
9680.8: added encodings package and raw unicode escape encoding; untabified
969 the proposal; added notes on Unicode format strings; added
970 .breaklines() method
9710.7: added a whole new set of codec APIs; added a different encoder
972 lookup scheme; fixed some names
9730.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
974 a real Python string object; changed Buffer Interface to delegate
975 requests to <defencstr>'s buffer interface; removed the explicit
976 reference to the unicodec.codecs dictionary (the module can implement
977 this in way fit for the purpose); removed the settable default
978 encoding; move UnicodeError from unicodec to exceptions; "s#"
979 not returns the internal data; passed the UCS-2/UTF-16 checking
980 from the Unicode constructor to the Codecs
9810.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
982 private use encodings and Unicode character properties
9830.4: added Codec interface, notes on %-formatting, changed some encoding
984 details, added comments on stream wrappers, fixed some discussion
985 points (most important: Internal Format), clarified the
986 'unicode-escape' encoding, added encoding references
9870.3: added references, comments on codec modules, the internal format,
988 bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
989 proposed by Tim Peters and fixed repr(u) accordingly
9900.2: integrated Guido's suggestions, added stream codecs and file
991 wrapping
9920.1: first version
993
994
995-----------------------------------------------------------------------------
996Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
997-----------------------------------------------------------------------------