Blame - Misc/unicode.txt - platform/external/python/cpython2

blob: ce74c05bd190087ac6c94f014d9918ea0aa08d8b [file] [log] [blame]

Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	1	=============================================================================
				2	Python Unicode Integration Proposal Version: 1.2
				3	-----------------------------------------------------------------------------
				4
				5
				6	Introduction:
				7	-------------
				8
				9	The idea of this proposal is to add native Unicode 3.0 support to
				10	Python in a way that makes use of Unicode strings as simple as
				11	possible without introducing too many pitfalls along the way.
				12
				13	Since this goal is not easy to achieve -- strings being one of the
				14	most fundamental objects in Python --, we expect this proposal to
				15	undergo some significant refinements.
				16
				17	Note that the current version of this proposal is still a bit unsorted
				18	due to the many different aspects of the Unicode-Python integration.
				19
				20	The latest version of this document is always available at:
				21
				22	http://starship.skyport.net/~lemburg/unicode-proposal.txt
				23
				24	Older versions are available as:
				25
				26	http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt
				27
				28
				29	Conventions:
				30	------------
				31
				32	· In examples we use u = Unicode object and s = Python string
				33
				34	· 'XXX' markings indicate points of discussion (PODs)
				35
				36
				37	General Remarks:
				38	----------------
				39
				40	· Unicode encoding names should be lower case on output and
				41	case-insensitive on input (they will be converted to lower case
				42	by all APIs taking an encoding name as input).
				43
				44	Encoding names should follow the name conventions as used by the
				45	Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
				46	written as 'utf-16'.
				47
				48	Codec modules should use the same names, but with hyphens converted
				49	to underscores, e.g. utf_8, utf_16, iso_8859_1.
				50
				51	· The <default encoding> should be the widely used 'utf-8' format. This
				52	is very close to the standard 7-bit ASCII format and thus resembles the
				53	standard used programming nowadays in most aspects.
				54
				55
				56	Unicode Constructors:
				57	---------------------
				58
				59	Python should provide a built-in constructor for Unicode strings which
				60	is available through __builtins__:
				61
				62	u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
				63
				64	u = u'<unicode-escape encoded Python string>'
				65
				66	u = ur'<raw-unicode-escape encoded Python string>'
				67
				68	With the 'unicode-escape' encoding being defined as:
				69
				70	· all non-escape characters represent themselves as Unicode ordinal
				71	(e.g. 'a' -> U+0061).
				72
				73	· all existing defined Python escape sequences are interpreted as
				74	Unicode ordinals; note that \xXXXX can represent all Unicode
				75	ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
				76
				77	· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
				78	error to have fewer than 4 digits after \u.
				79
				80	For an explanation of possible values for errors see the Codec section
				81	below.
				82
				83	Examples:
				84
				85	u'abc' -> U+0061 U+0062 U+0063
				86	u'\u1234' -> U+1234
				87	u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
				88
				89	The 'raw-unicode-escape' encoding is defined as follows:
				90
				91	· \uXXXX sequence represent the U+XXXX Unicode character if and
				92	only if the number of leading backslashes is odd
				93
				94	· all other characters represent themselves as Unicode ordinal
				95	(e.g. 'b' -> U+0062)
				96
				97
				98	Note that you should provide some hint to the encoding you used to
				99	write your programs as pragma line in one the first few comment lines
				100	of the source file (e.g. '# source file encoding: latin-1'). If you
				101	only use 7-bit ASCII then everything is fine and no such notice is
				102	needed, but if you include Latin-1 characters not defined in ASCII, it
				103	may well be worthwhile including a hint since people in other
				104	countries will want to be able to read you source strings too.
				105
				106
				107	Unicode Type Object:
				108	--------------------
				109
				110	Unicode objects should have the type UnicodeType with type name
				111	'unicode', made available through the standard types module.
				112
				113
				114	Unicode Output:
				115	---------------
				116
				117	Unicode objects have a method .encode([encoding=<default encoding>])
				118	which returns a Python string encoding the Unicode string using the
				119	given scheme (see Codecs).
				120
				121	print u := print u.encode() # using the <default encoding>
				122
				123	str(u) := u.encode() # using the <default encoding>
				124
				125	repr(u) := "u%s" % repr(u.encode('unicode-escape'))
				126
				127	Also see Internal Argument Parsing and Buffer Interface for details on
				128	how other APIs written in C will treat Unicode objects.
				129
				130
				131	Unicode Ordinals:
				132	-----------------
				133
				134	Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
				135	should provide 32-bit aware ordinal conversion APIs:
				136
				137	ord(u[:1]) (this is the standard ord() extended to work with Unicode
				138	objects)
				139	--> Unicode ordinal number (32-bit)
				140
				141	unichr(i)
				142	--> Unicode object for character i (provided it is 32-bit);
				143	ValueError otherwise
				144
				145	Both APIs should go into __builtins__ just like their string
				146	counterparts ord() and chr().
				147
				148	Note that Unicode provides space for private encodings. Usage of these
				149	can cause different output representations on different machines. This
				150	problem is not a Python or Unicode problem, but a machine setup and
				151	maintenance one.
				152
				153
				154	Comparison & Hash Value:
				155	------------------------
				156
				157	Unicode objects should compare equal to other objects after these
				158	other objects have been coerced to Unicode. For strings this means
				159	that they are interpreted as Unicode string using the <default
				160	encoding>.
				161
				162	For the same reason, Unicode objects should return the same hash value
				163	as their UTF-8 equivalent strings.
				164
				165	Coercion:
				166	---------
				167
				168	Using Python strings and Unicode objects to form new objects should
				169	always coerce to the more precise format, i.e. Unicode objects.
				170
				171	u + s := u + unicode(s)
				172
				173	s + u := unicode(s) + u
				174
				175	All string methods should delegate the call to an equivalent Unicode
				176	object method call by converting all envolved strings to Unicode and
				177	then applying the arguments to the Unicode method of the same name,
				178	e.g.
				179
				180	string.join((s,u),sep) := (s + sep) + u
				181
				182	sep.join((s,u)) := (s + sep) + u
				183
				184	For a discussion of %-formatting w/r to Unicode objects, see
				185	Formatting Markers.
				186
				187
				188	Exceptions:
				189	-----------
				190
				191	UnicodeError is defined in the exceptions module as subclass of
				192	ValueError. It is available at the C level via PyExc_UnicodeError.
				193	All exceptions related to Unicode encoding/decoding should be
				194	subclasses of UnicodeError.
				195
				196
				197	Codecs (Coder/Decoders) Lookup:
				198	-------------------------------
				199
				200	A Codec (see Codec Interface Definition) search registry should be
				201	implemented by a module "codecs":
				202
				203	codecs.register(search_function)
				204
				205	Search functions are expected to take one argument, the encoding name
				206	in all lower case letters, and return a tuple of functions (encoder,
				207	decoder, stream_reader, stream_writer) taking the following arguments:
				208
				209	encoder and decoder:
				210	These must be functions or methods which have the same
				211	interface as the .encode/.decode methods of Codec instances
				212	(see Codec Interface). The functions/methods are expected to
				213	work in a stateless mode.
				214
				215	stream_reader and stream_writer:
				216	These need to be factory functions with the following
				217	interface:
				218
				219	factory(stream,errors='strict')
				220
				221	The factory functions must return objects providing
				222	the interfaces defined by StreamWriter/StreamReader resp.
				223	(see Codec Interface). Stream codecs can maintain state.
				224
				225	Possible values for errors are defined in the Codec
				226	section below.
				227
				228	In case a search function cannot find a given encoding, it should
				229	return None.
				230
				231	Aliasing support for encodings is left to the search functions
				232	to implement.
				233
				234	The codecs module will maintain an encoding cache for performance
				235	reasons. Encodings are first looked up in the cache. If not found, the
				236	list of registered search functions is scanned. If no codecs tuple is
				237	found, a LookupError is raised. Otherwise, the codecs tuple is stored
				238	in the cache and returned to the caller.
				239
				240	To query the Codec instance the following API should be used:
				241
				242	codecs.lookup(encoding)
				243
				244	This will either return the found codecs tuple or raise a LookupError.
				245
				246
				247	Standard Codecs:
				248	----------------
				249
				250	Standard codecs should live inside an encodings/ package directory in the
				251	Standard Python Code Library. The __init__.py file of that directory should
				252	include a Codec Lookup compatible search function implementing a lazy module
				253	based codec lookup.
				254
				255	Python should provide a few standard codecs for the most relevant
				256	encodings, e.g.
				257
				258	'utf-8': 8-bit variable length encoding
				259	'utf-16': 16-bit variable length encoding (litte/big endian)
				260	'utf-16-le': utf-16 but explicitly little endian
				261	'utf-16-be': utf-16 but explicitly big endian
				262	'ascii': 7-bit ASCII codepage
				263	'iso-8859-1': ISO 8859-1 (Latin 1) codepage
				264	'unicode-escape': See Unicode Constructors for a definition
				265	'raw-unicode-escape': See Unicode Constructors for a definition
				266	'native': Dump of the Internal Format used by Python
				267
				268	Common aliases should also be provided per default, e.g. 'latin-1'
				269	for 'iso-8859-1'.
				270
				271	Note: 'utf-16' should be implemented by using and requiring byte order
				272	marks (BOM) for file input/output.
				273
				274	All other encodings such as the CJK ones to support Asian scripts
				275	should be implemented in seperate packages which do not get included
				276	in the core Python distribution and are not a part of this proposal.
				277
				278
				279	Codecs Interface Definition:
				280	----------------------------
				281
				282	The following base class should be defined in the module
				283	"codecs". They provide not only templates for use by encoding module
				284	implementors, but also define the interface which is expected by the
				285	Unicode implementation.
				286
				287	Note that the Codec Interface defined here is well suitable for a
				288	larger range of applications. The Unicode implementation expects
				289	Unicode objects on input for .encode() and .write() and character
				290	buffer compatible objects on input for .decode(). Output of .encode()
				291	and .read() should be a Python string and .decode() must return an
				292	Unicode object.
				293
				294	First, we have the stateless encoders/decoders. These do not work in
				295	chunks as the stream codecs (see below) do, because all components are
				296	expected to be available in memory.
				297
				298	class Codec:
				299
				300	""" Defines the interface for stateless encoders/decoders.
				301
				302	The .encode()/.decode() methods may implement different error
				303	handling schemes by providing the errors argument. These
				304	string values are defined:
				305
				306	'strict' - raise an error (or a subclass)
				307	'ignore' - ignore the character and continue with the next
				308	'replace' - replace with a suitable replacement character;
				309	Python will use the official U+FFFD REPLACEMENT
				310	CHARACTER for the builtin Unicode codecs.
				311
				312	"""
				313	def encode(self,input,errors='strict'):
				314
				315	""" Encodes the object intput and returns a tuple (output
				316	object, length consumed).
				317
				318	errors defines the error handling to apply. It defaults to
				319	'strict' handling.
				320
				321	The method may not store state in the Codec instance. Use
				322	SteamCodec for codecs which have to keep state in order to
				323	make encoding/decoding efficient.
				324
				325	"""
				326	...
				327
				328	def decode(self,input,errors='strict'):
				329
				330	""" Decodes the object input and returns a tuple (output
				331	object, length consumed).
				332
				333	input must be an object which provides the bf_getreadbuf
				334	buffer slot. Python strings, buffer objects and memory
				335	mapped files are examples of objects providing this slot.
				336
				337	errors defines the error handling to apply. It defaults to
				338	'strict' handling.
				339
				340	The method may not store state in the Codec instance. Use
				341	SteamCodec for codecs which have to keep state in order to
				342	make encoding/decoding efficient.
				343
				344	"""
				345	...
				346
				347	StreamWriter and StreamReader define the interface for stateful
				348	encoders/decoders which work on streams. These allow processing of the
				349	data in chunks to efficiently use memory. If you have large strings in
				350	memory, you may want to wrap them with cStringIO objects and then use
				351	these codecs on them to be able to do chunk processing as well,
				352	e.g. to provide progress information to the user.
				353
				354	class StreamWriter(Codec):
				355
				356	def __init__(self,stream,errors='strict'):
				357
				358	""" Creates a StreamWriter instance.
				359
				360	stream must be a file-like object open for writing
				361	(binary) data.
				362
				363	The StreamWriter may implement different error handling
				364	schemes by providing the errors keyword argument. These
				365	parameters are defined:
				366
				367	'strict' - raise a ValueError (or a subclass)
				368	'ignore' - ignore the character and continue with the next
				369	'replace'- replace with a suitable replacement character
				370
				371	"""
				372	self.stream = stream
				373	self.errors = errors
				374
				375	def write(self,object):
				376
				377	""" Writes the object's contents encoded to self.stream.
				378	"""
				379	data, consumed = self.encode(object,self.errors)
				380	self.stream.write(data)
				381
				382	def reset(self):
				383
				384	""" Flushes and resets the codec buffers used for keeping state.
				385
				386	Calling this method should ensure that the data on the
				387	output is put into a clean state, that allows appending
				388	of new fresh data without having to rescan the whole
				389	stream to recover state.
				390
				391	"""
				392	pass
				393
				394	def __getattr__(self,name,
				395
				396	getattr=getattr):
				397
				398	""" Inherit all other methods from the underlying stream.
				399	"""
				400	return getattr(self.stream,name)
				401
				402	class StreamReader(Codec):
				403
				404	def __init__(self,stream,errors='strict'):
				405
				406	""" Creates a StreamReader instance.
				407
				408	stream must be a file-like object open for reading
				409	(binary) data.
				410
				411	The StreamReader may implement different error handling
				412	schemes by providing the errors keyword argument. These
				413	parameters are defined:
				414
				415	'strict' - raise a ValueError (or a subclass)
				416	'ignore' - ignore the character and continue with the next
				417	'replace'- replace with a suitable replacement character;
				418
				419	"""
				420	self.stream = stream
				421	self.errors = errors
				422
				423	def read(self,size=-1):
				424
				425	""" Decodes data from the stream self.stream and returns the
				426	resulting object.
				427
				428	size indicates the approximate maximum number of bytes to
				429	read from the stream for decoding purposes. The decoder
				430	can modify this setting as appropriate. The default value
				431	-1 indicates to read and decode as much as possible. size
				432	is intended to prevent having to decode huge files in one
				433	step.
				434
				435	The method should use a greedy read strategy meaning that
				436	it should read as much data as is allowed within the
				437	definition of the encoding and the given size, e.g. if
				438	optional encoding endings or state markers are available
				439	on the stream, these should be read too.
				440
				441	"""
				442	# Unsliced reading:
				443	if size < 0:
				444	return self.decode(self.stream.read())[0]
				445
				446	# Sliced reading:
				447	read = self.stream.read
				448	decode = self.decode
				449	data = read(size)
				450	i = 0
				451	while 1:
				452	try:
				453	object, decodedbytes = decode(data)
				454	except ValueError,why:
				455	# This method is slow but should work under pretty much
				456	# all conditions; at most 10 tries are made
				457	i = i + 1
				458	newdata = read(1)
				459	if not newdata or i > 10:
				460	raise
				461	data = data + newdata
				462	else:
				463	return object
				464
				465	def reset(self):
				466
				467	""" Resets the codec buffers used for keeping state.
				468
				469	Note that no stream repositioning should take place.
				470	This method is primarely intended to be able to recover
				471	from decoding errors.
				472
				473	"""
				474	pass
				475
				476	def __getattr__(self,name,
				477
				478	getattr=getattr):
				479
				480	""" Inherit all other methods from the underlying stream.
				481	"""
				482	return getattr(self.stream,name)
				483
				484	XXX What about .readline(), .readlines() ? These could be implemented
				485	using .read() as generic functions instead of requiring their
				486	implementation by all codecs. Also see Line Breaks.
				487
				488	Stream codec implementors are free to combine the StreamWriter and
				489	StreamReader interfaces into one class. Even combining all these with
				490	the Codec class should be possible.
				491
				492	Implementors are free to add additional methods to enhance the codec
				493	functionality or provide extra state information needed for them to
				494	work. The internal codec implementation will only use the above
				495	interfaces, though.
				496
				497	It is not required by the Unicode implementation to use these base
				498	classes, only the interfaces must match; this allows writing Codecs as
				499	extensions types.
				500
				501	As guideline, large mapping tables should be implemented using static
				502	C data in separate (shared) extension modules. That way multiple
				503	processes can share the same data.
				504
				505	A tool to auto-convert Unicode mapping files to mapping modules should be
				506	provided to simplify support for additional mappings (see References).
				507
				508
				509	Whitespace:
				510	-----------
				511
				512	The .split() method will have to know about what is considered
				513	whitespace in Unicode.
				514
				515
				516	Case Conversion:
				517	----------------
				518
				519	Case conversion is rather complicated with Unicode data, since there
				520	are many different conditions to respect. See
				521
				522	http://www.unicode.org/unicode/reports/tr13/
				523
				524	for some guidelines on implementing case conversion.
				525
				526	For Python, we should only implement the 1-1 conversions included in
				527	Unicode. Locale dependent and other special case conversions (see the
				528	Unicode standard file SpecialCasing.txt) should be left to user land
				529	routines and not go into the core interpreter.
				530
				531	The methods .capitalize() and .iscapitalized() should follow the case
				532	mapping algorithm defined in the above technical report as closely as
				533	possible.
				534
				535
				536	Line Breaks:
				537	------------
				538
				539	Line breaking should be done for all Unicode characters having the B
				540	property as well as the combinations CRLF, CR, LF (interpreted in that
				541	order) and other special line separators defined by the standard.
				542
				543	The Unicode type should provide a .splitlines() method which returns a
				544	list of lines according to the above specification. See Unicode
				545	Methods.
				546
				547
				548	Unicode Character Properties:
				549	-----------------------------
				550
				551	A separate module "unicodedata" should provide a compact interface to
				552	all Unicode character properties defined in the standard's
				553	UnicodeData.txt file.
				554
				555	Among other things, these properties provide ways to recognize
				556	numbers, digits, spaces, whitespace, etc.
				557
				558	Since this module will have to provide access to all Unicode
				559	characters, it will eventually have to contain the data from
				560	UnicodeData.txt which takes up around 600kB. For this reason, the data
				561	should be stored in static C data. This enables compilation as shared
				562	module which the underlying OS can shared between processes (unlike
				563	normal Python code modules).
				564
				565	There should be a standard Python interface for accessing this information
				566	so that other implementors can plug in their own possibly enhanced versions,
				567	e.g. ones that do decompressing of the data on-the-fly.
				568
				569
				570	Private Code Point Areas:
				571	-------------------------
				572
				573	Support for these is left to user land Codecs and not explicitly
				574	intergrated into the core. Note that due to the Internal Format being
				575	implemented, only the area between \uE000 and \uF8FF is useable for
				576	private encodings.
				577
				578
				579	Internal Format:
				580	----------------
				581
				582	The internal format for Unicode objects should use a Python specific
				583	fixed format <PythonUnicode> implemented as 'unsigned short' (or
				584	another unsigned numeric type having 16 bits). Byte order is platform
				585	dependent.
				586
				587	This format will hold UTF-16 encodings of the corresponding Unicode
				588	ordinals. The Python Unicode implementation will address these values
				589	as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
				590	currently defined Unicode character points. UTF-16 without surrogates
				591	provides access to about 64k characters and covers all characters in
				592	the Basic Multilingual Plane (BMP) of Unicode.
				593
				594	It is the Codec's responsibility to ensure that the data they pass to
				595	the Unicode object constructor repects this assumption. The
				596	constructor does not check the data for Unicode compliance or use of
				597	surrogates.
				598
				599	Future implementations can extend the 32 bit restriction to the full
				600	set of all UTF-16 addressable characters (around 1M characters).
				601
				602	The Unicode API should provide inteface routines from <PythonUnicode>
				603	to the compiler's wchar_t which can be 16 or 32 bit depending on the
				604	compiler/libc/platform being used.
				605
				606	Unicode objects should have a pointer to a cached Python string object
				607	<defencstr> holding the object's value using the current <default
				608	encoding>. This is needed for performance and internal parsing (see
				609	Internal Argument Parsing) reasons. The buffer is filled when the
				610	first conversion request to the <default encoding> is issued on the
				611	object.
				612
				613	Interning is not needed (for now), since Python identifiers are
				614	defined as being ASCII only.
				615
				616	codecs.BOM should return the byte order mark (BOM) for the format
				617	used internally. The codecs module should provide the following
				618	additional constants for convenience and reference (codecs.BOM will
				619	either be BOM_BE or BOM_LE depending on the platform):
				620
				621	BOM_BE: '\376\377'
				622	(corresponds to Unicode U+0000FEFF in UTF-16 on big endian
				623	platforms == ZERO WIDTH NO-BREAK SPACE)
				624
				625	BOM_LE: '\377\376'
				626	(corresponds to Unicode U+0000FFFE in UTF-16 on little endian
				627	platforms == defined as being an illegal Unicode character)
				628
				629	BOM4_BE: '\000\000\376\377'
				630	(corresponds to Unicode U+0000FEFF in UCS-4)
				631
				632	BOM4_LE: '\377\376\000\000'
				633	(corresponds to Unicode U+0000FFFE in UCS-4)
				634
				635	Note that Unicode sees big endian byte order as being "correct". The
				636	swapped order is taken to be an indicator for a "wrong" format, hence
				637	the illegal character definition.
				638
				639	The configure script should provide aid in deciding whether Python can
				640	use the native wchar_t type or not (it has to be a 16-bit unsigned
				641	type).
				642
				643
				644	Buffer Interface:
				645	-----------------
				646
				647	Implement the buffer interface using the <defencstr> Python string
				648	object as basis for bf_getcharbuf (corresponds to the "t#" argument
				649	parsing marker) and the internal buffer for bf_getreadbuf (corresponds
				650	to the "s#" argument parsing marker). If bf_getcharbuf is requested
				651	and the <defencstr> object does not yet exist, it is created first.
				652
				653	This has the advantage of being able to write to output streams (which
				654	typically use this interface) without additional specification of the
				655	encoding to use.
				656
				657	The internal format can also be accessed using the 'unicode-internal'
				658	codec, e.g. via u.encode('unicode-internal').
				659
				660
				661	Pickle/Marshalling:
				662	-------------------
				663
				664	Should have native Unicode object support. The objects should be
				665	encoded using platform independent encodings.
				666
				667	Marshal should use UTF-8 and Pickle should either choose
				668	Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
				669	encoding. Using UTF-8 instead of UTF-16 has the advantage of
				670	eliminating the need to store a BOM mark.
				671
				672
				673	Regular Expressions:
				674	--------------------
				675
				676	Secret Labs AB is working on a Unicode-aware regular expression
				677	machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
				678	internal character buffers.
				679
				680	Also see
				681
				682	http://www.unicode.org/unicode/reports/tr18/
				683
				684	for some remarks on how to treat Unicode REs.
				685
				686
				687	Formatting Markers:
				688	-------------------
				689
				690	Format markers are used in Python format strings. If Python strings
				691	are used as format strings, the following interpretations should be in
				692	effect:
				693
				694	'%s': '%s' does str(u) for Unicode objects embedded
				695	in Python strings, so the output will be
				696	u.encode(<default encoding>)
				697
				698	In case the format string is an Unicode object, all parameters are coerced
				699	to Unicode first and then put together and formatted according to the format
				700	string. Numbers are first converted to strings and then to Unicode.
				701
				702	'%s': Python strings are interpreted as Unicode
				703	string using the <default encoding>. Unicode
				704	objects are taken as is.
				705
				706	All other string formatters should work accordingly.
				707
				708	Example:
				709
				710	u"%s %s" % (u"abc", "abc") == u"abc abc"
				711
				712
				713	Internal Argument Parsing:
				714	--------------------------
				715
				716	These markers are used by the PyArg_ParseTuple() APIs:
				717
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	718	"U": Check for Unicode object and return a pointer to it
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	719
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	720	"s": For Unicode objects: auto convert them to the <default encoding>
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	721	and return a pointer to the object's <defencstr> buffer.
				722
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	723	"s#": Access to the Unicode object via the bf_getreadbuf buffer interface
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	724	(see Buffer Interface); note that the length relates to the buffer
				725	length, not the Unicode string length (this may be different
				726	depending on the Internal Format).
				727
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	728	"t#": Access to the Unicode object via the bf_getcharbuf buffer interface
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	729	(see Buffer Interface); note that the length relates to the buffer
				730	length, not necessarily to the Unicode string length (this may
				731	be different depending on the <default encoding>).
				732
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	733	"es":
				734	Takes two parameters: encoding (const char *) and
				735	buffer (char **).
				736
				737	The input object is first coerced to Unicode in the usual way
				738	and then encoded into a string using the given encoding.
				739
				740	On output, a buffer of the needed size is allocated and
				741	returned through *buffer as NULL-terminated string.
				742	The encoded may not contain embedded NULL characters.
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame^]	743	The caller is responsible for calling PyMem_Free()
				744	to free the allocated *buffer after usage.
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	745
				746	"es#":
				747	Takes three parameters: encoding (const char *),
				748	buffer (char *) and buffer_len (int ).
				749
				750	The input object is first coerced to Unicode in the usual way
				751	and then encoded into a string using the given encoding.
				752
				753	If buffer is non-NULL, buffer_len must be set to sizeof(buffer)
				754	on input. Output is then copied to *buffer.
				755
				756	If *buffer is NULL, a buffer of the needed size is
				757	allocated and output copied into it. *buffer is then
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame^]	758	updated to point to the allocated memory area.
				759	The caller is responsible for calling PyMem_Free()
				760	to free the allocated *buffer after usage.
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	761
				762	In both cases *buffer_len is updated to the number of
				763	characters written (excluding the trailing NULL-byte).
				764	The output buffer is assured to be NULL-terminated.
				765
				766	Examples:
				767
				768	Using "es#" with auto-allocation:
				769
				770	static PyObject *
				771	test_parser(PyObject *self,
				772	PyObject *args)
				773	{
				774	PyObject *str;
				775	const char *encoding = "latin-1";
				776	char *buffer = NULL;
				777	int buffer_len = 0;
				778
				779	if (!PyArg_ParseTuple(args, "es#:test_parser",
				780	encoding, &buffer, &buffer_len))
				781	return NULL;
				782	if (!buffer) {
				783	PyErr_SetString(PyExc_SystemError,
				784	"buffer is NULL");
				785	return NULL;
				786	}
				787	str = PyString_FromStringAndSize(buffer, buffer_len);
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame^]	788	PyMem_Free(buffer);
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	789	return str;
				790	}
				791
				792	Using "es" with auto-allocation returning a NULL-terminated string:
				793
				794	static PyObject *
				795	test_parser(PyObject *self,
				796	PyObject *args)
				797	{
				798	PyObject *str;
				799	const char *encoding = "latin-1";
				800	char *buffer = NULL;
				801
				802	if (!PyArg_ParseTuple(args, "es:test_parser",
				803	encoding, &buffer))
				804	return NULL;
				805	if (!buffer) {
				806	PyErr_SetString(PyExc_SystemError,
				807	"buffer is NULL");
				808	return NULL;
				809	}
				810	str = PyString_FromString(buffer);
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame^]	811	PyMem_Free(buffer);
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	812	return str;
				813	}
				814
				815	Using "es#" with a pre-allocated buffer:
				816
				817	static PyObject *
				818	test_parser(PyObject *self,
				819	PyObject *args)
				820	{
				821	PyObject *str;
				822	const char *encoding = "latin-1";
				823	char _buffer[10];
				824	char *buffer = _buffer;
				825	int buffer_len = sizeof(_buffer);
				826
				827	if (!PyArg_ParseTuple(args, "es#:test_parser",
				828	encoding, &buffer, &buffer_len))
				829	return NULL;
				830	if (!buffer) {
				831	PyErr_SetString(PyExc_SystemError,
				832	"buffer is NULL");
				833	return NULL;
				834	}
				835	str = PyString_FromStringAndSize(buffer, buffer_len);
				836	return str;
				837	}
				838
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	839
				840	File/Stream Output:
				841	-------------------
				842
				843	Since file.write(object) and most other stream writers use the "s#"
				844	argument parsing marker for binary files and "t#" for text files, the
				845	buffer interface implementation determines the encoding to use (see
				846	Buffer Interface).
				847
				848	For explicit handling of files using Unicode, the standard
				849	stream codecs as available through the codecs module should
				850	be used.
				851
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	852	The codecs module should provide a short-cut open(filename,mode,encoding)
				853	available which also assures that mode contains the 'b' character when
				854	needed.
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	855
				856
				857	File/Stream Input:
				858	------------------
				859
				860	Only the user knows what encoding the input data uses, so no special
				861	magic is applied. The user will have to explicitly convert the string
				862	data to Unicode objects as needed or use the file wrappers defined in
				863	the codecs module (see File/Stream Output).
				864
				865
				866	Unicode Methods & Attributes:
				867	-----------------------------
				868
				869	All Python string methods, plus:
				870
				871	.encode([encoding=<default encoding>][,errors="strict"])
				872	--> see Unicode Output
				873
				874	.splitlines([include_breaks=0])
				875	--> breaks the Unicode string into a list of (Unicode) lines;
				876	returns the lines with line breaks included, if include_breaks
				877	is true. See Line Breaks for a specification of how line breaking
				878	is done.
				879
				880
				881	Code Base:
				882	----------
				883
				884	We should use Fredrik Lundh's Unicode object implementation as basis.
				885	It already implements most of the string methods needed and provides a
				886	well written code base which we can build upon.
				887
				888	The object sharing implemented in Fredrik's implementation should
				889	be dropped.
				890
				891
				892	Test Cases:
				893	-----------
				894
				895	Test cases should follow those in Lib/test/test_string.py and include
				896	additional checks for the Codec Registry and the Standard Codecs.
				897
				898
				899	References:
				900	-----------
				901
				902	Unicode Consortium:
				903	http://www.unicode.org/
				904
				905	Unicode FAQ:
				906	http://www.unicode.org/unicode/faq/
				907
				908	Unicode 3.0:
				909	http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
				910
				911	Unicode-TechReports:
				912	http://www.unicode.org/unicode/reports/techreports.html
				913
				914	Unicode-Mappings:
				915	ftp://ftp.unicode.org/Public/MAPPINGS/
				916
				917	Introduction to Unicode (a little outdated by still nice to read):
				918	http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
				919
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	920	For comparison:
				921	Introducing Unicode to ECMAScript --
				922	http://www-4.ibm.com/software/developer/library/internationalization-support.html
				923
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	924	Encodings:
				925
				926	Overview:
				927	http://czyborra.com/utf/
				928
				929	UTC-2:
				930	http://www.uazone.com/multiling/unicode/ucs2.html
				931
				932	UTF-7:
				933	Defined in RFC2152, e.g.
				934	http://www.uazone.com/multiling/ml-docs/rfc2152.txt
				935
				936	UTF-8:
				937	Defined in RFC2279, e.g.
				938	http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
				939
				940	UTF-16:
				941	http://www.uazone.com/multiling/unicode/wg2n1035.html
				942
				943
				944	History of this Proposal:
				945	-------------------------
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	946	1.3: Added new "es" and "es#" parser markers
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	947	1.2: Removed POD about codecs.open()
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	948	1.1: Added note about comparisons and hash values. Added note about
				949	case mapping algorithms. Changed stream codecs .read() and
				950	.write() method to match the standard file-like object methods
				951	(bytes consumed information is no longer returned by the methods)
				952	1.0: changed encode Codec method to be symmetric to the decode method
				953	(they both return (object, data consumed) now and thus become
				954	interchangeable); removed __init__ method of Codec class (the
				955	methods are stateless) and moved the errors argument down to the
				956	methods; made the Codec design more generic w/r to type of input
				957	and output objects; changed StreamWriter.flush to StreamWriter.reset
				958	in order to avoid overriding the stream's .flush() method;
				959	renamed .breaklines() to .splitlines(); renamed the module unicodec
				960	to codecs; modified the File I/O section to refer to the stream codecs.
				961	0.9: changed errors keyword argument definition; added 'replace' error
				962	handling; changed the codec APIs to accept buffer like objects on
				963	input; some minor typo fixes; added Whitespace section and
				964	included references for Unicode characters that have the whitespace
				965	and the line break characteristic; added note that search functions
				966	can expect lower-case encoding names; dropped slicing and offsets
				967	in the codec APIs
				968	0.8: added encodings package and raw unicode escape encoding; untabified
				969	the proposal; added notes on Unicode format strings; added
				970	.breaklines() method
				971	0.7: added a whole new set of codec APIs; added a different encoder
				972	lookup scheme; fixed some names
				973	0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
				974	a real Python string object; changed Buffer Interface to delegate
				975	requests to <defencstr>'s buffer interface; removed the explicit
				976	reference to the unicodec.codecs dictionary (the module can implement
				977	this in way fit for the purpose); removed the settable default
				978	encoding; move UnicodeError from unicodec to exceptions; "s#"
				979	not returns the internal data; passed the UCS-2/UTF-16 checking
				980	from the Unicode constructor to the Codecs
				981	0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
				982	private use encodings and Unicode character properties
				983	0.4: added Codec interface, notes on %-formatting, changed some encoding
				984	details, added comments on stream wrappers, fixed some discussion
				985	points (most important: Internal Format), clarified the
				986	'unicode-escape' encoding, added encoding references
				987	0.3: added references, comments on codec modules, the internal format,
				988	bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
				989	proposed by Tim Peters and fixed repr(u) accordingly
				990	0.2: integrated Guido's suggestions, added stream codecs and file
				991	wrapping
				992	0.1: first version
				993
				994
				995	-----------------------------------------------------------------------------
				996	Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
				997	-----------------------------------------------------------------------------