Blame - Misc/unicode.txt - platform/external/python/cpython3

blob: 63635e7280f2a6ec66020eb6c035c84db2f12a83 [file] [log] [blame]

Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	1	=============================================================================
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	2	Python Unicode Integration Proposal Version: 1.4
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	3	-----------------------------------------------------------------------------
				4
				5
				6	Introduction:
				7	-------------
				8
				9	The idea of this proposal is to add native Unicode 3.0 support to
				10	Python in a way that makes use of Unicode strings as simple as
				11	possible without introducing too many pitfalls along the way.
				12
				13	Since this goal is not easy to achieve -- strings being one of the
				14	most fundamental objects in Python --, we expect this proposal to
				15	undergo some significant refinements.
				16
				17	Note that the current version of this proposal is still a bit unsorted
				18	due to the many different aspects of the Unicode-Python integration.
				19
				20	The latest version of this document is always available at:
				21
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	22	http://starship.python.net/~lemburg/unicode-proposal.txt
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	23
				24	Older versions are available as:
				25
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	26	http://starship.python.net/~lemburg/unicode-proposal-X.X.txt
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	27
				28
				29	Conventions:
				30	------------
				31
				32	· In examples we use u = Unicode object and s = Python string
				33
				34	· 'XXX' markings indicate points of discussion (PODs)
				35
				36
				37	General Remarks:
				38	----------------
				39
				40	· Unicode encoding names should be lower case on output and
				41	case-insensitive on input (they will be converted to lower case
				42	by all APIs taking an encoding name as input).
				43
				44	Encoding names should follow the name conventions as used by the
				45	Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
				46	written as 'utf-16'.
				47
				48	Codec modules should use the same names, but with hyphens converted
				49	to underscores, e.g. utf_8, utf_16, iso_8859_1.
				50
				51	· The <default encoding> should be the widely used 'utf-8' format. This
				52	is very close to the standard 7-bit ASCII format and thus resembles the
				53	standard used programming nowadays in most aspects.
				54
				55
				56	Unicode Constructors:
				57	---------------------
				58
				59	Python should provide a built-in constructor for Unicode strings which
				60	is available through __builtins__:
				61
				62	u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
				63
				64	u = u'<unicode-escape encoded Python string>'
				65
				66	u = ur'<raw-unicode-escape encoded Python string>'
				67
				68	With the 'unicode-escape' encoding being defined as:
				69
				70	· all non-escape characters represent themselves as Unicode ordinal
				71	(e.g. 'a' -> U+0061).
				72
				73	· all existing defined Python escape sequences are interpreted as
				74	Unicode ordinals; note that \xXXXX can represent all Unicode
				75	ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
				76
				77	· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
				78	error to have fewer than 4 digits after \u.
				79
				80	For an explanation of possible values for errors see the Codec section
				81	below.
				82
				83	Examples:
				84
				85	u'abc' -> U+0061 U+0062 U+0063
				86	u'\u1234' -> U+1234
				87	u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
				88
				89	The 'raw-unicode-escape' encoding is defined as follows:
				90
				91	· \uXXXX sequence represent the U+XXXX Unicode character if and
				92	only if the number of leading backslashes is odd
				93
				94	· all other characters represent themselves as Unicode ordinal
				95	(e.g. 'b' -> U+0062)
				96
				97
				98	Note that you should provide some hint to the encoding you used to
				99	write your programs as pragma line in one the first few comment lines
				100	of the source file (e.g. '# source file encoding: latin-1'). If you
				101	only use 7-bit ASCII then everything is fine and no such notice is
				102	needed, but if you include Latin-1 characters not defined in ASCII, it
				103	may well be worthwhile including a hint since people in other
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	104	countries will want to be able to read your source strings too.
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	105
				106
				107	Unicode Type Object:
				108	--------------------
				109
				110	Unicode objects should have the type UnicodeType with type name
				111	'unicode', made available through the standard types module.
				112
				113
				114	Unicode Output:
				115	---------------
				116
				117	Unicode objects have a method .encode([encoding=<default encoding>])
				118	which returns a Python string encoding the Unicode string using the
				119	given scheme (see Codecs).
				120
				121	print u := print u.encode() # using the <default encoding>
				122
				123	str(u) := u.encode() # using the <default encoding>
				124
				125	repr(u) := "u%s" % repr(u.encode('unicode-escape'))
				126
				127	Also see Internal Argument Parsing and Buffer Interface for details on
				128	how other APIs written in C will treat Unicode objects.
				129
				130
				131	Unicode Ordinals:
				132	-----------------
				133
				134	Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
				135	should provide 32-bit aware ordinal conversion APIs:
				136
				137	ord(u[:1]) (this is the standard ord() extended to work with Unicode
				138	objects)
				139	--> Unicode ordinal number (32-bit)
				140
				141	unichr(i)
				142	--> Unicode object for character i (provided it is 32-bit);
				143	ValueError otherwise
				144
				145	Both APIs should go into __builtins__ just like their string
				146	counterparts ord() and chr().
				147
				148	Note that Unicode provides space for private encodings. Usage of these
				149	can cause different output representations on different machines. This
				150	problem is not a Python or Unicode problem, but a machine setup and
				151	maintenance one.
				152
				153
				154	Comparison & Hash Value:
				155	------------------------
				156
				157	Unicode objects should compare equal to other objects after these
				158	other objects have been coerced to Unicode. For strings this means
				159	that they are interpreted as Unicode string using the <default
				160	encoding>.
				161
				162	For the same reason, Unicode objects should return the same hash value
				163	as their UTF-8 equivalent strings.
				164
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	165	When compared using cmp() (or PyObject_Compare()) the implementation
				166	should mask TypeErrors raised during the conversion to remain in synch
				167	with the string behavior. All other errors such as ValueErrors raised
				168	during coercion of strings to Unicode should not be masked and passed
				169	through to the user.
				170
				171	In containment tests ('a' in u'abc' and u'a' in 'abc') both sides
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	172	should be coerced to Unicode before applying the test. Errors occurring
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	173	during coercion (e.g. None in u'abc') should not be masked.
				174
				175
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	176	Coercion:
				177	---------
				178
				179	Using Python strings and Unicode objects to form new objects should
				180	always coerce to the more precise format, i.e. Unicode objects.
				181
				182	u + s := u + unicode(s)
				183
				184	s + u := unicode(s) + u
				185
				186	All string methods should delegate the call to an equivalent Unicode
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	187	object method call by converting all involved strings to Unicode and
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	188	then applying the arguments to the Unicode method of the same name,
				189	e.g.
				190
				191	string.join((s,u),sep) := (s + sep) + u
				192
				193	sep.join((s,u)) := (s + sep) + u
				194
				195	For a discussion of %-formatting w/r to Unicode objects, see
				196	Formatting Markers.
				197
				198
				199	Exceptions:
				200	-----------
				201
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	202	UnicodeError is defined in the exceptions module as a subclass of
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	203	ValueError. It is available at the C level via PyExc_UnicodeError.
				204	All exceptions related to Unicode encoding/decoding should be
				205	subclasses of UnicodeError.
				206
				207
				208	Codecs (Coder/Decoders) Lookup:
				209	-------------------------------
				210
				211	A Codec (see Codec Interface Definition) search registry should be
				212	implemented by a module "codecs":
				213
				214	codecs.register(search_function)
				215
				216	Search functions are expected to take one argument, the encoding name
Guido van Rossum	2581764	2000-04-10 19:45:09 +0000	[diff] [blame]	217	in all lower case letters and with hyphens and spaces converted to
				218	underscores, and return a tuple of functions (encoder, decoder,
				219	stream_reader, stream_writer) taking the following arguments:
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	220
				221	encoder and decoder:
				222	These must be functions or methods which have the same
				223	interface as the .encode/.decode methods of Codec instances
				224	(see Codec Interface). The functions/methods are expected to
				225	work in a stateless mode.
				226
				227	stream_reader and stream_writer:
				228	These need to be factory functions with the following
				229	interface:
				230
				231	factory(stream,errors='strict')
				232
				233	The factory functions must return objects providing
				234	the interfaces defined by StreamWriter/StreamReader resp.
				235	(see Codec Interface). Stream codecs can maintain state.
				236
				237	Possible values for errors are defined in the Codec
				238	section below.
				239
				240	In case a search function cannot find a given encoding, it should
				241	return None.
				242
				243	Aliasing support for encodings is left to the search functions
				244	to implement.
				245
				246	The codecs module will maintain an encoding cache for performance
				247	reasons. Encodings are first looked up in the cache. If not found, the
				248	list of registered search functions is scanned. If no codecs tuple is
				249	found, a LookupError is raised. Otherwise, the codecs tuple is stored
				250	in the cache and returned to the caller.
				251
				252	To query the Codec instance the following API should be used:
				253
				254	codecs.lookup(encoding)
				255
				256	This will either return the found codecs tuple or raise a LookupError.
				257
				258
				259	Standard Codecs:
				260	----------------
				261
				262	Standard codecs should live inside an encodings/ package directory in the
				263	Standard Python Code Library. The __init__.py file of that directory should
				264	include a Codec Lookup compatible search function implementing a lazy module
				265	based codec lookup.
				266
				267	Python should provide a few standard codecs for the most relevant
				268	encodings, e.g.
				269
				270	'utf-8': 8-bit variable length encoding
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	271	'utf-16': 16-bit variable length encoding (little/big endian)
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	272	'utf-16-le': utf-16 but explicitly little endian
				273	'utf-16-be': utf-16 but explicitly big endian
				274	'ascii': 7-bit ASCII codepage
				275	'iso-8859-1': ISO 8859-1 (Latin 1) codepage
				276	'unicode-escape': See Unicode Constructors for a definition
				277	'raw-unicode-escape': See Unicode Constructors for a definition
				278	'native': Dump of the Internal Format used by Python
				279
				280	Common aliases should also be provided per default, e.g. 'latin-1'
				281	for 'iso-8859-1'.
				282
				283	Note: 'utf-16' should be implemented by using and requiring byte order
				284	marks (BOM) for file input/output.
				285
				286	All other encodings such as the CJK ones to support Asian scripts
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	287	should be implemented in separate packages which do not get included
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	288	in the core Python distribution and are not a part of this proposal.
				289
				290
				291	Codecs Interface Definition:
				292	----------------------------
				293
				294	The following base class should be defined in the module
				295	"codecs". They provide not only templates for use by encoding module
				296	implementors, but also define the interface which is expected by the
				297	Unicode implementation.
				298
				299	Note that the Codec Interface defined here is well suitable for a
				300	larger range of applications. The Unicode implementation expects
				301	Unicode objects on input for .encode() and .write() and character
				302	buffer compatible objects on input for .decode(). Output of .encode()
				303	and .read() should be a Python string and .decode() must return an
				304	Unicode object.
				305
				306	First, we have the stateless encoders/decoders. These do not work in
				307	chunks as the stream codecs (see below) do, because all components are
				308	expected to be available in memory.
				309
				310	class Codec:
				311
				312	""" Defines the interface for stateless encoders/decoders.
				313
				314	The .encode()/.decode() methods may implement different error
				315	handling schemes by providing the errors argument. These
				316	string values are defined:
				317
				318	'strict' - raise an error (or a subclass)
				319	'ignore' - ignore the character and continue with the next
				320	'replace' - replace with a suitable replacement character;
				321	Python will use the official U+FFFD REPLACEMENT
				322	CHARACTER for the builtin Unicode codecs.
				323
				324	"""
				325	def encode(self,input,errors='strict'):
				326
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	327	""" Encodes the object input and returns a tuple (output
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	328	object, length consumed).
				329
				330	errors defines the error handling to apply. It defaults to
				331	'strict' handling.
				332
				333	The method may not store state in the Codec instance. Use
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	334	StreamCodec for codecs which have to keep state in order to
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	335	make encoding/decoding efficient.
				336
				337	"""
				338	...
				339
				340	def decode(self,input,errors='strict'):
				341
				342	""" Decodes the object input and returns a tuple (output
				343	object, length consumed).
				344
				345	input must be an object which provides the bf_getreadbuf
				346	buffer slot. Python strings, buffer objects and memory
				347	mapped files are examples of objects providing this slot.
				348
				349	errors defines the error handling to apply. It defaults to
				350	'strict' handling.
				351
				352	The method may not store state in the Codec instance. Use
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	353	StreamCodec for codecs which have to keep state in order to
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	354	make encoding/decoding efficient.
				355
				356	"""
				357	...
				358
				359	StreamWriter and StreamReader define the interface for stateful
				360	encoders/decoders which work on streams. These allow processing of the
				361	data in chunks to efficiently use memory. If you have large strings in
				362	memory, you may want to wrap them with cStringIO objects and then use
				363	these codecs on them to be able to do chunk processing as well,
				364	e.g. to provide progress information to the user.
				365
				366	class StreamWriter(Codec):
				367
				368	def __init__(self,stream,errors='strict'):
				369
				370	""" Creates a StreamWriter instance.
				371
				372	stream must be a file-like object open for writing
				373	(binary) data.
				374
				375	The StreamWriter may implement different error handling
				376	schemes by providing the errors keyword argument. These
				377	parameters are defined:
				378
				379	'strict' - raise a ValueError (or a subclass)
				380	'ignore' - ignore the character and continue with the next
				381	'replace'- replace with a suitable replacement character
				382
				383	"""
				384	self.stream = stream
				385	self.errors = errors
				386
				387	def write(self,object):
				388
				389	""" Writes the object's contents encoded to self.stream.
				390	"""
				391	data, consumed = self.encode(object,self.errors)
				392	self.stream.write(data)
				393
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	394	def writelines(self, list):
				395
				396	""" Writes the concatenated list of strings to the stream
				397	using .write().
				398	"""
				399	self.write(''.join(list))
				400
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	401	def reset(self):
				402
				403	""" Flushes and resets the codec buffers used for keeping state.
				404
				405	Calling this method should ensure that the data on the
				406	output is put into a clean state, that allows appending
				407	of new fresh data without having to rescan the whole
				408	stream to recover state.
				409
				410	"""
				411	pass
				412
				413	def __getattr__(self,name,
				414
				415	getattr=getattr):
				416
				417	""" Inherit all other methods from the underlying stream.
				418	"""
				419	return getattr(self.stream,name)
				420
				421	class StreamReader(Codec):
				422
				423	def __init__(self,stream,errors='strict'):
				424
				425	""" Creates a StreamReader instance.
				426
				427	stream must be a file-like object open for reading
				428	(binary) data.
				429
				430	The StreamReader may implement different error handling
				431	schemes by providing the errors keyword argument. These
				432	parameters are defined:
				433
				434	'strict' - raise a ValueError (or a subclass)
				435	'ignore' - ignore the character and continue with the next
				436	'replace'- replace with a suitable replacement character;
				437
				438	"""
				439	self.stream = stream
				440	self.errors = errors
				441
				442	def read(self,size=-1):
				443
				444	""" Decodes data from the stream self.stream and returns the
				445	resulting object.
				446
				447	size indicates the approximate maximum number of bytes to
				448	read from the stream for decoding purposes. The decoder
				449	can modify this setting as appropriate. The default value
				450	-1 indicates to read and decode as much as possible. size
				451	is intended to prevent having to decode huge files in one
				452	step.
				453
				454	The method should use a greedy read strategy meaning that
				455	it should read as much data as is allowed within the
				456	definition of the encoding and the given size, e.g. if
				457	optional encoding endings or state markers are available
				458	on the stream, these should be read too.
				459
				460	"""
				461	# Unsliced reading:
				462	if size < 0:
				463	return self.decode(self.stream.read())[0]
				464
				465	# Sliced reading:
				466	read = self.stream.read
				467	decode = self.decode
				468	data = read(size)
				469	i = 0
				470	while 1:
				471	try:
				472	object, decodedbytes = decode(data)
				473	except ValueError,why:
				474	# This method is slow but should work under pretty much
				475	# all conditions; at most 10 tries are made
				476	i = i + 1
				477	newdata = read(1)
				478	if not newdata or i > 10:
				479	raise
				480	data = data + newdata
				481	else:
				482	return object
				483
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	484	def readline(self, size=None):
				485
				486	""" Read one line from the input stream and return the
				487	decoded data.
				488
				489	Note: Unlike the .readlines() method, this method inherits
				490	the line breaking knowledge from the underlying stream's
				491	.readline() method -- there is currently no support for
				492	line breaking using the codec decoder due to lack of line
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	493	buffering. Subclasses should however, if possible, try to
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	494	implement this method using their own knowledge of line
				495	breaking.
				496
				497	size, if given, is passed as size argument to the stream's
				498	.readline() method.
				499
				500	"""
				501	if size is None:
				502	line = self.stream.readline()
				503	else:
				504	line = self.stream.readline(size)
				505	return self.decode(line)[0]
				506
				507	def readlines(self, sizehint=0):
				508
				509	""" Read all lines available on the input stream
				510	and return them as list of lines.
				511
				512	Line breaks are implemented using the codec's decoder
				513	method and are included in the list entries.
				514
				515	sizehint, if given, is passed as size argument to the
				516	stream's .read() method.
				517
				518	"""
				519	if sizehint is None:
				520	data = self.stream.read()
				521	else:
				522	data = self.stream.read(sizehint)
				523	return self.decode(data)[0].splitlines(1)
				524
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	525	def reset(self):
				526
				527	""" Resets the codec buffers used for keeping state.
				528
				529	Note that no stream repositioning should take place.
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	530	This method is primarily intended to be able to recover
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	531	from decoding errors.
				532
				533	"""
				534	pass
				535
				536	def __getattr__(self,name,
				537
				538	getattr=getattr):
				539
				540	""" Inherit all other methods from the underlying stream.
				541	"""
				542	return getattr(self.stream,name)
				543
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	544
				545	Stream codec implementors are free to combine the StreamWriter and
				546	StreamReader interfaces into one class. Even combining all these with
				547	the Codec class should be possible.
				548
				549	Implementors are free to add additional methods to enhance the codec
				550	functionality or provide extra state information needed for them to
				551	work. The internal codec implementation will only use the above
				552	interfaces, though.
				553
				554	It is not required by the Unicode implementation to use these base
				555	classes, only the interfaces must match; this allows writing Codecs as
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	556	extension types.
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	557
				558	As guideline, large mapping tables should be implemented using static
				559	C data in separate (shared) extension modules. That way multiple
				560	processes can share the same data.
				561
				562	A tool to auto-convert Unicode mapping files to mapping modules should be
				563	provided to simplify support for additional mappings (see References).
				564
				565
				566	Whitespace:
				567	-----------
				568
				569	The .split() method will have to know about what is considered
				570	whitespace in Unicode.
				571
				572
				573	Case Conversion:
				574	----------------
				575
				576	Case conversion is rather complicated with Unicode data, since there
				577	are many different conditions to respect. See
				578
				579	http://www.unicode.org/unicode/reports/tr13/
				580
				581	for some guidelines on implementing case conversion.
				582
				583	For Python, we should only implement the 1-1 conversions included in
				584	Unicode. Locale dependent and other special case conversions (see the
				585	Unicode standard file SpecialCasing.txt) should be left to user land
				586	routines and not go into the core interpreter.
				587
				588	The methods .capitalize() and .iscapitalized() should follow the case
				589	mapping algorithm defined in the above technical report as closely as
				590	possible.
				591
				592
				593	Line Breaks:
				594	------------
				595
				596	Line breaking should be done for all Unicode characters having the B
				597	property as well as the combinations CRLF, CR, LF (interpreted in that
				598	order) and other special line separators defined by the standard.
				599
				600	The Unicode type should provide a .splitlines() method which returns a
				601	list of lines according to the above specification. See Unicode
				602	Methods.
				603
				604
				605	Unicode Character Properties:
				606	-----------------------------
				607
				608	A separate module "unicodedata" should provide a compact interface to
				609	all Unicode character properties defined in the standard's
				610	UnicodeData.txt file.
				611
				612	Among other things, these properties provide ways to recognize
				613	numbers, digits, spaces, whitespace, etc.
				614
				615	Since this module will have to provide access to all Unicode
				616	characters, it will eventually have to contain the data from
				617	UnicodeData.txt which takes up around 600kB. For this reason, the data
				618	should be stored in static C data. This enables compilation as shared
				619	module which the underlying OS can shared between processes (unlike
				620	normal Python code modules).
				621
				622	There should be a standard Python interface for accessing this information
				623	so that other implementors can plug in their own possibly enhanced versions,
				624	e.g. ones that do decompressing of the data on-the-fly.
				625
				626
				627	Private Code Point Areas:
				628	-------------------------
				629
				630	Support for these is left to user land Codecs and not explicitly
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	631	integrated into the core. Note that due to the Internal Format being
				632	implemented, only the area between \uE000 and \uF8FF is usable for
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	633	private encodings.
				634
				635
				636	Internal Format:
				637	----------------
				638
				639	The internal format for Unicode objects should use a Python specific
				640	fixed format <PythonUnicode> implemented as 'unsigned short' (or
				641	another unsigned numeric type having 16 bits). Byte order is platform
				642	dependent.
				643
				644	This format will hold UTF-16 encodings of the corresponding Unicode
				645	ordinals. The Python Unicode implementation will address these values
				646	as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
				647	currently defined Unicode character points. UTF-16 without surrogates
				648	provides access to about 64k characters and covers all characters in
				649	the Basic Multilingual Plane (BMP) of Unicode.
				650
				651	It is the Codec's responsibility to ensure that the data they pass to
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	652	the Unicode object constructor respects this assumption. The
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	653	constructor does not check the data for Unicode compliance or use of
				654	surrogates.
				655
				656	Future implementations can extend the 32 bit restriction to the full
				657	set of all UTF-16 addressable characters (around 1M characters).
				658
Marc-André Lemburg	bfa36f5	2000-06-08 17:51:33 +0000	[diff] [blame]	659	The Unicode API should provide interface routines from <PythonUnicode>
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	660	to the compiler's wchar_t which can be 16 or 32 bit depending on the
				661	compiler/libc/platform being used.
				662
				663	Unicode objects should have a pointer to a cached Python string object
				664	<defencstr> holding the object's value using the current <default
				665	encoding>. This is needed for performance and internal parsing (see
				666	Internal Argument Parsing) reasons. The buffer is filled when the
				667	first conversion request to the <default encoding> is issued on the
				668	object.
				669
				670	Interning is not needed (for now), since Python identifiers are
				671	defined as being ASCII only.
				672
				673	codecs.BOM should return the byte order mark (BOM) for the format
				674	used internally. The codecs module should provide the following
				675	additional constants for convenience and reference (codecs.BOM will
				676	either be BOM_BE or BOM_LE depending on the platform):
				677
				678	BOM_BE: '\376\377'
				679	(corresponds to Unicode U+0000FEFF in UTF-16 on big endian
				680	platforms == ZERO WIDTH NO-BREAK SPACE)
				681
				682	BOM_LE: '\377\376'
				683	(corresponds to Unicode U+0000FFFE in UTF-16 on little endian
				684	platforms == defined as being an illegal Unicode character)
				685
				686	BOM4_BE: '\000\000\376\377'
				687	(corresponds to Unicode U+0000FEFF in UCS-4)
				688
				689	BOM4_LE: '\377\376\000\000'
				690	(corresponds to Unicode U+0000FFFE in UCS-4)
				691
				692	Note that Unicode sees big endian byte order as being "correct". The
				693	swapped order is taken to be an indicator for a "wrong" format, hence
				694	the illegal character definition.
				695
				696	The configure script should provide aid in deciding whether Python can
				697	use the native wchar_t type or not (it has to be a 16-bit unsigned
				698	type).
				699
				700
				701	Buffer Interface:
				702	-----------------
				703
				704	Implement the buffer interface using the <defencstr> Python string
				705	object as basis for bf_getcharbuf (corresponds to the "t#" argument
				706	parsing marker) and the internal buffer for bf_getreadbuf (corresponds
				707	to the "s#" argument parsing marker). If bf_getcharbuf is requested
				708	and the <defencstr> object does not yet exist, it is created first.
				709
				710	This has the advantage of being able to write to output streams (which
				711	typically use this interface) without additional specification of the
				712	encoding to use.
				713
				714	The internal format can also be accessed using the 'unicode-internal'
				715	codec, e.g. via u.encode('unicode-internal').
				716
				717
				718	Pickle/Marshalling:
				719	-------------------
				720
				721	Should have native Unicode object support. The objects should be
				722	encoded using platform independent encodings.
				723
				724	Marshal should use UTF-8 and Pickle should either choose
				725	Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
				726	encoding. Using UTF-8 instead of UTF-16 has the advantage of
				727	eliminating the need to store a BOM mark.
				728
				729
				730	Regular Expressions:
				731	--------------------
				732
				733	Secret Labs AB is working on a Unicode-aware regular expression
				734	machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
				735	internal character buffers.
				736
				737	Also see
				738
				739	http://www.unicode.org/unicode/reports/tr18/
				740
				741	for some remarks on how to treat Unicode REs.
				742
				743
				744	Formatting Markers:
				745	-------------------
				746
				747	Format markers are used in Python format strings. If Python strings
				748	are used as format strings, the following interpretations should be in
				749	effect:
				750
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	751	'%s': For Unicode objects this will cause coercion of the
				752	whole format string to Unicode. Note that
				753	you should use a Unicode format string to start
				754	with for performance reasons.
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	755
				756	In case the format string is an Unicode object, all parameters are coerced
				757	to Unicode first and then put together and formatted according to the format
				758	string. Numbers are first converted to strings and then to Unicode.
				759
				760	'%s': Python strings are interpreted as Unicode
				761	string using the <default encoding>. Unicode
				762	objects are taken as is.
				763
				764	All other string formatters should work accordingly.
				765
				766	Example:
				767
				768	u"%s %s" % (u"abc", "abc") == u"abc abc"
				769
				770
				771	Internal Argument Parsing:
				772	--------------------------
				773
				774	These markers are used by the PyArg_ParseTuple() APIs:
				775
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	776	"U": Check for Unicode object and return a pointer to it
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	777
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	778	"s": For Unicode objects: auto convert them to the <default encoding>
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	779	and return a pointer to the object's <defencstr> buffer.
				780
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	781	"s#": Access to the Unicode object via the bf_getreadbuf buffer interface
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	782	(see Buffer Interface); note that the length relates to the buffer
				783	length, not the Unicode string length (this may be different
				784	depending on the Internal Format).
				785
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	786	"t#": Access to the Unicode object via the bf_getcharbuf buffer interface
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	787	(see Buffer Interface); note that the length relates to the buffer
				788	length, not necessarily to the Unicode string length (this may
				789	be different depending on the <default encoding>).
				790
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	791	"es":
				792	Takes two parameters: encoding (const char *) and
				793	buffer (char **).
				794
				795	The input object is first coerced to Unicode in the usual way
				796	and then encoded into a string using the given encoding.
				797
				798	On output, a buffer of the needed size is allocated and
				799	returned through *buffer as NULL-terminated string.
				800	The encoded may not contain embedded NULL characters.
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame]	801	The caller is responsible for calling PyMem_Free()
				802	to free the allocated *buffer after usage.
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	803
				804	"es#":
				805	Takes three parameters: encoding (const char *),
				806	buffer (char *) and buffer_len (int ).
				807
				808	The input object is first coerced to Unicode in the usual way
				809	and then encoded into a string using the given encoding.
				810
				811	If buffer is non-NULL, buffer_len must be set to sizeof(buffer)
				812	on input. Output is then copied to *buffer.
				813
				814	If *buffer is NULL, a buffer of the needed size is
				815	allocated and output copied into it. *buffer is then
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame]	816	updated to point to the allocated memory area.
				817	The caller is responsible for calling PyMem_Free()
				818	to free the allocated *buffer after usage.
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	819
				820	In both cases *buffer_len is updated to the number of
				821	characters written (excluding the trailing NULL-byte).
				822	The output buffer is assured to be NULL-terminated.
				823
				824	Examples:
				825
				826	Using "es#" with auto-allocation:
				827
				828	static PyObject *
				829	test_parser(PyObject *self,
				830	PyObject *args)
				831	{
				832	PyObject *str;
				833	const char *encoding = "latin-1";
				834	char *buffer = NULL;
				835	int buffer_len = 0;
				836
				837	if (!PyArg_ParseTuple(args, "es#:test_parser",
				838	encoding, &buffer, &buffer_len))
				839	return NULL;
				840	if (!buffer) {
				841	PyErr_SetString(PyExc_SystemError,
				842	"buffer is NULL");
				843	return NULL;
				844	}
				845	str = PyString_FromStringAndSize(buffer, buffer_len);
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame]	846	PyMem_Free(buffer);
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	847	return str;
				848	}
				849
				850	Using "es" with auto-allocation returning a NULL-terminated string:
				851
				852	static PyObject *
				853	test_parser(PyObject *self,
				854	PyObject *args)
				855	{
				856	PyObject *str;
				857	const char *encoding = "latin-1";
				858	char *buffer = NULL;
				859
				860	if (!PyArg_ParseTuple(args, "es:test_parser",
				861	encoding, &buffer))
				862	return NULL;
				863	if (!buffer) {
				864	PyErr_SetString(PyExc_SystemError,
				865	"buffer is NULL");
				866	return NULL;
				867	}
				868	str = PyString_FromString(buffer);
Guido van Rossum	24bdb04	2000-03-28 20:29:59 +0000	[diff] [blame]	869	PyMem_Free(buffer);
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	870	return str;
				871	}
				872
				873	Using "es#" with a pre-allocated buffer:
				874
				875	static PyObject *
				876	test_parser(PyObject *self,
				877	PyObject *args)
				878	{
				879	PyObject *str;
				880	const char *encoding = "latin-1";
				881	char _buffer[10];
				882	char *buffer = _buffer;
				883	int buffer_len = sizeof(_buffer);
				884
				885	if (!PyArg_ParseTuple(args, "es#:test_parser",
				886	encoding, &buffer, &buffer_len))
				887	return NULL;
				888	if (!buffer) {
				889	PyErr_SetString(PyExc_SystemError,
				890	"buffer is NULL");
				891	return NULL;
				892	}
				893	str = PyString_FromStringAndSize(buffer, buffer_len);
				894	return str;
				895	}
				896
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	897
				898	File/Stream Output:
				899	-------------------
				900
				901	Since file.write(object) and most other stream writers use the "s#"
				902	argument parsing marker for binary files and "t#" for text files, the
				903	buffer interface implementation determines the encoding to use (see
				904	Buffer Interface).
				905
				906	For explicit handling of files using Unicode, the standard
				907	stream codecs as available through the codecs module should
				908	be used.
				909
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	910	The codecs module should provide a short-cut open(filename,mode,encoding)
				911	available which also assures that mode contains the 'b' character when
				912	needed.
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	913
				914
				915	File/Stream Input:
				916	------------------
				917
				918	Only the user knows what encoding the input data uses, so no special
				919	magic is applied. The user will have to explicitly convert the string
				920	data to Unicode objects as needed or use the file wrappers defined in
				921	the codecs module (see File/Stream Output).
				922
				923
				924	Unicode Methods & Attributes:
				925	-----------------------------
				926
				927	All Python string methods, plus:
				928
				929	.encode([encoding=<default encoding>][,errors="strict"])
				930	--> see Unicode Output
				931
				932	.splitlines([include_breaks=0])
				933	--> breaks the Unicode string into a list of (Unicode) lines;
				934	returns the lines with line breaks included, if include_breaks
				935	is true. See Line Breaks for a specification of how line breaking
				936	is done.
				937
				938
				939	Code Base:
				940	----------
				941
				942	We should use Fredrik Lundh's Unicode object implementation as basis.
				943	It already implements most of the string methods needed and provides a
				944	well written code base which we can build upon.
				945
				946	The object sharing implemented in Fredrik's implementation should
				947	be dropped.
				948
				949
				950	Test Cases:
				951	-----------
				952
				953	Test cases should follow those in Lib/test/test_string.py and include
				954	additional checks for the Codec Registry and the Standard Codecs.
				955
				956
				957	References:
				958	-----------
				959
				960	Unicode Consortium:
				961	http://www.unicode.org/
				962
				963	Unicode FAQ:
				964	http://www.unicode.org/unicode/faq/
				965
				966	Unicode 3.0:
				967	http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
				968
				969	Unicode-TechReports:
				970	http://www.unicode.org/unicode/reports/techreports.html
				971
				972	Unicode-Mappings:
				973	ftp://ftp.unicode.org/Public/MAPPINGS/
				974
				975	Introduction to Unicode (a little outdated by still nice to read):
				976	http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
				977
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	978	For comparison:
Fred Drake	a69ef82	2000-05-09 19:58:19 +0000	[diff] [blame]	979	Introducing Unicode to ECMAScript (aka JavaScript) --
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	980	http://www-4.ibm.com/software/developer/library/internationalization-support.html
				981
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	982	IANA Character Set Names:
				983	ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
				984
Fred Drake	a69ef82	2000-05-09 19:58:19 +0000	[diff] [blame]	985	Discussion of UTF-8 and Unicode support for POSIX and Linux:
				986	http://www.cl.cam.ac.uk/~mgk25/unicode.html
				987
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	988	Encodings:
				989
				990	Overview:
				991	http://czyborra.com/utf/
				992
				993	UTC-2:
				994	http://www.uazone.com/multiling/unicode/ucs2.html
				995
				996	UTF-7:
				997	Defined in RFC2152, e.g.
				998	http://www.uazone.com/multiling/ml-docs/rfc2152.txt
				999
				1000	UTF-8:
				1001	Defined in RFC2279, e.g.
				1002	http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
				1003
				1004	UTF-16:
				1005	http://www.uazone.com/multiling/unicode/wg2n1035.html
				1006
				1007
				1008	History of this Proposal:
				1009	-------------------------
Fred Drake	10dfd4c	2000-04-13 14:12:38 +0000	[diff] [blame]	1010	1.4: Added note about mixed type comparisons and contains tests.
				1011	Changed treating of Unicode objects in format strings (if used
				1012	with '%s' % u they will now cause the format string to be
				1013	coerced to Unicode, thus producing a Unicode object on return).
				1014	Added link to IANA charset names (thanks to Lars Marius Garshol).
				1015	Added new codec methods .readline(), .readlines() and .writelines().
Guido van Rossum	d8855fd	2000-03-24 22:14:19 +0000	[diff] [blame]	1016	1.3: Added new "es" and "es#" parser markers
Barry Warsaw	51ac580	2000-03-20 16:36:48 +0000	[diff] [blame]	1017	1.2: Removed POD about codecs.open()
Guido van Rossum	9ed0d1e	2000-03-10 23:14:11 +0000	[diff] [blame]	1018	1.1: Added note about comparisons and hash values. Added note about
				1019	case mapping algorithms. Changed stream codecs .read() and
				1020	.write() method to match the standard file-like object methods
				1021	(bytes consumed information is no longer returned by the methods)
				1022	1.0: changed encode Codec method to be symmetric to the decode method
				1023	(they both return (object, data consumed) now and thus become
				1024	interchangeable); removed __init__ method of Codec class (the
				1025	methods are stateless) and moved the errors argument down to the
				1026	methods; made the Codec design more generic w/r to type of input
				1027	and output objects; changed StreamWriter.flush to StreamWriter.reset
				1028	in order to avoid overriding the stream's .flush() method;
				1029	renamed .breaklines() to .splitlines(); renamed the module unicodec
				1030	to codecs; modified the File I/O section to refer to the stream codecs.
				1031	0.9: changed errors keyword argument definition; added 'replace' error
				1032	handling; changed the codec APIs to accept buffer like objects on
				1033	input; some minor typo fixes; added Whitespace section and
				1034	included references for Unicode characters that have the whitespace
				1035	and the line break characteristic; added note that search functions
				1036	can expect lower-case encoding names; dropped slicing and offsets
				1037	in the codec APIs
				1038	0.8: added encodings package and raw unicode escape encoding; untabified
				1039	the proposal; added notes on Unicode format strings; added
				1040	.breaklines() method
				1041	0.7: added a whole new set of codec APIs; added a different encoder
				1042	lookup scheme; fixed some names
				1043	0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
				1044	a real Python string object; changed Buffer Interface to delegate
				1045	requests to <defencstr>'s buffer interface; removed the explicit
				1046	reference to the unicodec.codecs dictionary (the module can implement
				1047	this in way fit for the purpose); removed the settable default
				1048	encoding; move UnicodeError from unicodec to exceptions; "s#"
				1049	not returns the internal data; passed the UCS-2/UTF-16 checking
				1050	from the Unicode constructor to the Codecs
				1051	0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
				1052	private use encodings and Unicode character properties
				1053	0.4: added Codec interface, notes on %-formatting, changed some encoding
				1054	details, added comments on stream wrappers, fixed some discussion
				1055	points (most important: Internal Format), clarified the
				1056	'unicode-escape' encoding, added encoding references
				1057	0.3: added references, comments on codec modules, the internal format,
				1058	bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
				1059	proposed by Tim Peters and fixed repr(u) accordingly
				1060	0.2: integrated Guido's suggestions, added stream codecs and file
				1061	wrapping
				1062	0.1: first version
				1063
				1064
				1065	-----------------------------------------------------------------------------
				1066	Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
				1067	-----------------------------------------------------------------------------