Blame - Doc/c-api/unicode.rst - platform/external/python/cpython2

blob: 8308304e7cdcaa89f4f15a9773eade20e2af6590 [file] [log] [blame]

Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
				13
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	14	Unicode Type
				15	""""""""""""
				16
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	17	These are the basic Unicode object types used for the Unicode implementation in
				18	Python:
				19
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	20
				21	.. ctype:: Py_UNICODE
				22
				23	This type represents the storage type which is used by Python internally as
				24	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				25	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				26	possible to build a UCS4 version of Python (most recent Linux distributions come
				27	with UCS4 builds of Python). These builds then use a 32-bit type for
				28	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				29	where :ctype:`wchar_t` is available and compatible with the chosen Python
				30	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				31	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				32	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				33	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				34
				35	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				36	this in mind when writing extensions or interfaces.
				37
				38
				39	.. ctype:: PyUnicodeObject
				40
				41	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				42
				43
				44	.. cvar:: PyTypeObject PyUnicode_Type
				45
				46	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				47	is exposed to Python code as ``unicode`` and ``types.UnicodeType``.
				48
				49	The following APIs are really C macros and can be used to do fast checks and to
				50	access internal read-only data of Unicode objects:
				51
				52
				53	.. cfunction:: int PyUnicode_Check(PyObject *o)
				54
				55	Return true if the object o is a Unicode object or an instance of a Unicode
				56	subtype.
				57
				58	.. versionchanged:: 2.2
				59	Allowed subtypes to be accepted.
				60
				61
				62	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				63
				64	Return true if the object o is a Unicode object, but not an instance of a
				65	subtype.
				66
				67	.. versionadded:: 2.2
				68
				69
				70	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				71
				72	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				73	checked).
				74
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	75	.. versionchanged:: 2.5
				76	This function returned an :ctype:`int` type. This might require changes
				77	in your code for properly supporting 64-bit systems.
				78
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	79
				80	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				81
				82	Return the size of the object's internal buffer in bytes. o has to be a
				83	:ctype:`PyUnicodeObject` (not checked).
				84
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	85	.. versionchanged:: 2.5
				86	This function returned an :ctype:`int` type. This might require changes
				87	in your code for properly supporting 64-bit systems.
				88
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	89
				90	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				91
				92	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				93	has to be a :ctype:`PyUnicodeObject` (not checked).
				94
				95
				96	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				97
				98	Return a pointer to the internal buffer of the object. o has to be a
				99	:ctype:`PyUnicodeObject` (not checked).
				100
Christian Heimes	3b718a7	2008-02-14 12:47:33 +0000	[diff] [blame]	101
Georg Brandl	36b30b5	2009-07-24 16:46:38 +0000	[diff] [blame]	102	.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes	3b718a7	2008-02-14 12:47:33 +0000	[diff] [blame]	103
				104	Clear the free list. Return the total number of freed items.
				105
				106	.. versionadded:: 2.6
				107
Georg Brandl	36b30b5	2009-07-24 16:46:38 +0000	[diff] [blame]	108
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	109	Unicode Character Properties
				110	""""""""""""""""""""""""""""
				111
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	112	Unicode provides many different character properties. The most often needed ones
				113	are available through these macros which are mapped to C functions depending on
				114	the Python configuration.
				115
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	116
				117	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				118
				119	Return 1 or 0 depending on whether ch is a whitespace character.
				120
				121
				122	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				123
				124	Return 1 or 0 depending on whether ch is a lowercase character.
				125
				126
				127	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				128
				129	Return 1 or 0 depending on whether ch is an uppercase character.
				130
				131
				132	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				133
				134	Return 1 or 0 depending on whether ch is a titlecase character.
				135
				136
				137	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				138
				139	Return 1 or 0 depending on whether ch is a linebreak character.
				140
				141
				142	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				143
				144	Return 1 or 0 depending on whether ch is a decimal character.
				145
				146
				147	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				148
				149	Return 1 or 0 depending on whether ch is a digit character.
				150
				151
				152	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				153
				154	Return 1 or 0 depending on whether ch is a numeric character.
				155
				156
				157	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				158
				159	Return 1 or 0 depending on whether ch is an alphabetic character.
				160
				161
				162	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				163
				164	Return 1 or 0 depending on whether ch is an alphanumeric character.
				165
				166	These APIs can be used for fast direct character conversions:
				167
				168
				169	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				170
				171	Return the character ch converted to lower case.
				172
				173
				174	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				175
				176	Return the character ch converted to upper case.
				177
				178
				179	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				180
				181	Return the character ch converted to title case.
				182
				183
				184	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				185
				186	Return the character ch converted to a decimal positive integer. Return
				187	``-1`` if this is not possible. This macro does not raise exceptions.
				188
				189
				190	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				191
				192	Return the character ch converted to a single digit integer. Return ``-1`` if
				193	this is not possible. This macro does not raise exceptions.
				194
				195
				196	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				197
				198	Return the character ch converted to a double. Return ``-1.0`` if this is not
				199	possible. This macro does not raise exceptions.
				200
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	201
				202	Plain Py_UNICODE
				203	""""""""""""""""
				204
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	205	To create Unicode objects and access their basic sequence properties, use these
				206	APIs:
				207
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	208
				209	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				210
Georg Brandl	b8d0e36	2010-11-26 07:53:50 +0000	[diff] [blame]	211	Create a Unicode object from the Py_UNICODE buffer u of the given size. u
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	212	may be NULL which causes the contents to be undefined. It is the user's
				213	responsibility to fill in the needed data. The buffer is copied into the new
				214	object. If the buffer is not NULL, the return value might be a shared object.
				215	Therefore, modification of the resulting Unicode object is only allowed when u
				216	is NULL.
				217
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	218	.. versionchanged:: 2.5
				219	This function used an :ctype:`int` type for size. This might require
				220	changes in your code for properly supporting 64-bit systems.
				221
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	222
Georg Brandl	79cdff0	2010-10-17 10:54:57 +0000	[diff] [blame]	223	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				224
Georg Brandl	b8d0e36	2010-11-26 07:53:50 +0000	[diff] [blame]	225	Create a Unicode object from the char buffer u. The bytes will be interpreted
Georg Brandl	79cdff0	2010-10-17 10:54:57 +0000	[diff] [blame]	226	as being UTF-8 encoded. u may also be NULL which
				227	causes the contents to be undefined. It is the user's responsibility to fill in
				228	the needed data. The buffer is copied into the new object. If the buffer is not
				229	NULL, the return value might be a shared object. Therefore, modification of
				230	the resulting Unicode object is only allowed when u is NULL.
				231
				232	.. versionadded:: 2.6
				233
				234
				235	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				236
				237	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				238	u.
				239
				240	.. versionadded:: 2.6
				241
				242
				243	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				244
				245	Take a C :cfunc:`printf`\ -style format string and a variable number of
				246	arguments, calculate the size of the resulting Python unicode string and return
				247	a string with the values formatted into it. The variable arguments must be C
				248	types and must correspond exactly to the format characters in the format
				249	string. The following format characters are allowed:
				250
				251	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				252	.. % because not all compilers support the %z width modifier -- we fake it
				253	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
				254
				255	+-------------------+---------------------+--------------------------------+
				256	\| Format Characters \| Type \| Comment \|
				257	+===================+=====================+================================+
				258	\| :attr:`%%` \| n/a \| The literal % character. \|
				259	+-------------------+---------------------+--------------------------------+
				260	\| :attr:`%c` \| int \| A single character, \|
				261	\| \| \| represented as an C int. \|
				262	+-------------------+---------------------+--------------------------------+
				263	\| :attr:`%d` \| int \| Exactly equivalent to \|
				264	\| \| \| ``printf("%d")``. \|
				265	+-------------------+---------------------+--------------------------------+
				266	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				267	\| \| \| ``printf("%u")``. \|
				268	+-------------------+---------------------+--------------------------------+
				269	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				270	\| \| \| ``printf("%ld")``. \|
				271	+-------------------+---------------------+--------------------------------+
				272	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				273	\| \| \| ``printf("%lu")``. \|
				274	+-------------------+---------------------+--------------------------------+
				275	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				276	\| \| \| ``printf("%zd")``. \|
				277	+-------------------+---------------------+--------------------------------+
				278	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				279	\| \| \| ``printf("%zu")``. \|
				280	+-------------------+---------------------+--------------------------------+
				281	\| :attr:`%i` \| int \| Exactly equivalent to \|
				282	\| \| \| ``printf("%i")``. \|
				283	+-------------------+---------------------+--------------------------------+
				284	\| :attr:`%x` \| int \| Exactly equivalent to \|
				285	\| \| \| ``printf("%x")``. \|
				286	+-------------------+---------------------+--------------------------------+
				287	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				288	\| \| \| array. \|
				289	+-------------------+---------------------+--------------------------------+
				290	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				291	\| \| \| pointer. Mostly equivalent to \|
				292	\| \| \| ``printf("%p")`` except that \|
				293	\| \| \| it is guaranteed to start with \|
				294	\| \| \| the literal ``0x`` regardless \|
				295	\| \| \| of what the platform's \|
				296	\| \| \| ``printf`` yields. \|
				297	+-------------------+---------------------+--------------------------------+
				298	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				299	+-------------------+---------------------+--------------------------------+
				300	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				301	\| \| \| NULL) and a null-terminated \|
				302	\| \| \| C character array as a second \|
				303	\| \| \| parameter (which will be used, \|
				304	\| \| \| if the first parameter is \|
				305	\| \| \| NULL). \|
				306	+-------------------+---------------------+--------------------------------+
				307	\| :attr:`%S` \| PyObject\* \| The result of calling \|
				308	\| \| \| :func:`PyObject_Unicode`. \|
				309	+-------------------+---------------------+--------------------------------+
				310	\| :attr:`%R` \| PyObject\* \| The result of calling \|
				311	\| \| \| :func:`PyObject_Repr`. \|
				312	+-------------------+---------------------+--------------------------------+
				313
				314	An unrecognized format character causes all the rest of the format string to be
				315	copied as-is to the result string, and any extra arguments discarded.
				316
				317	.. versionadded:: 2.6
				318
				319
				320	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				321
				322	Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
				323	arguments.
				324
				325	.. versionadded:: 2.6
				326
				327
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	328	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				329
Victor Stinner	28a545e	2011-12-18 19:39:53 +0100	[diff] [blame]	330	Return a read-only pointer to the Unicode object's internal
Ezio Melotti	2d679a4	2011-12-19 07:17:08 +0200	[diff] [blame^]	331	:ctype:`Py_UNICODE` buffer, NULL if unicode is not a Unicode object.
				332	Note that the resulting :ctype:`Py_UNICODE*` string may contain embedded
Victor Stinner	28a545e	2011-12-18 19:39:53 +0100	[diff] [blame]	333	null characters, which would cause the string to be truncated when used in
				334	most C functions.
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	335
				336
				337	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				338
				339	Return the length of the Unicode object.
				340
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	341	.. versionchanged:: 2.5
				342	This function returned an :ctype:`int` type. This might require changes
				343	in your code for properly supporting 64-bit systems.
				344
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	345
				346	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				347
				348	Coerce an encoded object obj to an Unicode object and return a reference with
				349	incremented refcount.
				350
				351	String and other char buffer compatible objects are decoded according to the
				352	given encoding and using the error handling defined by errors. Both can be
				353	NULL to have the interface use the default values (see the next section for
				354	details).
				355
				356	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				357	set.
				358
				359	The API returns NULL if there was an error. The caller is responsible for
				360	decref'ing the returned objects.
				361
				362
				363	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				364
				365	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				366	throughout the interpreter whenever coercion to Unicode is needed.
				367
				368	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				369	Python can interface directly to this type using the following functions.
				370	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				371	the system's :ctype:`wchar_t`.
				372
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	373
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	374	wchar_t Support
				375	"""""""""""""""
				376
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	377	:ctype:`wchar_t` support for platforms which support it:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	378
				379	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				380
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	381	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	382	Return NULL on failure.
				383
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	384	.. versionchanged:: 2.5
				385	This function used an :ctype:`int` type for size. This might require
				386	changes in your code for properly supporting 64-bit systems.
				387
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	388
				389	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				390
				391	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				392	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				393	0-termination character). Return the number of :ctype:`wchar_t` characters
				394	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				395	string may or may not be 0-terminated. It is the responsibility of the caller
				396	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
Ezio Melotti	2d679a4	2011-12-19 07:17:08 +0200	[diff] [blame^]	397	required by the application. Also, note that the :ctype:`wchar_t*` string
Victor Stinner	28a545e	2011-12-18 19:39:53 +0100	[diff] [blame]	398	might contain null characters, which would cause the string to be truncated
				399	when used with most C functions.
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	400
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	401	.. versionchanged:: 2.5
				402	This function returned an :ctype:`int` type and used an :ctype:`int`
				403	type for size. This might require changes in your code for properly
				404	supporting 64-bit systems.
				405
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	406
				407	.. _builtincodecs:
				408
				409	Built-in Codecs
				410	^^^^^^^^^^^^^^^
				411
Georg Brandl	d7d4fd7	2009-07-26 14:37:28 +0000	[diff] [blame]	412	Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	413	these codecs are directly usable via the following functions.
				414
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	415	Many of the following APIs take two arguments encoding and errors, and they
				416	have the same semantics as the ones of the built-in :func:`unicode` Unicode
				417	object constructor.
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	418
				419	Setting encoding to NULL causes the default encoding to be used which is
				420	ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	421	as the encoding for file names. This variable should be treated as read-only: on
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	422	some systems, it will be a pointer to a static string, on others, it will change
				423	at run-time (such as when the application invokes setlocale).
				424
				425	Error handling is set by errors which may also be set to NULL meaning to use
				426	the default handling defined for the codec. Default error handling for all
Georg Brandl	d7d4fd7	2009-07-26 14:37:28 +0000	[diff] [blame]	427	built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	428
				429	The codecs all use a similar interface. Only deviation from the following
				430	generic ones are documented for simplicity.
				431
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	432
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	433	Generic Codecs
				434	""""""""""""""
				435
				436	These are the generic codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	437
				438
				439	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				440
				441	Create a Unicode object by decoding size bytes of the encoded string s.
				442	encoding and errors have the same meaning as the parameters of the same name
Georg Brandl	d7d4fd7	2009-07-26 14:37:28 +0000	[diff] [blame]	443	in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	444	using the Python codec registry. Return NULL if an exception was raised by
				445	the codec.
				446
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	447	.. versionchanged:: 2.5
				448	This function used an :ctype:`int` type for size. This might require
				449	changes in your code for properly supporting 64-bit systems.
				450
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	451
				452	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				453
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	454	Encode the :ctype:`Py_UNICODE` buffer s of the given size and return a Python
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	455	string object. encoding and errors have the same meaning as the parameters
				456	of the same name in the Unicode :meth:`encode` method. The codec to be used is
				457	looked up using the Python codec registry. Return NULL if an exception was
				458	raised by the codec.
				459
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	460	.. versionchanged:: 2.5
				461	This function used an :ctype:`int` type for size. This might require
				462	changes in your code for properly supporting 64-bit systems.
				463
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	464
				465	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				466
				467	Encode a Unicode object and return the result as Python string object.
				468	encoding and errors have the same meaning as the parameters of the same name
				469	in the Unicode :meth:`encode` method. The codec to be used is looked up using
				470	the Python codec registry. Return NULL if an exception was raised by the
				471	codec.
				472
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	473
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	474	UTF-8 Codecs
				475	""""""""""""
				476
				477	These are the UTF-8 codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	478
				479
				480	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				481
				482	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				483	s. Return NULL if an exception was raised by the codec.
				484
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	485	.. versionchanged:: 2.5
				486	This function used an :ctype:`int` type for size. This might require
				487	changes in your code for properly supporting 64-bit systems.
				488
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	489
				490	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				491
				492	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				493	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				494	treated as an error. Those bytes will not be decoded and the number of bytes
				495	that have been decoded will be stored in consumed.
				496
				497	.. versionadded:: 2.4
				498
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	499	.. versionchanged:: 2.5
				500	This function used an :ctype:`int` type for size. This might require
				501	changes in your code for properly supporting 64-bit systems.
				502
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	503
				504	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				505
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	506	Encode the :ctype:`Py_UNICODE` buffer s of the given size using UTF-8 and return a
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	507	Python string object. Return NULL if an exception was raised by the codec.
				508
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	509	.. versionchanged:: 2.5
				510	This function used an :ctype:`int` type for size. This might require
				511	changes in your code for properly supporting 64-bit systems.
				512
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	513
				514	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				515
				516	Encode a Unicode object using UTF-8 and return the result as Python string
				517	object. Error handling is "strict". Return NULL if an exception was raised
				518	by the codec.
				519
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	520
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	521	UTF-32 Codecs
				522	"""""""""""""
				523
				524	These are the UTF-32 codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	525
				526
				527	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				528
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	529	Decode size bytes from a UTF-32 encoded buffer string and return the
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	530	corresponding Unicode object. errors (if non-NULL) defines the error
				531	handling. It defaults to "strict".
				532
				533	If byteorder is non-NULL, the decoder starts decoding using the given byte
				534	order::
				535
				536	*byteorder == -1: little endian
				537	*byteorder == 0: native order
				538	*byteorder == 1: big endian
				539
Georg Brandl	579a358	2009-09-18 21:35:59 +0000	[diff] [blame]	540	If ``*byteorder`` is zero, and the first four bytes of the input data are a
				541	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				542	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				543	``1``, any byte order mark is copied to the output.
				544
				545	After completion, \byteorder* is set to the current byte order at the end
				546	of input data.
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	547
				548	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				549
				550	If byteorder is NULL, the codec starts in native order mode.
				551
				552	Return NULL if an exception was raised by the codec.
				553
				554	.. versionadded:: 2.6
				555
				556
				557	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				558
				559	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				560	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				561	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				562	by four) as an error. Those bytes will not be decoded and the number of bytes
				563	that have been decoded will be stored in consumed.
				564
				565	.. versionadded:: 2.6
				566
				567
				568	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				569
				570	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Georg Brandl	579a358	2009-09-18 21:35:59 +0000	[diff] [blame]	571	data in s. Output is written according to the following byte order::
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	572
				573	byteorder == -1: little endian
				574	byteorder == 0: native byte order (writes a BOM mark)
				575	byteorder == 1: big endian
				576
				577	If byteorder is ``0``, the output string will always start with the Unicode BOM
				578	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				579
				580	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				581	as a single codepoint.
				582
				583	Return NULL if an exception was raised by the codec.
				584
				585	.. versionadded:: 2.6
				586
				587
				588	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				589
				590	Return a Python string using the UTF-32 encoding in native byte order. The
				591	string always starts with a BOM mark. Error handling is "strict". Return
				592	NULL if an exception was raised by the codec.
				593
				594	.. versionadded:: 2.6
				595
				596
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	597	UTF-16 Codecs
				598	"""""""""""""
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	599
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	600	These are the UTF-16 codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	601
				602
				603	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				604
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	605	Decode size bytes from a UTF-16 encoded buffer string and return the
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	606	corresponding Unicode object. errors (if non-NULL) defines the error
				607	handling. It defaults to "strict".
				608
				609	If byteorder is non-NULL, the decoder starts decoding using the given byte
				610	order::
				611
				612	*byteorder == -1: little endian
				613	*byteorder == 0: native order
				614	*byteorder == 1: big endian
				615
Georg Brandl	579a358	2009-09-18 21:35:59 +0000	[diff] [blame]	616	If ``*byteorder`` is zero, and the first two bytes of the input data are a
				617	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				618	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				619	``1``, any byte order mark is copied to the output (where it will result in
				620	either a ``\ufeff`` or a ``\ufffe`` character).
				621
				622	After completion, \byteorder* is set to the current byte order at the end
				623	of input data.
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	624
				625	If byteorder is NULL, the codec starts in native order mode.
				626
				627	Return NULL if an exception was raised by the codec.
				628
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	629	.. versionchanged:: 2.5
				630	This function used an :ctype:`int` type for size. This might require
				631	changes in your code for properly supporting 64-bit systems.
				632
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	633
				634	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				635
				636	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				637	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				638	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				639	split surrogate pair) as an error. Those bytes will not be decoded and the
				640	number of bytes that have been decoded will be stored in consumed.
				641
				642	.. versionadded:: 2.4
				643
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	644	.. versionchanged:: 2.5
				645	This function used an :ctype:`int` type for size and an :ctype:`int *`
				646	type for consumed. This might require changes in your code for
				647	properly supporting 64-bit systems.
				648
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	649
				650	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				651
				652	Return a Python string object holding the UTF-16 encoded value of the Unicode
Georg Brandl	579a358	2009-09-18 21:35:59 +0000	[diff] [blame]	653	data in s. Output is written according to the following byte order::
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	654
				655	byteorder == -1: little endian
				656	byteorder == 0: native byte order (writes a BOM mark)
				657	byteorder == 1: big endian
				658
				659	If byteorder is ``0``, the output string will always start with the Unicode BOM
				660	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				661
				662	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				663	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				664	values is interpreted as an UCS-2 character.
				665
				666	Return NULL if an exception was raised by the codec.
				667
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	668	.. versionchanged:: 2.5
				669	This function used an :ctype:`int` type for size. This might require
				670	changes in your code for properly supporting 64-bit systems.
				671
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	672
				673	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				674
				675	Return a Python string using the UTF-16 encoding in native byte order. The
				676	string always starts with a BOM mark. Error handling is "strict". Return
				677	NULL if an exception was raised by the codec.
				678
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	679
Georg Brandl	7d4bfb3	2010-08-02 21:44:25 +0000	[diff] [blame]	680	UTF-7 Codecs
				681	""""""""""""
				682
				683	These are the UTF-7 codec APIs:
				684
				685
				686	.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char s, Py_ssize_t size, const char errors)
				687
				688	Create a Unicode object by decoding size bytes of the UTF-7 encoded string
				689	s. Return NULL if an exception was raised by the codec.
				690
				691
Georg Brandl	21946af	2010-10-06 09:28:45 +0000	[diff] [blame]	692	.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
Georg Brandl	7d4bfb3	2010-08-02 21:44:25 +0000	[diff] [blame]	693
				694	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
				695	consumed is not NULL, trailing incomplete UTF-7 base-64 sections will not
				696	be treated as an error. Those bytes will not be decoded and the number of
				697	bytes that have been decoded will be stored in consumed.
				698
				699
				700	.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char errors)
				701
				702	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
				703	return a Python bytes object. Return NULL if an exception was raised by
				704	the codec.
				705
				706	If base64SetO is nonzero, "Set O" (punctuation that has no otherwise
				707	special meaning) will be encoded in base-64. If base64WhiteSpace is
				708	nonzero, whitespace will be encoded in base-64. Both are set to zero for the
				709	Python "utf-7" codec.
				710
				711
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	712	Unicode-Escape Codecs
				713	"""""""""""""""""""""
				714
				715	These are the "Unicode Escape" codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	716
				717
				718	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				719
				720	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				721	string s. Return NULL if an exception was raised by the codec.
				722
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	723	.. versionchanged:: 2.5
				724	This function used an :ctype:`int` type for size. This might require
				725	changes in your code for properly supporting 64-bit systems.
				726
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	727
				728	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				729
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	730	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	731	return a Python string object. Return NULL if an exception was raised by the
				732	codec.
				733
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	734	.. versionchanged:: 2.5
				735	This function used an :ctype:`int` type for size. This might require
				736	changes in your code for properly supporting 64-bit systems.
				737
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	738
				739	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				740
				741	Encode a Unicode object using Unicode-Escape and return the result as Python
				742	string object. Error handling is "strict". Return NULL if an exception was
				743	raised by the codec.
				744
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	745
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	746	Raw-Unicode-Escape Codecs
				747	"""""""""""""""""""""""""
				748
				749	These are the "Raw Unicode Escape" codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	750
				751
				752	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				753
				754	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				755	encoded string s. Return NULL if an exception was raised by the codec.
				756
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	757	.. versionchanged:: 2.5
				758	This function used an :ctype:`int` type for size. This might require
				759	changes in your code for properly supporting 64-bit systems.
				760
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	761
				762	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				763
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	764	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	765	and return a Python string object. Return NULL if an exception was raised by
				766	the codec.
				767
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	768	.. versionchanged:: 2.5
				769	This function used an :ctype:`int` type for size. This might require
				770	changes in your code for properly supporting 64-bit systems.
				771
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	772
				773	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				774
				775	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				776	Python string object. Error handling is "strict". Return NULL if an exception
				777	was raised by the codec.
				778
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	779
				780	Latin-1 Codecs
				781	""""""""""""""
				782
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	783	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				784	ordinals and only these are accepted by the codecs during encoding.
				785
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	786
				787	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				788
				789	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				790	s. Return NULL if an exception was raised by the codec.
				791
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	792	.. versionchanged:: 2.5
				793	This function used an :ctype:`int` type for size. This might require
				794	changes in your code for properly supporting 64-bit systems.
				795
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	796
				797	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				798
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	799	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	800	a Python string object. Return NULL if an exception was raised by the codec.
				801
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	802	.. versionchanged:: 2.5
				803	This function used an :ctype:`int` type for size. This might require
				804	changes in your code for properly supporting 64-bit systems.
				805
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	806
				807	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				808
				809	Encode a Unicode object using Latin-1 and return the result as Python string
				810	object. Error handling is "strict". Return NULL if an exception was raised
				811	by the codec.
				812
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	813
				814	ASCII Codecs
				815	""""""""""""
				816
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	817	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				818	codes generate errors.
				819
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	820
				821	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				822
				823	Create a Unicode object by decoding size bytes of the ASCII encoded string
				824	s. Return NULL if an exception was raised by the codec.
				825
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	826	.. versionchanged:: 2.5
				827	This function used an :ctype:`int` type for size. This might require
				828	changes in your code for properly supporting 64-bit systems.
				829
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	830
				831	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				832
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	833	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	834	Python string object. Return NULL if an exception was raised by the codec.
				835
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	836	.. versionchanged:: 2.5
				837	This function used an :ctype:`int` type for size. This might require
				838	changes in your code for properly supporting 64-bit systems.
				839
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	840
				841	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				842
				843	Encode a Unicode object using ASCII and return the result as Python string
				844	object. Error handling is "strict". Return NULL if an exception was raised
				845	by the codec.
				846
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	847
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	848	Character Map Codecs
				849	""""""""""""""""""""
				850
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	851	This codec is special in that it can be used to implement many different codecs
				852	(and this is in fact what was done to obtain most of the standard codecs
				853	included in the :mod:`encodings` package). The codec uses mapping to encode and
				854	decode characters.
				855
				856	Decoding mappings must map single string characters to single Unicode
				857	characters, integers (which are then interpreted as Unicode ordinals) or None
				858	(meaning "undefined mapping" and causing an error).
				859
				860	Encoding mappings must map single Unicode characters to single string
				861	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				862	(meaning "undefined mapping" and causing an error).
				863
				864	The mapping objects provided must only support the __getitem__ mapping
				865	interface.
				866
				867	If a character lookup fails with a LookupError, the character is copied as-is
				868	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				869	resp. Because of this, mappings only need to contain those mappings which map
				870	characters to different code points.
				871
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	872	These are the mapping codec APIs:
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	873
				874	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				875
				876	Create a Unicode object by decoding size bytes of the encoded string s using
				877	the given mapping object. Return NULL if an exception was raised by the
				878	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				879	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				880	Byte values greater that the length of the string and U+FFFE "characters" are
				881	treated as "undefined mapping".
				882
				883	.. versionchanged:: 2.4
				884	Allowed unicode string as mapping argument.
				885
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	886	.. versionchanged:: 2.5
				887	This function used an :ctype:`int` type for size. This might require
				888	changes in your code for properly supporting 64-bit systems.
				889
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	890
				891	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				892
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	893	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	894	mapping object and return a Python string object. Return NULL if an
				895	exception was raised by the codec.
				896
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	897	.. versionchanged:: 2.5
				898	This function used an :ctype:`int` type for size. This might require
				899	changes in your code for properly supporting 64-bit systems.
				900
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	901
				902	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				903
				904	Encode a Unicode object using the given mapping object and return the result
				905	as Python string object. Error handling is "strict". Return NULL if an
				906	exception was raised by the codec.
				907
				908	The following codec API is special in that maps Unicode to Unicode.
				909
				910
				911	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				912
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	913	Translate a :ctype:`Py_UNICODE` buffer of the given size by applying a
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	914	character mapping table to it and return the resulting Unicode object. Return
				915	NULL when an exception was raised by the codec.
				916
				917	The mapping table must map Unicode ordinal integers to Unicode ordinal
				918	integers or None (causing deletion of the character).
				919
				920	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				921	and sequences work well. Unmapped character ordinals (ones which cause a
				922	:exc:`LookupError`) are left untouched and are copied as-is.
				923
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	924	.. versionchanged:: 2.5
				925	This function used an :ctype:`int` type for size. This might require
				926	changes in your code for properly supporting 64-bit systems.
				927
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	928
				929	MBCS codecs for Windows
				930	"""""""""""""""""""""""
				931
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	932	These are the MBCS codec APIs. They are currently only available on Windows and
				933	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				934	DBCS) is a class of encodings, not just one. The target encoding is defined by
				935	the user settings on the machine running the codec.
				936
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	937
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	938	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				939
				940	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				941	Return NULL if an exception was raised by the codec.
				942
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	943	.. versionchanged:: 2.5
				944	This function used an :ctype:`int` type for size. This might require
				945	changes in your code for properly supporting 64-bit systems.
				946
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	947
				948	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				949
				950	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				951	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				952	trailing lead byte and the number of bytes that have been decoded will be stored
				953	in consumed.
				954
				955	.. versionadded:: 2.5
				956
				957
				958	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				959
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	960	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	961	Python string object. Return NULL if an exception was raised by the codec.
				962
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	963	.. versionchanged:: 2.5
				964	This function used an :ctype:`int` type for size. This might require
				965	changes in your code for properly supporting 64-bit systems.
				966
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	967
				968	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				969
				970	Encode a Unicode object using MBCS and return the result as Python string
				971	object. Error handling is "strict". Return NULL if an exception was raised
				972	by the codec.
				973
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	974
Victor Stinner	5f8aae0	2010-05-14 15:53:20 +0000	[diff] [blame]	975	Methods & Slots
				976	"""""""""""""""
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	977
				978	.. _unicodemethodsandslots:
				979
				980	Methods and Slot Functions
				981	^^^^^^^^^^^^^^^^^^^^^^^^^^
				982
				983	The following APIs are capable of handling Unicode objects and strings on input
				984	(we refer to them as strings in the descriptions) and return Unicode objects or
				985	integers as appropriate.
				986
				987	They all return NULL or ``-1`` if an exception occurs.
				988
				989
				990	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				991
				992	Concat two strings giving a new Unicode string.
				993
				994
				995	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				996
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	997	Split a string giving a list of Unicode strings. If sep is NULL, splitting
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	998	will be done at all whitespace substrings. Otherwise, splits occur at the given
				999	separator. At most maxsplit splits will be done. If negative, no limit is
				1000	set. Separators are not included in the resulting list.
				1001
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	1002	.. versionchanged:: 2.5
				1003	This function used an :ctype:`int` type for maxsplit. This might require
				1004	changes in your code for properly supporting 64-bit systems.
				1005
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1006
				1007	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				1008
				1009	Split a Unicode string at line breaks, returning a list of Unicode strings.
				1010	CRLF is considered to be one line break. If keepend is 0, the Line break
				1011	characters are not included in the resulting strings.
				1012
				1013
				1014	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				1015
				1016	Translate a string by applying a character mapping table to it and return the
				1017	resulting Unicode object.
				1018
				1019	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				1020	or None (causing deletion of the character).
				1021
				1022	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				1023	and sequences work well. Unmapped character ordinals (ones which cause a
				1024	:exc:`LookupError`) are left untouched and are copied as-is.
				1025
				1026	errors has the usual meaning for codecs. It may be NULL which indicates to
				1027	use the default error handling.
				1028
				1029
				1030	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				1031
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	1032	Join a sequence of strings using the given separator and return the resulting
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1033	Unicode string.
				1034
				1035
				1036	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				1037
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	1038	Return 1 if substr matches ``str[start:end]`` at the given tail end
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1039	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				1040	0 otherwise. Return ``-1`` if an error occurred.
				1041
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	1042	.. versionchanged:: 2.5
				1043	This function used an :ctype:`int` type for start and end. This
				1044	might require changes in your code for properly supporting 64-bit
				1045	systems.
				1046
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1047
				1048	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				1049
Ezio Melotti	020f650	2011-04-14 07:39:06 +0300	[diff] [blame]	1050	Return the first position of substr in ``str[start:end]`` using the given
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1051	direction (direction == 1 means to do a forward search, direction == -1 a
				1052	backward search). The return value is the index of the first match; a value of
				1053	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				1054	occurred and an exception has been set.
				1055
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	1056	.. versionchanged:: 2.5
				1057	This function used an :ctype:`int` type for start and end. This
				1058	might require changes in your code for properly supporting 64-bit
				1059	systems.
				1060
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1061
				1062	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				1063
				1064	Return the number of non-overlapping occurrences of substr in
				1065	``str[start:end]``. Return ``-1`` if an error occurred.
				1066
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	1067	.. versionchanged:: 2.5
				1068	This function returned an :ctype:`int` type and used an :ctype:`int`
				1069	type for start and end. This might require changes in your code for
				1070	properly supporting 64-bit systems.
				1071
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1072
				1073	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				1074
				1075	Replace at most maxcount occurrences of substr in str with replstr and
				1076	return the resulting Unicode object. maxcount == -1 means replace all
				1077	occurrences.
				1078
Jeroen Ruigrok van der Werven	dfcffd4	2009-04-25 21:16:05 +0000	[diff] [blame]	1079	.. versionchanged:: 2.5
				1080	This function used an :ctype:`int` type for maxcount. This might
				1081	require changes in your code for properly supporting 64-bit systems.
				1082
Georg Brandl	f684272	2008-01-19 22:08:21 +0000	[diff] [blame]	1083
				1084	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				1085
				1086	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				1087	respectively.
				1088
				1089
				1090	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				1091
				1092	Rich compare two unicode strings and return one of the following:
				1093
				1094	* ``NULL`` in case an exception was raised
				1095	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				1096	* :const:`Py_NotImplemented` in case the type combination is unknown
				1097
				1098	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				1099	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				1100	with a :exc:`UnicodeDecodeError`.
				1101
				1102	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				1103	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				1104
				1105
				1106	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				1107
				1108	Return a new string object from format and args; this is analogous to
				1109	``format % args``. The args argument must be a tuple.
				1110
				1111
				1112	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				1113
				1114	Check whether element is contained in container and return true or false
				1115	accordingly.
				1116
				1117	element has to coerce to a one element Unicode string. ``-1`` is returned if
				1118	there was an error.