Blame - Doc/c-api/unicode.rst - platform/external/python/cpython2

blob: 4533279699495fc1d4848502c15786993b877241 [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	13	Unicode Type
				14	""""""""""""
				15
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	16	These are the basic Unicode object types used for the Unicode implementation in
				17	Python:
				18
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	19
				20	.. ctype:: Py_UNICODE
				21
				22	This type represents the storage type which is used by Python internally as
				23	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				24	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				25	possible to build a UCS4 version of Python (most recent Linux distributions come
				26	with UCS4 builds of Python). These builds then use a 32-bit type for
				27	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				28	where :ctype:`wchar_t` is available and compatible with the chosen Python
				29	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				30	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				31	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				32	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				33
				34	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				35	this in mind when writing extensions or interfaces.
				36
				37
				38	.. ctype:: PyUnicodeObject
				39
				40	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				41
				42
				43	.. cvar:: PyTypeObject PyUnicode_Type
				44
				45	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				46	is exposed to Python code as ``str``.
				47
				48	The following APIs are really C macros and can be used to do fast checks and to
				49	access internal read-only data of Unicode objects:
				50
				51
				52	.. cfunction:: int PyUnicode_Check(PyObject *o)
				53
				54	Return true if the object o is a Unicode object or an instance of a Unicode
				55	subtype.
				56
				57
				58	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				59
				60	Return true if the object o is a Unicode object, but not an instance of a
				61	subtype.
				62
				63
				64	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				65
				66	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				67	checked).
				68
				69
				70	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				71
				72	Return the size of the object's internal buffer in bytes. o has to be a
				73	:ctype:`PyUnicodeObject` (not checked).
				74
				75
				76	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				77
				78	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				79	has to be a :ctype:`PyUnicodeObject` (not checked).
				80
				81
				82	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				83
				84	Return a pointer to the internal buffer of the object. o has to be a
				85	:ctype:`PyUnicodeObject` (not checked).
				86
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	87
Alexandre Vassalotti	6d3dfc3	2009-07-29 19:54:39 +0000	[diff] [blame]	88	.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	89
				90	Clear the free list. Return the total number of freed items.
				91
Alexandre Vassalotti	6d3dfc3	2009-07-29 19:54:39 +0000	[diff] [blame]	92
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	93	Unicode Character Properties
				94	""""""""""""""""""""""""""""
				95
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	96	Unicode provides many different character properties. The most often needed ones
				97	are available through these macros which are mapped to C functions depending on
				98	the Python configuration.
				99
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	100
				101	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				102
				103	Return 1 or 0 depending on whether ch is a whitespace character.
				104
				105
				106	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				107
				108	Return 1 or 0 depending on whether ch is a lowercase character.
				109
				110
				111	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				112
				113	Return 1 or 0 depending on whether ch is an uppercase character.
				114
				115
				116	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				117
				118	Return 1 or 0 depending on whether ch is a titlecase character.
				119
				120
				121	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				122
				123	Return 1 or 0 depending on whether ch is a linebreak character.
				124
				125
				126	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				127
				128	Return 1 or 0 depending on whether ch is a decimal character.
				129
				130
				131	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				132
				133	Return 1 or 0 depending on whether ch is a digit character.
				134
				135
				136	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				137
				138	Return 1 or 0 depending on whether ch is a numeric character.
				139
				140
				141	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				142
				143	Return 1 or 0 depending on whether ch is an alphabetic character.
				144
				145
				146	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				147
				148	Return 1 or 0 depending on whether ch is an alphanumeric character.
				149
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	150
				151	.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
				152
				153	Return 1 or 0 depending on whether ch is a printable character.
				154	Nonprintable characters are those characters defined in the Unicode character
				155	database as "Other" or "Separator", excepting the ASCII space (0x20) which is
				156	considered printable. (Note that printable characters in this context are
				157	those which should not be escaped when :func:`repr` is invoked on a string.
				158	It has no bearing on the handling of strings written to :data:`sys.stdout` or
				159	:data:`sys.stderr`.)
				160
				161
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	162	These APIs can be used for fast direct character conversions:
				163
				164
				165	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				166
				167	Return the character ch converted to lower case.
				168
				169
				170	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				171
				172	Return the character ch converted to upper case.
				173
				174
				175	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				176
				177	Return the character ch converted to title case.
				178
				179
				180	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				181
				182	Return the character ch converted to a decimal positive integer. Return
				183	``-1`` if this is not possible. This macro does not raise exceptions.
				184
				185
				186	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				187
				188	Return the character ch converted to a single digit integer. Return ``-1`` if
				189	this is not possible. This macro does not raise exceptions.
				190
				191
				192	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				193
				194	Return the character ch converted to a double. Return ``-1.0`` if this is not
				195	possible. This macro does not raise exceptions.
				196
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	197
				198	Plain Py_UNICODE
				199	""""""""""""""""
				200
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	201	To create Unicode objects and access their basic sequence properties, use these
				202	APIs:
				203
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	204
				205	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				206
				207	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				208	may be NULL which causes the contents to be undefined. It is the user's
				209	responsibility to fill in the needed data. The buffer is copied into the new
				210	object. If the buffer is not NULL, the return value might be a shared object.
				211	Therefore, modification of the resulting Unicode object is only allowed when u
				212	is NULL.
				213
				214
				215	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				216
				217	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				218	as being UTF-8 encoded. u may also be NULL which
				219	causes the contents to be undefined. It is the user's responsibility to fill in
				220	the needed data. The buffer is copied into the new object. If the buffer is not
				221	NULL, the return value might be a shared object. Therefore, modification of
				222	the resulting Unicode object is only allowed when u is NULL.
				223
				224
				225	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				226
				227	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				228	u.
				229
				230
				231	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				232
				233	Take a C :cfunc:`printf`\ -style format string and a variable number of
				234	arguments, calculate the size of the resulting Python unicode string and return
				235	a string with the values formatted into it. The variable arguments must be C
				236	types and must correspond exactly to the format characters in the format
				237	string. The following format characters are allowed:
				238
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	239	.. % This should be exactly the same as the table in PyErr_Format.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	240	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				241	.. % because not all compilers support the %z width modifier -- we fake it
				242	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	243	.. % Similar comments apply to the %ll width modifier and
				244	.. % PY_FORMAT_LONG_LONG.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	245
				246	+-------------------+---------------------+--------------------------------+
				247	\| Format Characters \| Type \| Comment \|
				248	+===================+=====================+================================+
				249	\| :attr:`%%` \| n/a \| The literal % character. \|
				250	+-------------------+---------------------+--------------------------------+
				251	\| :attr:`%c` \| int \| A single character, \|
				252	\| \| \| represented as an C int. \|
				253	+-------------------+---------------------+--------------------------------+
				254	\| :attr:`%d` \| int \| Exactly equivalent to \|
				255	\| \| \| ``printf("%d")``. \|
				256	+-------------------+---------------------+--------------------------------+
				257	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				258	\| \| \| ``printf("%u")``. \|
				259	+-------------------+---------------------+--------------------------------+
				260	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				261	\| \| \| ``printf("%ld")``. \|
				262	+-------------------+---------------------+--------------------------------+
				263	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				264	\| \| \| ``printf("%lu")``. \|
				265	+-------------------+---------------------+--------------------------------+
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	266	\| :attr:`%lld` \| long long \| Exactly equivalent to \|
				267	\| \| \| ``printf("%lld")``. \|
				268	+-------------------+---------------------+--------------------------------+
				269	\| :attr:`%llu` \| unsigned long long \| Exactly equivalent to \|
				270	\| \| \| ``printf("%llu")``. \|
				271	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	272	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				273	\| \| \| ``printf("%zd")``. \|
				274	+-------------------+---------------------+--------------------------------+
				275	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				276	\| \| \| ``printf("%zu")``. \|
				277	+-------------------+---------------------+--------------------------------+
				278	\| :attr:`%i` \| int \| Exactly equivalent to \|
				279	\| \| \| ``printf("%i")``. \|
				280	+-------------------+---------------------+--------------------------------+
				281	\| :attr:`%x` \| int \| Exactly equivalent to \|
				282	\| \| \| ``printf("%x")``. \|
				283	+-------------------+---------------------+--------------------------------+
				284	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				285	\| \| \| array. \|
				286	+-------------------+---------------------+--------------------------------+
				287	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				288	\| \| \| pointer. Mostly equivalent to \|
				289	\| \| \| ``printf("%p")`` except that \|
				290	\| \| \| it is guaranteed to start with \|
				291	\| \| \| the literal ``0x`` regardless \|
				292	\| \| \| of what the platform's \|
				293	\| \| \| ``printf`` yields. \|
				294	+-------------------+---------------------+--------------------------------+
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	295	\| :attr:`%A` \| PyObject\* \| The result of calling \|
				296	\| \| \| :func:`ascii`. \|
				297	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	298	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				299	+-------------------+---------------------+--------------------------------+
				300	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				301	\| \| \| NULL) and a null-terminated \|
				302	\| \| \| C character array as a second \|
				303	\| \| \| parameter (which will be used, \|
				304	\| \| \| if the first parameter is \|
				305	\| \| \| NULL). \|
				306	+-------------------+---------------------+--------------------------------+
				307	\| :attr:`%S` \| PyObject\* \| The result of calling \|
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	308	\| \| \| :cfunc:`PyObject_Str`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	309	+-------------------+---------------------+--------------------------------+
				310	\| :attr:`%R` \| PyObject\* \| The result of calling \|
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	311	\| \| \| :cfunc:`PyObject_Repr`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	312	+-------------------+---------------------+--------------------------------+
				313
				314	An unrecognized format character causes all the rest of the format string to be
				315	copied as-is to the result string, and any extra arguments discarded.
				316
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	317	.. note::
				318
				319	The `"%lld"` and `"%llu"` format specifiers are only available
Georg Brandl	ef871f6	2010-03-12 10:06:40 +0000	[diff] [blame]	320	when :const:`HAVE_LONG_LONG` is defined.
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	321
				322	.. versionchanged:: 3.2
Georg Brandl	67b21b7	2010-08-17 15:07:14 +0000	[diff] [blame]	323	Support for ``"%lld"`` and ``"%llu"`` added.
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	324
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	325
				326	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				327
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	328	Identical to :cfunc:`PyUnicode_FromFormat` except that it takes exactly two
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	329	arguments.
				330
				331
				332	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				333
				334	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				335	buffer, NULL if unicode is not a Unicode object.
				336
				337
Victor Stinner	e4ea994	2010-09-03 16:23:29 +0000	[diff] [blame]	338	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)
				339
				340	Create a copy of a unicode string ending with a nul character. Return NULL
				341	and raise a :exc:`MemoryError` exception on memory allocation failure,
				342	otherwise return a new allocated buffer (use :cfunc:`PyMem_Free` to free the
				343	buffer).
				344
				345
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	346	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				347
				348	Return the length of the Unicode object.
				349
				350
				351	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				352
				353	Coerce an encoded object obj to an Unicode object and return a reference with
				354	incremented refcount.
				355
Georg Brandl	952867a	2010-06-27 10:17:12 +0000	[diff] [blame]	356	:class:`bytes`, :class:`bytearray` and other char buffer compatible objects
				357	are decoded according to the given encoding and using the error handling
				358	defined by errors. Both can be NULL to have the interface use the default
				359	values (see the next section for details).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	360
				361	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				362	set.
				363
				364	The API returns NULL if there was an error. The caller is responsible for
				365	decref'ing the returned objects.
				366
				367
				368	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				369
				370	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				371	throughout the interpreter whenever coercion to Unicode is needed.
				372
				373	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				374	Python can interface directly to this type using the following functions.
				375	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				376	the system's :ctype:`wchar_t`.
				377
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	378
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	379	File System Encoding
				380	""""""""""""""""""""
				381
				382	To encode and decode file names and other environment strings,
				383	:cdata:`Py_FileSystemEncoding` should be used as the encoding, and
				384	``"surrogateescape"`` should be used as the error handler (:pep:`383`). To
				385	encode file names during argument parsing, the ``"O&"`` converter should be
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	386	used, passsing :cfunc:`PyUnicode_FSConverter` as the conversion function:
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	387
				388	.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
				389
Victor Stinner	47fcb5b	2010-08-13 23:59:58 +0000	[diff] [blame]	390	ParseTuple converter: encode :class:`str` objects to :class:`bytes` using
				391	:cfunc:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
				392	result must be a :ctype:`PyBytesObject*` which must be released when it is
				393	no longer used.
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	394
				395	.. versionadded:: 3.1
				396
Georg Brandl	67b21b7	2010-08-17 15:07:14 +0000	[diff] [blame]	397
Victor Stinner	47fcb5b	2010-08-13 23:59:58 +0000	[diff] [blame]	398	To decode file names during argument parsing, the ``"O&"`` converter should be
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	399	used, passsing :cfunc:`PyUnicode_FSDecoder` as the conversion function:
Victor Stinner	47fcb5b	2010-08-13 23:59:58 +0000	[diff] [blame]	400
				401	.. cfunction:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
				402
				403	ParseTuple converter: decode :class:`bytes` objects to :class:`str` using
				404	:cfunc:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` objects are output
				405	as-is. result must be a :ctype:`PyUnicodeObject*` which must be released
				406	when it is no longer used.
				407
				408	.. versionadded:: 3.2
				409
Georg Brandl	67b21b7	2010-08-17 15:07:14 +0000	[diff] [blame]	410
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	411	.. cfunction:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
				412
				413	Decode a null-terminated string using :cdata:`Py_FileSystemDefaultEncoding`
				414	and the ``"surrogateescape"`` error handler.
				415
				416	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				417
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	418	Use :cfunc:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	419
Victor Stinner	ae6265f	2010-05-15 16:27:27 +0000	[diff] [blame]	420
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	421	.. cfunction:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
				422
				423	Decode a string using :cdata:`Py_FileSystemDefaultEncoding` and
				424	the ``"surrogateescape"`` error handler.
				425
				426	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				427
				428
Victor Stinner	ae6265f	2010-05-15 16:27:27 +0000	[diff] [blame]	429	.. cfunction:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
				430
				431	Encode a Unicode object to :cdata:`Py_FileSystemDefaultEncoding` with the
Benjamin Peterson	b432451	2010-05-15 17:42:02 +0000	[diff] [blame]	432	``'surrogateescape'`` error handler, and return :class:`bytes`.
Victor Stinner	ae6265f	2010-05-15 16:27:27 +0000	[diff] [blame]	433
				434	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				435
				436	.. versionadded:: 3.2
				437
				438
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	439	wchar_t Support
				440	"""""""""""""""
				441
				442	wchar_t support for platforms which support it:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	443
				444	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				445
				446	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Martin v. Löwis	790465f	2008-04-05 20:41:37 +0000	[diff] [blame]	447	Passing -1 as the size indicates that the function must itself compute the length,
				448	using wcslen.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	449	Return NULL on failure.
				450
				451
				452	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				453
				454	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				455	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				456	0-termination character). Return the number of :ctype:`wchar_t` characters
				457	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				458	string may or may not be 0-terminated. It is the responsibility of the caller
				459	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				460	required by the application.
				461
				462
				463	.. _builtincodecs:
				464
				465	Built-in Codecs
				466	^^^^^^^^^^^^^^^
				467
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	468	Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	469	these codecs are directly usable via the following functions.
				470
				471	Many of the following APIs take two arguments encoding and errors. These
				472	parameters encoding and errors have the same semantics as the ones of the
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	473	built-in :func:`unicode` Unicode object constructor.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	474
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	475	Setting encoding to NULL causes the default encoding to be used
				476	which is ASCII. The file system calls should use
				477	:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
				478	variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
				479	variable should be treated as read-only: On some systems, it will be a
				480	pointer to a static string, on others, it will change at run-time
				481	(such as when the application invokes setlocale).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	482
				483	Error handling is set by errors which may also be set to NULL meaning to use
				484	the default handling defined for the codec. Default error handling for all
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	485	built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	486
				487	The codecs all use a similar interface. Only deviation from the following
				488	generic ones are documented for simplicity.
				489
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	490
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	491	Generic Codecs
				492	""""""""""""""
				493
				494	These are the generic codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	495
				496
				497	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				498
				499	Create a Unicode object by decoding size bytes of the encoded string s.
				500	encoding and errors have the same meaning as the parameters of the same name
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	501	in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	502	using the Python codec registry. Return NULL if an exception was raised by
				503	the codec.
				504
				505
				506	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				507
				508	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	509	bytes object. encoding and errors have the same meaning as the
				510	parameters of the same name in the Unicode :meth:`encode` method. The codec
				511	to be used is looked up using the Python codec registry. Return NULL if an
				512	exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	513
				514
				515	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				516
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	517	Encode a Unicode object and return the result as Python bytes object.
				518	encoding and errors have the same meaning as the parameters of the same
				519	name in the Unicode :meth:`encode` method. The codec to be used is looked up
				520	using the Python codec registry. Return NULL if an exception was raised by
				521	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	522
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	523
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	524	UTF-8 Codecs
				525	""""""""""""
				526
				527	These are the UTF-8 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	528
				529
				530	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				531
				532	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				533	s. Return NULL if an exception was raised by the codec.
				534
				535
				536	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				537
				538	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				539	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				540	treated as an error. Those bytes will not be decoded and the number of bytes
				541	that have been decoded will be stored in consumed.
				542
				543
				544	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				545
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	546	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
				547	return a Python bytes object. Return NULL if an exception was raised by
				548	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	549
				550
				551	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				552
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	553	Encode a Unicode object using UTF-8 and return the result as Python bytes
				554	object. Error handling is "strict". Return NULL if an exception was
				555	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	556
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	557
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	558	UTF-32 Codecs
				559	"""""""""""""
				560
				561	These are the UTF-32 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	562
				563
				564	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				565
				566	Decode length bytes from a UTF-32 encoded buffer string and return the
				567	corresponding Unicode object. errors (if non-NULL) defines the error
				568	handling. It defaults to "strict".
				569
				570	If byteorder is non-NULL, the decoder starts decoding using the given byte
				571	order::
				572
				573	*byteorder == -1: little endian
				574	*byteorder == 0: native order
				575	*byteorder == 1: big endian
				576
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	577	If ``*byteorder`` is zero, and the first four bytes of the input data are a
				578	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				579	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				580	``1``, any byte order mark is copied to the output.
				581
				582	After completion, \byteorder* is set to the current byte order at the end
				583	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	584
				585	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				586
				587	If byteorder is NULL, the codec starts in native order mode.
				588
				589	Return NULL if an exception was raised by the codec.
				590
				591
				592	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				593
				594	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				595	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				596	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				597	by four) as an error. Those bytes will not be decoded and the number of bytes
				598	that have been decoded will be stored in consumed.
				599
				600
				601	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				602
				603	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	604	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	605
				606	byteorder == -1: little endian
				607	byteorder == 0: native byte order (writes a BOM mark)
				608	byteorder == 1: big endian
				609
				610	If byteorder is ``0``, the output string will always start with the Unicode BOM
				611	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				612
				613	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				614	as a single codepoint.
				615
				616	Return NULL if an exception was raised by the codec.
				617
				618
				619	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				620
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	621	Return a Python byte string using the UTF-32 encoding in native byte
				622	order. The string always starts with a BOM mark. Error handling is "strict".
				623	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	624
				625
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	626	UTF-16 Codecs
				627	"""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	628
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	629	These are the UTF-16 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	630
				631
				632	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				633
				634	Decode length bytes from a UTF-16 encoded buffer string and return the
				635	corresponding Unicode object. errors (if non-NULL) defines the error
				636	handling. It defaults to "strict".
				637
				638	If byteorder is non-NULL, the decoder starts decoding using the given byte
				639	order::
				640
				641	*byteorder == -1: little endian
				642	*byteorder == 0: native order
				643	*byteorder == 1: big endian
				644
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	645	If ``*byteorder`` is zero, and the first two bytes of the input data are a
				646	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				647	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				648	``1``, any byte order mark is copied to the output (where it will result in
				649	either a ``\ufeff`` or a ``\ufffe`` character).
				650
				651	After completion, \byteorder* is set to the current byte order at the end
				652	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	653
				654	If byteorder is NULL, the codec starts in native order mode.
				655
				656	Return NULL if an exception was raised by the codec.
				657
				658
				659	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				660
				661	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				662	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				663	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				664	split surrogate pair) as an error. Those bytes will not be decoded and the
				665	number of bytes that have been decoded will be stored in consumed.
				666
				667
				668	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				669
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	670	Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	671	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	672
				673	byteorder == -1: little endian
				674	byteorder == 0: native byte order (writes a BOM mark)
				675	byteorder == 1: big endian
				676
				677	If byteorder is ``0``, the output string will always start with the Unicode BOM
				678	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				679
				680	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				681	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				682	values is interpreted as an UCS-2 character.
				683
				684	Return NULL if an exception was raised by the codec.
				685
				686
				687	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				688
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	689	Return a Python byte string using the UTF-16 encoding in native byte
				690	order. The string always starts with a BOM mark. Error handling is "strict".
				691	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	692
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	693
Georg Brandl	8477f82	2010-08-02 20:05:19 +0000	[diff] [blame]	694	UTF-7 Codecs
				695	""""""""""""
				696
				697	These are the UTF-7 codec APIs:
				698
				699
				700	.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char s, Py_ssize_t size, const char errors)
				701
				702	Create a Unicode object by decoding size bytes of the UTF-7 encoded string
				703	s. Return NULL if an exception was raised by the codec.
				704
				705
Georg Brandl	4d22409	2010-08-13 15:10:49 +0000	[diff] [blame]	706	.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
Georg Brandl	8477f82	2010-08-02 20:05:19 +0000	[diff] [blame]	707
				708	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
				709	consumed is not NULL, trailing incomplete UTF-7 base-64 sections will not
				710	be treated as an error. Those bytes will not be decoded and the number of
				711	bytes that have been decoded will be stored in consumed.
				712
				713
				714	.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char errors)
				715
				716	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
				717	return a Python bytes object. Return NULL if an exception was raised by
				718	the codec.
				719
				720	If base64SetO is nonzero, "Set O" (punctuation that has no otherwise
				721	special meaning) will be encoded in base-64. If base64WhiteSpace is
				722	nonzero, whitespace will be encoded in base-64. Both are set to zero for the
				723	Python "utf-7" codec.
				724
				725
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	726	Unicode-Escape Codecs
				727	"""""""""""""""""""""
				728
				729	These are the "Unicode Escape" codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	730
				731
				732	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				733
				734	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				735	string s. Return NULL if an exception was raised by the codec.
				736
				737
				738	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				739
				740	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				741	return a Python string object. Return NULL if an exception was raised by the
				742	codec.
				743
				744
				745	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				746
				747	Encode a Unicode object using Unicode-Escape and return the result as Python
				748	string object. Error handling is "strict". Return NULL if an exception was
				749	raised by the codec.
				750
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	751
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	752	Raw-Unicode-Escape Codecs
				753	"""""""""""""""""""""""""
				754
				755	These are the "Raw Unicode Escape" codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	756
				757
				758	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				759
				760	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				761	encoded string s. Return NULL if an exception was raised by the codec.
				762
				763
				764	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				765
				766	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				767	and return a Python string object. Return NULL if an exception was raised by
				768	the codec.
				769
				770
				771	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				772
				773	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				774	Python string object. Error handling is "strict". Return NULL if an exception
				775	was raised by the codec.
				776
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	777
				778	Latin-1 Codecs
				779	""""""""""""""
				780
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	781	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				782	ordinals and only these are accepted by the codecs during encoding.
				783
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	784
				785	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				786
				787	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				788	s. Return NULL if an exception was raised by the codec.
				789
				790
				791	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				792
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	793	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
				794	return a Python bytes object. Return NULL if an exception was raised by
				795	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	796
				797
				798	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				799
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	800	Encode a Unicode object using Latin-1 and return the result as Python bytes
				801	object. Error handling is "strict". Return NULL if an exception was
				802	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	803
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	804
				805	ASCII Codecs
				806	""""""""""""
				807
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	808	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				809	codes generate errors.
				810
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	811
				812	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				813
				814	Create a Unicode object by decoding size bytes of the ASCII encoded string
				815	s. Return NULL if an exception was raised by the codec.
				816
				817
				818	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				819
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	820	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
				821	return a Python bytes object. Return NULL if an exception was raised by
				822	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	823
				824
				825	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				826
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	827	Encode a Unicode object using ASCII and return the result as Python bytes
				828	object. Error handling is "strict". Return NULL if an exception was
				829	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	830
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	831
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	832	Character Map Codecs
				833	""""""""""""""""""""
				834
				835	These are the mapping codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	836
				837	This codec is special in that it can be used to implement many different codecs
				838	(and this is in fact what was done to obtain most of the standard codecs
				839	included in the :mod:`encodings` package). The codec uses mapping to encode and
				840	decode characters.
				841
				842	Decoding mappings must map single string characters to single Unicode
				843	characters, integers (which are then interpreted as Unicode ordinals) or None
				844	(meaning "undefined mapping" and causing an error).
				845
				846	Encoding mappings must map single Unicode characters to single string
				847	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				848	(meaning "undefined mapping" and causing an error).
				849
				850	The mapping objects provided must only support the __getitem__ mapping
				851	interface.
				852
				853	If a character lookup fails with a LookupError, the character is copied as-is
				854	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				855	resp. Because of this, mappings only need to contain those mappings which map
				856	characters to different code points.
				857
				858
				859	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				860
				861	Create a Unicode object by decoding size bytes of the encoded string s using
				862	the given mapping object. Return NULL if an exception was raised by the
				863	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				864	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				865	Byte values greater that the length of the string and U+FFFE "characters" are
				866	treated as "undefined mapping".
				867
				868
				869	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				870
				871	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				872	mapping object and return a Python string object. Return NULL if an
				873	exception was raised by the codec.
				874
				875
				876	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				877
				878	Encode a Unicode object using the given mapping object and return the result
				879	as Python string object. Error handling is "strict". Return NULL if an
				880	exception was raised by the codec.
				881
				882	The following codec API is special in that maps Unicode to Unicode.
				883
				884
				885	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				886
				887	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				888	character mapping table to it and return the resulting Unicode object. Return
				889	NULL when an exception was raised by the codec.
				890
				891	The mapping table must map Unicode ordinal integers to Unicode ordinal
				892	integers or None (causing deletion of the character).
				893
				894	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				895	and sequences work well. Unmapped character ordinals (ones which cause a
				896	:exc:`LookupError`) are left untouched and are copied as-is.
				897
Jeroen Ruigrok van der Werven	47a7d70	2009-04-27 05:43:17 +0000	[diff] [blame]	898
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	899	These are the MBCS codec APIs. They are currently only available on Windows and
				900	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				901	DBCS) is a class of encodings, not just one. The target encoding is defined by
				902	the user settings on the machine running the codec.
				903
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	904
				905	MBCS codecs for Windows
				906	"""""""""""""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	907
				908
				909	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				910
				911	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				912	Return NULL if an exception was raised by the codec.
				913
				914
				915	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				916
				917	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				918	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				919	trailing lead byte and the number of bytes that have been decoded will be stored
				920	in consumed.
				921
				922
				923	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				924
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	925	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
				926	a Python bytes object. Return NULL if an exception was raised by the
				927	codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	928
				929
				930	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				931
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	932	Encode a Unicode object using MBCS and return the result as Python bytes
				933	object. Error handling is "strict". Return NULL if an exception was
				934	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	935
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	936
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	937	Methods & Slots
				938	"""""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	939
				940
				941	.. _unicodemethodsandslots:
				942
				943	Methods and Slot Functions
				944	^^^^^^^^^^^^^^^^^^^^^^^^^^
				945
				946	The following APIs are capable of handling Unicode objects and strings on input
				947	(we refer to them as strings in the descriptions) and return Unicode objects or
				948	integers as appropriate.
				949
				950	They all return NULL or ``-1`` if an exception occurs.
				951
				952
				953	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				954
				955	Concat two strings giving a new Unicode string.
				956
				957
				958	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				959
				960	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				961	will be done at all whitespace substrings. Otherwise, splits occur at the given
				962	separator. At most maxsplit splits will be done. If negative, no limit is
				963	set. Separators are not included in the resulting list.
				964
				965
				966	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				967
				968	Split a Unicode string at line breaks, returning a list of Unicode strings.
				969	CRLF is considered to be one line break. If keepend is 0, the Line break
				970	characters are not included in the resulting strings.
				971
				972
				973	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				974
				975	Translate a string by applying a character mapping table to it and return the
				976	resulting Unicode object.
				977
				978	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				979	or None (causing deletion of the character).
				980
				981	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				982	and sequences work well. Unmapped character ordinals (ones which cause a
				983	:exc:`LookupError`) are left untouched and are copied as-is.
				984
				985	errors has the usual meaning for codecs. It may be NULL which indicates to
				986	use the default error handling.
				987
				988
				989	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				990
				991	Join a sequence of strings using the given separator and return the resulting
				992	Unicode string.
				993
				994
				995	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				996
				997	Return 1 if substr matches str[start:end] at the given tail end
				998	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				999	0 otherwise. Return ``-1`` if an error occurred.
				1000
				1001
				1002	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				1003
				1004	Return the first position of substr in str[start:end] using the given
				1005	direction (direction == 1 means to do a forward search, direction == -1 a
				1006	backward search). The return value is the index of the first match; a value of
				1007	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				1008	occurred and an exception has been set.
				1009
				1010
				1011	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				1012
				1013	Return the number of non-overlapping occurrences of substr in
				1014	``str[start:end]``. Return ``-1`` if an error occurred.
				1015
				1016
				1017	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				1018
				1019	Replace at most maxcount occurrences of substr in str with replstr and
				1020	return the resulting Unicode object. maxcount == -1 means replace all
				1021	occurrences.
				1022
				1023
				1024	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				1025
				1026	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				1027	respectively.
				1028
				1029
Benjamin Peterson	c22ed14	2008-07-01 19:12:34 +0000	[diff] [blame]	1030	.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject uni, char string)
				1031
				1032	Compare a unicode object, uni, with string and return -1, 0, 1 for less
				1033	than, equal, and greater than, respectively.
				1034
				1035
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1036	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				1037
				1038	Rich compare two unicode strings and return one of the following:
				1039
				1040	* ``NULL`` in case an exception was raised
				1041	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				1042	* :const:`Py_NotImplemented` in case the type combination is unknown
				1043
				1044	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				1045	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				1046	with a :exc:`UnicodeDecodeError`.
				1047
				1048	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				1049	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				1050
				1051
				1052	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				1053
				1054	Return a new string object from format and args; this is analogous to
				1055	``format % args``. The args argument must be a tuple.
				1056
				1057
				1058	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				1059
				1060	Check whether element is contained in container and return true or false
				1061	accordingly.
				1062
				1063	element has to coerce to a one element Unicode string. ``-1`` is returned if
				1064	there was an error.
				1065
				1066
				1067	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				1068
				1069	Intern the argument \string* in place. The argument must be the address of a
				1070	pointer variable pointing to a Python unicode string object. If there is an
				1071	existing interned string that is the same as \string, it sets \string to
				1072	it (decrementing the reference count of the old string object and incrementing
				1073	the reference count of the interned string object), otherwise it leaves
				1074	\string* alone and interns it (incrementing its reference count).
				1075	(Clarification: even though there is a lot of talk about reference counts, think
				1076	of this function as reference-count-neutral; you own the object after the call
				1077	if and only if you owned it before the call.)
				1078
				1079
				1080	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				1081
				1082	A combination of :cfunc:`PyUnicode_FromString` and
				1083	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				1084	that has been interned, or a new ("owned") reference to an earlier interned
				1085	string object with the same value.
				1086