Blame - Doc/c-api/unicode.rst - platform/external/python/cpython2

blob: a4ee03abc0a224ab9fbd7bd71b8a35966cde39a2 [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	13	Unicode Type
				14	""""""""""""
				15
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	16	These are the basic Unicode object types used for the Unicode implementation in
				17	Python:
				18
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	19
				20	.. ctype:: Py_UNICODE
				21
				22	This type represents the storage type which is used by Python internally as
				23	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				24	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				25	possible to build a UCS4 version of Python (most recent Linux distributions come
				26	with UCS4 builds of Python). These builds then use a 32-bit type for
				27	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				28	where :ctype:`wchar_t` is available and compatible with the chosen Python
				29	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				30	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				31	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				32	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				33
				34	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				35	this in mind when writing extensions or interfaces.
				36
				37
				38	.. ctype:: PyUnicodeObject
				39
				40	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				41
				42
				43	.. cvar:: PyTypeObject PyUnicode_Type
				44
				45	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				46	is exposed to Python code as ``str``.
				47
				48	The following APIs are really C macros and can be used to do fast checks and to
				49	access internal read-only data of Unicode objects:
				50
				51
				52	.. cfunction:: int PyUnicode_Check(PyObject *o)
				53
				54	Return true if the object o is a Unicode object or an instance of a Unicode
				55	subtype.
				56
				57
				58	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				59
				60	Return true if the object o is a Unicode object, but not an instance of a
				61	subtype.
				62
				63
				64	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				65
				66	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				67	checked).
				68
				69
				70	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				71
				72	Return the size of the object's internal buffer in bytes. o has to be a
				73	:ctype:`PyUnicodeObject` (not checked).
				74
				75
				76	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				77
				78	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				79	has to be a :ctype:`PyUnicodeObject` (not checked).
				80
				81
				82	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				83
				84	Return a pointer to the internal buffer of the object. o has to be a
				85	:ctype:`PyUnicodeObject` (not checked).
				86
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	87
Alexandre Vassalotti	6d3dfc3	2009-07-29 19:54:39 +0000	[diff] [blame]	88	.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	89
				90	Clear the free list. Return the total number of freed items.
				91
Alexandre Vassalotti	6d3dfc3	2009-07-29 19:54:39 +0000	[diff] [blame]	92
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	93	Unicode Character Properties
				94	""""""""""""""""""""""""""""
				95
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	96	Unicode provides many different character properties. The most often needed ones
				97	are available through these macros which are mapped to C functions depending on
				98	the Python configuration.
				99
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	100
				101	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				102
				103	Return 1 or 0 depending on whether ch is a whitespace character.
				104
				105
				106	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				107
				108	Return 1 or 0 depending on whether ch is a lowercase character.
				109
				110
				111	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				112
				113	Return 1 or 0 depending on whether ch is an uppercase character.
				114
				115
				116	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				117
				118	Return 1 or 0 depending on whether ch is a titlecase character.
				119
				120
				121	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				122
				123	Return 1 or 0 depending on whether ch is a linebreak character.
				124
				125
				126	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				127
				128	Return 1 or 0 depending on whether ch is a decimal character.
				129
				130
				131	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				132
				133	Return 1 or 0 depending on whether ch is a digit character.
				134
				135
				136	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				137
				138	Return 1 or 0 depending on whether ch is a numeric character.
				139
				140
				141	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				142
				143	Return 1 or 0 depending on whether ch is an alphabetic character.
				144
				145
				146	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				147
				148	Return 1 or 0 depending on whether ch is an alphanumeric character.
				149
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	150
				151	.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
				152
				153	Return 1 or 0 depending on whether ch is a printable character.
				154	Nonprintable characters are those characters defined in the Unicode character
				155	database as "Other" or "Separator", excepting the ASCII space (0x20) which is
				156	considered printable. (Note that printable characters in this context are
				157	those which should not be escaped when :func:`repr` is invoked on a string.
				158	It has no bearing on the handling of strings written to :data:`sys.stdout` or
				159	:data:`sys.stderr`.)
				160
				161
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	162	These APIs can be used for fast direct character conversions:
				163
				164
				165	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				166
				167	Return the character ch converted to lower case.
				168
				169
				170	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				171
				172	Return the character ch converted to upper case.
				173
				174
				175	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				176
				177	Return the character ch converted to title case.
				178
				179
				180	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				181
				182	Return the character ch converted to a decimal positive integer. Return
				183	``-1`` if this is not possible. This macro does not raise exceptions.
				184
				185
				186	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				187
				188	Return the character ch converted to a single digit integer. Return ``-1`` if
				189	this is not possible. This macro does not raise exceptions.
				190
				191
				192	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				193
				194	Return the character ch converted to a double. Return ``-1.0`` if this is not
				195	possible. This macro does not raise exceptions.
				196
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	197
				198	Plain Py_UNICODE
				199	""""""""""""""""
				200
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	201	To create Unicode objects and access their basic sequence properties, use these
				202	APIs:
				203
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	204
				205	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				206
				207	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				208	may be NULL which causes the contents to be undefined. It is the user's
				209	responsibility to fill in the needed data. The buffer is copied into the new
				210	object. If the buffer is not NULL, the return value might be a shared object.
				211	Therefore, modification of the resulting Unicode object is only allowed when u
				212	is NULL.
				213
				214
				215	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				216
				217	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				218	as being UTF-8 encoded. u may also be NULL which
				219	causes the contents to be undefined. It is the user's responsibility to fill in
				220	the needed data. The buffer is copied into the new object. If the buffer is not
				221	NULL, the return value might be a shared object. Therefore, modification of
				222	the resulting Unicode object is only allowed when u is NULL.
				223
				224
				225	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				226
				227	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				228	u.
				229
				230
				231	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				232
				233	Take a C :cfunc:`printf`\ -style format string and a variable number of
				234	arguments, calculate the size of the resulting Python unicode string and return
				235	a string with the values formatted into it. The variable arguments must be C
				236	types and must correspond exactly to the format characters in the format
				237	string. The following format characters are allowed:
				238
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	239	.. % This should be exactly the same as the table in PyErr_Format.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	240	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				241	.. % because not all compilers support the %z width modifier -- we fake it
				242	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	243	.. % Similar comments apply to the %ll width modifier and
				244	.. % PY_FORMAT_LONG_LONG.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	245
				246	+-------------------+---------------------+--------------------------------+
				247	\| Format Characters \| Type \| Comment \|
				248	+===================+=====================+================================+
				249	\| :attr:`%%` \| n/a \| The literal % character. \|
				250	+-------------------+---------------------+--------------------------------+
				251	\| :attr:`%c` \| int \| A single character, \|
				252	\| \| \| represented as an C int. \|
				253	+-------------------+---------------------+--------------------------------+
				254	\| :attr:`%d` \| int \| Exactly equivalent to \|
				255	\| \| \| ``printf("%d")``. \|
				256	+-------------------+---------------------+--------------------------------+
				257	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				258	\| \| \| ``printf("%u")``. \|
				259	+-------------------+---------------------+--------------------------------+
				260	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				261	\| \| \| ``printf("%ld")``. \|
				262	+-------------------+---------------------+--------------------------------+
				263	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				264	\| \| \| ``printf("%lu")``. \|
				265	+-------------------+---------------------+--------------------------------+
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	266	\| :attr:`%lld` \| long long \| Exactly equivalent to \|
				267	\| \| \| ``printf("%lld")``. \|
				268	+-------------------+---------------------+--------------------------------+
				269	\| :attr:`%llu` \| unsigned long long \| Exactly equivalent to \|
				270	\| \| \| ``printf("%llu")``. \|
				271	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	272	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				273	\| \| \| ``printf("%zd")``. \|
				274	+-------------------+---------------------+--------------------------------+
				275	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				276	\| \| \| ``printf("%zu")``. \|
				277	+-------------------+---------------------+--------------------------------+
				278	\| :attr:`%i` \| int \| Exactly equivalent to \|
				279	\| \| \| ``printf("%i")``. \|
				280	+-------------------+---------------------+--------------------------------+
				281	\| :attr:`%x` \| int \| Exactly equivalent to \|
				282	\| \| \| ``printf("%x")``. \|
				283	+-------------------+---------------------+--------------------------------+
				284	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				285	\| \| \| array. \|
				286	+-------------------+---------------------+--------------------------------+
				287	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				288	\| \| \| pointer. Mostly equivalent to \|
				289	\| \| \| ``printf("%p")`` except that \|
				290	\| \| \| it is guaranteed to start with \|
				291	\| \| \| the literal ``0x`` regardless \|
				292	\| \| \| of what the platform's \|
				293	\| \| \| ``printf`` yields. \|
				294	+-------------------+---------------------+--------------------------------+
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	295	\| :attr:`%A` \| PyObject\* \| The result of calling \|
				296	\| \| \| :func:`ascii`. \|
				297	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	298	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				299	+-------------------+---------------------+--------------------------------+
				300	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				301	\| \| \| NULL) and a null-terminated \|
				302	\| \| \| C character array as a second \|
				303	\| \| \| parameter (which will be used, \|
				304	\| \| \| if the first parameter is \|
				305	\| \| \| NULL). \|
				306	+-------------------+---------------------+--------------------------------+
				307	\| :attr:`%S` \| PyObject\* \| The result of calling \|
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	308	\| \| \| :cfunc:`PyObject_Str`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	309	+-------------------+---------------------+--------------------------------+
				310	\| :attr:`%R` \| PyObject\* \| The result of calling \|
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	311	\| \| \| :cfunc:`PyObject_Repr`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	312	+-------------------+---------------------+--------------------------------+
				313
				314	An unrecognized format character causes all the rest of the format string to be
				315	copied as-is to the result string, and any extra arguments discarded.
				316
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	317	.. note::
				318
				319	The `"%lld"` and `"%llu"` format specifiers are only available
Georg Brandl	ef871f6	2010-03-12 10:06:40 +0000	[diff] [blame]	320	when :const:`HAVE_LONG_LONG` is defined.
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	321
				322	.. versionchanged:: 3.2
Georg Brandl	67b21b7	2010-08-17 15:07:14 +0000	[diff] [blame]	323	Support for ``"%lld"`` and ``"%llu"`` added.
Mark Dickinson	6ce4a9a	2009-11-16 17:00:11 +0000	[diff] [blame]	324
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	325
				326	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				327
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	328	Identical to :cfunc:`PyUnicode_FromFormat` except that it takes exactly two
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	329	arguments.
				330
				331
				332	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				333
				334	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				335	buffer, NULL if unicode is not a Unicode object.
				336
				337
Victor Stinner	e4ea994	2010-09-03 16:23:29 +0000	[diff] [blame]	338	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicodeCopy(PyObject *unicode)
				339
				340	Create a copy of a unicode string ending with a nul character. Return NULL
				341	and raise a :exc:`MemoryError` exception on memory allocation failure,
				342	otherwise return a new allocated buffer (use :cfunc:`PyMem_Free` to free the
				343	buffer).
				344
Victor Stinner	2b19f35	2010-09-03 22:13:42 +0000	[diff] [blame^]	345	.. versionadded:: 3.2
				346
Victor Stinner	e4ea994	2010-09-03 16:23:29 +0000	[diff] [blame]	347
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	348	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				349
				350	Return the length of the Unicode object.
				351
				352
				353	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				354
				355	Coerce an encoded object obj to an Unicode object and return a reference with
				356	incremented refcount.
				357
Georg Brandl	952867a	2010-06-27 10:17:12 +0000	[diff] [blame]	358	:class:`bytes`, :class:`bytearray` and other char buffer compatible objects
				359	are decoded according to the given encoding and using the error handling
				360	defined by errors. Both can be NULL to have the interface use the default
				361	values (see the next section for details).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	362
				363	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				364	set.
				365
				366	The API returns NULL if there was an error. The caller is responsible for
				367	decref'ing the returned objects.
				368
				369
				370	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				371
				372	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				373	throughout the interpreter whenever coercion to Unicode is needed.
				374
				375	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				376	Python can interface directly to this type using the following functions.
				377	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				378	the system's :ctype:`wchar_t`.
				379
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	380
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	381	File System Encoding
				382	""""""""""""""""""""
				383
				384	To encode and decode file names and other environment strings,
				385	:cdata:`Py_FileSystemEncoding` should be used as the encoding, and
				386	``"surrogateescape"`` should be used as the error handler (:pep:`383`). To
				387	encode file names during argument parsing, the ``"O&"`` converter should be
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	388	used, passsing :cfunc:`PyUnicode_FSConverter` as the conversion function:
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	389
				390	.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
				391
Victor Stinner	47fcb5b	2010-08-13 23:59:58 +0000	[diff] [blame]	392	ParseTuple converter: encode :class:`str` objects to :class:`bytes` using
				393	:cfunc:`PyUnicode_EncodeFSDefault`; :class:`bytes` objects are output as-is.
				394	result must be a :ctype:`PyBytesObject*` which must be released when it is
				395	no longer used.
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	396
				397	.. versionadded:: 3.1
				398
Georg Brandl	67b21b7	2010-08-17 15:07:14 +0000	[diff] [blame]	399
Victor Stinner	47fcb5b	2010-08-13 23:59:58 +0000	[diff] [blame]	400	To decode file names during argument parsing, the ``"O&"`` converter should be
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	401	used, passsing :cfunc:`PyUnicode_FSDecoder` as the conversion function:
Victor Stinner	47fcb5b	2010-08-13 23:59:58 +0000	[diff] [blame]	402
				403	.. cfunction:: int PyUnicode_FSDecoder(PyObject* obj, void* result)
				404
				405	ParseTuple converter: decode :class:`bytes` objects to :class:`str` using
				406	:cfunc:`PyUnicode_DecodeFSDefaultAndSize`; :class:`str` objects are output
				407	as-is. result must be a :ctype:`PyUnicodeObject*` which must be released
				408	when it is no longer used.
				409
				410	.. versionadded:: 3.2
				411
Georg Brandl	67b21b7	2010-08-17 15:07:14 +0000	[diff] [blame]	412
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	413	.. cfunction:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
				414
				415	Decode a null-terminated string using :cdata:`Py_FileSystemDefaultEncoding`
				416	and the ``"surrogateescape"`` error handler.
				417
				418	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				419
Victor Stinner	6009ece	2010-08-17 22:01:02 +0000	[diff] [blame]	420	Use :cfunc:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	421
Victor Stinner	ae6265f	2010-05-15 16:27:27 +0000	[diff] [blame]	422
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	423	.. cfunction:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
				424
				425	Decode a string using :cdata:`Py_FileSystemDefaultEncoding` and
				426	the ``"surrogateescape"`` error handler.
				427
				428	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				429
				430
Victor Stinner	ae6265f	2010-05-15 16:27:27 +0000	[diff] [blame]	431	.. cfunction:: PyObject* PyUnicode_EncodeFSDefault(PyObject *unicode)
				432
				433	Encode a Unicode object to :cdata:`Py_FileSystemDefaultEncoding` with the
Benjamin Peterson	b432451	2010-05-15 17:42:02 +0000	[diff] [blame]	434	``'surrogateescape'`` error handler, and return :class:`bytes`.
Victor Stinner	ae6265f	2010-05-15 16:27:27 +0000	[diff] [blame]	435
				436	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				437
				438	.. versionadded:: 3.2
				439
				440
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	441	wchar_t Support
				442	"""""""""""""""
				443
				444	wchar_t support for platforms which support it:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	445
				446	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				447
				448	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Martin v. Löwis	790465f	2008-04-05 20:41:37 +0000	[diff] [blame]	449	Passing -1 as the size indicates that the function must itself compute the length,
				450	using wcslen.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	451	Return NULL on failure.
				452
				453
				454	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				455
				456	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				457	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				458	0-termination character). Return the number of :ctype:`wchar_t` characters
				459	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				460	string may or may not be 0-terminated. It is the responsibility of the caller
				461	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				462	required by the application.
				463
				464
				465	.. _builtincodecs:
				466
				467	Built-in Codecs
				468	^^^^^^^^^^^^^^^
				469
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	470	Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	471	these codecs are directly usable via the following functions.
				472
				473	Many of the following APIs take two arguments encoding and errors. These
				474	parameters encoding and errors have the same semantics as the ones of the
Daniel Stutzbach	98c07bd	2010-09-03 18:31:07 +0000	[diff] [blame]	475	built-in :func:`str` string object constructor.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	476
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	477	Setting encoding to NULL causes the default encoding to be used
				478	which is ASCII. The file system calls should use
				479	:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
				480	variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
				481	variable should be treated as read-only: On some systems, it will be a
				482	pointer to a static string, on others, it will change at run-time
				483	(such as when the application invokes setlocale).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	484
				485	Error handling is set by errors which may also be set to NULL meaning to use
				486	the default handling defined for the codec. Default error handling for all
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	487	built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	488
				489	The codecs all use a similar interface. Only deviation from the following
				490	generic ones are documented for simplicity.
				491
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	492
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	493	Generic Codecs
				494	""""""""""""""
				495
				496	These are the generic codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	497
				498
				499	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				500
				501	Create a Unicode object by decoding size bytes of the encoded string s.
				502	encoding and errors have the same meaning as the parameters of the same name
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	503	in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	504	using the Python codec registry. Return NULL if an exception was raised by
				505	the codec.
				506
				507
				508	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				509
				510	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	511	bytes object. encoding and errors have the same meaning as the
				512	parameters of the same name in the Unicode :meth:`encode` method. The codec
				513	to be used is looked up using the Python codec registry. Return NULL if an
				514	exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	515
				516
				517	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				518
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	519	Encode a Unicode object and return the result as Python bytes object.
				520	encoding and errors have the same meaning as the parameters of the same
				521	name in the Unicode :meth:`encode` method. The codec to be used is looked up
				522	using the Python codec registry. Return NULL if an exception was raised by
				523	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	524
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	525
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	526	UTF-8 Codecs
				527	""""""""""""
				528
				529	These are the UTF-8 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	530
				531
				532	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				533
				534	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				535	s. Return NULL if an exception was raised by the codec.
				536
				537
				538	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				539
				540	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				541	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				542	treated as an error. Those bytes will not be decoded and the number of bytes
				543	that have been decoded will be stored in consumed.
				544
				545
				546	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				547
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	548	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
				549	return a Python bytes object. Return NULL if an exception was raised by
				550	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	551
				552
				553	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				554
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	555	Encode a Unicode object using UTF-8 and return the result as Python bytes
				556	object. Error handling is "strict". Return NULL if an exception was
				557	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	558
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	559
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	560	UTF-32 Codecs
				561	"""""""""""""
				562
				563	These are the UTF-32 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	564
				565
				566	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				567
				568	Decode length bytes from a UTF-32 encoded buffer string and return the
				569	corresponding Unicode object. errors (if non-NULL) defines the error
				570	handling. It defaults to "strict".
				571
				572	If byteorder is non-NULL, the decoder starts decoding using the given byte
				573	order::
				574
				575	*byteorder == -1: little endian
				576	*byteorder == 0: native order
				577	*byteorder == 1: big endian
				578
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	579	If ``*byteorder`` is zero, and the first four bytes of the input data are a
				580	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				581	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				582	``1``, any byte order mark is copied to the output.
				583
				584	After completion, \byteorder* is set to the current byte order at the end
				585	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	586
				587	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				588
				589	If byteorder is NULL, the codec starts in native order mode.
				590
				591	Return NULL if an exception was raised by the codec.
				592
				593
				594	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				595
				596	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				597	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				598	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				599	by four) as an error. Those bytes will not be decoded and the number of bytes
				600	that have been decoded will be stored in consumed.
				601
				602
				603	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				604
				605	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	606	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	607
				608	byteorder == -1: little endian
				609	byteorder == 0: native byte order (writes a BOM mark)
				610	byteorder == 1: big endian
				611
				612	If byteorder is ``0``, the output string will always start with the Unicode BOM
				613	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				614
				615	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				616	as a single codepoint.
				617
				618	Return NULL if an exception was raised by the codec.
				619
				620
				621	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				622
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	623	Return a Python byte string using the UTF-32 encoding in native byte
				624	order. The string always starts with a BOM mark. Error handling is "strict".
				625	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	626
				627
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	628	UTF-16 Codecs
				629	"""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	630
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	631	These are the UTF-16 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	632
				633
				634	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				635
				636	Decode length bytes from a UTF-16 encoded buffer string and return the
				637	corresponding Unicode object. errors (if non-NULL) defines the error
				638	handling. It defaults to "strict".
				639
				640	If byteorder is non-NULL, the decoder starts decoding using the given byte
				641	order::
				642
				643	*byteorder == -1: little endian
				644	*byteorder == 0: native order
				645	*byteorder == 1: big endian
				646
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	647	If ``*byteorder`` is zero, and the first two bytes of the input data are a
				648	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				649	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				650	``1``, any byte order mark is copied to the output (where it will result in
				651	either a ``\ufeff`` or a ``\ufffe`` character).
				652
				653	After completion, \byteorder* is set to the current byte order at the end
				654	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	655
				656	If byteorder is NULL, the codec starts in native order mode.
				657
				658	Return NULL if an exception was raised by the codec.
				659
				660
				661	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				662
				663	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				664	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				665	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				666	split surrogate pair) as an error. Those bytes will not be decoded and the
				667	number of bytes that have been decoded will be stored in consumed.
				668
				669
				670	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				671
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	672	Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	673	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	674
				675	byteorder == -1: little endian
				676	byteorder == 0: native byte order (writes a BOM mark)
				677	byteorder == 1: big endian
				678
				679	If byteorder is ``0``, the output string will always start with the Unicode BOM
				680	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				681
				682	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				683	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				684	values is interpreted as an UCS-2 character.
				685
				686	Return NULL if an exception was raised by the codec.
				687
				688
				689	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				690
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	691	Return a Python byte string using the UTF-16 encoding in native byte
				692	order. The string always starts with a BOM mark. Error handling is "strict".
				693	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	694
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	695
Georg Brandl	8477f82	2010-08-02 20:05:19 +0000	[diff] [blame]	696	UTF-7 Codecs
				697	""""""""""""
				698
				699	These are the UTF-7 codec APIs:
				700
				701
				702	.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char s, Py_ssize_t size, const char errors)
				703
				704	Create a Unicode object by decoding size bytes of the UTF-7 encoded string
				705	s. Return NULL if an exception was raised by the codec.
				706
				707
Georg Brandl	4d22409	2010-08-13 15:10:49 +0000	[diff] [blame]	708	.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
Georg Brandl	8477f82	2010-08-02 20:05:19 +0000	[diff] [blame]	709
				710	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
				711	consumed is not NULL, trailing incomplete UTF-7 base-64 sections will not
				712	be treated as an error. Those bytes will not be decoded and the number of
				713	bytes that have been decoded will be stored in consumed.
				714
				715
				716	.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char errors)
				717
				718	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
				719	return a Python bytes object. Return NULL if an exception was raised by
				720	the codec.
				721
				722	If base64SetO is nonzero, "Set O" (punctuation that has no otherwise
				723	special meaning) will be encoded in base-64. If base64WhiteSpace is
				724	nonzero, whitespace will be encoded in base-64. Both are set to zero for the
				725	Python "utf-7" codec.
				726
				727
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	728	Unicode-Escape Codecs
				729	"""""""""""""""""""""
				730
				731	These are the "Unicode Escape" codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	732
				733
				734	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				735
				736	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				737	string s. Return NULL if an exception was raised by the codec.
				738
				739
				740	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				741
				742	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				743	return a Python string object. Return NULL if an exception was raised by the
				744	codec.
				745
				746
				747	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				748
				749	Encode a Unicode object using Unicode-Escape and return the result as Python
				750	string object. Error handling is "strict". Return NULL if an exception was
				751	raised by the codec.
				752
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	753
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	754	Raw-Unicode-Escape Codecs
				755	"""""""""""""""""""""""""
				756
				757	These are the "Raw Unicode Escape" codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	758
				759
				760	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				761
				762	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				763	encoded string s. Return NULL if an exception was raised by the codec.
				764
				765
				766	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				767
				768	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				769	and return a Python string object. Return NULL if an exception was raised by
				770	the codec.
				771
				772
				773	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				774
				775	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				776	Python string object. Error handling is "strict". Return NULL if an exception
				777	was raised by the codec.
				778
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	779
				780	Latin-1 Codecs
				781	""""""""""""""
				782
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	783	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				784	ordinals and only these are accepted by the codecs during encoding.
				785
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	786
				787	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				788
				789	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				790	s. Return NULL if an exception was raised by the codec.
				791
				792
				793	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				794
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	795	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
				796	return a Python bytes object. Return NULL if an exception was raised by
				797	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	798
				799
				800	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				801
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	802	Encode a Unicode object using Latin-1 and return the result as Python bytes
				803	object. Error handling is "strict". Return NULL if an exception was
				804	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	805
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	806
				807	ASCII Codecs
				808	""""""""""""
				809
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	810	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				811	codes generate errors.
				812
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	813
				814	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				815
				816	Create a Unicode object by decoding size bytes of the ASCII encoded string
				817	s. Return NULL if an exception was raised by the codec.
				818
				819
				820	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				821
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	822	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
				823	return a Python bytes object. Return NULL if an exception was raised by
				824	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	825
				826
				827	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				828
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	829	Encode a Unicode object using ASCII and return the result as Python bytes
				830	object. Error handling is "strict". Return NULL if an exception was
				831	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	832
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	833
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	834	Character Map Codecs
				835	""""""""""""""""""""
				836
				837	These are the mapping codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	838
				839	This codec is special in that it can be used to implement many different codecs
				840	(and this is in fact what was done to obtain most of the standard codecs
				841	included in the :mod:`encodings` package). The codec uses mapping to encode and
				842	decode characters.
				843
				844	Decoding mappings must map single string characters to single Unicode
				845	characters, integers (which are then interpreted as Unicode ordinals) or None
				846	(meaning "undefined mapping" and causing an error).
				847
				848	Encoding mappings must map single Unicode characters to single string
				849	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				850	(meaning "undefined mapping" and causing an error).
				851
				852	The mapping objects provided must only support the __getitem__ mapping
				853	interface.
				854
				855	If a character lookup fails with a LookupError, the character is copied as-is
				856	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				857	resp. Because of this, mappings only need to contain those mappings which map
				858	characters to different code points.
				859
				860
				861	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				862
				863	Create a Unicode object by decoding size bytes of the encoded string s using
				864	the given mapping object. Return NULL if an exception was raised by the
				865	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				866	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				867	Byte values greater that the length of the string and U+FFFE "characters" are
				868	treated as "undefined mapping".
				869
				870
				871	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				872
				873	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				874	mapping object and return a Python string object. Return NULL if an
				875	exception was raised by the codec.
				876
				877
				878	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				879
				880	Encode a Unicode object using the given mapping object and return the result
				881	as Python string object. Error handling is "strict". Return NULL if an
				882	exception was raised by the codec.
				883
				884	The following codec API is special in that maps Unicode to Unicode.
				885
				886
				887	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				888
				889	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				890	character mapping table to it and return the resulting Unicode object. Return
				891	NULL when an exception was raised by the codec.
				892
				893	The mapping table must map Unicode ordinal integers to Unicode ordinal
				894	integers or None (causing deletion of the character).
				895
				896	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				897	and sequences work well. Unmapped character ordinals (ones which cause a
				898	:exc:`LookupError`) are left untouched and are copied as-is.
				899
Jeroen Ruigrok van der Werven	47a7d70	2009-04-27 05:43:17 +0000	[diff] [blame]	900
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	901	These are the MBCS codec APIs. They are currently only available on Windows and
				902	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				903	DBCS) is a class of encodings, not just one. The target encoding is defined by
				904	the user settings on the machine running the codec.
				905
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	906
				907	MBCS codecs for Windows
				908	"""""""""""""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	909
				910
				911	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				912
				913	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				914	Return NULL if an exception was raised by the codec.
				915
				916
				917	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				918
				919	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				920	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				921	trailing lead byte and the number of bytes that have been decoded will be stored
				922	in consumed.
				923
				924
				925	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				926
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	927	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
				928	a Python bytes object. Return NULL if an exception was raised by the
				929	codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	930
				931
				932	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				933
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	934	Encode a Unicode object using MBCS and return the result as Python bytes
				935	object. Error handling is "strict". Return NULL if an exception was
				936	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	937
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	938
Victor Stinner	77c3862	2010-05-14 15:58:55 +0000	[diff] [blame]	939	Methods & Slots
				940	"""""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	941
				942
				943	.. _unicodemethodsandslots:
				944
				945	Methods and Slot Functions
				946	^^^^^^^^^^^^^^^^^^^^^^^^^^
				947
				948	The following APIs are capable of handling Unicode objects and strings on input
				949	(we refer to them as strings in the descriptions) and return Unicode objects or
				950	integers as appropriate.
				951
				952	They all return NULL or ``-1`` if an exception occurs.
				953
				954
				955	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				956
				957	Concat two strings giving a new Unicode string.
				958
				959
				960	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				961
				962	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				963	will be done at all whitespace substrings. Otherwise, splits occur at the given
				964	separator. At most maxsplit splits will be done. If negative, no limit is
				965	set. Separators are not included in the resulting list.
				966
				967
				968	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				969
				970	Split a Unicode string at line breaks, returning a list of Unicode strings.
				971	CRLF is considered to be one line break. If keepend is 0, the Line break
				972	characters are not included in the resulting strings.
				973
				974
				975	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				976
				977	Translate a string by applying a character mapping table to it and return the
				978	resulting Unicode object.
				979
				980	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				981	or None (causing deletion of the character).
				982
				983	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				984	and sequences work well. Unmapped character ordinals (ones which cause a
				985	:exc:`LookupError`) are left untouched and are copied as-is.
				986
				987	errors has the usual meaning for codecs. It may be NULL which indicates to
				988	use the default error handling.
				989
				990
				991	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				992
				993	Join a sequence of strings using the given separator and return the resulting
				994	Unicode string.
				995
				996
				997	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				998
				999	Return 1 if substr matches str[start:end] at the given tail end
				1000	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				1001	0 otherwise. Return ``-1`` if an error occurred.
				1002
				1003
				1004	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				1005
				1006	Return the first position of substr in str[start:end] using the given
				1007	direction (direction == 1 means to do a forward search, direction == -1 a
				1008	backward search). The return value is the index of the first match; a value of
				1009	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				1010	occurred and an exception has been set.
				1011
				1012
				1013	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				1014
				1015	Return the number of non-overlapping occurrences of substr in
				1016	``str[start:end]``. Return ``-1`` if an error occurred.
				1017
				1018
				1019	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				1020
				1021	Replace at most maxcount occurrences of substr in str with replstr and
				1022	return the resulting Unicode object. maxcount == -1 means replace all
				1023	occurrences.
				1024
				1025
				1026	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				1027
				1028	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				1029	respectively.
				1030
				1031
Benjamin Peterson	c22ed14	2008-07-01 19:12:34 +0000	[diff] [blame]	1032	.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject uni, char string)
				1033
				1034	Compare a unicode object, uni, with string and return -1, 0, 1 for less
				1035	than, equal, and greater than, respectively.
				1036
				1037
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1038	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				1039
				1040	Rich compare two unicode strings and return one of the following:
				1041
				1042	* ``NULL`` in case an exception was raised
				1043	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				1044	* :const:`Py_NotImplemented` in case the type combination is unknown
				1045
				1046	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				1047	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				1048	with a :exc:`UnicodeDecodeError`.
				1049
				1050	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				1051	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				1052
				1053
				1054	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				1055
				1056	Return a new string object from format and args; this is analogous to
				1057	``format % args``. The args argument must be a tuple.
				1058
				1059
				1060	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				1061
				1062	Check whether element is contained in container and return true or false
				1063	accordingly.
				1064
				1065	element has to coerce to a one element Unicode string. ``-1`` is returned if
				1066	there was an error.
				1067
				1068
				1069	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				1070
				1071	Intern the argument \string* in place. The argument must be the address of a
				1072	pointer variable pointing to a Python unicode string object. If there is an
				1073	existing interned string that is the same as \string, it sets \string to
				1074	it (decrementing the reference count of the old string object and incrementing
				1075	the reference count of the interned string object), otherwise it leaves
				1076	\string* alone and interns it (incrementing its reference count).
				1077	(Clarification: even though there is a lot of talk about reference counts, think
				1078	of this function as reference-count-neutral; you own the object after the call
				1079	if and only if you owned it before the call.)
				1080
				1081
				1082	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				1083
				1084	A combination of :cfunc:`PyUnicode_FromString` and
				1085	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				1086	that has been interned, or a new ("owned") reference to an earlier interned
				1087	string object with the same value.
				1088