Blame - Doc/c-api/unicode.rst - platform/external/python/cpython3

blob: 4c0d6a462d4b9cbe597b8a32cbbeb05fb4675bc9 [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
				13	These are the basic Unicode object types used for the Unicode implementation in
				14	Python:
				15
				16	.. % --- Unicode Type -------------------------------------------------------
				17
				18
				19	.. ctype:: Py_UNICODE
				20
				21	This type represents the storage type which is used by Python internally as
				22	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				23	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				24	possible to build a UCS4 version of Python (most recent Linux distributions come
				25	with UCS4 builds of Python). These builds then use a 32-bit type for
				26	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				27	where :ctype:`wchar_t` is available and compatible with the chosen Python
				28	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				29	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				30	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				31	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				32
				33	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				34	this in mind when writing extensions or interfaces.
				35
				36
				37	.. ctype:: PyUnicodeObject
				38
				39	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				40
				41
				42	.. cvar:: PyTypeObject PyUnicode_Type
				43
				44	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				45	is exposed to Python code as ``str``.
				46
				47	The following APIs are really C macros and can be used to do fast checks and to
				48	access internal read-only data of Unicode objects:
				49
				50
				51	.. cfunction:: int PyUnicode_Check(PyObject *o)
				52
				53	Return true if the object o is a Unicode object or an instance of a Unicode
				54	subtype.
				55
				56
				57	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				58
				59	Return true if the object o is a Unicode object, but not an instance of a
				60	subtype.
				61
				62
				63	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				64
				65	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				66	checked).
				67
				68
				69	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				70
				71	Return the size of the object's internal buffer in bytes. o has to be a
				72	:ctype:`PyUnicodeObject` (not checked).
				73
				74
				75	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				76
				77	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				78	has to be a :ctype:`PyUnicodeObject` (not checked).
				79
				80
				81	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				82
				83	Return a pointer to the internal buffer of the object. o has to be a
				84	:ctype:`PyUnicodeObject` (not checked).
				85
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	86
Alexandre Vassalotti	6d3dfc3	2009-07-29 19:54:39 +0000	[diff] [blame]	87	.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	88
				89	Clear the free list. Return the total number of freed items.
				90
Alexandre Vassalotti	6d3dfc3	2009-07-29 19:54:39 +0000	[diff] [blame]	91
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	92	Unicode provides many different character properties. The most often needed ones
				93	are available through these macros which are mapped to C functions depending on
				94	the Python configuration.
				95
				96	.. % --- Unicode character properties ---------------------------------------
				97
				98
				99	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				100
				101	Return 1 or 0 depending on whether ch is a whitespace character.
				102
				103
				104	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				105
				106	Return 1 or 0 depending on whether ch is a lowercase character.
				107
				108
				109	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				110
				111	Return 1 or 0 depending on whether ch is an uppercase character.
				112
				113
				114	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				115
				116	Return 1 or 0 depending on whether ch is a titlecase character.
				117
				118
				119	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				120
				121	Return 1 or 0 depending on whether ch is a linebreak character.
				122
				123
				124	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				125
				126	Return 1 or 0 depending on whether ch is a decimal character.
				127
				128
				129	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				130
				131	Return 1 or 0 depending on whether ch is a digit character.
				132
				133
				134	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				135
				136	Return 1 or 0 depending on whether ch is a numeric character.
				137
				138
				139	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				140
				141	Return 1 or 0 depending on whether ch is an alphabetic character.
				142
				143
				144	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				145
				146	Return 1 or 0 depending on whether ch is an alphanumeric character.
				147
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	148
				149	.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
				150
				151	Return 1 or 0 depending on whether ch is a printable character.
				152	Nonprintable characters are those characters defined in the Unicode character
				153	database as "Other" or "Separator", excepting the ASCII space (0x20) which is
				154	considered printable. (Note that printable characters in this context are
				155	those which should not be escaped when :func:`repr` is invoked on a string.
				156	It has no bearing on the handling of strings written to :data:`sys.stdout` or
				157	:data:`sys.stderr`.)
				158
				159
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	160	These APIs can be used for fast direct character conversions:
				161
				162
				163	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				164
				165	Return the character ch converted to lower case.
				166
				167
				168	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				169
				170	Return the character ch converted to upper case.
				171
				172
				173	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				174
				175	Return the character ch converted to title case.
				176
				177
				178	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				179
				180	Return the character ch converted to a decimal positive integer. Return
				181	``-1`` if this is not possible. This macro does not raise exceptions.
				182
				183
				184	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				185
				186	Return the character ch converted to a single digit integer. Return ``-1`` if
				187	this is not possible. This macro does not raise exceptions.
				188
				189
				190	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				191
				192	Return the character ch converted to a double. Return ``-1.0`` if this is not
				193	possible. This macro does not raise exceptions.
				194
				195	To create Unicode objects and access their basic sequence properties, use these
				196	APIs:
				197
				198	.. % --- Plain Py_UNICODE ---------------------------------------------------
				199
				200
				201	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				202
				203	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				204	may be NULL which causes the contents to be undefined. It is the user's
				205	responsibility to fill in the needed data. The buffer is copied into the new
				206	object. If the buffer is not NULL, the return value might be a shared object.
				207	Therefore, modification of the resulting Unicode object is only allowed when u
				208	is NULL.
				209
				210
				211	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				212
				213	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				214	as being UTF-8 encoded. u may also be NULL which
				215	causes the contents to be undefined. It is the user's responsibility to fill in
				216	the needed data. The buffer is copied into the new object. If the buffer is not
				217	NULL, the return value might be a shared object. Therefore, modification of
				218	the resulting Unicode object is only allowed when u is NULL.
				219
				220
				221	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				222
				223	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				224	u.
				225
				226
				227	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				228
				229	Take a C :cfunc:`printf`\ -style format string and a variable number of
				230	arguments, calculate the size of the resulting Python unicode string and return
				231	a string with the values formatted into it. The variable arguments must be C
				232	types and must correspond exactly to the format characters in the format
				233	string. The following format characters are allowed:
				234
				235	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				236	.. % because not all compilers support the %z width modifier -- we fake it
				237	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
				238
				239	+-------------------+---------------------+--------------------------------+
				240	\| Format Characters \| Type \| Comment \|
				241	+===================+=====================+================================+
				242	\| :attr:`%%` \| n/a \| The literal % character. \|
				243	+-------------------+---------------------+--------------------------------+
				244	\| :attr:`%c` \| int \| A single character, \|
				245	\| \| \| represented as an C int. \|
				246	+-------------------+---------------------+--------------------------------+
				247	\| :attr:`%d` \| int \| Exactly equivalent to \|
				248	\| \| \| ``printf("%d")``. \|
				249	+-------------------+---------------------+--------------------------------+
				250	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				251	\| \| \| ``printf("%u")``. \|
				252	+-------------------+---------------------+--------------------------------+
				253	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				254	\| \| \| ``printf("%ld")``. \|
				255	+-------------------+---------------------+--------------------------------+
				256	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				257	\| \| \| ``printf("%lu")``. \|
				258	+-------------------+---------------------+--------------------------------+
				259	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				260	\| \| \| ``printf("%zd")``. \|
				261	+-------------------+---------------------+--------------------------------+
				262	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				263	\| \| \| ``printf("%zu")``. \|
				264	+-------------------+---------------------+--------------------------------+
				265	\| :attr:`%i` \| int \| Exactly equivalent to \|
				266	\| \| \| ``printf("%i")``. \|
				267	+-------------------+---------------------+--------------------------------+
				268	\| :attr:`%x` \| int \| Exactly equivalent to \|
				269	\| \| \| ``printf("%x")``. \|
				270	+-------------------+---------------------+--------------------------------+
				271	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				272	\| \| \| array. \|
				273	+-------------------+---------------------+--------------------------------+
				274	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				275	\| \| \| pointer. Mostly equivalent to \|
				276	\| \| \| ``printf("%p")`` except that \|
				277	\| \| \| it is guaranteed to start with \|
				278	\| \| \| the literal ``0x`` regardless \|
				279	\| \| \| of what the platform's \|
				280	\| \| \| ``printf`` yields. \|
				281	+-------------------+---------------------+--------------------------------+
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	282	\| :attr:`%A` \| PyObject\* \| The result of calling \|
				283	\| \| \| :func:`ascii`. \|
				284	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	285	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				286	+-------------------+---------------------+--------------------------------+
				287	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				288	\| \| \| NULL) and a null-terminated \|
				289	\| \| \| C character array as a second \|
				290	\| \| \| parameter (which will be used, \|
				291	\| \| \| if the first parameter is \|
				292	\| \| \| NULL). \|
				293	+-------------------+---------------------+--------------------------------+
				294	\| :attr:`%S` \| PyObject\* \| The result of calling \|
Benjamin Peterson	e866206	2009-03-08 23:51:13 +0000	[diff] [blame]	295	\| \| \| :func:`PyObject_Str`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	296	+-------------------+---------------------+--------------------------------+
				297	\| :attr:`%R` \| PyObject\* \| The result of calling \|
				298	\| \| \| :func:`PyObject_Repr`. \|
				299	+-------------------+---------------------+--------------------------------+
				300
				301	An unrecognized format character causes all the rest of the format string to be
				302	copied as-is to the result string, and any extra arguments discarded.
				303
				304
				305	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				306
				307	Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
				308	arguments.
				309
				310
				311	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				312
				313	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				314	buffer, NULL if unicode is not a Unicode object.
				315
				316
				317	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				318
				319	Return the length of the Unicode object.
				320
				321
				322	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				323
				324	Coerce an encoded object obj to an Unicode object and return a reference with
				325	incremented refcount.
				326
				327	String and other char buffer compatible objects are decoded according to the
				328	given encoding and using the error handling defined by errors. Both can be
				329	NULL to have the interface use the default values (see the next section for
				330	details).
				331
				332	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				333	set.
				334
				335	The API returns NULL if there was an error. The caller is responsible for
				336	decref'ing the returned objects.
				337
				338
				339	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				340
				341	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				342	throughout the interpreter whenever coercion to Unicode is needed.
				343
				344	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				345	Python can interface directly to this type using the following functions.
				346	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				347	the system's :ctype:`wchar_t`.
				348
				349	.. % --- wchar_t support for platforms which support it ---------------------
				350
				351
				352	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				353
				354	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Martin v. Löwis	790465f	2008-04-05 20:41:37 +0000	[diff] [blame]	355	Passing -1 as the size indicates that the function must itself compute the length,
				356	using wcslen.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	357	Return NULL on failure.
				358
				359
				360	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				361
				362	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				363	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				364	0-termination character). Return the number of :ctype:`wchar_t` characters
				365	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				366	string may or may not be 0-terminated. It is the responsibility of the caller
				367	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				368	required by the application.
				369
				370
				371	.. _builtincodecs:
				372
				373	Built-in Codecs
				374	^^^^^^^^^^^^^^^
				375
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	376	Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	377	these codecs are directly usable via the following functions.
				378
				379	Many of the following APIs take two arguments encoding and errors. These
				380	parameters encoding and errors have the same semantics as the ones of the
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	381	built-in :func:`unicode` Unicode object constructor.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	382
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	383	Setting encoding to NULL causes the default encoding to be used
				384	which is ASCII. The file system calls should use
				385	:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
				386	variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
				387	variable should be treated as read-only: On some systems, it will be a
				388	pointer to a static string, on others, it will change at run-time
				389	(such as when the application invokes setlocale).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	390
				391	Error handling is set by errors which may also be set to NULL meaning to use
				392	the default handling defined for the codec. Default error handling for all
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	393	built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	394
				395	The codecs all use a similar interface. Only deviation from the following
				396	generic ones are documented for simplicity.
				397
				398	These are the generic codec APIs:
				399
				400	.. % --- Generic Codecs -----------------------------------------------------
				401
				402
				403	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				404
				405	Create a Unicode object by decoding size bytes of the encoded string s.
				406	encoding and errors have the same meaning as the parameters of the same name
Georg Brandl	22b3431	2009-07-26 14:54:51 +0000	[diff] [blame]	407	in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	408	using the Python codec registry. Return NULL if an exception was raised by
				409	the codec.
				410
				411
				412	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				413
				414	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	415	bytes object. encoding and errors have the same meaning as the
				416	parameters of the same name in the Unicode :meth:`encode` method. The codec
				417	to be used is looked up using the Python codec registry. Return NULL if an
				418	exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	419
				420
				421	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				422
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	423	Encode a Unicode object and return the result as Python bytes object.
				424	encoding and errors have the same meaning as the parameters of the same
				425	name in the Unicode :meth:`encode` method. The codec to be used is looked up
				426	using the Python codec registry. Return NULL if an exception was raised by
				427	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	428
				429	These are the UTF-8 codec APIs:
				430
				431	.. % --- UTF-8 Codecs -------------------------------------------------------
				432
				433
				434	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				435
				436	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				437	s. Return NULL if an exception was raised by the codec.
				438
				439
				440	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				441
				442	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				443	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				444	treated as an error. Those bytes will not be decoded and the number of bytes
				445	that have been decoded will be stored in consumed.
				446
				447
				448	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				449
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	450	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
				451	return a Python bytes object. Return NULL if an exception was raised by
				452	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	453
				454
				455	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				456
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	457	Encode a Unicode object using UTF-8 and return the result as Python bytes
				458	object. Error handling is "strict". Return NULL if an exception was
				459	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	460
				461	These are the UTF-32 codec APIs:
				462
				463	.. % --- UTF-32 Codecs ------------------------------------------------------ */
				464
				465
				466	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				467
				468	Decode length bytes from a UTF-32 encoded buffer string and return the
				469	corresponding Unicode object. errors (if non-NULL) defines the error
				470	handling. It defaults to "strict".
				471
				472	If byteorder is non-NULL, the decoder starts decoding using the given byte
				473	order::
				474
				475	*byteorder == -1: little endian
				476	*byteorder == 0: native order
				477	*byteorder == 1: big endian
				478
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	479	If ``*byteorder`` is zero, and the first four bytes of the input data are a
				480	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				481	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				482	``1``, any byte order mark is copied to the output.
				483
				484	After completion, \byteorder* is set to the current byte order at the end
				485	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	486
				487	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				488
				489	If byteorder is NULL, the codec starts in native order mode.
				490
				491	Return NULL if an exception was raised by the codec.
				492
				493
				494	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				495
				496	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				497	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				498	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				499	by four) as an error. Those bytes will not be decoded and the number of bytes
				500	that have been decoded will be stored in consumed.
				501
				502
				503	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				504
				505	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	506	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	507
				508	byteorder == -1: little endian
				509	byteorder == 0: native byte order (writes a BOM mark)
				510	byteorder == 1: big endian
				511
				512	If byteorder is ``0``, the output string will always start with the Unicode BOM
				513	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				514
				515	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				516	as a single codepoint.
				517
				518	Return NULL if an exception was raised by the codec.
				519
				520
				521	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				522
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	523	Return a Python byte string using the UTF-32 encoding in native byte
				524	order. The string always starts with a BOM mark. Error handling is "strict".
				525	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	526
				527
				528	These are the UTF-16 codec APIs:
				529
				530	.. % --- UTF-16 Codecs ------------------------------------------------------ */
				531
				532
				533	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				534
				535	Decode length bytes from a UTF-16 encoded buffer string and return the
				536	corresponding Unicode object. errors (if non-NULL) defines the error
				537	handling. It defaults to "strict".
				538
				539	If byteorder is non-NULL, the decoder starts decoding using the given byte
				540	order::
				541
				542	*byteorder == -1: little endian
				543	*byteorder == 0: native order
				544	*byteorder == 1: big endian
				545
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	546	If ``*byteorder`` is zero, and the first two bytes of the input data are a
				547	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				548	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				549	``1``, any byte order mark is copied to the output (where it will result in
				550	either a ``\ufeff`` or a ``\ufffe`` character).
				551
				552	After completion, \byteorder* is set to the current byte order at the end
				553	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	554
				555	If byteorder is NULL, the codec starts in native order mode.
				556
				557	Return NULL if an exception was raised by the codec.
				558
				559
				560	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				561
				562	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				563	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				564	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				565	split surrogate pair) as an error. Those bytes will not be decoded and the
				566	number of bytes that have been decoded will be stored in consumed.
				567
				568
				569	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				570
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	571	Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Benjamin Peterson	4ac9ce4	2009-10-04 14:49:41 +0000	[diff] [blame]	572	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	573
				574	byteorder == -1: little endian
				575	byteorder == 0: native byte order (writes a BOM mark)
				576	byteorder == 1: big endian
				577
				578	If byteorder is ``0``, the output string will always start with the Unicode BOM
				579	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				580
				581	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				582	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				583	values is interpreted as an UCS-2 character.
				584
				585	Return NULL if an exception was raised by the codec.
				586
				587
				588	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				589
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	590	Return a Python byte string using the UTF-16 encoding in native byte
				591	order. The string always starts with a BOM mark. Error handling is "strict".
				592	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	593
				594	These are the "Unicode Escape" codec APIs:
				595
				596	.. % --- Unicode-Escape Codecs ----------------------------------------------
				597
				598
				599	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				600
				601	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				602	string s. Return NULL if an exception was raised by the codec.
				603
				604
				605	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				606
				607	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				608	return a Python string object. Return NULL if an exception was raised by the
				609	codec.
				610
				611
				612	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				613
				614	Encode a Unicode object using Unicode-Escape and return the result as Python
				615	string object. Error handling is "strict". Return NULL if an exception was
				616	raised by the codec.
				617
				618	These are the "Raw Unicode Escape" codec APIs:
				619
				620	.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
				621
				622
				623	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				624
				625	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				626	encoded string s. Return NULL if an exception was raised by the codec.
				627
				628
				629	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				630
				631	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				632	and return a Python string object. Return NULL if an exception was raised by
				633	the codec.
				634
				635
				636	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				637
				638	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				639	Python string object. Error handling is "strict". Return NULL if an exception
				640	was raised by the codec.
				641
				642	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				643	ordinals and only these are accepted by the codecs during encoding.
				644
				645	.. % --- Latin-1 Codecs -----------------------------------------------------
				646
				647
				648	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				649
				650	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				651	s. Return NULL if an exception was raised by the codec.
				652
				653
				654	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				655
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	656	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
				657	return a Python bytes object. Return NULL if an exception was raised by
				658	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	659
				660
				661	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				662
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	663	Encode a Unicode object using Latin-1 and return the result as Python bytes
				664	object. Error handling is "strict". Return NULL if an exception was
				665	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	666
				667	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				668	codes generate errors.
				669
				670	.. % --- ASCII Codecs -------------------------------------------------------
				671
				672
				673	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				674
				675	Create a Unicode object by decoding size bytes of the ASCII encoded string
				676	s. Return NULL if an exception was raised by the codec.
				677
				678
				679	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				680
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	681	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
				682	return a Python bytes object. Return NULL if an exception was raised by
				683	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	684
				685
				686	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				687
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	688	Encode a Unicode object using ASCII and return the result as Python bytes
				689	object. Error handling is "strict". Return NULL if an exception was
				690	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	691
				692	These are the mapping codec APIs:
				693
				694	.. % --- Character Map Codecs -----------------------------------------------
				695
				696	This codec is special in that it can be used to implement many different codecs
				697	(and this is in fact what was done to obtain most of the standard codecs
				698	included in the :mod:`encodings` package). The codec uses mapping to encode and
				699	decode characters.
				700
				701	Decoding mappings must map single string characters to single Unicode
				702	characters, integers (which are then interpreted as Unicode ordinals) or None
				703	(meaning "undefined mapping" and causing an error).
				704
				705	Encoding mappings must map single Unicode characters to single string
				706	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				707	(meaning "undefined mapping" and causing an error).
				708
				709	The mapping objects provided must only support the __getitem__ mapping
				710	interface.
				711
				712	If a character lookup fails with a LookupError, the character is copied as-is
				713	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				714	resp. Because of this, mappings only need to contain those mappings which map
				715	characters to different code points.
				716
				717
				718	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				719
				720	Create a Unicode object by decoding size bytes of the encoded string s using
				721	the given mapping object. Return NULL if an exception was raised by the
				722	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				723	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				724	Byte values greater that the length of the string and U+FFFE "characters" are
				725	treated as "undefined mapping".
				726
				727
				728	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				729
				730	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				731	mapping object and return a Python string object. Return NULL if an
				732	exception was raised by the codec.
				733
				734
				735	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				736
				737	Encode a Unicode object using the given mapping object and return the result
				738	as Python string object. Error handling is "strict". Return NULL if an
				739	exception was raised by the codec.
				740
				741	The following codec API is special in that maps Unicode to Unicode.
				742
				743
				744	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				745
				746	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				747	character mapping table to it and return the resulting Unicode object. Return
				748	NULL when an exception was raised by the codec.
				749
				750	The mapping table must map Unicode ordinal integers to Unicode ordinal
				751	integers or None (causing deletion of the character).
				752
				753	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				754	and sequences work well. Unmapped character ordinals (ones which cause a
				755	:exc:`LookupError`) are left untouched and are copied as-is.
				756
Jeroen Ruigrok van der Werven	47a7d70	2009-04-27 05:43:17 +0000	[diff] [blame]	757
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	758	These are the MBCS codec APIs. They are currently only available on Windows and
				759	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				760	DBCS) is a class of encodings, not just one. The target encoding is defined by
				761	the user settings on the machine running the codec.
				762
				763	.. % --- MBCS codecs for Windows --------------------------------------------
				764
				765
				766	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				767
				768	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				769	Return NULL if an exception was raised by the codec.
				770
				771
				772	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				773
				774	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				775	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				776	trailing lead byte and the number of bytes that have been decoded will be stored
				777	in consumed.
				778
				779
				780	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				781
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	782	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
				783	a Python bytes object. Return NULL if an exception was raised by the
				784	codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	785
				786
				787	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				788
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	789	Encode a Unicode object using MBCS and return the result as Python bytes
				790	object. Error handling is "strict". Return NULL if an exception was
				791	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	792
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	793	For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding`
				794	should be used as the encoding, and ``"surrogateescape"`` should be used as the error
				795	handler. For encoding file names during argument parsing, the ``O&`` converter should
				796	be used, passsing PyUnicode_FSConverter as the conversion function:
				797
				798	.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
				799
				800	Convert obj into result, using the file system encoding, and the ``surrogateescape``
				801	error handler. result must be a ``PyObject*``, yielding a bytes or bytearray object
				802	which must be released if it is no longer used.
				803
				804	.. versionadded:: 3.1
				805
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	806	.. % --- Methods & Slots ----------------------------------------------------
				807
				808
				809	.. _unicodemethodsandslots:
				810
				811	Methods and Slot Functions
				812	^^^^^^^^^^^^^^^^^^^^^^^^^^
				813
				814	The following APIs are capable of handling Unicode objects and strings on input
				815	(we refer to them as strings in the descriptions) and return Unicode objects or
				816	integers as appropriate.
				817
				818	They all return NULL or ``-1`` if an exception occurs.
				819
				820
				821	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				822
				823	Concat two strings giving a new Unicode string.
				824
				825
				826	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				827
				828	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				829	will be done at all whitespace substrings. Otherwise, splits occur at the given
				830	separator. At most maxsplit splits will be done. If negative, no limit is
				831	set. Separators are not included in the resulting list.
				832
				833
				834	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				835
				836	Split a Unicode string at line breaks, returning a list of Unicode strings.
				837	CRLF is considered to be one line break. If keepend is 0, the Line break
				838	characters are not included in the resulting strings.
				839
				840
				841	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				842
				843	Translate a string by applying a character mapping table to it and return the
				844	resulting Unicode object.
				845
				846	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				847	or None (causing deletion of the character).
				848
				849	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				850	and sequences work well. Unmapped character ordinals (ones which cause a
				851	:exc:`LookupError`) are left untouched and are copied as-is.
				852
				853	errors has the usual meaning for codecs. It may be NULL which indicates to
				854	use the default error handling.
				855
				856
				857	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				858
				859	Join a sequence of strings using the given separator and return the resulting
				860	Unicode string.
				861
				862
				863	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				864
				865	Return 1 if substr matches str[start:end] at the given tail end
				866	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				867	0 otherwise. Return ``-1`` if an error occurred.
				868
				869
				870	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				871
				872	Return the first position of substr in str[start:end] using the given
				873	direction (direction == 1 means to do a forward search, direction == -1 a
				874	backward search). The return value is the index of the first match; a value of
				875	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				876	occurred and an exception has been set.
				877
				878
				879	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				880
				881	Return the number of non-overlapping occurrences of substr in
				882	``str[start:end]``. Return ``-1`` if an error occurred.
				883
				884
				885	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				886
				887	Replace at most maxcount occurrences of substr in str with replstr and
				888	return the resulting Unicode object. maxcount == -1 means replace all
				889	occurrences.
				890
				891
				892	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				893
				894	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				895	respectively.
				896
				897
Benjamin Peterson	c22ed14	2008-07-01 19:12:34 +0000	[diff] [blame]	898	.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject uni, char string)
				899
				900	Compare a unicode object, uni, with string and return -1, 0, 1 for less
				901	than, equal, and greater than, respectively.
				902
				903
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	904	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				905
				906	Rich compare two unicode strings and return one of the following:
				907
				908	* ``NULL`` in case an exception was raised
				909	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				910	* :const:`Py_NotImplemented` in case the type combination is unknown
				911
				912	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				913	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				914	with a :exc:`UnicodeDecodeError`.
				915
				916	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				917	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				918
				919
				920	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				921
				922	Return a new string object from format and args; this is analogous to
				923	``format % args``. The args argument must be a tuple.
				924
				925
				926	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				927
				928	Check whether element is contained in container and return true or false
				929	accordingly.
				930
				931	element has to coerce to a one element Unicode string. ``-1`` is returned if
				932	there was an error.
				933
				934
				935	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				936
				937	Intern the argument \string* in place. The argument must be the address of a
				938	pointer variable pointing to a Python unicode string object. If there is an
				939	existing interned string that is the same as \string, it sets \string to
				940	it (decrementing the reference count of the old string object and incrementing
				941	the reference count of the interned string object), otherwise it leaves
				942	\string* alone and interns it (incrementing its reference count).
				943	(Clarification: even though there is a lot of talk about reference counts, think
				944	of this function as reference-count-neutral; you own the object after the call
				945	if and only if you owned it before the call.)
				946
				947
				948	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				949
				950	A combination of :cfunc:`PyUnicode_FromString` and
				951	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				952	that has been interned, or a new ("owned") reference to an earlier interned
				953	string object with the same value.
				954