Blame - Doc/c-api/unicode.rst - platform/external/python/cpython2

blob: 448cf6895cc8e8ea063da0aefad82443729a6376 [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
				13	These are the basic Unicode object types used for the Unicode implementation in
				14	Python:
				15
				16	.. % --- Unicode Type -------------------------------------------------------
				17
				18
				19	.. ctype:: Py_UNICODE
				20
				21	This type represents the storage type which is used by Python internally as
				22	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				23	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				24	possible to build a UCS4 version of Python (most recent Linux distributions come
				25	with UCS4 builds of Python). These builds then use a 32-bit type for
				26	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				27	where :ctype:`wchar_t` is available and compatible with the chosen Python
				28	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				29	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				30	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				31	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				32
				33	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				34	this in mind when writing extensions or interfaces.
				35
				36
				37	.. ctype:: PyUnicodeObject
				38
				39	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				40
				41
				42	.. cvar:: PyTypeObject PyUnicode_Type
				43
				44	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				45	is exposed to Python code as ``str``.
				46
				47	The following APIs are really C macros and can be used to do fast checks and to
				48	access internal read-only data of Unicode objects:
				49
				50
				51	.. cfunction:: int PyUnicode_Check(PyObject *o)
				52
				53	Return true if the object o is a Unicode object or an instance of a Unicode
				54	subtype.
				55
				56
				57	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				58
				59	Return true if the object o is a Unicode object, but not an instance of a
				60	subtype.
				61
				62
				63	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				64
				65	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				66	checked).
				67
				68
				69	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				70
				71	Return the size of the object's internal buffer in bytes. o has to be a
				72	:ctype:`PyUnicodeObject` (not checked).
				73
				74
				75	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				76
				77	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				78	has to be a :ctype:`PyUnicodeObject` (not checked).
				79
				80
				81	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				82
				83	Return a pointer to the internal buffer of the object. o has to be a
				84	:ctype:`PyUnicodeObject` (not checked).
				85
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame^]	86
				87	.. cfunction:: int PyUnicode_ClearFreeList(void)
				88
				89	Clear the free list. Return the total number of freed items.
				90
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	91	Unicode provides many different character properties. The most often needed ones
				92	are available through these macros which are mapped to C functions depending on
				93	the Python configuration.
				94
				95	.. % --- Unicode character properties ---------------------------------------
				96
				97
				98	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				99
				100	Return 1 or 0 depending on whether ch is a whitespace character.
				101
				102
				103	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				104
				105	Return 1 or 0 depending on whether ch is a lowercase character.
				106
				107
				108	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				109
				110	Return 1 or 0 depending on whether ch is an uppercase character.
				111
				112
				113	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				114
				115	Return 1 or 0 depending on whether ch is a titlecase character.
				116
				117
				118	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				119
				120	Return 1 or 0 depending on whether ch is a linebreak character.
				121
				122
				123	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				124
				125	Return 1 or 0 depending on whether ch is a decimal character.
				126
				127
				128	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				129
				130	Return 1 or 0 depending on whether ch is a digit character.
				131
				132
				133	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				134
				135	Return 1 or 0 depending on whether ch is a numeric character.
				136
				137
				138	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				139
				140	Return 1 or 0 depending on whether ch is an alphabetic character.
				141
				142
				143	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				144
				145	Return 1 or 0 depending on whether ch is an alphanumeric character.
				146
				147	These APIs can be used for fast direct character conversions:
				148
				149
				150	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				151
				152	Return the character ch converted to lower case.
				153
				154
				155	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				156
				157	Return the character ch converted to upper case.
				158
				159
				160	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				161
				162	Return the character ch converted to title case.
				163
				164
				165	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				166
				167	Return the character ch converted to a decimal positive integer. Return
				168	``-1`` if this is not possible. This macro does not raise exceptions.
				169
				170
				171	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				172
				173	Return the character ch converted to a single digit integer. Return ``-1`` if
				174	this is not possible. This macro does not raise exceptions.
				175
				176
				177	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				178
				179	Return the character ch converted to a double. Return ``-1.0`` if this is not
				180	possible. This macro does not raise exceptions.
				181
				182	To create Unicode objects and access their basic sequence properties, use these
				183	APIs:
				184
				185	.. % --- Plain Py_UNICODE ---------------------------------------------------
				186
				187
				188	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				189
				190	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				191	may be NULL which causes the contents to be undefined. It is the user's
				192	responsibility to fill in the needed data. The buffer is copied into the new
				193	object. If the buffer is not NULL, the return value might be a shared object.
				194	Therefore, modification of the resulting Unicode object is only allowed when u
				195	is NULL.
				196
				197
				198	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				199
				200	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				201	as being UTF-8 encoded. u may also be NULL which
				202	causes the contents to be undefined. It is the user's responsibility to fill in
				203	the needed data. The buffer is copied into the new object. If the buffer is not
				204	NULL, the return value might be a shared object. Therefore, modification of
				205	the resulting Unicode object is only allowed when u is NULL.
				206
				207
				208	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				209
				210	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				211	u.
				212
				213
				214	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				215
				216	Take a C :cfunc:`printf`\ -style format string and a variable number of
				217	arguments, calculate the size of the resulting Python unicode string and return
				218	a string with the values formatted into it. The variable arguments must be C
				219	types and must correspond exactly to the format characters in the format
				220	string. The following format characters are allowed:
				221
				222	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				223	.. % because not all compilers support the %z width modifier -- we fake it
				224	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
				225
				226	+-------------------+---------------------+--------------------------------+
				227	\| Format Characters \| Type \| Comment \|
				228	+===================+=====================+================================+
				229	\| :attr:`%%` \| n/a \| The literal % character. \|
				230	+-------------------+---------------------+--------------------------------+
				231	\| :attr:`%c` \| int \| A single character, \|
				232	\| \| \| represented as an C int. \|
				233	+-------------------+---------------------+--------------------------------+
				234	\| :attr:`%d` \| int \| Exactly equivalent to \|
				235	\| \| \| ``printf("%d")``. \|
				236	+-------------------+---------------------+--------------------------------+
				237	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				238	\| \| \| ``printf("%u")``. \|
				239	+-------------------+---------------------+--------------------------------+
				240	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				241	\| \| \| ``printf("%ld")``. \|
				242	+-------------------+---------------------+--------------------------------+
				243	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				244	\| \| \| ``printf("%lu")``. \|
				245	+-------------------+---------------------+--------------------------------+
				246	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				247	\| \| \| ``printf("%zd")``. \|
				248	+-------------------+---------------------+--------------------------------+
				249	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				250	\| \| \| ``printf("%zu")``. \|
				251	+-------------------+---------------------+--------------------------------+
				252	\| :attr:`%i` \| int \| Exactly equivalent to \|
				253	\| \| \| ``printf("%i")``. \|
				254	+-------------------+---------------------+--------------------------------+
				255	\| :attr:`%x` \| int \| Exactly equivalent to \|
				256	\| \| \| ``printf("%x")``. \|
				257	+-------------------+---------------------+--------------------------------+
				258	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				259	\| \| \| array. \|
				260	+-------------------+---------------------+--------------------------------+
				261	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				262	\| \| \| pointer. Mostly equivalent to \|
				263	\| \| \| ``printf("%p")`` except that \|
				264	\| \| \| it is guaranteed to start with \|
				265	\| \| \| the literal ``0x`` regardless \|
				266	\| \| \| of what the platform's \|
				267	\| \| \| ``printf`` yields. \|
				268	+-------------------+---------------------+--------------------------------+
				269	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				270	+-------------------+---------------------+--------------------------------+
				271	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				272	\| \| \| NULL) and a null-terminated \|
				273	\| \| \| C character array as a second \|
				274	\| \| \| parameter (which will be used, \|
				275	\| \| \| if the first parameter is \|
				276	\| \| \| NULL). \|
				277	+-------------------+---------------------+--------------------------------+
				278	\| :attr:`%S` \| PyObject\* \| The result of calling \|
				279	\| \| \| :func:`PyObject_Unicode`. \|
				280	+-------------------+---------------------+--------------------------------+
				281	\| :attr:`%R` \| PyObject\* \| The result of calling \|
				282	\| \| \| :func:`PyObject_Repr`. \|
				283	+-------------------+---------------------+--------------------------------+
				284
				285	An unrecognized format character causes all the rest of the format string to be
				286	copied as-is to the result string, and any extra arguments discarded.
				287
				288
				289	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				290
				291	Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
				292	arguments.
				293
				294
				295	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				296
				297	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				298	buffer, NULL if unicode is not a Unicode object.
				299
				300
				301	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				302
				303	Return the length of the Unicode object.
				304
				305
				306	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				307
				308	Coerce an encoded object obj to an Unicode object and return a reference with
				309	incremented refcount.
				310
				311	String and other char buffer compatible objects are decoded according to the
				312	given encoding and using the error handling defined by errors. Both can be
				313	NULL to have the interface use the default values (see the next section for
				314	details).
				315
				316	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				317	set.
				318
				319	The API returns NULL if there was an error. The caller is responsible for
				320	decref'ing the returned objects.
				321
				322
				323	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				324
				325	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				326	throughout the interpreter whenever coercion to Unicode is needed.
				327
				328	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				329	Python can interface directly to this type using the following functions.
				330	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				331	the system's :ctype:`wchar_t`.
				332
				333	.. % --- wchar_t support for platforms which support it ---------------------
				334
				335
				336	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				337
				338	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
				339	Return NULL on failure.
				340
				341
				342	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				343
				344	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				345	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				346	0-termination character). Return the number of :ctype:`wchar_t` characters
				347	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				348	string may or may not be 0-terminated. It is the responsibility of the caller
				349	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				350	required by the application.
				351
				352
				353	.. _builtincodecs:
				354
				355	Built-in Codecs
				356	^^^^^^^^^^^^^^^
				357
				358	Python provides a set of builtin codecs which are written in C for speed. All of
				359	these codecs are directly usable via the following functions.
				360
				361	Many of the following APIs take two arguments encoding and errors. These
				362	parameters encoding and errors have the same semantics as the ones of the
				363	builtin unicode() Unicode object constructor.
				364
				365	Setting encoding to NULL causes the default encoding to be used which is
				366	ASCII. The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
				367	as the encoding for file names. This variable should be treated as read-only: On
				368	some systems, it will be a pointer to a static string, on others, it will change
				369	at run-time (such as when the application invokes setlocale).
				370
				371	Error handling is set by errors which may also be set to NULL meaning to use
				372	the default handling defined for the codec. Default error handling for all
				373	builtin codecs is "strict" (:exc:`ValueError` is raised).
				374
				375	The codecs all use a similar interface. Only deviation from the following
				376	generic ones are documented for simplicity.
				377
				378	These are the generic codec APIs:
				379
				380	.. % --- Generic Codecs -----------------------------------------------------
				381
				382
				383	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				384
				385	Create a Unicode object by decoding size bytes of the encoded string s.
				386	encoding and errors have the same meaning as the parameters of the same name
				387	in the :func:`unicode` builtin function. The codec to be used is looked up
				388	using the Python codec registry. Return NULL if an exception was raised by
				389	the codec.
				390
				391
				392	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				393
				394	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
				395	string object. encoding and errors have the same meaning as the parameters
				396	of the same name in the Unicode :meth:`encode` method. The codec to be used is
				397	looked up using the Python codec registry. Return NULL if an exception was
				398	raised by the codec.
				399
				400
				401	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				402
				403	Encode a Unicode object and return the result as Python string object.
				404	encoding and errors have the same meaning as the parameters of the same name
				405	in the Unicode :meth:`encode` method. The codec to be used is looked up using
				406	the Python codec registry. Return NULL if an exception was raised by the
				407	codec.
				408
				409	These are the UTF-8 codec APIs:
				410
				411	.. % --- UTF-8 Codecs -------------------------------------------------------
				412
				413
				414	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				415
				416	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				417	s. Return NULL if an exception was raised by the codec.
				418
				419
				420	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				421
				422	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				423	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				424	treated as an error. Those bytes will not be decoded and the number of bytes
				425	that have been decoded will be stored in consumed.
				426
				427
				428	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				429
				430	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
				431	Python string object. Return NULL if an exception was raised by the codec.
				432
				433
				434	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				435
				436	Encode a Unicode object using UTF-8 and return the result as Python string
				437	object. Error handling is "strict". Return NULL if an exception was raised
				438	by the codec.
				439
				440	These are the UTF-32 codec APIs:
				441
				442	.. % --- UTF-32 Codecs ------------------------------------------------------ */
				443
				444
				445	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				446
				447	Decode length bytes from a UTF-32 encoded buffer string and return the
				448	corresponding Unicode object. errors (if non-NULL) defines the error
				449	handling. It defaults to "strict".
				450
				451	If byteorder is non-NULL, the decoder starts decoding using the given byte
				452	order::
				453
				454	*byteorder == -1: little endian
				455	*byteorder == 0: native order
				456	*byteorder == 1: big endian
				457
				458	and then switches if the first four bytes of the input data are a byte order mark
				459	(BOM) and the specified byte order is native order. This BOM is not copied into
				460	the resulting Unicode string. After completion, \byteorder* is set to the
				461	current byte order at the end of input data.
				462
				463	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				464
				465	If byteorder is NULL, the codec starts in native order mode.
				466
				467	Return NULL if an exception was raised by the codec.
				468
				469
				470	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				471
				472	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				473	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				474	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				475	by four) as an error. Those bytes will not be decoded and the number of bytes
				476	that have been decoded will be stored in consumed.
				477
				478
				479	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				480
				481	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
				482	data in s. If byteorder is not ``0``, output is written according to the
				483	following byte order::
				484
				485	byteorder == -1: little endian
				486	byteorder == 0: native byte order (writes a BOM mark)
				487	byteorder == 1: big endian
				488
				489	If byteorder is ``0``, the output string will always start with the Unicode BOM
				490	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				491
				492	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				493	as a single codepoint.
				494
				495	Return NULL if an exception was raised by the codec.
				496
				497
				498	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				499
				500	Return a Python string using the UTF-32 encoding in native byte order. The
				501	string always starts with a BOM mark. Error handling is "strict". Return
				502	NULL if an exception was raised by the codec.
				503
				504
				505	These are the UTF-16 codec APIs:
				506
				507	.. % --- UTF-16 Codecs ------------------------------------------------------ */
				508
				509
				510	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				511
				512	Decode length bytes from a UTF-16 encoded buffer string and return the
				513	corresponding Unicode object. errors (if non-NULL) defines the error
				514	handling. It defaults to "strict".
				515
				516	If byteorder is non-NULL, the decoder starts decoding using the given byte
				517	order::
				518
				519	*byteorder == -1: little endian
				520	*byteorder == 0: native order
				521	*byteorder == 1: big endian
				522
				523	and then switches if the first two bytes of the input data are a byte order mark
				524	(BOM) and the specified byte order is native order. This BOM is not copied into
				525	the resulting Unicode string. After completion, \byteorder* is set to the
				526	current byte order at the end of input data.
				527
				528	If byteorder is NULL, the codec starts in native order mode.
				529
				530	Return NULL if an exception was raised by the codec.
				531
				532
				533	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				534
				535	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				536	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				537	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				538	split surrogate pair) as an error. Those bytes will not be decoded and the
				539	number of bytes that have been decoded will be stored in consumed.
				540
				541
				542	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				543
				544	Return a Python string object holding the UTF-16 encoded value of the Unicode
				545	data in s. If byteorder is not ``0``, output is written according to the
				546	following byte order::
				547
				548	byteorder == -1: little endian
				549	byteorder == 0: native byte order (writes a BOM mark)
				550	byteorder == 1: big endian
				551
				552	If byteorder is ``0``, the output string will always start with the Unicode BOM
				553	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				554
				555	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				556	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				557	values is interpreted as an UCS-2 character.
				558
				559	Return NULL if an exception was raised by the codec.
				560
				561
				562	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				563
				564	Return a Python string using the UTF-16 encoding in native byte order. The
				565	string always starts with a BOM mark. Error handling is "strict". Return
				566	NULL if an exception was raised by the codec.
				567
				568	These are the "Unicode Escape" codec APIs:
				569
				570	.. % --- Unicode-Escape Codecs ----------------------------------------------
				571
				572
				573	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				574
				575	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				576	string s. Return NULL if an exception was raised by the codec.
				577
				578
				579	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				580
				581	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				582	return a Python string object. Return NULL if an exception was raised by the
				583	codec.
				584
				585
				586	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				587
				588	Encode a Unicode object using Unicode-Escape and return the result as Python
				589	string object. Error handling is "strict". Return NULL if an exception was
				590	raised by the codec.
				591
				592	These are the "Raw Unicode Escape" codec APIs:
				593
				594	.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
				595
				596
				597	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				598
				599	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				600	encoded string s. Return NULL if an exception was raised by the codec.
				601
				602
				603	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				604
				605	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				606	and return a Python string object. Return NULL if an exception was raised by
				607	the codec.
				608
				609
				610	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				611
				612	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				613	Python string object. Error handling is "strict". Return NULL if an exception
				614	was raised by the codec.
				615
				616	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				617	ordinals and only these are accepted by the codecs during encoding.
				618
				619	.. % --- Latin-1 Codecs -----------------------------------------------------
				620
				621
				622	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				623
				624	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				625	s. Return NULL if an exception was raised by the codec.
				626
				627
				628	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				629
				630	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
				631	a Python string object. Return NULL if an exception was raised by the codec.
				632
				633
				634	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				635
				636	Encode a Unicode object using Latin-1 and return the result as Python string
				637	object. Error handling is "strict". Return NULL if an exception was raised
				638	by the codec.
				639
				640	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				641	codes generate errors.
				642
				643	.. % --- ASCII Codecs -------------------------------------------------------
				644
				645
				646	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				647
				648	Create a Unicode object by decoding size bytes of the ASCII encoded string
				649	s. Return NULL if an exception was raised by the codec.
				650
				651
				652	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				653
				654	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
				655	Python string object. Return NULL if an exception was raised by the codec.
				656
				657
				658	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				659
				660	Encode a Unicode object using ASCII and return the result as Python string
				661	object. Error handling is "strict". Return NULL if an exception was raised
				662	by the codec.
				663
				664	These are the mapping codec APIs:
				665
				666	.. % --- Character Map Codecs -----------------------------------------------
				667
				668	This codec is special in that it can be used to implement many different codecs
				669	(and this is in fact what was done to obtain most of the standard codecs
				670	included in the :mod:`encodings` package). The codec uses mapping to encode and
				671	decode characters.
				672
				673	Decoding mappings must map single string characters to single Unicode
				674	characters, integers (which are then interpreted as Unicode ordinals) or None
				675	(meaning "undefined mapping" and causing an error).
				676
				677	Encoding mappings must map single Unicode characters to single string
				678	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				679	(meaning "undefined mapping" and causing an error).
				680
				681	The mapping objects provided must only support the __getitem__ mapping
				682	interface.
				683
				684	If a character lookup fails with a LookupError, the character is copied as-is
				685	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				686	resp. Because of this, mappings only need to contain those mappings which map
				687	characters to different code points.
				688
				689
				690	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				691
				692	Create a Unicode object by decoding size bytes of the encoded string s using
				693	the given mapping object. Return NULL if an exception was raised by the
				694	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				695	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				696	Byte values greater that the length of the string and U+FFFE "characters" are
				697	treated as "undefined mapping".
				698
				699
				700	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				701
				702	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				703	mapping object and return a Python string object. Return NULL if an
				704	exception was raised by the codec.
				705
				706
				707	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				708
				709	Encode a Unicode object using the given mapping object and return the result
				710	as Python string object. Error handling is "strict". Return NULL if an
				711	exception was raised by the codec.
				712
				713	The following codec API is special in that maps Unicode to Unicode.
				714
				715
				716	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				717
				718	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				719	character mapping table to it and return the resulting Unicode object. Return
				720	NULL when an exception was raised by the codec.
				721
				722	The mapping table must map Unicode ordinal integers to Unicode ordinal
				723	integers or None (causing deletion of the character).
				724
				725	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				726	and sequences work well. Unmapped character ordinals (ones which cause a
				727	:exc:`LookupError`) are left untouched and are copied as-is.
				728
				729	These are the MBCS codec APIs. They are currently only available on Windows and
				730	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				731	DBCS) is a class of encodings, not just one. The target encoding is defined by
				732	the user settings on the machine running the codec.
				733
				734	.. % --- MBCS codecs for Windows --------------------------------------------
				735
				736
				737	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				738
				739	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				740	Return NULL if an exception was raised by the codec.
				741
				742
				743	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				744
				745	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				746	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				747	trailing lead byte and the number of bytes that have been decoded will be stored
				748	in consumed.
				749
				750
				751	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				752
				753	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
				754	Python string object. Return NULL if an exception was raised by the codec.
				755
				756
				757	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				758
				759	Encode a Unicode object using MBCS and return the result as Python string
				760	object. Error handling is "strict". Return NULL if an exception was raised
				761	by the codec.
				762
				763	.. % --- Methods & Slots ----------------------------------------------------
				764
				765
				766	.. _unicodemethodsandslots:
				767
				768	Methods and Slot Functions
				769	^^^^^^^^^^^^^^^^^^^^^^^^^^
				770
				771	The following APIs are capable of handling Unicode objects and strings on input
				772	(we refer to them as strings in the descriptions) and return Unicode objects or
				773	integers as appropriate.
				774
				775	They all return NULL or ``-1`` if an exception occurs.
				776
				777
				778	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				779
				780	Concat two strings giving a new Unicode string.
				781
				782
				783	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				784
				785	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				786	will be done at all whitespace substrings. Otherwise, splits occur at the given
				787	separator. At most maxsplit splits will be done. If negative, no limit is
				788	set. Separators are not included in the resulting list.
				789
				790
				791	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				792
				793	Split a Unicode string at line breaks, returning a list of Unicode strings.
				794	CRLF is considered to be one line break. If keepend is 0, the Line break
				795	characters are not included in the resulting strings.
				796
				797
				798	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				799
				800	Translate a string by applying a character mapping table to it and return the
				801	resulting Unicode object.
				802
				803	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				804	or None (causing deletion of the character).
				805
				806	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				807	and sequences work well. Unmapped character ordinals (ones which cause a
				808	:exc:`LookupError`) are left untouched and are copied as-is.
				809
				810	errors has the usual meaning for codecs. It may be NULL which indicates to
				811	use the default error handling.
				812
				813
				814	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				815
				816	Join a sequence of strings using the given separator and return the resulting
				817	Unicode string.
				818
				819
				820	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				821
				822	Return 1 if substr matches str[start:end] at the given tail end
				823	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				824	0 otherwise. Return ``-1`` if an error occurred.
				825
				826
				827	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				828
				829	Return the first position of substr in str[start:end] using the given
				830	direction (direction == 1 means to do a forward search, direction == -1 a
				831	backward search). The return value is the index of the first match; a value of
				832	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				833	occurred and an exception has been set.
				834
				835
				836	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				837
				838	Return the number of non-overlapping occurrences of substr in
				839	``str[start:end]``. Return ``-1`` if an error occurred.
				840
				841
				842	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				843
				844	Replace at most maxcount occurrences of substr in str with replstr and
				845	return the resulting Unicode object. maxcount == -1 means replace all
				846	occurrences.
				847
				848
				849	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				850
				851	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				852	respectively.
				853
				854
				855	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				856
				857	Rich compare two unicode strings and return one of the following:
				858
				859	* ``NULL`` in case an exception was raised
				860	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				861	* :const:`Py_NotImplemented` in case the type combination is unknown
				862
				863	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				864	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				865	with a :exc:`UnicodeDecodeError`.
				866
				867	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				868	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				869
				870
				871	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				872
				873	Return a new string object from format and args; this is analogous to
				874	``format % args``. The args argument must be a tuple.
				875
				876
				877	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				878
				879	Check whether element is contained in container and return true or false
				880	accordingly.
				881
				882	element has to coerce to a one element Unicode string. ``-1`` is returned if
				883	there was an error.
				884
				885
				886	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				887
				888	Intern the argument \string* in place. The argument must be the address of a
				889	pointer variable pointing to a Python unicode string object. If there is an
				890	existing interned string that is the same as \string, it sets \string to
				891	it (decrementing the reference count of the old string object and incrementing
				892	the reference count of the interned string object), otherwise it leaves
				893	\string* alone and interns it (incrementing its reference count).
				894	(Clarification: even though there is a lot of talk about reference counts, think
				895	of this function as reference-count-neutral; you own the object after the call
				896	if and only if you owned it before the call.)
				897
				898
				899	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				900
				901	A combination of :cfunc:`PyUnicode_FromString` and
				902	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				903	that has been interned, or a new ("owned") reference to an earlier interned
				904	string object with the same value.
				905