Blame - Doc/c-api/unicode.rst - platform/external/python/cpython3

blob: e348ee7c8721504055c44c84b83ad769c2cb002d [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
				13	These are the basic Unicode object types used for the Unicode implementation in
				14	Python:
				15
				16	.. % --- Unicode Type -------------------------------------------------------
				17
				18
				19	.. ctype:: Py_UNICODE
				20
				21	This type represents the storage type which is used by Python internally as
				22	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				23	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				24	possible to build a UCS4 version of Python (most recent Linux distributions come
				25	with UCS4 builds of Python). These builds then use a 32-bit type for
				26	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				27	where :ctype:`wchar_t` is available and compatible with the chosen Python
				28	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				29	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				30	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				31	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				32
				33	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				34	this in mind when writing extensions or interfaces.
				35
				36
				37	.. ctype:: PyUnicodeObject
				38
				39	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				40
				41
				42	.. cvar:: PyTypeObject PyUnicode_Type
				43
				44	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				45	is exposed to Python code as ``str``.
				46
				47	The following APIs are really C macros and can be used to do fast checks and to
				48	access internal read-only data of Unicode objects:
				49
				50
				51	.. cfunction:: int PyUnicode_Check(PyObject *o)
				52
				53	Return true if the object o is a Unicode object or an instance of a Unicode
				54	subtype.
				55
				56
				57	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				58
				59	Return true if the object o is a Unicode object, but not an instance of a
				60	subtype.
				61
				62
				63	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				64
				65	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				66	checked).
				67
				68
				69	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				70
				71	Return the size of the object's internal buffer in bytes. o has to be a
				72	:ctype:`PyUnicodeObject` (not checked).
				73
				74
				75	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				76
				77	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				78	has to be a :ctype:`PyUnicodeObject` (not checked).
				79
				80
				81	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				82
				83	Return a pointer to the internal buffer of the object. o has to be a
				84	:ctype:`PyUnicodeObject` (not checked).
				85
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	86
				87	.. cfunction:: int PyUnicode_ClearFreeList(void)
				88
				89	Clear the free list. Return the total number of freed items.
				90
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	91	Unicode provides many different character properties. The most often needed ones
				92	are available through these macros which are mapped to C functions depending on
				93	the Python configuration.
				94
				95	.. % --- Unicode character properties ---------------------------------------
				96
				97
				98	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				99
				100	Return 1 or 0 depending on whether ch is a whitespace character.
				101
				102
				103	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				104
				105	Return 1 or 0 depending on whether ch is a lowercase character.
				106
				107
				108	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				109
				110	Return 1 or 0 depending on whether ch is an uppercase character.
				111
				112
				113	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				114
				115	Return 1 or 0 depending on whether ch is a titlecase character.
				116
				117
				118	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				119
				120	Return 1 or 0 depending on whether ch is a linebreak character.
				121
				122
				123	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				124
				125	Return 1 or 0 depending on whether ch is a decimal character.
				126
				127
				128	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				129
				130	Return 1 or 0 depending on whether ch is a digit character.
				131
				132
				133	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				134
				135	Return 1 or 0 depending on whether ch is a numeric character.
				136
				137
				138	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				139
				140	Return 1 or 0 depending on whether ch is an alphabetic character.
				141
				142
				143	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				144
				145	Return 1 or 0 depending on whether ch is an alphanumeric character.
				146
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	147
				148	.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
				149
				150	Return 1 or 0 depending on whether ch is a printable character.
				151	Nonprintable characters are those characters defined in the Unicode character
				152	database as "Other" or "Separator", excepting the ASCII space (0x20) which is
				153	considered printable. (Note that printable characters in this context are
				154	those which should not be escaped when :func:`repr` is invoked on a string.
				155	It has no bearing on the handling of strings written to :data:`sys.stdout` or
				156	:data:`sys.stderr`.)
				157
				158
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	159	These APIs can be used for fast direct character conversions:
				160
				161
				162	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				163
				164	Return the character ch converted to lower case.
				165
				166
				167	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				168
				169	Return the character ch converted to upper case.
				170
				171
				172	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				173
				174	Return the character ch converted to title case.
				175
				176
				177	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				178
				179	Return the character ch converted to a decimal positive integer. Return
				180	``-1`` if this is not possible. This macro does not raise exceptions.
				181
				182
				183	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				184
				185	Return the character ch converted to a single digit integer. Return ``-1`` if
				186	this is not possible. This macro does not raise exceptions.
				187
				188
				189	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				190
				191	Return the character ch converted to a double. Return ``-1.0`` if this is not
				192	possible. This macro does not raise exceptions.
				193
				194	To create Unicode objects and access their basic sequence properties, use these
				195	APIs:
				196
				197	.. % --- Plain Py_UNICODE ---------------------------------------------------
				198
				199
				200	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				201
				202	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				203	may be NULL which causes the contents to be undefined. It is the user's
				204	responsibility to fill in the needed data. The buffer is copied into the new
				205	object. If the buffer is not NULL, the return value might be a shared object.
				206	Therefore, modification of the resulting Unicode object is only allowed when u
				207	is NULL.
				208
				209
				210	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				211
				212	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				213	as being UTF-8 encoded. u may also be NULL which
				214	causes the contents to be undefined. It is the user's responsibility to fill in
				215	the needed data. The buffer is copied into the new object. If the buffer is not
				216	NULL, the return value might be a shared object. Therefore, modification of
				217	the resulting Unicode object is only allowed when u is NULL.
				218
				219
				220	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				221
				222	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				223	u.
				224
				225
				226	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				227
				228	Take a C :cfunc:`printf`\ -style format string and a variable number of
				229	arguments, calculate the size of the resulting Python unicode string and return
				230	a string with the values formatted into it. The variable arguments must be C
				231	types and must correspond exactly to the format characters in the format
				232	string. The following format characters are allowed:
				233
				234	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				235	.. % because not all compilers support the %z width modifier -- we fake it
				236	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
				237
				238	+-------------------+---------------------+--------------------------------+
				239	\| Format Characters \| Type \| Comment \|
				240	+===================+=====================+================================+
				241	\| :attr:`%%` \| n/a \| The literal % character. \|
				242	+-------------------+---------------------+--------------------------------+
				243	\| :attr:`%c` \| int \| A single character, \|
				244	\| \| \| represented as an C int. \|
				245	+-------------------+---------------------+--------------------------------+
				246	\| :attr:`%d` \| int \| Exactly equivalent to \|
				247	\| \| \| ``printf("%d")``. \|
				248	+-------------------+---------------------+--------------------------------+
				249	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				250	\| \| \| ``printf("%u")``. \|
				251	+-------------------+---------------------+--------------------------------+
				252	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				253	\| \| \| ``printf("%ld")``. \|
				254	+-------------------+---------------------+--------------------------------+
				255	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				256	\| \| \| ``printf("%lu")``. \|
				257	+-------------------+---------------------+--------------------------------+
				258	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				259	\| \| \| ``printf("%zd")``. \|
				260	+-------------------+---------------------+--------------------------------+
				261	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				262	\| \| \| ``printf("%zu")``. \|
				263	+-------------------+---------------------+--------------------------------+
				264	\| :attr:`%i` \| int \| Exactly equivalent to \|
				265	\| \| \| ``printf("%i")``. \|
				266	+-------------------+---------------------+--------------------------------+
				267	\| :attr:`%x` \| int \| Exactly equivalent to \|
				268	\| \| \| ``printf("%x")``. \|
				269	+-------------------+---------------------+--------------------------------+
				270	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				271	\| \| \| array. \|
				272	+-------------------+---------------------+--------------------------------+
				273	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				274	\| \| \| pointer. Mostly equivalent to \|
				275	\| \| \| ``printf("%p")`` except that \|
				276	\| \| \| it is guaranteed to start with \|
				277	\| \| \| the literal ``0x`` regardless \|
				278	\| \| \| of what the platform's \|
				279	\| \| \| ``printf`` yields. \|
				280	+-------------------+---------------------+--------------------------------+
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	281	\| :attr:`%A` \| PyObject\* \| The result of calling \|
				282	\| \| \| :func:`ascii`. \|
				283	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	284	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				285	+-------------------+---------------------+--------------------------------+
				286	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				287	\| \| \| NULL) and a null-terminated \|
				288	\| \| \| C character array as a second \|
				289	\| \| \| parameter (which will be used, \|
				290	\| \| \| if the first parameter is \|
				291	\| \| \| NULL). \|
				292	+-------------------+---------------------+--------------------------------+
				293	\| :attr:`%S` \| PyObject\* \| The result of calling \|
Benjamin Peterson	e866206	2009-03-08 23:51:13 +0000	[diff] [blame]	294	\| \| \| :func:`PyObject_Str`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	295	+-------------------+---------------------+--------------------------------+
				296	\| :attr:`%R` \| PyObject\* \| The result of calling \|
				297	\| \| \| :func:`PyObject_Repr`. \|
				298	+-------------------+---------------------+--------------------------------+
				299
				300	An unrecognized format character causes all the rest of the format string to be
				301	copied as-is to the result string, and any extra arguments discarded.
				302
				303
				304	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				305
				306	Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
				307	arguments.
				308
				309
				310	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				311
				312	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				313	buffer, NULL if unicode is not a Unicode object.
				314
				315
				316	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				317
				318	Return the length of the Unicode object.
				319
				320
				321	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				322
				323	Coerce an encoded object obj to an Unicode object and return a reference with
				324	incremented refcount.
				325
				326	String and other char buffer compatible objects are decoded according to the
				327	given encoding and using the error handling defined by errors. Both can be
				328	NULL to have the interface use the default values (see the next section for
				329	details).
				330
				331	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				332	set.
				333
				334	The API returns NULL if there was an error. The caller is responsible for
				335	decref'ing the returned objects.
				336
				337
				338	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				339
				340	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				341	throughout the interpreter whenever coercion to Unicode is needed.
				342
				343	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				344	Python can interface directly to this type using the following functions.
				345	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				346	the system's :ctype:`wchar_t`.
				347
				348	.. % --- wchar_t support for platforms which support it ---------------------
				349
				350
				351	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				352
				353	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Martin v. Löwis	790465f	2008-04-05 20:41:37 +0000	[diff] [blame]	354	Passing -1 as the size indicates that the function must itself compute the length,
				355	using wcslen.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	356	Return NULL on failure.
				357
				358
				359	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				360
				361	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				362	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				363	0-termination character). Return the number of :ctype:`wchar_t` characters
				364	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				365	string may or may not be 0-terminated. It is the responsibility of the caller
				366	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				367	required by the application.
				368
				369
				370	.. _builtincodecs:
				371
				372	Built-in Codecs
				373	^^^^^^^^^^^^^^^
				374
				375	Python provides a set of builtin codecs which are written in C for speed. All of
				376	these codecs are directly usable via the following functions.
				377
				378	Many of the following APIs take two arguments encoding and errors. These
				379	parameters encoding and errors have the same semantics as the ones of the
				380	builtin unicode() Unicode object constructor.
				381
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	382	Setting encoding to NULL causes the default encoding to be used
				383	which is ASCII. The file system calls should use
				384	:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
				385	variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
				386	variable should be treated as read-only: On some systems, it will be a
				387	pointer to a static string, on others, it will change at run-time
				388	(such as when the application invokes setlocale).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	389
				390	Error handling is set by errors which may also be set to NULL meaning to use
				391	the default handling defined for the codec. Default error handling for all
				392	builtin codecs is "strict" (:exc:`ValueError` is raised).
				393
				394	The codecs all use a similar interface. Only deviation from the following
				395	generic ones are documented for simplicity.
				396
				397	These are the generic codec APIs:
				398
				399	.. % --- Generic Codecs -----------------------------------------------------
				400
				401
				402	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				403
				404	Create a Unicode object by decoding size bytes of the encoded string s.
				405	encoding and errors have the same meaning as the parameters of the same name
				406	in the :func:`unicode` builtin function. The codec to be used is looked up
				407	using the Python codec registry. Return NULL if an exception was raised by
				408	the codec.
				409
				410
				411	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				412
				413	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	414	bytes object. encoding and errors have the same meaning as the
				415	parameters of the same name in the Unicode :meth:`encode` method. The codec
				416	to be used is looked up using the Python codec registry. Return NULL if an
				417	exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	418
				419
				420	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				421
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	422	Encode a Unicode object and return the result as Python bytes object.
				423	encoding and errors have the same meaning as the parameters of the same
				424	name in the Unicode :meth:`encode` method. The codec to be used is looked up
				425	using the Python codec registry. Return NULL if an exception was raised by
				426	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	427
				428	These are the UTF-8 codec APIs:
				429
				430	.. % --- UTF-8 Codecs -------------------------------------------------------
				431
				432
				433	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				434
				435	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				436	s. Return NULL if an exception was raised by the codec.
				437
				438
				439	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				440
				441	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				442	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				443	treated as an error. Those bytes will not be decoded and the number of bytes
				444	that have been decoded will be stored in consumed.
				445
				446
				447	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				448
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	449	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
				450	return a Python bytes object. Return NULL if an exception was raised by
				451	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	452
				453
				454	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				455
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	456	Encode a Unicode object using UTF-8 and return the result as Python bytes
				457	object. Error handling is "strict". Return NULL if an exception was
				458	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	459
				460	These are the UTF-32 codec APIs:
				461
				462	.. % --- UTF-32 Codecs ------------------------------------------------------ */
				463
				464
				465	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				466
				467	Decode length bytes from a UTF-32 encoded buffer string and return the
				468	corresponding Unicode object. errors (if non-NULL) defines the error
				469	handling. It defaults to "strict".
				470
				471	If byteorder is non-NULL, the decoder starts decoding using the given byte
				472	order::
				473
				474	*byteorder == -1: little endian
				475	*byteorder == 0: native order
				476	*byteorder == 1: big endian
				477
				478	and then switches if the first four bytes of the input data are a byte order mark
				479	(BOM) and the specified byte order is native order. This BOM is not copied into
				480	the resulting Unicode string. After completion, \byteorder* is set to the
				481	current byte order at the end of input data.
				482
				483	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				484
				485	If byteorder is NULL, the codec starts in native order mode.
				486
				487	Return NULL if an exception was raised by the codec.
				488
				489
				490	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				491
				492	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				493	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				494	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				495	by four) as an error. Those bytes will not be decoded and the number of bytes
				496	that have been decoded will be stored in consumed.
				497
				498
				499	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				500
				501	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
				502	data in s. If byteorder is not ``0``, output is written according to the
				503	following byte order::
				504
				505	byteorder == -1: little endian
				506	byteorder == 0: native byte order (writes a BOM mark)
				507	byteorder == 1: big endian
				508
				509	If byteorder is ``0``, the output string will always start with the Unicode BOM
				510	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				511
				512	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				513	as a single codepoint.
				514
				515	Return NULL if an exception was raised by the codec.
				516
				517
				518	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				519
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	520	Return a Python byte string using the UTF-32 encoding in native byte
				521	order. The string always starts with a BOM mark. Error handling is "strict".
				522	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	523
				524
				525	These are the UTF-16 codec APIs:
				526
				527	.. % --- UTF-16 Codecs ------------------------------------------------------ */
				528
				529
				530	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				531
				532	Decode length bytes from a UTF-16 encoded buffer string and return the
				533	corresponding Unicode object. errors (if non-NULL) defines the error
				534	handling. It defaults to "strict".
				535
				536	If byteorder is non-NULL, the decoder starts decoding using the given byte
				537	order::
				538
				539	*byteorder == -1: little endian
				540	*byteorder == 0: native order
				541	*byteorder == 1: big endian
				542
				543	and then switches if the first two bytes of the input data are a byte order mark
				544	(BOM) and the specified byte order is native order. This BOM is not copied into
				545	the resulting Unicode string. After completion, \byteorder* is set to the
				546	current byte order at the end of input data.
				547
				548	If byteorder is NULL, the codec starts in native order mode.
				549
				550	Return NULL if an exception was raised by the codec.
				551
				552
				553	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				554
				555	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				556	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				557	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				558	split surrogate pair) as an error. Those bytes will not be decoded and the
				559	number of bytes that have been decoded will be stored in consumed.
				560
				561
				562	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				563
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	564	Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	565	data in s. If byteorder is not ``0``, output is written according to the
				566	following byte order::
				567
				568	byteorder == -1: little endian
				569	byteorder == 0: native byte order (writes a BOM mark)
				570	byteorder == 1: big endian
				571
				572	If byteorder is ``0``, the output string will always start with the Unicode BOM
				573	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				574
				575	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				576	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				577	values is interpreted as an UCS-2 character.
				578
				579	Return NULL if an exception was raised by the codec.
				580
				581
				582	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				583
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	584	Return a Python byte string using the UTF-16 encoding in native byte
				585	order. The string always starts with a BOM mark. Error handling is "strict".
				586	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	587
				588	These are the "Unicode Escape" codec APIs:
				589
				590	.. % --- Unicode-Escape Codecs ----------------------------------------------
				591
				592
				593	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				594
				595	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				596	string s. Return NULL if an exception was raised by the codec.
				597
				598
				599	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				600
				601	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				602	return a Python string object. Return NULL if an exception was raised by the
				603	codec.
				604
				605
				606	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				607
				608	Encode a Unicode object using Unicode-Escape and return the result as Python
				609	string object. Error handling is "strict". Return NULL if an exception was
				610	raised by the codec.
				611
				612	These are the "Raw Unicode Escape" codec APIs:
				613
				614	.. % --- Raw-Unicode-Escape Codecs ------------------------------------------
				615
				616
				617	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				618
				619	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				620	encoded string s. Return NULL if an exception was raised by the codec.
				621
				622
				623	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				624
				625	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				626	and return a Python string object. Return NULL if an exception was raised by
				627	the codec.
				628
				629
				630	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				631
				632	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				633	Python string object. Error handling is "strict". Return NULL if an exception
				634	was raised by the codec.
				635
				636	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				637	ordinals and only these are accepted by the codecs during encoding.
				638
				639	.. % --- Latin-1 Codecs -----------------------------------------------------
				640
				641
				642	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				643
				644	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				645	s. Return NULL if an exception was raised by the codec.
				646
				647
				648	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				649
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	650	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
				651	return a Python bytes object. Return NULL if an exception was raised by
				652	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	653
				654
				655	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				656
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	657	Encode a Unicode object using Latin-1 and return the result as Python bytes
				658	object. Error handling is "strict". Return NULL if an exception was
				659	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	660
				661	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				662	codes generate errors.
				663
				664	.. % --- ASCII Codecs -------------------------------------------------------
				665
				666
				667	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				668
				669	Create a Unicode object by decoding size bytes of the ASCII encoded string
				670	s. Return NULL if an exception was raised by the codec.
				671
				672
				673	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				674
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	675	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
				676	return a Python bytes object. Return NULL if an exception was raised by
				677	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	678
				679
				680	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				681
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	682	Encode a Unicode object using ASCII and return the result as Python bytes
				683	object. Error handling is "strict". Return NULL if an exception was
				684	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	685
				686	These are the mapping codec APIs:
				687
				688	.. % --- Character Map Codecs -----------------------------------------------
				689
				690	This codec is special in that it can be used to implement many different codecs
				691	(and this is in fact what was done to obtain most of the standard codecs
				692	included in the :mod:`encodings` package). The codec uses mapping to encode and
				693	decode characters.
				694
				695	Decoding mappings must map single string characters to single Unicode
				696	characters, integers (which are then interpreted as Unicode ordinals) or None
				697	(meaning "undefined mapping" and causing an error).
				698
				699	Encoding mappings must map single Unicode characters to single string
				700	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				701	(meaning "undefined mapping" and causing an error).
				702
				703	The mapping objects provided must only support the __getitem__ mapping
				704	interface.
				705
				706	If a character lookup fails with a LookupError, the character is copied as-is
				707	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				708	resp. Because of this, mappings only need to contain those mappings which map
				709	characters to different code points.
				710
				711
				712	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				713
				714	Create a Unicode object by decoding size bytes of the encoded string s using
				715	the given mapping object. Return NULL if an exception was raised by the
				716	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				717	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				718	Byte values greater that the length of the string and U+FFFE "characters" are
				719	treated as "undefined mapping".
				720
				721
				722	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				723
				724	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				725	mapping object and return a Python string object. Return NULL if an
				726	exception was raised by the codec.
				727
				728
				729	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				730
				731	Encode a Unicode object using the given mapping object and return the result
				732	as Python string object. Error handling is "strict". Return NULL if an
				733	exception was raised by the codec.
				734
				735	The following codec API is special in that maps Unicode to Unicode.
				736
				737
				738	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				739
				740	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				741	character mapping table to it and return the resulting Unicode object. Return
				742	NULL when an exception was raised by the codec.
				743
				744	The mapping table must map Unicode ordinal integers to Unicode ordinal
				745	integers or None (causing deletion of the character).
				746
				747	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				748	and sequences work well. Unmapped character ordinals (ones which cause a
				749	:exc:`LookupError`) are left untouched and are copied as-is.
				750
Jeroen Ruigrok van der Werven	47a7d70	2009-04-27 05:43:17 +0000	[diff] [blame]	751
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	752	These are the MBCS codec APIs. They are currently only available on Windows and
				753	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				754	DBCS) is a class of encodings, not just one. The target encoding is defined by
				755	the user settings on the machine running the codec.
				756
				757	.. % --- MBCS codecs for Windows --------------------------------------------
				758
				759
				760	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				761
				762	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				763	Return NULL if an exception was raised by the codec.
				764
				765
				766	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				767
				768	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				769	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				770	trailing lead byte and the number of bytes that have been decoded will be stored
				771	in consumed.
				772
				773
				774	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				775
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	776	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
				777	a Python bytes object. Return NULL if an exception was raised by the
				778	codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	779
				780
				781	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				782
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	783	Encode a Unicode object using MBCS and return the result as Python bytes
				784	object. Error handling is "strict". Return NULL if an exception was
				785	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	786
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	787	For decoding file names and other environment strings, :cdata:`Py_FileSystemEncoding`
				788	should be used as the encoding, and ``"surrogateescape"`` should be used as the error
				789	handler. For encoding file names during argument parsing, the ``O&`` converter should
				790	be used, passsing PyUnicode_FSConverter as the conversion function:
				791
				792	.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
				793
				794	Convert obj into result, using the file system encoding, and the ``surrogateescape``
				795	error handler. result must be a ``PyObject*``, yielding a bytes or bytearray object
				796	which must be released if it is no longer used.
				797
				798	.. versionadded:: 3.1
				799
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	800	.. % --- Methods & Slots ----------------------------------------------------
				801
				802
				803	.. _unicodemethodsandslots:
				804
				805	Methods and Slot Functions
				806	^^^^^^^^^^^^^^^^^^^^^^^^^^
				807
				808	The following APIs are capable of handling Unicode objects and strings on input
				809	(we refer to them as strings in the descriptions) and return Unicode objects or
				810	integers as appropriate.
				811
				812	They all return NULL or ``-1`` if an exception occurs.
				813
				814
				815	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				816
				817	Concat two strings giving a new Unicode string.
				818
				819
				820	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				821
				822	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				823	will be done at all whitespace substrings. Otherwise, splits occur at the given
				824	separator. At most maxsplit splits will be done. If negative, no limit is
				825	set. Separators are not included in the resulting list.
				826
				827
				828	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				829
				830	Split a Unicode string at line breaks, returning a list of Unicode strings.
				831	CRLF is considered to be one line break. If keepend is 0, the Line break
				832	characters are not included in the resulting strings.
				833
				834
				835	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				836
				837	Translate a string by applying a character mapping table to it and return the
				838	resulting Unicode object.
				839
				840	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				841	or None (causing deletion of the character).
				842
				843	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				844	and sequences work well. Unmapped character ordinals (ones which cause a
				845	:exc:`LookupError`) are left untouched and are copied as-is.
				846
				847	errors has the usual meaning for codecs. It may be NULL which indicates to
				848	use the default error handling.
				849
				850
				851	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				852
				853	Join a sequence of strings using the given separator and return the resulting
				854	Unicode string.
				855
				856
				857	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				858
				859	Return 1 if substr matches str[start:end] at the given tail end
				860	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				861	0 otherwise. Return ``-1`` if an error occurred.
				862
				863
				864	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				865
				866	Return the first position of substr in str[start:end] using the given
				867	direction (direction == 1 means to do a forward search, direction == -1 a
				868	backward search). The return value is the index of the first match; a value of
				869	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				870	occurred and an exception has been set.
				871
				872
				873	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				874
				875	Return the number of non-overlapping occurrences of substr in
				876	``str[start:end]``. Return ``-1`` if an error occurred.
				877
				878
				879	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				880
				881	Replace at most maxcount occurrences of substr in str with replstr and
				882	return the resulting Unicode object. maxcount == -1 means replace all
				883	occurrences.
				884
				885
				886	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				887
				888	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				889	respectively.
				890
				891
Benjamin Peterson	c22ed14	2008-07-01 19:12:34 +0000	[diff] [blame]	892	.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject uni, char string)
				893
				894	Compare a unicode object, uni, with string and return -1, 0, 1 for less
				895	than, equal, and greater than, respectively.
				896
				897
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	898	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				899
				900	Rich compare two unicode strings and return one of the following:
				901
				902	* ``NULL`` in case an exception was raised
				903	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				904	* :const:`Py_NotImplemented` in case the type combination is unknown
				905
				906	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				907	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				908	with a :exc:`UnicodeDecodeError`.
				909
				910	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				911	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				912
				913
				914	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				915
				916	Return a new string object from format and args; this is analogous to
				917	``format % args``. The args argument must be a tuple.
				918
				919
				920	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				921
				922	Check whether element is contained in container and return true or false
				923	accordingly.
				924
				925	element has to coerce to a one element Unicode string. ``-1`` is returned if
				926	there was an error.
				927
				928
				929	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				930
				931	Intern the argument \string* in place. The argument must be the address of a
				932	pointer variable pointing to a Python unicode string object. If there is an
				933	existing interned string that is the same as \string, it sets \string to
				934	it (decrementing the reference count of the old string object and incrementing
				935	the reference count of the interned string object), otherwise it leaves
				936	\string* alone and interns it (incrementing its reference count).
				937	(Clarification: even though there is a lot of talk about reference counts, think
				938	of this function as reference-count-neutral; you own the object after the call
				939	if and only if you owned it before the call.)
				940
				941
				942	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				943
				944	A combination of :cfunc:`PyUnicode_FromString` and
				945	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				946	that has been interned, or a new ("owned") reference to an earlier interned
				947	string object with the same value.
				948