Blame - Doc/c-api/unicode.rst - platform/external/python/cpython3

blob: 14d1c27088ae9c5bd5d2607381e216f639509720 [file] [log] [blame]

Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	1	.. highlightlang:: c
				2
				3	.. _unicodeobjects:
				4
				5	Unicode Objects and Codecs
				6	--------------------------
				7
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9
				10	Unicode Objects
				11	^^^^^^^^^^^^^^^
				12
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	13	Unicode Type
				14	""""""""""""
				15
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	16	These are the basic Unicode object types used for the Unicode implementation in
				17	Python:
				18
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	19
				20	.. ctype:: Py_UNICODE
				21
				22	This type represents the storage type which is used by Python internally as
				23	basis for holding Unicode ordinals. Python's default builds use a 16-bit type
				24	for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
				25	possible to build a UCS4 version of Python (most recent Linux distributions come
				26	with UCS4 builds of Python). These builds then use a 32-bit type for
				27	:ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
				28	where :ctype:`wchar_t` is available and compatible with the chosen Python
				29	Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
				30	:ctype:`wchar_t` to enhance native platform compatibility. On all other
				31	platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
				32	short` (UCS2) or :ctype:`unsigned long` (UCS4).
				33
				34	Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
				35	this in mind when writing extensions or interfaces.
				36
				37
				38	.. ctype:: PyUnicodeObject
				39
				40	This subtype of :ctype:`PyObject` represents a Python Unicode object.
				41
				42
				43	.. cvar:: PyTypeObject PyUnicode_Type
				44
				45	This instance of :ctype:`PyTypeObject` represents the Python Unicode type. It
				46	is exposed to Python code as ``str``.
				47
				48	The following APIs are really C macros and can be used to do fast checks and to
				49	access internal read-only data of Unicode objects:
				50
				51
				52	.. cfunction:: int PyUnicode_Check(PyObject *o)
				53
				54	Return true if the object o is a Unicode object or an instance of a Unicode
				55	subtype.
				56
				57
				58	.. cfunction:: int PyUnicode_CheckExact(PyObject *o)
				59
				60	Return true if the object o is a Unicode object, but not an instance of a
				61	subtype.
				62
				63
				64	.. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
				65
				66	Return the size of the object. o has to be a :ctype:`PyUnicodeObject` (not
				67	checked).
				68
				69
				70	.. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
				71
				72	Return the size of the object's internal buffer in bytes. o has to be a
				73	:ctype:`PyUnicodeObject` (not checked).
				74
				75
				76	.. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
				77
				78	Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object. o
				79	has to be a :ctype:`PyUnicodeObject` (not checked).
				80
				81
				82	.. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
				83
				84	Return a pointer to the internal buffer of the object. o has to be a
				85	:ctype:`PyUnicodeObject` (not checked).
				86
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	87
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame]	88	.. cfunction:: int PyUnicode_ClearFreeList()
Christian Heimes	a156e09	2008-02-16 07:38:31 +0000	[diff] [blame]	89
				90	Clear the free list. Return the total number of freed items.
				91
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame]	92
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	93	Unicode Character Properties
				94	""""""""""""""""""""""""""""
				95
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	96	Unicode provides many different character properties. The most often needed ones
				97	are available through these macros which are mapped to C functions depending on
				98	the Python configuration.
				99
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	100
				101	.. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
				102
				103	Return 1 or 0 depending on whether ch is a whitespace character.
				104
				105
				106	.. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
				107
				108	Return 1 or 0 depending on whether ch is a lowercase character.
				109
				110
				111	.. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
				112
				113	Return 1 or 0 depending on whether ch is an uppercase character.
				114
				115
				116	.. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
				117
				118	Return 1 or 0 depending on whether ch is a titlecase character.
				119
				120
				121	.. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
				122
				123	Return 1 or 0 depending on whether ch is a linebreak character.
				124
				125
				126	.. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
				127
				128	Return 1 or 0 depending on whether ch is a decimal character.
				129
				130
				131	.. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
				132
				133	Return 1 or 0 depending on whether ch is a digit character.
				134
				135
				136	.. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
				137
				138	Return 1 or 0 depending on whether ch is a numeric character.
				139
				140
				141	.. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
				142
				143	Return 1 or 0 depending on whether ch is an alphabetic character.
				144
				145
				146	.. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
				147
				148	Return 1 or 0 depending on whether ch is an alphanumeric character.
				149
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	150
				151	.. cfunction:: int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)
				152
				153	Return 1 or 0 depending on whether ch is a printable character.
				154	Nonprintable characters are those characters defined in the Unicode character
				155	database as "Other" or "Separator", excepting the ASCII space (0x20) which is
				156	considered printable. (Note that printable characters in this context are
				157	those which should not be escaped when :func:`repr` is invoked on a string.
				158	It has no bearing on the handling of strings written to :data:`sys.stdout` or
				159	:data:`sys.stderr`.)
				160
				161
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	162	These APIs can be used for fast direct character conversions:
				163
				164
				165	.. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
				166
				167	Return the character ch converted to lower case.
				168
				169
				170	.. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
				171
				172	Return the character ch converted to upper case.
				173
				174
				175	.. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
				176
				177	Return the character ch converted to title case.
				178
				179
				180	.. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
				181
				182	Return the character ch converted to a decimal positive integer. Return
				183	``-1`` if this is not possible. This macro does not raise exceptions.
				184
				185
				186	.. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
				187
				188	Return the character ch converted to a single digit integer. Return ``-1`` if
				189	this is not possible. This macro does not raise exceptions.
				190
				191
				192	.. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
				193
				194	Return the character ch converted to a double. Return ``-1.0`` if this is not
				195	possible. This macro does not raise exceptions.
				196
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	197
				198	Plain Py_UNICODE
				199	""""""""""""""""
				200
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	201	To create Unicode objects and access their basic sequence properties, use these
				202	APIs:
				203
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	204
				205	.. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
				206
				207	Create a Unicode Object from the Py_UNICODE buffer u of the given size. u
				208	may be NULL which causes the contents to be undefined. It is the user's
				209	responsibility to fill in the needed data. The buffer is copied into the new
				210	object. If the buffer is not NULL, the return value might be a shared object.
				211	Therefore, modification of the resulting Unicode object is only allowed when u
				212	is NULL.
				213
				214
				215	.. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
				216
				217	Create a Unicode Object from the char buffer u. The bytes will be interpreted
				218	as being UTF-8 encoded. u may also be NULL which
				219	causes the contents to be undefined. It is the user's responsibility to fill in
				220	the needed data. The buffer is copied into the new object. If the buffer is not
				221	NULL, the return value might be a shared object. Therefore, modification of
				222	the resulting Unicode object is only allowed when u is NULL.
				223
				224
				225	.. cfunction:: PyObject PyUnicode_FromString(const char u)
				226
				227	Create a Unicode object from an UTF-8 encoded null-terminated char buffer
				228	u.
				229
				230
				231	.. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
				232
				233	Take a C :cfunc:`printf`\ -style format string and a variable number of
				234	arguments, calculate the size of the resulting Python unicode string and return
				235	a string with the values formatted into it. The variable arguments must be C
				236	types and must correspond exactly to the format characters in the format
				237	string. The following format characters are allowed:
				238
				239	.. % The descriptions for %zd and %zu are wrong, but the truth is complicated
				240	.. % because not all compilers support the %z width modifier -- we fake it
				241	.. % when necessary via interpolating PY_FORMAT_SIZE_T.
				242
				243	+-------------------+---------------------+--------------------------------+
				244	\| Format Characters \| Type \| Comment \|
				245	+===================+=====================+================================+
				246	\| :attr:`%%` \| n/a \| The literal % character. \|
				247	+-------------------+---------------------+--------------------------------+
				248	\| :attr:`%c` \| int \| A single character, \|
				249	\| \| \| represented as an C int. \|
				250	+-------------------+---------------------+--------------------------------+
				251	\| :attr:`%d` \| int \| Exactly equivalent to \|
				252	\| \| \| ``printf("%d")``. \|
				253	+-------------------+---------------------+--------------------------------+
				254	\| :attr:`%u` \| unsigned int \| Exactly equivalent to \|
				255	\| \| \| ``printf("%u")``. \|
				256	+-------------------+---------------------+--------------------------------+
				257	\| :attr:`%ld` \| long \| Exactly equivalent to \|
				258	\| \| \| ``printf("%ld")``. \|
				259	+-------------------+---------------------+--------------------------------+
				260	\| :attr:`%lu` \| unsigned long \| Exactly equivalent to \|
				261	\| \| \| ``printf("%lu")``. \|
				262	+-------------------+---------------------+--------------------------------+
				263	\| :attr:`%zd` \| Py_ssize_t \| Exactly equivalent to \|
				264	\| \| \| ``printf("%zd")``. \|
				265	+-------------------+---------------------+--------------------------------+
				266	\| :attr:`%zu` \| size_t \| Exactly equivalent to \|
				267	\| \| \| ``printf("%zu")``. \|
				268	+-------------------+---------------------+--------------------------------+
				269	\| :attr:`%i` \| int \| Exactly equivalent to \|
				270	\| \| \| ``printf("%i")``. \|
				271	+-------------------+---------------------+--------------------------------+
				272	\| :attr:`%x` \| int \| Exactly equivalent to \|
				273	\| \| \| ``printf("%x")``. \|
				274	+-------------------+---------------------+--------------------------------+
				275	\| :attr:`%s` \| char\* \| A null-terminated C character \|
				276	\| \| \| array. \|
				277	+-------------------+---------------------+--------------------------------+
				278	\| :attr:`%p` \| void\* \| The hex representation of a C \|
				279	\| \| \| pointer. Mostly equivalent to \|
				280	\| \| \| ``printf("%p")`` except that \|
				281	\| \| \| it is guaranteed to start with \|
				282	\| \| \| the literal ``0x`` regardless \|
				283	\| \| \| of what the platform's \|
				284	\| \| \| ``printf`` yields. \|
				285	+-------------------+---------------------+--------------------------------+
Georg Brandl	559e5d7	2008-06-11 18:37:52 +0000	[diff] [blame]	286	\| :attr:`%A` \| PyObject\* \| The result of calling \|
				287	\| \| \| :func:`ascii`. \|
				288	+-------------------+---------------------+--------------------------------+
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	289	\| :attr:`%U` \| PyObject\* \| A unicode object. \|
				290	+-------------------+---------------------+--------------------------------+
				291	\| :attr:`%V` \| PyObject\, char \ \| A unicode object (which may be \|
				292	\| \| \| NULL) and a null-terminated \|
				293	\| \| \| C character array as a second \|
				294	\| \| \| parameter (which will be used, \|
				295	\| \| \| if the first parameter is \|
				296	\| \| \| NULL). \|
				297	+-------------------+---------------------+--------------------------------+
				298	\| :attr:`%S` \| PyObject\* \| The result of calling \|
Benjamin Peterson	e866206	2009-03-08 23:51:13 +0000	[diff] [blame]	299	\| \| \| :func:`PyObject_Str`. \|
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	300	+-------------------+---------------------+--------------------------------+
				301	\| :attr:`%R` \| PyObject\* \| The result of calling \|
				302	\| \| \| :func:`PyObject_Repr`. \|
				303	+-------------------+---------------------+--------------------------------+
				304
				305	An unrecognized format character causes all the rest of the format string to be
				306	copied as-is to the result string, and any extra arguments discarded.
				307
				308
				309	.. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
				310
				311	Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
				312	arguments.
				313
				314
				315	.. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
				316
				317	Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
				318	buffer, NULL if unicode is not a Unicode object.
				319
				320
				321	.. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
				322
				323	Return the length of the Unicode object.
				324
				325
				326	.. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject obj, const char encoding, const char *errors)
				327
				328	Coerce an encoded object obj to an Unicode object and return a reference with
				329	incremented refcount.
				330
Georg Brandl	c7b6908	2010-10-06 08:08:40 +0000	[diff] [blame]	331	:class:`bytes`, :class:`bytearray` and other char buffer compatible objects
				332	are decoded according to the given encoding and using the error handling
				333	defined by errors. Both can be NULL to have the interface use the default
				334	values (see the next section for details).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	335
				336	All other objects, including Unicode objects, cause a :exc:`TypeError` to be
				337	set.
				338
				339	The API returns NULL if there was an error. The caller is responsible for
				340	decref'ing the returned objects.
				341
				342
				343	.. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
				344
				345	Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
				346	throughout the interpreter whenever coercion to Unicode is needed.
				347
				348	If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
				349	Python can interface directly to this type using the following functions.
				350	Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
				351	the system's :ctype:`wchar_t`.
				352
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	353
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	354	File System Encoding
				355	""""""""""""""""""""
				356
				357	To encode and decode file names and other environment strings,
				358	:cdata:`Py_FileSystemEncoding` should be used as the encoding, and
				359	``"surrogateescape"`` should be used as the error handler (:pep:`383`). To
				360	encode file names during argument parsing, the ``"O&"`` converter should be
Georg Brandl	4b05466	2010-10-06 08:56:53 +0000	[diff] [blame^]	361	used, passing :cfunc:`PyUnicode_FSConverter` as the conversion function:
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	362
				363	.. cfunction:: int PyUnicode_FSConverter(PyObject* obj, void* result)
				364
				365	Convert obj into result, using :cdata:`Py_FileSystemDefaultEncoding`,
				366	and the ``"surrogateescape"`` error handler. result must be a
				367	``PyObject*``, return a :func:`bytes` object which must be released if it
				368	is no longer used.
				369
				370	.. versionadded:: 3.1
				371
Georg Brandl	23b4f92	2010-10-06 08:43:56 +0000	[diff] [blame]	372
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	373	.. cfunction:: PyObject* PyUnicode_DecodeFSDefaultAndSize(const char *s, Py_ssize_t size)
				374
				375	Decode a null-terminated string using :cdata:`Py_FileSystemDefaultEncoding`
				376	and the ``"surrogateescape"`` error handler.
				377
				378	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				379
				380	Use :func:`PyUnicode_DecodeFSDefaultAndSize` if you know the string length.
				381
				382	.. cfunction:: PyObject* PyUnicode_DecodeFSDefault(const char *s)
				383
				384	Decode a string using :cdata:`Py_FileSystemDefaultEncoding` and
				385	the ``"surrogateescape"`` error handler.
				386
				387	If :cdata:`Py_FileSystemDefaultEncoding` is not set, fall back to UTF-8.
				388
				389
				390	wchar_t Support
				391	"""""""""""""""
				392
				393	wchar_t support for platforms which support it:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	394
				395	.. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
				396
				397	Create a Unicode object from the :ctype:`wchar_t` buffer w of the given size.
Martin v. Löwis	790465f	2008-04-05 20:41:37 +0000	[diff] [blame]	398	Passing -1 as the size indicates that the function must itself compute the length,
				399	using wcslen.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	400	Return NULL on failure.
				401
				402
				403	.. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject unicode, wchar_t w, Py_ssize_t size)
				404
				405	Copy the Unicode object contents into the :ctype:`wchar_t` buffer w. At most
				406	size :ctype:`wchar_t` characters are copied (excluding a possibly trailing
				407	0-termination character). Return the number of :ctype:`wchar_t` characters
				408	copied or -1 in case of an error. Note that the resulting :ctype:`wchar_t`
				409	string may or may not be 0-terminated. It is the responsibility of the caller
				410	to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
				411	required by the application.
				412
				413
				414	.. _builtincodecs:
				415
				416	Built-in Codecs
				417	^^^^^^^^^^^^^^^
				418
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame]	419	Python provides a set of built-in codecs which are written in C for speed. All of
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	420	these codecs are directly usable via the following functions.
				421
				422	Many of the following APIs take two arguments encoding and errors. These
				423	parameters encoding and errors have the same semantics as the ones of the
Daniel Stutzbach	23ef20f	2010-09-03 18:37:34 +0000	[diff] [blame]	424	built-in :func:`str` string object constructor.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	425
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	426	Setting encoding to NULL causes the default encoding to be used
				427	which is ASCII. The file system calls should use
				428	:cfunc:`PyUnicode_FSConverter` for encoding file names. This uses the
				429	variable :cdata:`Py_FileSystemDefaultEncoding` internally. This
				430	variable should be treated as read-only: On some systems, it will be a
				431	pointer to a static string, on others, it will change at run-time
				432	(such as when the application invokes setlocale).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	433
				434	Error handling is set by errors which may also be set to NULL meaning to use
				435	the default handling defined for the codec. Default error handling for all
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame]	436	built-in codecs is "strict" (:exc:`ValueError` is raised).
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	437
				438	The codecs all use a similar interface. Only deviation from the following
				439	generic ones are documented for simplicity.
				440
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	441
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	442	Generic Codecs
				443	""""""""""""""
				444
				445	These are the generic codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	446
				447
				448	.. cfunction:: PyObject* PyUnicode_Decode(const char s, Py_ssize_t size, const char encoding, const char *errors)
				449
				450	Create a Unicode object by decoding size bytes of the encoded string s.
				451	encoding and errors have the same meaning as the parameters of the same name
Georg Brandl	c5605df	2009-08-13 08:26:44 +0000	[diff] [blame]	452	in the :func:`unicode` built-in function. The codec to be used is looked up
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	453	using the Python codec registry. Return NULL if an exception was raised by
				454	the codec.
				455
				456
				457	.. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE s, Py_ssize_t size, const char encoding, const char *errors)
				458
				459	Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	460	bytes object. encoding and errors have the same meaning as the
				461	parameters of the same name in the Unicode :meth:`encode` method. The codec
				462	to be used is looked up using the Python codec registry. Return NULL if an
				463	exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	464
				465
				466	.. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject unicode, const char encoding, const char *errors)
				467
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	468	Encode a Unicode object and return the result as Python bytes object.
				469	encoding and errors have the same meaning as the parameters of the same
				470	name in the Unicode :meth:`encode` method. The codec to be used is looked up
				471	using the Python codec registry. Return NULL if an exception was raised by
				472	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	473
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	474
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	475	UTF-8 Codecs
				476	""""""""""""
				477
				478	These are the UTF-8 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	479
				480
				481	.. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char s, Py_ssize_t size, const char errors)
				482
				483	Create a Unicode object by decoding size bytes of the UTF-8 encoded string
				484	s. Return NULL if an exception was raised by the codec.
				485
				486
				487	.. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
				488
				489	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
				490	consumed is not NULL, trailing incomplete UTF-8 byte sequences will not be
				491	treated as an error. Those bytes will not be decoded and the number of bytes
				492	that have been decoded will be stored in consumed.
				493
				494
				495	.. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE s, Py_ssize_t size, const char errors)
				496
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	497	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and
				498	return a Python bytes object. Return NULL if an exception was raised by
				499	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	500
				501
				502	.. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
				503
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	504	Encode a Unicode object using UTF-8 and return the result as Python bytes
				505	object. Error handling is "strict". Return NULL if an exception was
				506	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	507
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	508
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	509	UTF-32 Codecs
				510	"""""""""""""
				511
				512	These are the UTF-32 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	513
				514
				515	.. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char s, Py_ssize_t size, const char errors, int *byteorder)
				516
				517	Decode length bytes from a UTF-32 encoded buffer string and return the
				518	corresponding Unicode object. errors (if non-NULL) defines the error
				519	handling. It defaults to "strict".
				520
				521	If byteorder is non-NULL, the decoder starts decoding using the given byte
				522	order::
				523
				524	*byteorder == -1: little endian
				525	*byteorder == 0: native order
				526	*byteorder == 1: big endian
				527
Benjamin Peterson	f3d7dbe	2009-10-04 14:54:52 +0000	[diff] [blame]	528	If ``*byteorder`` is zero, and the first four bytes of the input data are a
				529	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				530	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				531	``1``, any byte order mark is copied to the output.
				532
				533	After completion, \byteorder* is set to the current byte order at the end
				534	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	535
				536	In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
				537
				538	If byteorder is NULL, the codec starts in native order mode.
				539
				540	Return NULL if an exception was raised by the codec.
				541
				542
				543	.. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				544
				545	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
				546	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
				547	trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
				548	by four) as an error. Those bytes will not be decoded and the number of bytes
				549	that have been decoded will be stored in consumed.
				550
				551
				552	.. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				553
				554	Return a Python bytes object holding the UTF-32 encoded value of the Unicode
Benjamin Peterson	f3d7dbe	2009-10-04 14:54:52 +0000	[diff] [blame]	555	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	556
				557	byteorder == -1: little endian
				558	byteorder == 0: native byte order (writes a BOM mark)
				559	byteorder == 1: big endian
				560
				561	If byteorder is ``0``, the output string will always start with the Unicode BOM
				562	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				563
				564	If Py_UNICODE_WIDE is not defined, surrogate pairs will be output
				565	as a single codepoint.
				566
				567	Return NULL if an exception was raised by the codec.
				568
				569
				570	.. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
				571
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	572	Return a Python byte string using the UTF-32 encoding in native byte
				573	order. The string always starts with a BOM mark. Error handling is "strict".
				574	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	575
				576
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	577	UTF-16 Codecs
				578	"""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	579
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	580	These are the UTF-16 codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	581
				582
				583	.. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char s, Py_ssize_t size, const char errors, int *byteorder)
				584
				585	Decode length bytes from a UTF-16 encoded buffer string and return the
				586	corresponding Unicode object. errors (if non-NULL) defines the error
				587	handling. It defaults to "strict".
				588
				589	If byteorder is non-NULL, the decoder starts decoding using the given byte
				590	order::
				591
				592	*byteorder == -1: little endian
				593	*byteorder == 0: native order
				594	*byteorder == 1: big endian
				595
Benjamin Peterson	f3d7dbe	2009-10-04 14:54:52 +0000	[diff] [blame]	596	If ``*byteorder`` is zero, and the first two bytes of the input data are a
				597	byte order mark (BOM), the decoder switches to this byte order and the BOM is
				598	not copied into the resulting Unicode string. If ``*byteorder`` is ``-1`` or
				599	``1``, any byte order mark is copied to the output (where it will result in
				600	either a ``\ufeff`` or a ``\ufffe`` character).
				601
				602	After completion, \byteorder* is set to the current byte order at the end
				603	of input data.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	604
				605	If byteorder is NULL, the codec starts in native order mode.
				606
				607	Return NULL if an exception was raised by the codec.
				608
				609
				610	.. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char s, Py_ssize_t size, const char errors, int byteorder, Py_ssize_t consumed)
				611
				612	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
				613	consumed is not NULL, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
				614	trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
				615	split surrogate pair) as an error. Those bytes will not be decoded and the
				616	number of bytes that have been decoded will be stored in consumed.
				617
				618
				619	.. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE s, Py_ssize_t size, const char errors, int byteorder)
				620
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	621	Return a Python bytes object holding the UTF-16 encoded value of the Unicode
Benjamin Peterson	f3d7dbe	2009-10-04 14:54:52 +0000	[diff] [blame]	622	data in s. Output is written according to the following byte order::
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	623
				624	byteorder == -1: little endian
				625	byteorder == 0: native byte order (writes a BOM mark)
				626	byteorder == 1: big endian
				627
				628	If byteorder is ``0``, the output string will always start with the Unicode BOM
				629	mark (U+FEFF). In the other two modes, no BOM mark is prepended.
				630
				631	If Py_UNICODE_WIDE is defined, a single :ctype:`Py_UNICODE` value may get
				632	represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
				633	values is interpreted as an UCS-2 character.
				634
				635	Return NULL if an exception was raised by the codec.
				636
				637
				638	.. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
				639
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	640	Return a Python byte string using the UTF-16 encoding in native byte
				641	order. The string always starts with a BOM mark. Error handling is "strict".
				642	Return NULL if an exception was raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	643
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	644
Georg Brandl	4009c9e	2010-10-06 08:26:09 +0000	[diff] [blame]	645	UTF-7 Codecs
				646	""""""""""""
				647
				648	These are the UTF-7 codec APIs:
				649
				650
				651	.. cfunction:: PyObject* PyUnicode_DecodeUTF7(const char s, Py_ssize_t size, const char errors)
				652
				653	Create a Unicode object by decoding size bytes of the UTF-7 encoded string
				654	s. Return NULL if an exception was raised by the codec.
				655
				656
Georg Brandl	13f959b	2010-10-06 08:35:38 +0000	[diff] [blame]	657	.. cfunction:: PyObject* PyUnicode_DecodeUTF7Stateful(const char s, Py_ssize_t size, const char errors, Py_ssize_t *consumed)
Georg Brandl	4009c9e	2010-10-06 08:26:09 +0000	[diff] [blame]	658
				659	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeUTF7`. If
				660	consumed is not NULL, trailing incomplete UTF-7 base-64 sections will not
				661	be treated as an error. Those bytes will not be decoded and the number of
				662	bytes that have been decoded will be stored in consumed.
				663
				664
				665	.. cfunction:: PyObject* PyUnicode_EncodeUTF7(const Py_UNICODE s, Py_ssize_t size, int base64SetO, int base64WhiteSpace, const char errors)
				666
				667	Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-7 and
				668	return a Python bytes object. Return NULL if an exception was raised by
				669	the codec.
				670
				671	If base64SetO is nonzero, "Set O" (punctuation that has no otherwise
				672	special meaning) will be encoded in base-64. If base64WhiteSpace is
				673	nonzero, whitespace will be encoded in base-64. Both are set to zero for the
				674	Python "utf-7" codec.
				675
				676
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	677	Unicode-Escape Codecs
				678	"""""""""""""""""""""
				679
				680	These are the "Unicode Escape" codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	681
				682
				683	.. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				684
				685	Create a Unicode object by decoding size bytes of the Unicode-Escape encoded
				686	string s. Return NULL if an exception was raised by the codec.
				687
				688
				689	.. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
				690
				691	Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
				692	return a Python string object. Return NULL if an exception was raised by the
				693	codec.
				694
				695
				696	.. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
				697
				698	Encode a Unicode object using Unicode-Escape and return the result as Python
				699	string object. Error handling is "strict". Return NULL if an exception was
				700	raised by the codec.
				701
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	702
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	703	Raw-Unicode-Escape Codecs
				704	"""""""""""""""""""""""""
				705
				706	These are the "Raw Unicode Escape" codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	707
				708
				709	.. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char s, Py_ssize_t size, const char errors)
				710
				711	Create a Unicode object by decoding size bytes of the Raw-Unicode-Escape
				712	encoded string s. Return NULL if an exception was raised by the codec.
				713
				714
				715	.. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE s, Py_ssize_t size, const char errors)
				716
				717	Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
				718	and return a Python string object. Return NULL if an exception was raised by
				719	the codec.
				720
				721
				722	.. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
				723
				724	Encode a Unicode object using Raw-Unicode-Escape and return the result as
				725	Python string object. Error handling is "strict". Return NULL if an exception
				726	was raised by the codec.
				727
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	728
				729	Latin-1 Codecs
				730	""""""""""""""
				731
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	732	These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
				733	ordinals and only these are accepted by the codecs during encoding.
				734
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	735
				736	.. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char s, Py_ssize_t size, const char errors)
				737
				738	Create a Unicode object by decoding size bytes of the Latin-1 encoded string
				739	s. Return NULL if an exception was raised by the codec.
				740
				741
				742	.. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE s, Py_ssize_t size, const char errors)
				743
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	744	Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and
				745	return a Python bytes object. Return NULL if an exception was raised by
				746	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	747
				748
				749	.. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
				750
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	751	Encode a Unicode object using Latin-1 and return the result as Python bytes
				752	object. Error handling is "strict". Return NULL if an exception was
				753	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	754
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	755
				756	ASCII Codecs
				757	""""""""""""
				758
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	759	These are the ASCII codec APIs. Only 7-bit ASCII data is accepted. All other
				760	codes generate errors.
				761
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	762
				763	.. cfunction:: PyObject* PyUnicode_DecodeASCII(const char s, Py_ssize_t size, const char errors)
				764
				765	Create a Unicode object by decoding size bytes of the ASCII encoded string
				766	s. Return NULL if an exception was raised by the codec.
				767
				768
				769	.. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE s, Py_ssize_t size, const char errors)
				770
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	771	Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and
				772	return a Python bytes object. Return NULL if an exception was raised by
				773	the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	774
				775
				776	.. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
				777
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	778	Encode a Unicode object using ASCII and return the result as Python bytes
				779	object. Error handling is "strict". Return NULL if an exception was
				780	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	781
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	782
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	783	Character Map Codecs
				784	""""""""""""""""""""
				785
				786	These are the mapping codec APIs:
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	787
				788	This codec is special in that it can be used to implement many different codecs
				789	(and this is in fact what was done to obtain most of the standard codecs
				790	included in the :mod:`encodings` package). The codec uses mapping to encode and
				791	decode characters.
				792
				793	Decoding mappings must map single string characters to single Unicode
				794	characters, integers (which are then interpreted as Unicode ordinals) or None
				795	(meaning "undefined mapping" and causing an error).
				796
				797	Encoding mappings must map single Unicode characters to single string
				798	characters, integers (which are then interpreted as Latin-1 ordinals) or None
				799	(meaning "undefined mapping" and causing an error).
				800
				801	The mapping objects provided must only support the __getitem__ mapping
				802	interface.
				803
				804	If a character lookup fails with a LookupError, the character is copied as-is
				805	meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
				806	resp. Because of this, mappings only need to contain those mappings which map
				807	characters to different code points.
				808
				809
				810	.. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char s, Py_ssize_t size, PyObject mapping, const char *errors)
				811
				812	Create a Unicode object by decoding size bytes of the encoded string s using
				813	the given mapping object. Return NULL if an exception was raised by the
				814	codec. If mapping is NULL latin-1 decoding will be done. Else it can be a
				815	dictionary mapping byte or a unicode string, which is treated as a lookup table.
				816	Byte values greater that the length of the string and U+FFFE "characters" are
				817	treated as "undefined mapping".
				818
				819
				820	.. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject mapping, const char *errors)
				821
				822	Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
				823	mapping object and return a Python string object. Return NULL if an
				824	exception was raised by the codec.
				825
				826
				827	.. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject unicode, PyObject mapping)
				828
				829	Encode a Unicode object using the given mapping object and return the result
				830	as Python string object. Error handling is "strict". Return NULL if an
				831	exception was raised by the codec.
				832
				833	The following codec API is special in that maps Unicode to Unicode.
				834
				835
				836	.. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE s, Py_ssize_t size, PyObject table, const char *errors)
				837
				838	Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
				839	character mapping table to it and return the resulting Unicode object. Return
				840	NULL when an exception was raised by the codec.
				841
				842	The mapping table must map Unicode ordinal integers to Unicode ordinal
				843	integers or None (causing deletion of the character).
				844
				845	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				846	and sequences work well. Unmapped character ordinals (ones which cause a
				847	:exc:`LookupError`) are left untouched and are copied as-is.
				848
Jeroen Ruigrok van der Werven	47a7d70	2009-04-27 05:43:17 +0000	[diff] [blame]	849
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	850	These are the MBCS codec APIs. They are currently only available on Windows and
				851	use the Win32 MBCS converters to implement the conversions. Note that MBCS (or
				852	DBCS) is a class of encodings, not just one. The target encoding is defined by
				853	the user settings on the machine running the codec.
				854
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	855
				856	MBCS codecs for Windows
				857	"""""""""""""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	858
				859
				860	.. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char s, Py_ssize_t size, const char errors)
				861
				862	Create a Unicode object by decoding size bytes of the MBCS encoded string s.
				863	Return NULL if an exception was raised by the codec.
				864
				865
				866	.. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char s, int size, const char errors, int *consumed)
				867
				868	If consumed is NULL, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
				869	consumed is not NULL, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
				870	trailing lead byte and the number of bytes that have been decoded will be stored
				871	in consumed.
				872
				873
				874	.. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE s, Py_ssize_t size, const char errors)
				875
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	876	Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return
				877	a Python bytes object. Return NULL if an exception was raised by the
				878	codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	879
				880
				881	.. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
				882
Benjamin Peterson	b6eba4f	2009-01-13 23:14:04 +0000	[diff] [blame]	883	Encode a Unicode object using MBCS and return the result as Python bytes
				884	object. Error handling is "strict". Return NULL if an exception was
				885	raised by the codec.
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	886
Martin v. Löwis	c15bdef	2009-05-29 14:47:46 +0000	[diff] [blame]	887
Victor Stinner	9076f9e	2010-05-14 16:08:46 +0000	[diff] [blame]	888	Methods & Slots
				889	"""""""""""""""
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	890
				891
				892	.. _unicodemethodsandslots:
				893
				894	Methods and Slot Functions
				895	^^^^^^^^^^^^^^^^^^^^^^^^^^
				896
				897	The following APIs are capable of handling Unicode objects and strings on input
				898	(we refer to them as strings in the descriptions) and return Unicode objects or
				899	integers as appropriate.
				900
				901	They all return NULL or ``-1`` if an exception occurs.
				902
				903
				904	.. cfunction:: PyObject* PyUnicode_Concat(PyObject left, PyObject right)
				905
				906	Concat two strings giving a new Unicode string.
				907
				908
				909	.. cfunction:: PyObject* PyUnicode_Split(PyObject s, PyObject sep, Py_ssize_t maxsplit)
				910
				911	Split a string giving a list of Unicode strings. If sep is NULL, splitting
				912	will be done at all whitespace substrings. Otherwise, splits occur at the given
				913	separator. At most maxsplit splits will be done. If negative, no limit is
				914	set. Separators are not included in the resulting list.
				915
				916
				917	.. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
				918
				919	Split a Unicode string at line breaks, returning a list of Unicode strings.
				920	CRLF is considered to be one line break. If keepend is 0, the Line break
				921	characters are not included in the resulting strings.
				922
				923
				924	.. cfunction:: PyObject* PyUnicode_Translate(PyObject str, PyObject table, const char *errors)
				925
				926	Translate a string by applying a character mapping table to it and return the
				927	resulting Unicode object.
				928
				929	The mapping table must map Unicode ordinal integers to Unicode ordinal integers
				930	or None (causing deletion of the character).
				931
				932	Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
				933	and sequences work well. Unmapped character ordinals (ones which cause a
				934	:exc:`LookupError`) are left untouched and are copied as-is.
				935
				936	errors has the usual meaning for codecs. It may be NULL which indicates to
				937	use the default error handling.
				938
				939
				940	.. cfunction:: PyObject* PyUnicode_Join(PyObject separator, PyObject seq)
				941
				942	Join a sequence of strings using the given separator and return the resulting
				943	Unicode string.
				944
				945
				946	.. cfunction:: int PyUnicode_Tailmatch(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				947
				948	Return 1 if substr matches str[start:end] at the given tail end
				949	(direction == -1 means to do a prefix match, direction == 1 a suffix match),
				950	0 otherwise. Return ``-1`` if an error occurred.
				951
				952
				953	.. cfunction:: Py_ssize_t PyUnicode_Find(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end, int direction)
				954
				955	Return the first position of substr in str[start:end] using the given
				956	direction (direction == 1 means to do a forward search, direction == -1 a
				957	backward search). The return value is the index of the first match; a value of
				958	``-1`` indicates that no match was found, and ``-2`` indicates that an error
				959	occurred and an exception has been set.
				960
				961
				962	.. cfunction:: Py_ssize_t PyUnicode_Count(PyObject str, PyObject substr, Py_ssize_t start, Py_ssize_t end)
				963
				964	Return the number of non-overlapping occurrences of substr in
				965	``str[start:end]``. Return ``-1`` if an error occurred.
				966
				967
				968	.. cfunction:: PyObject* PyUnicode_Replace(PyObject str, PyObject substr, PyObject *replstr, Py_ssize_t maxcount)
				969
				970	Replace at most maxcount occurrences of substr in str with replstr and
				971	return the resulting Unicode object. maxcount == -1 means replace all
				972	occurrences.
				973
				974
				975	.. cfunction:: int PyUnicode_Compare(PyObject left, PyObject right)
				976
				977	Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
				978	respectively.
				979
				980
Benjamin Peterson	c22ed14	2008-07-01 19:12:34 +0000	[diff] [blame]	981	.. cfunction:: int PyUnicode_CompareWithASCIIString(PyObject uni, char string)
				982
				983	Compare a unicode object, uni, with string and return -1, 0, 1 for less
				984	than, equal, and greater than, respectively.
				985
				986
Georg Brandl	54a3faa	2008-01-20 09:30:57 +0000	[diff] [blame]	987	.. cfunction:: int PyUnicode_RichCompare(PyObject left, PyObject right, int op)
				988
				989	Rich compare two unicode strings and return one of the following:
				990
				991	* ``NULL`` in case an exception was raised
				992	* :const:`Py_True` or :const:`Py_False` for successful comparisons
				993	* :const:`Py_NotImplemented` in case the type combination is unknown
				994
				995	Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
				996	:exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
				997	with a :exc:`UnicodeDecodeError`.
				998
				999	Possible values for op are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
				1000	:const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
				1001
				1002
				1003	.. cfunction:: PyObject* PyUnicode_Format(PyObject format, PyObject args)
				1004
				1005	Return a new string object from format and args; this is analogous to
				1006	``format % args``. The args argument must be a tuple.
				1007
				1008
				1009	.. cfunction:: int PyUnicode_Contains(PyObject container, PyObject element)
				1010
				1011	Check whether element is contained in container and return true or false
				1012	accordingly.
				1013
				1014	element has to coerce to a one element Unicode string. ``-1`` is returned if
				1015	there was an error.
				1016
				1017
				1018	.. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
				1019
				1020	Intern the argument \string* in place. The argument must be the address of a
				1021	pointer variable pointing to a Python unicode string object. If there is an
				1022	existing interned string that is the same as \string, it sets \string to
				1023	it (decrementing the reference count of the old string object and incrementing
				1024	the reference count of the interned string object), otherwise it leaves
				1025	\string* alone and interns it (incrementing its reference count).
				1026	(Clarification: even though there is a lot of talk about reference counts, think
				1027	of this function as reference-count-neutral; you own the object after the call
				1028	if and only if you owned it before the call.)
				1029
				1030
				1031	.. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
				1032
				1033	A combination of :cfunc:`PyUnicode_FromString` and
				1034	:cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
				1035	that has been interned, or a new ("owned") reference to an earlier interned
				1036	string object with the same value.
				1037