| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 1 | .. highlightlang:: c | 
|  | 2 |  | 
|  | 3 | .. _unicodeobjects: | 
|  | 4 |  | 
|  | 5 | Unicode Objects and Codecs | 
|  | 6 | -------------------------- | 
|  | 7 |  | 
|  | 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
|  | 9 |  | 
|  | 10 | Unicode Objects | 
|  | 11 | ^^^^^^^^^^^^^^^ | 
|  | 12 |  | 
|  | 13 |  | 
|  | 14 | These are the basic Unicode object types used for the Unicode implementation in | 
|  | 15 | Python: | 
|  | 16 |  | 
|  | 17 | .. % --- Unicode Type ------------------------------------------------------- | 
|  | 18 |  | 
|  | 19 |  | 
|  | 20 | .. ctype:: Py_UNICODE | 
|  | 21 |  | 
|  | 22 | This type represents the storage type which is used by Python internally as | 
|  | 23 | basis for holding Unicode ordinals.  Python's default builds use a 16-bit type | 
|  | 24 | for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also | 
|  | 25 | possible to build a UCS4 version of Python (most recent Linux distributions come | 
|  | 26 | with UCS4 builds of Python). These builds then use a 32-bit type for | 
|  | 27 | :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms | 
|  | 28 | where :ctype:`wchar_t` is available and compatible with the chosen Python | 
|  | 29 | Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for | 
|  | 30 | :ctype:`wchar_t` to enhance native platform compatibility. On all other | 
|  | 31 | platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned | 
|  | 32 | short` (UCS2) or :ctype:`unsigned long` (UCS4). | 
|  | 33 |  | 
|  | 34 | Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep | 
|  | 35 | this in mind when writing extensions or interfaces. | 
|  | 36 |  | 
|  | 37 |  | 
|  | 38 | .. ctype:: PyUnicodeObject | 
|  | 39 |  | 
|  | 40 | This subtype of :ctype:`PyObject` represents a Python Unicode object. | 
|  | 41 |  | 
|  | 42 |  | 
|  | 43 | .. cvar:: PyTypeObject PyUnicode_Type | 
|  | 44 |  | 
|  | 45 | This instance of :ctype:`PyTypeObject` represents the Python Unicode type.  It | 
|  | 46 | is exposed to Python code as ``unicode`` and ``types.UnicodeType``. | 
|  | 47 |  | 
|  | 48 | The following APIs are really C macros and can be used to do fast checks and to | 
|  | 49 | access internal read-only data of Unicode objects: | 
|  | 50 |  | 
|  | 51 |  | 
|  | 52 | .. cfunction:: int PyUnicode_Check(PyObject *o) | 
|  | 53 |  | 
|  | 54 | Return true if the object *o* is a Unicode object or an instance of a Unicode | 
|  | 55 | subtype. | 
|  | 56 |  | 
|  | 57 | .. versionchanged:: 2.2 | 
|  | 58 | Allowed subtypes to be accepted. | 
|  | 59 |  | 
|  | 60 |  | 
|  | 61 | .. cfunction:: int PyUnicode_CheckExact(PyObject *o) | 
|  | 62 |  | 
|  | 63 | Return true if the object *o* is a Unicode object, but not an instance of a | 
|  | 64 | subtype. | 
|  | 65 |  | 
|  | 66 | .. versionadded:: 2.2 | 
|  | 67 |  | 
|  | 68 |  | 
|  | 69 | .. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o) | 
|  | 70 |  | 
|  | 71 | Return the size of the object.  *o* has to be a :ctype:`PyUnicodeObject` (not | 
|  | 72 | checked). | 
|  | 73 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 74 | .. versionchanged:: 2.5 | 
|  | 75 | This function returned an :ctype:`int` type. This might require changes | 
|  | 76 | in your code for properly supporting 64-bit systems. | 
|  | 77 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 78 |  | 
|  | 79 | .. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o) | 
|  | 80 |  | 
|  | 81 | Return the size of the object's internal buffer in bytes.  *o* has to be a | 
|  | 82 | :ctype:`PyUnicodeObject` (not checked). | 
|  | 83 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 84 | .. versionchanged:: 2.5 | 
|  | 85 | This function returned an :ctype:`int` type. This might require changes | 
|  | 86 | in your code for properly supporting 64-bit systems. | 
|  | 87 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 88 |  | 
|  | 89 | .. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o) | 
|  | 90 |  | 
|  | 91 | Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object.  *o* | 
|  | 92 | has to be a :ctype:`PyUnicodeObject` (not checked). | 
|  | 93 |  | 
|  | 94 |  | 
|  | 95 | .. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o) | 
|  | 96 |  | 
|  | 97 | Return a pointer to the internal buffer of the object. *o* has to be a | 
|  | 98 | :ctype:`PyUnicodeObject` (not checked). | 
|  | 99 |  | 
| Christian Heimes | 3b718a7 | 2008-02-14 12:47:33 +0000 | [diff] [blame] | 100 |  | 
|  | 101 | .. cfunction:: int PyUnicode_ClearFreeList(void) | 
|  | 102 |  | 
|  | 103 | Clear the free list. Return the total number of freed items. | 
|  | 104 |  | 
|  | 105 | .. versionadded:: 2.6 | 
|  | 106 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 107 | Unicode provides many different character properties. The most often needed ones | 
|  | 108 | are available through these macros which are mapped to C functions depending on | 
|  | 109 | the Python configuration. | 
|  | 110 |  | 
|  | 111 | .. % --- Unicode character properties --------------------------------------- | 
|  | 112 |  | 
|  | 113 |  | 
|  | 114 | .. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch) | 
|  | 115 |  | 
|  | 116 | Return 1 or 0 depending on whether *ch* is a whitespace character. | 
|  | 117 |  | 
|  | 118 |  | 
|  | 119 | .. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch) | 
|  | 120 |  | 
|  | 121 | Return 1 or 0 depending on whether *ch* is a lowercase character. | 
|  | 122 |  | 
|  | 123 |  | 
|  | 124 | .. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch) | 
|  | 125 |  | 
|  | 126 | Return 1 or 0 depending on whether *ch* is an uppercase character. | 
|  | 127 |  | 
|  | 128 |  | 
|  | 129 | .. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch) | 
|  | 130 |  | 
|  | 131 | Return 1 or 0 depending on whether *ch* is a titlecase character. | 
|  | 132 |  | 
|  | 133 |  | 
|  | 134 | .. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch) | 
|  | 135 |  | 
|  | 136 | Return 1 or 0 depending on whether *ch* is a linebreak character. | 
|  | 137 |  | 
|  | 138 |  | 
|  | 139 | .. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch) | 
|  | 140 |  | 
|  | 141 | Return 1 or 0 depending on whether *ch* is a decimal character. | 
|  | 142 |  | 
|  | 143 |  | 
|  | 144 | .. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch) | 
|  | 145 |  | 
|  | 146 | Return 1 or 0 depending on whether *ch* is a digit character. | 
|  | 147 |  | 
|  | 148 |  | 
|  | 149 | .. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch) | 
|  | 150 |  | 
|  | 151 | Return 1 or 0 depending on whether *ch* is a numeric character. | 
|  | 152 |  | 
|  | 153 |  | 
|  | 154 | .. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch) | 
|  | 155 |  | 
|  | 156 | Return 1 or 0 depending on whether *ch* is an alphabetic character. | 
|  | 157 |  | 
|  | 158 |  | 
|  | 159 | .. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch) | 
|  | 160 |  | 
|  | 161 | Return 1 or 0 depending on whether *ch* is an alphanumeric character. | 
|  | 162 |  | 
|  | 163 | These APIs can be used for fast direct character conversions: | 
|  | 164 |  | 
|  | 165 |  | 
|  | 166 | .. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch) | 
|  | 167 |  | 
|  | 168 | Return the character *ch* converted to lower case. | 
|  | 169 |  | 
|  | 170 |  | 
|  | 171 | .. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch) | 
|  | 172 |  | 
|  | 173 | Return the character *ch* converted to upper case. | 
|  | 174 |  | 
|  | 175 |  | 
|  | 176 | .. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch) | 
|  | 177 |  | 
|  | 178 | Return the character *ch* converted to title case. | 
|  | 179 |  | 
|  | 180 |  | 
|  | 181 | .. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch) | 
|  | 182 |  | 
|  | 183 | Return the character *ch* converted to a decimal positive integer.  Return | 
|  | 184 | ``-1`` if this is not possible.  This macro does not raise exceptions. | 
|  | 185 |  | 
|  | 186 |  | 
|  | 187 | .. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch) | 
|  | 188 |  | 
|  | 189 | Return the character *ch* converted to a single digit integer. Return ``-1`` if | 
|  | 190 | this is not possible.  This macro does not raise exceptions. | 
|  | 191 |  | 
|  | 192 |  | 
|  | 193 | .. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch) | 
|  | 194 |  | 
|  | 195 | Return the character *ch* converted to a double. Return ``-1.0`` if this is not | 
|  | 196 | possible.  This macro does not raise exceptions. | 
|  | 197 |  | 
|  | 198 | To create Unicode objects and access their basic sequence properties, use these | 
|  | 199 | APIs: | 
|  | 200 |  | 
|  | 201 | .. % --- Plain Py_UNICODE --------------------------------------------------- | 
|  | 202 |  | 
|  | 203 |  | 
|  | 204 | .. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size) | 
|  | 205 |  | 
|  | 206 | Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u* | 
|  | 207 | may be *NULL* which causes the contents to be undefined. It is the user's | 
|  | 208 | responsibility to fill in the needed data.  The buffer is copied into the new | 
|  | 209 | object. If the buffer is not *NULL*, the return value might be a shared object. | 
|  | 210 | Therefore, modification of the resulting Unicode object is only allowed when *u* | 
|  | 211 | is *NULL*. | 
|  | 212 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 213 | .. versionchanged:: 2.5 | 
|  | 214 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 215 | changes in your code for properly supporting 64-bit systems. | 
|  | 216 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 217 |  | 
|  | 218 | .. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode) | 
|  | 219 |  | 
|  | 220 | Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE` | 
|  | 221 | buffer, *NULL* if *unicode* is not a Unicode object. | 
|  | 222 |  | 
|  | 223 |  | 
|  | 224 | .. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode) | 
|  | 225 |  | 
|  | 226 | Return the length of the Unicode object. | 
|  | 227 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 228 | .. versionchanged:: 2.5 | 
|  | 229 | This function returned an :ctype:`int` type. This might require changes | 
|  | 230 | in your code for properly supporting 64-bit systems. | 
|  | 231 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 232 |  | 
|  | 233 | .. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors) | 
|  | 234 |  | 
|  | 235 | Coerce an encoded object *obj* to an Unicode object and return a reference with | 
|  | 236 | incremented refcount. | 
|  | 237 |  | 
|  | 238 | String and other char buffer compatible objects are decoded according to the | 
|  | 239 | given encoding and using the error handling defined by errors.  Both can be | 
|  | 240 | *NULL* to have the interface use the default values (see the next section for | 
|  | 241 | details). | 
|  | 242 |  | 
|  | 243 | All other objects, including Unicode objects, cause a :exc:`TypeError` to be | 
|  | 244 | set. | 
|  | 245 |  | 
|  | 246 | The API returns *NULL* if there was an error.  The caller is responsible for | 
|  | 247 | decref'ing the returned objects. | 
|  | 248 |  | 
|  | 249 |  | 
|  | 250 | .. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj) | 
|  | 251 |  | 
|  | 252 | Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used | 
|  | 253 | throughout the interpreter whenever coercion to Unicode is needed. | 
|  | 254 |  | 
|  | 255 | If the platform supports :ctype:`wchar_t` and provides a header file wchar.h, | 
|  | 256 | Python can interface directly to this type using the following functions. | 
|  | 257 | Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to | 
|  | 258 | the system's :ctype:`wchar_t`. | 
|  | 259 |  | 
|  | 260 | .. % --- wchar_t support for platforms which support it --------------------- | 
|  | 261 |  | 
|  | 262 |  | 
|  | 263 | .. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size) | 
|  | 264 |  | 
|  | 265 | Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size. | 
|  | 266 | Return *NULL* on failure. | 
|  | 267 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 268 | .. versionchanged:: 2.5 | 
|  | 269 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 270 | changes in your code for properly supporting 64-bit systems. | 
|  | 271 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 272 |  | 
|  | 273 | .. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size) | 
|  | 274 |  | 
|  | 275 | Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*.  At most | 
|  | 276 | *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing | 
|  | 277 | 0-termination character).  Return the number of :ctype:`wchar_t` characters | 
|  | 278 | copied or -1 in case of an error.  Note that the resulting :ctype:`wchar_t` | 
|  | 279 | string may or may not be 0-terminated.  It is the responsibility of the caller | 
|  | 280 | to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is | 
|  | 281 | required by the application. | 
|  | 282 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 283 | .. versionchanged:: 2.5 | 
|  | 284 | This function returned an :ctype:`int` type and used an :ctype:`int` | 
|  | 285 | type for *size*. This might require changes in your code for properly | 
|  | 286 | supporting 64-bit systems. | 
|  | 287 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 288 |  | 
|  | 289 | .. _builtincodecs: | 
|  | 290 |  | 
|  | 291 | Built-in Codecs | 
|  | 292 | ^^^^^^^^^^^^^^^ | 
|  | 293 |  | 
|  | 294 | Python provides a set of builtin codecs which are written in C for speed. All of | 
|  | 295 | these codecs are directly usable via the following functions. | 
|  | 296 |  | 
|  | 297 | Many of the following APIs take two arguments encoding and errors. These | 
|  | 298 | parameters encoding and errors have the same semantics as the ones of the | 
|  | 299 | builtin unicode() Unicode object constructor. | 
|  | 300 |  | 
|  | 301 | Setting encoding to *NULL* causes the default encoding to be used which is | 
|  | 302 | ASCII.  The file system calls should use :cdata:`Py_FileSystemDefaultEncoding` | 
|  | 303 | as the encoding for file names. This variable should be treated as read-only: On | 
|  | 304 | some systems, it will be a pointer to a static string, on others, it will change | 
|  | 305 | at run-time (such as when the application invokes setlocale). | 
|  | 306 |  | 
|  | 307 | Error handling is set by errors which may also be set to *NULL* meaning to use | 
|  | 308 | the default handling defined for the codec.  Default error handling for all | 
|  | 309 | builtin codecs is "strict" (:exc:`ValueError` is raised). | 
|  | 310 |  | 
|  | 311 | The codecs all use a similar interface.  Only deviation from the following | 
|  | 312 | generic ones are documented for simplicity. | 
|  | 313 |  | 
|  | 314 | These are the generic codec APIs: | 
|  | 315 |  | 
|  | 316 | .. % --- Generic Codecs ----------------------------------------------------- | 
|  | 317 |  | 
|  | 318 |  | 
|  | 319 | .. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors) | 
|  | 320 |  | 
|  | 321 | Create a Unicode object by decoding *size* bytes of the encoded string *s*. | 
|  | 322 | *encoding* and *errors* have the same meaning as the parameters of the same name | 
|  | 323 | in the :func:`unicode` builtin function.  The codec to be used is looked up | 
|  | 324 | using the Python codec registry.  Return *NULL* if an exception was raised by | 
|  | 325 | the codec. | 
|  | 326 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 327 | .. versionchanged:: 2.5 | 
|  | 328 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 329 | changes in your code for properly supporting 64-bit systems. | 
|  | 330 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 331 |  | 
|  | 332 | .. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors) | 
|  | 333 |  | 
|  | 334 | Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python | 
|  | 335 | string object.  *encoding* and *errors* have the same meaning as the parameters | 
|  | 336 | of the same name in the Unicode :meth:`encode` method.  The codec to be used is | 
|  | 337 | looked up using the Python codec registry.  Return *NULL* if an exception was | 
|  | 338 | raised by the codec. | 
|  | 339 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 340 | .. versionchanged:: 2.5 | 
|  | 341 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 342 | changes in your code for properly supporting 64-bit systems. | 
|  | 343 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 344 |  | 
|  | 345 | .. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors) | 
|  | 346 |  | 
|  | 347 | Encode a Unicode object and return the result as Python string object. | 
|  | 348 | *encoding* and *errors* have the same meaning as the parameters of the same name | 
|  | 349 | in the Unicode :meth:`encode` method. The codec to be used is looked up using | 
|  | 350 | the Python codec registry. Return *NULL* if an exception was raised by the | 
|  | 351 | codec. | 
|  | 352 |  | 
|  | 353 | These are the UTF-8 codec APIs: | 
|  | 354 |  | 
|  | 355 | .. % --- UTF-8 Codecs ------------------------------------------------------- | 
|  | 356 |  | 
|  | 357 |  | 
|  | 358 | .. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors) | 
|  | 359 |  | 
|  | 360 | Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string | 
|  | 361 | *s*. Return *NULL* if an exception was raised by the codec. | 
|  | 362 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 363 | .. versionchanged:: 2.5 | 
|  | 364 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 365 | changes in your code for properly supporting 64-bit systems. | 
|  | 366 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 367 |  | 
|  | 368 | .. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed) | 
|  | 369 |  | 
|  | 370 | If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If | 
|  | 371 | *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be | 
|  | 372 | treated as an error. Those bytes will not be decoded and the number of bytes | 
|  | 373 | that have been decoded will be stored in *consumed*. | 
|  | 374 |  | 
|  | 375 | .. versionadded:: 2.4 | 
|  | 376 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 377 | .. versionchanged:: 2.5 | 
|  | 378 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 379 | changes in your code for properly supporting 64-bit systems. | 
|  | 380 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 381 |  | 
|  | 382 | .. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors) | 
|  | 383 |  | 
|  | 384 | Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a | 
|  | 385 | Python string object.  Return *NULL* if an exception was raised by the codec. | 
|  | 386 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 387 | .. versionchanged:: 2.5 | 
|  | 388 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 389 | changes in your code for properly supporting 64-bit systems. | 
|  | 390 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 391 |  | 
|  | 392 | .. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode) | 
|  | 393 |  | 
|  | 394 | Encode a Unicode object using UTF-8 and return the result as Python string | 
|  | 395 | object.  Error handling is "strict".  Return *NULL* if an exception was raised | 
|  | 396 | by the codec. | 
|  | 397 |  | 
|  | 398 | These are the UTF-32 codec APIs: | 
|  | 399 |  | 
|  | 400 | .. % --- UTF-32 Codecs ------------------------------------------------------ */ | 
|  | 401 |  | 
|  | 402 |  | 
|  | 403 | .. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder) | 
|  | 404 |  | 
|  | 405 | Decode *length* bytes from a UTF-32 encoded buffer string and return the | 
|  | 406 | corresponding Unicode object.  *errors* (if non-*NULL*) defines the error | 
|  | 407 | handling. It defaults to "strict". | 
|  | 408 |  | 
|  | 409 | If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte | 
|  | 410 | order:: | 
|  | 411 |  | 
|  | 412 | *byteorder == -1: little endian | 
|  | 413 | *byteorder == 0:  native order | 
|  | 414 | *byteorder == 1:  big endian | 
|  | 415 |  | 
|  | 416 | and then switches if the first four bytes of the input data are a byte order mark | 
|  | 417 | (BOM) and the specified byte order is native order.  This BOM is not copied into | 
|  | 418 | the resulting Unicode string.  After completion, *\*byteorder* is set to the | 
|  | 419 | current byte order at the end of input data. | 
|  | 420 |  | 
|  | 421 | In a narrow build codepoints outside the BMP will be decoded as surrogate pairs. | 
|  | 422 |  | 
|  | 423 | If *byteorder* is *NULL*, the codec starts in native order mode. | 
|  | 424 |  | 
|  | 425 | Return *NULL* if an exception was raised by the codec. | 
|  | 426 |  | 
|  | 427 | .. versionadded:: 2.6 | 
|  | 428 |  | 
|  | 429 |  | 
|  | 430 | .. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) | 
|  | 431 |  | 
|  | 432 | If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If | 
|  | 433 | *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat | 
|  | 434 | trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible | 
|  | 435 | by four) as an error. Those bytes will not be decoded and the number of bytes | 
|  | 436 | that have been decoded will be stored in *consumed*. | 
|  | 437 |  | 
|  | 438 | .. versionadded:: 2.6 | 
|  | 439 |  | 
|  | 440 |  | 
|  | 441 | .. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) | 
|  | 442 |  | 
|  | 443 | Return a Python bytes object holding the UTF-32 encoded value of the Unicode | 
|  | 444 | data in *s*.  If *byteorder* is not ``0``, output is written according to the | 
|  | 445 | following byte order:: | 
|  | 446 |  | 
|  | 447 | byteorder == -1: little endian | 
|  | 448 | byteorder == 0:  native byte order (writes a BOM mark) | 
|  | 449 | byteorder == 1:  big endian | 
|  | 450 |  | 
|  | 451 | If byteorder is ``0``, the output string will always start with the Unicode BOM | 
|  | 452 | mark (U+FEFF). In the other two modes, no BOM mark is prepended. | 
|  | 453 |  | 
|  | 454 | If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output | 
|  | 455 | as a single codepoint. | 
|  | 456 |  | 
|  | 457 | Return *NULL* if an exception was raised by the codec. | 
|  | 458 |  | 
|  | 459 | .. versionadded:: 2.6 | 
|  | 460 |  | 
|  | 461 |  | 
|  | 462 | .. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode) | 
|  | 463 |  | 
|  | 464 | Return a Python string using the UTF-32 encoding in native byte order. The | 
|  | 465 | string always starts with a BOM mark.  Error handling is "strict".  Return | 
|  | 466 | *NULL* if an exception was raised by the codec. | 
|  | 467 |  | 
|  | 468 | .. versionadded:: 2.6 | 
|  | 469 |  | 
|  | 470 |  | 
|  | 471 | These are the UTF-16 codec APIs: | 
|  | 472 |  | 
|  | 473 | .. % --- UTF-16 Codecs ------------------------------------------------------ */ | 
|  | 474 |  | 
|  | 475 |  | 
|  | 476 | .. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder) | 
|  | 477 |  | 
|  | 478 | Decode *length* bytes from a UTF-16 encoded buffer string and return the | 
|  | 479 | corresponding Unicode object.  *errors* (if non-*NULL*) defines the error | 
|  | 480 | handling. It defaults to "strict". | 
|  | 481 |  | 
|  | 482 | If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte | 
|  | 483 | order:: | 
|  | 484 |  | 
|  | 485 | *byteorder == -1: little endian | 
|  | 486 | *byteorder == 0:  native order | 
|  | 487 | *byteorder == 1:  big endian | 
|  | 488 |  | 
|  | 489 | and then switches if the first two bytes of the input data are a byte order mark | 
|  | 490 | (BOM) and the specified byte order is native order.  This BOM is not copied into | 
|  | 491 | the resulting Unicode string.  After completion, *\*byteorder* is set to the | 
|  | 492 | current byte order at the. | 
|  | 493 |  | 
|  | 494 | If *byteorder* is *NULL*, the codec starts in native order mode. | 
|  | 495 |  | 
|  | 496 | Return *NULL* if an exception was raised by the codec. | 
|  | 497 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 498 | .. versionchanged:: 2.5 | 
|  | 499 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 500 | changes in your code for properly supporting 64-bit systems. | 
|  | 501 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 502 |  | 
|  | 503 | .. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed) | 
|  | 504 |  | 
|  | 505 | If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If | 
|  | 506 | *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat | 
|  | 507 | trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a | 
|  | 508 | split surrogate pair) as an error. Those bytes will not be decoded and the | 
|  | 509 | number of bytes that have been decoded will be stored in *consumed*. | 
|  | 510 |  | 
|  | 511 | .. versionadded:: 2.4 | 
|  | 512 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 513 | .. versionchanged:: 2.5 | 
|  | 514 | This function used an :ctype:`int` type for *size* and an :ctype:`int *` | 
|  | 515 | type for *consumed*. This might require changes in your code for | 
|  | 516 | properly supporting 64-bit systems. | 
|  | 517 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 518 |  | 
|  | 519 | .. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder) | 
|  | 520 |  | 
|  | 521 | Return a Python string object holding the UTF-16 encoded value of the Unicode | 
|  | 522 | data in *s*.  If *byteorder* is not ``0``, output is written according to the | 
|  | 523 | following byte order:: | 
|  | 524 |  | 
|  | 525 | byteorder == -1: little endian | 
|  | 526 | byteorder == 0:  native byte order (writes a BOM mark) | 
|  | 527 | byteorder == 1:  big endian | 
|  | 528 |  | 
|  | 529 | If byteorder is ``0``, the output string will always start with the Unicode BOM | 
|  | 530 | mark (U+FEFF). In the other two modes, no BOM mark is prepended. | 
|  | 531 |  | 
|  | 532 | If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get | 
|  | 533 | represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE` | 
|  | 534 | values is interpreted as an UCS-2 character. | 
|  | 535 |  | 
|  | 536 | Return *NULL* if an exception was raised by the codec. | 
|  | 537 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 538 | .. versionchanged:: 2.5 | 
|  | 539 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 540 | changes in your code for properly supporting 64-bit systems. | 
|  | 541 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 542 |  | 
|  | 543 | .. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode) | 
|  | 544 |  | 
|  | 545 | Return a Python string using the UTF-16 encoding in native byte order. The | 
|  | 546 | string always starts with a BOM mark.  Error handling is "strict".  Return | 
|  | 547 | *NULL* if an exception was raised by the codec. | 
|  | 548 |  | 
|  | 549 | These are the "Unicode Escape" codec APIs: | 
|  | 550 |  | 
|  | 551 | .. % --- Unicode-Escape Codecs ---------------------------------------------- | 
|  | 552 |  | 
|  | 553 |  | 
|  | 554 | .. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) | 
|  | 555 |  | 
|  | 556 | Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded | 
|  | 557 | string *s*.  Return *NULL* if an exception was raised by the codec. | 
|  | 558 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 559 | .. versionchanged:: 2.5 | 
|  | 560 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 561 | changes in your code for properly supporting 64-bit systems. | 
|  | 562 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 563 |  | 
|  | 564 | .. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size) | 
|  | 565 |  | 
|  | 566 | Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and | 
|  | 567 | return a Python string object.  Return *NULL* if an exception was raised by the | 
|  | 568 | codec. | 
|  | 569 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 570 | .. versionchanged:: 2.5 | 
|  | 571 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 572 | changes in your code for properly supporting 64-bit systems. | 
|  | 573 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 574 |  | 
|  | 575 | .. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode) | 
|  | 576 |  | 
|  | 577 | Encode a Unicode object using Unicode-Escape and return the result as Python | 
|  | 578 | string object.  Error handling is "strict". Return *NULL* if an exception was | 
|  | 579 | raised by the codec. | 
|  | 580 |  | 
|  | 581 | These are the "Raw Unicode Escape" codec APIs: | 
|  | 582 |  | 
|  | 583 | .. % --- Raw-Unicode-Escape Codecs ------------------------------------------ | 
|  | 584 |  | 
|  | 585 |  | 
|  | 586 | .. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors) | 
|  | 587 |  | 
|  | 588 | Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape | 
|  | 589 | encoded string *s*.  Return *NULL* if an exception was raised by the codec. | 
|  | 590 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 591 | .. versionchanged:: 2.5 | 
|  | 592 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 593 | changes in your code for properly supporting 64-bit systems. | 
|  | 594 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 595 |  | 
|  | 596 | .. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors) | 
|  | 597 |  | 
|  | 598 | Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape | 
|  | 599 | and return a Python string object.  Return *NULL* if an exception was raised by | 
|  | 600 | the codec. | 
|  | 601 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 602 | .. versionchanged:: 2.5 | 
|  | 603 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 604 | changes in your code for properly supporting 64-bit systems. | 
|  | 605 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 606 |  | 
|  | 607 | .. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode) | 
|  | 608 |  | 
|  | 609 | Encode a Unicode object using Raw-Unicode-Escape and return the result as | 
|  | 610 | Python string object. Error handling is "strict". Return *NULL* if an exception | 
|  | 611 | was raised by the codec. | 
|  | 612 |  | 
|  | 613 | These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode | 
|  | 614 | ordinals and only these are accepted by the codecs during encoding. | 
|  | 615 |  | 
|  | 616 | .. % --- Latin-1 Codecs ----------------------------------------------------- | 
|  | 617 |  | 
|  | 618 |  | 
|  | 619 | .. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors) | 
|  | 620 |  | 
|  | 621 | Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string | 
|  | 622 | *s*.  Return *NULL* if an exception was raised by the codec. | 
|  | 623 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 624 | .. versionchanged:: 2.5 | 
|  | 625 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 626 | changes in your code for properly supporting 64-bit systems. | 
|  | 627 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 628 |  | 
|  | 629 | .. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors) | 
|  | 630 |  | 
|  | 631 | Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return | 
|  | 632 | a Python string object.  Return *NULL* if an exception was raised by the codec. | 
|  | 633 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 634 | .. versionchanged:: 2.5 | 
|  | 635 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 636 | changes in your code for properly supporting 64-bit systems. | 
|  | 637 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 638 |  | 
|  | 639 | .. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode) | 
|  | 640 |  | 
|  | 641 | Encode a Unicode object using Latin-1 and return the result as Python string | 
|  | 642 | object.  Error handling is "strict".  Return *NULL* if an exception was raised | 
|  | 643 | by the codec. | 
|  | 644 |  | 
|  | 645 | These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other | 
|  | 646 | codes generate errors. | 
|  | 647 |  | 
|  | 648 | .. % --- ASCII Codecs ------------------------------------------------------- | 
|  | 649 |  | 
|  | 650 |  | 
|  | 651 | .. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors) | 
|  | 652 |  | 
|  | 653 | Create a Unicode object by decoding *size* bytes of the ASCII encoded string | 
|  | 654 | *s*.  Return *NULL* if an exception was raised by the codec. | 
|  | 655 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 656 | .. versionchanged:: 2.5 | 
|  | 657 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 658 | changes in your code for properly supporting 64-bit systems. | 
|  | 659 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 660 |  | 
|  | 661 | .. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors) | 
|  | 662 |  | 
|  | 663 | Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a | 
|  | 664 | Python string object.  Return *NULL* if an exception was raised by the codec. | 
|  | 665 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 666 | .. versionchanged:: 2.5 | 
|  | 667 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 668 | changes in your code for properly supporting 64-bit systems. | 
|  | 669 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 670 |  | 
|  | 671 | .. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode) | 
|  | 672 |  | 
|  | 673 | Encode a Unicode object using ASCII and return the result as Python string | 
|  | 674 | object.  Error handling is "strict".  Return *NULL* if an exception was raised | 
|  | 675 | by the codec. | 
|  | 676 |  | 
|  | 677 | These are the mapping codec APIs: | 
|  | 678 |  | 
|  | 679 | .. % --- Character Map Codecs ----------------------------------------------- | 
|  | 680 |  | 
|  | 681 | This codec is special in that it can be used to implement many different codecs | 
|  | 682 | (and this is in fact what was done to obtain most of the standard codecs | 
|  | 683 | included in the :mod:`encodings` package). The codec uses mapping to encode and | 
|  | 684 | decode characters. | 
|  | 685 |  | 
|  | 686 | Decoding mappings must map single string characters to single Unicode | 
|  | 687 | characters, integers (which are then interpreted as Unicode ordinals) or None | 
|  | 688 | (meaning "undefined mapping" and causing an error). | 
|  | 689 |  | 
|  | 690 | Encoding mappings must map single Unicode characters to single string | 
|  | 691 | characters, integers (which are then interpreted as Latin-1 ordinals) or None | 
|  | 692 | (meaning "undefined mapping" and causing an error). | 
|  | 693 |  | 
|  | 694 | The mapping objects provided must only support the __getitem__ mapping | 
|  | 695 | interface. | 
|  | 696 |  | 
|  | 697 | If a character lookup fails with a LookupError, the character is copied as-is | 
|  | 698 | meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal | 
|  | 699 | resp. Because of this, mappings only need to contain those mappings which map | 
|  | 700 | characters to different code points. | 
|  | 701 |  | 
|  | 702 |  | 
|  | 703 | .. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors) | 
|  | 704 |  | 
|  | 705 | Create a Unicode object by decoding *size* bytes of the encoded string *s* using | 
|  | 706 | the given *mapping* object.  Return *NULL* if an exception was raised by the | 
|  | 707 | codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a | 
|  | 708 | dictionary mapping byte or a unicode string, which is treated as a lookup table. | 
|  | 709 | Byte values greater that the length of the string and U+FFFE "characters" are | 
|  | 710 | treated as "undefined mapping". | 
|  | 711 |  | 
|  | 712 | .. versionchanged:: 2.4 | 
|  | 713 | Allowed unicode string as mapping argument. | 
|  | 714 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 715 | .. versionchanged:: 2.5 | 
|  | 716 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 717 | changes in your code for properly supporting 64-bit systems. | 
|  | 718 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 719 |  | 
|  | 720 | .. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors) | 
|  | 721 |  | 
|  | 722 | Encode the :ctype:`Py_UNICODE` buffer of the given size using the given | 
|  | 723 | *mapping* object and return a Python string object. Return *NULL* if an | 
|  | 724 | exception was raised by the codec. | 
|  | 725 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 726 | .. versionchanged:: 2.5 | 
|  | 727 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 728 | changes in your code for properly supporting 64-bit systems. | 
|  | 729 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 730 |  | 
|  | 731 | .. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping) | 
|  | 732 |  | 
|  | 733 | Encode a Unicode object using the given *mapping* object and return the result | 
|  | 734 | as Python string object.  Error handling is "strict".  Return *NULL* if an | 
|  | 735 | exception was raised by the codec. | 
|  | 736 |  | 
|  | 737 | The following codec API is special in that maps Unicode to Unicode. | 
|  | 738 |  | 
|  | 739 |  | 
|  | 740 | .. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors) | 
|  | 741 |  | 
|  | 742 | Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a | 
|  | 743 | character mapping *table* to it and return the resulting Unicode object.  Return | 
|  | 744 | *NULL* when an exception was raised by the codec. | 
|  | 745 |  | 
|  | 746 | The *mapping* table must map Unicode ordinal integers to Unicode ordinal | 
|  | 747 | integers or None (causing deletion of the character). | 
|  | 748 |  | 
|  | 749 | Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries | 
|  | 750 | and sequences work well.  Unmapped character ordinals (ones which cause a | 
|  | 751 | :exc:`LookupError`) are left untouched and are copied as-is. | 
|  | 752 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 753 | .. versionchanged:: 2.5 | 
|  | 754 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 755 | changes in your code for properly supporting 64-bit systems. | 
|  | 756 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 757 | These are the MBCS codec APIs. They are currently only available on Windows and | 
|  | 758 | use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or | 
|  | 759 | DBCS) is a class of encodings, not just one.  The target encoding is defined by | 
|  | 760 | the user settings on the machine running the codec. | 
|  | 761 |  | 
|  | 762 | .. % --- MBCS codecs for Windows -------------------------------------------- | 
|  | 763 |  | 
|  | 764 |  | 
|  | 765 | .. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors) | 
|  | 766 |  | 
|  | 767 | Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*. | 
|  | 768 | Return *NULL* if an exception was raised by the codec. | 
|  | 769 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 770 | .. versionchanged:: 2.5 | 
|  | 771 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 772 | changes in your code for properly supporting 64-bit systems. | 
|  | 773 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 774 |  | 
|  | 775 | .. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed) | 
|  | 776 |  | 
|  | 777 | If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If | 
|  | 778 | *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode | 
|  | 779 | trailing lead byte and the number of bytes that have been decoded will be stored | 
|  | 780 | in *consumed*. | 
|  | 781 |  | 
|  | 782 | .. versionadded:: 2.5 | 
|  | 783 |  | 
|  | 784 |  | 
|  | 785 | .. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors) | 
|  | 786 |  | 
|  | 787 | Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a | 
|  | 788 | Python string object.  Return *NULL* if an exception was raised by the codec. | 
|  | 789 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 790 | .. versionchanged:: 2.5 | 
|  | 791 | This function used an :ctype:`int` type for *size*. This might require | 
|  | 792 | changes in your code for properly supporting 64-bit systems. | 
|  | 793 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 794 |  | 
|  | 795 | .. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode) | 
|  | 796 |  | 
|  | 797 | Encode a Unicode object using MBCS and return the result as Python string | 
|  | 798 | object.  Error handling is "strict".  Return *NULL* if an exception was raised | 
|  | 799 | by the codec. | 
|  | 800 |  | 
|  | 801 | .. % --- Methods & Slots ---------------------------------------------------- | 
|  | 802 |  | 
|  | 803 |  | 
|  | 804 | .. _unicodemethodsandslots: | 
|  | 805 |  | 
|  | 806 | Methods and Slot Functions | 
|  | 807 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ | 
|  | 808 |  | 
|  | 809 | The following APIs are capable of handling Unicode objects and strings on input | 
|  | 810 | (we refer to them as strings in the descriptions) and return Unicode objects or | 
|  | 811 | integers as appropriate. | 
|  | 812 |  | 
|  | 813 | They all return *NULL* or ``-1`` if an exception occurs. | 
|  | 814 |  | 
|  | 815 |  | 
|  | 816 | .. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right) | 
|  | 817 |  | 
|  | 818 | Concat two strings giving a new Unicode string. | 
|  | 819 |  | 
|  | 820 |  | 
|  | 821 | .. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit) | 
|  | 822 |  | 
|  | 823 | Split a string giving a list of Unicode strings.  If sep is *NULL*, splitting | 
|  | 824 | will be done at all whitespace substrings.  Otherwise, splits occur at the given | 
|  | 825 | separator.  At most *maxsplit* splits will be done.  If negative, no limit is | 
|  | 826 | set.  Separators are not included in the resulting list. | 
|  | 827 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 828 | .. versionchanged:: 2.5 | 
|  | 829 | This function used an :ctype:`int` type for *maxsplit*. This might require | 
|  | 830 | changes in your code for properly supporting 64-bit systems. | 
|  | 831 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 832 |  | 
|  | 833 | .. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend) | 
|  | 834 |  | 
|  | 835 | Split a Unicode string at line breaks, returning a list of Unicode strings. | 
|  | 836 | CRLF is considered to be one line break.  If *keepend* is 0, the Line break | 
|  | 837 | characters are not included in the resulting strings. | 
|  | 838 |  | 
|  | 839 |  | 
|  | 840 | .. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors) | 
|  | 841 |  | 
|  | 842 | Translate a string by applying a character mapping table to it and return the | 
|  | 843 | resulting Unicode object. | 
|  | 844 |  | 
|  | 845 | The mapping table must map Unicode ordinal integers to Unicode ordinal integers | 
|  | 846 | or None (causing deletion of the character). | 
|  | 847 |  | 
|  | 848 | Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries | 
|  | 849 | and sequences work well.  Unmapped character ordinals (ones which cause a | 
|  | 850 | :exc:`LookupError`) are left untouched and are copied as-is. | 
|  | 851 |  | 
|  | 852 | *errors* has the usual meaning for codecs. It may be *NULL* which indicates to | 
|  | 853 | use the default error handling. | 
|  | 854 |  | 
|  | 855 |  | 
|  | 856 | .. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq) | 
|  | 857 |  | 
|  | 858 | Join a sequence of strings using the given separator and return the resulting | 
|  | 859 | Unicode string. | 
|  | 860 |  | 
|  | 861 |  | 
|  | 862 | .. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) | 
|  | 863 |  | 
|  | 864 | Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end | 
|  | 865 | (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match), | 
|  | 866 | 0 otherwise. Return ``-1`` if an error occurred. | 
|  | 867 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 868 | .. versionchanged:: 2.5 | 
|  | 869 | This function used an :ctype:`int` type for *start* and *end*. This | 
|  | 870 | might require changes in your code for properly supporting 64-bit | 
|  | 871 | systems. | 
|  | 872 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 873 |  | 
|  | 874 | .. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction) | 
|  | 875 |  | 
|  | 876 | Return the first position of *substr* in *str*[*start*:*end*] using the given | 
|  | 877 | *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a | 
|  | 878 | backward search).  The return value is the index of the first match; a value of | 
|  | 879 | ``-1`` indicates that no match was found, and ``-2`` indicates that an error | 
|  | 880 | occurred and an exception has been set. | 
|  | 881 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 882 | .. versionchanged:: 2.5 | 
|  | 883 | This function used an :ctype:`int` type for *start* and *end*. This | 
|  | 884 | might require changes in your code for properly supporting 64-bit | 
|  | 885 | systems. | 
|  | 886 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 887 |  | 
|  | 888 | .. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end) | 
|  | 889 |  | 
|  | 890 | Return the number of non-overlapping occurrences of *substr* in | 
|  | 891 | ``str[start:end]``.  Return ``-1`` if an error occurred. | 
|  | 892 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 893 | .. versionchanged:: 2.5 | 
|  | 894 | This function returned an :ctype:`int` type and used an :ctype:`int` | 
|  | 895 | type for *start* and *end*. This might require changes in your code for | 
|  | 896 | properly supporting 64-bit systems. | 
|  | 897 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 898 |  | 
|  | 899 | .. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount) | 
|  | 900 |  | 
|  | 901 | Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and | 
|  | 902 | return the resulting Unicode object. *maxcount* == -1 means replace all | 
|  | 903 | occurrences. | 
|  | 904 |  | 
| Jeroen Ruigrok van der Werven | 0051bf3 | 2009-04-29 08:00:05 +0000 | [diff] [blame] | 905 | .. versionchanged:: 2.5 | 
|  | 906 | This function used an :ctype:`int` type for *maxcount*. This might | 
|  | 907 | require changes in your code for properly supporting 64-bit systems. | 
|  | 908 |  | 
| Georg Brandl | f684272 | 2008-01-19 22:08:21 +0000 | [diff] [blame] | 909 |  | 
|  | 910 | .. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right) | 
|  | 911 |  | 
|  | 912 | Compare two strings and return -1, 0, 1 for less than, equal, and greater than, | 
|  | 913 | respectively. | 
|  | 914 |  | 
|  | 915 |  | 
|  | 916 | .. cfunction:: int PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op) | 
|  | 917 |  | 
|  | 918 | Rich compare two unicode strings and return one of the following: | 
|  | 919 |  | 
|  | 920 | * ``NULL`` in case an exception was raised | 
|  | 921 | * :const:`Py_True` or :const:`Py_False` for successful comparisons | 
|  | 922 | * :const:`Py_NotImplemented` in case the type combination is unknown | 
|  | 923 |  | 
|  | 924 | Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a | 
|  | 925 | :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails | 
|  | 926 | with a :exc:`UnicodeDecodeError`. | 
|  | 927 |  | 
|  | 928 | Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`, | 
|  | 929 | :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`. | 
|  | 930 |  | 
|  | 931 |  | 
|  | 932 | .. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args) | 
|  | 933 |  | 
|  | 934 | Return a new string object from *format* and *args*; this is analogous to | 
|  | 935 | ``format % args``.  The *args* argument must be a tuple. | 
|  | 936 |  | 
|  | 937 |  | 
|  | 938 | .. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element) | 
|  | 939 |  | 
|  | 940 | Check whether *element* is contained in *container* and return true or false | 
|  | 941 | accordingly. | 
|  | 942 |  | 
|  | 943 | *element* has to coerce to a one element Unicode string. ``-1`` is returned if | 
|  | 944 | there was an error. |