Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame^] | 1 | |
| 2 | :mod:`string` --- Common string operations |
| 3 | ========================================== |
| 4 | |
| 5 | .. module:: string |
| 6 | :synopsis: Common string operations. |
| 7 | |
| 8 | |
| 9 | .. index:: module: re |
| 10 | |
| 11 | The :mod:`string` module contains a number of useful constants and |
| 12 | classes, as well as some deprecated legacy functions that are also |
| 13 | available as methods on strings. In addition, Python's built-in string |
| 14 | classes support the sequence type methods described in the |
| 15 | :ref:`typesseq` section, and also the string-specific methods described |
| 16 | in the :ref:`string-methods` section. To output formatted strings use |
| 17 | template strings or the ``%`` operator described in the |
| 18 | :ref:`string-formatting` section. Also, see the :mod:`re` module for |
| 19 | string functions based on regular expressions. |
| 20 | |
| 21 | |
| 22 | String constants |
| 23 | ---------------- |
| 24 | |
| 25 | The constants defined in this module are: |
| 26 | |
| 27 | |
| 28 | .. data:: ascii_letters |
| 29 | |
| 30 | The concatenation of the :const:`ascii_lowercase` and :const:`ascii_uppercase` |
| 31 | constants described below. This value is not locale-dependent. |
| 32 | |
| 33 | |
| 34 | .. data:: ascii_lowercase |
| 35 | |
| 36 | The lowercase letters ``'abcdefghijklmnopqrstuvwxyz'``. This value is not |
| 37 | locale-dependent and will not change. |
| 38 | |
| 39 | |
| 40 | .. data:: ascii_uppercase |
| 41 | |
| 42 | The uppercase letters ``'ABCDEFGHIJKLMNOPQRSTUVWXYZ'``. This value is not |
| 43 | locale-dependent and will not change. |
| 44 | |
| 45 | |
| 46 | .. data:: digits |
| 47 | |
| 48 | The string ``'0123456789'``. |
| 49 | |
| 50 | |
| 51 | .. data:: hexdigits |
| 52 | |
| 53 | The string ``'0123456789abcdefABCDEF'``. |
| 54 | |
| 55 | |
| 56 | .. data:: letters |
| 57 | |
| 58 | The concatenation of the strings :const:`lowercase` and :const:`uppercase` |
| 59 | described below. The specific value is locale-dependent, and will be updated |
| 60 | when :func:`locale.setlocale` is called. |
| 61 | |
| 62 | |
| 63 | .. data:: lowercase |
| 64 | |
| 65 | A string containing all the characters that are considered lowercase letters. |
| 66 | On most systems this is the string ``'abcdefghijklmnopqrstuvwxyz'``. Do not |
| 67 | change its definition --- the effect on the routines :func:`upper` and |
| 68 | :func:`swapcase` is undefined. The specific value is locale-dependent, and will |
| 69 | be updated when :func:`locale.setlocale` is called. |
| 70 | |
| 71 | |
| 72 | .. data:: octdigits |
| 73 | |
| 74 | The string ``'01234567'``. |
| 75 | |
| 76 | |
| 77 | .. data:: punctuation |
| 78 | |
| 79 | String of ASCII characters which are considered punctuation characters in the |
| 80 | ``C`` locale. |
| 81 | |
| 82 | |
| 83 | .. data:: printable |
| 84 | |
| 85 | String of characters which are considered printable. This is a combination of |
| 86 | :const:`digits`, :const:`letters`, :const:`punctuation`, and |
| 87 | :const:`whitespace`. |
| 88 | |
| 89 | |
| 90 | .. data:: uppercase |
| 91 | |
| 92 | A string containing all the characters that are considered uppercase letters. |
| 93 | On most systems this is the string ``'ABCDEFGHIJKLMNOPQRSTUVWXYZ'``. Do not |
| 94 | change its definition --- the effect on the routines :func:`lower` and |
| 95 | :func:`swapcase` is undefined. The specific value is locale-dependent, and will |
| 96 | be updated when :func:`locale.setlocale` is called. |
| 97 | |
| 98 | |
| 99 | .. data:: whitespace |
| 100 | |
| 101 | A string containing all characters that are considered whitespace. On most |
| 102 | systems this includes the characters space, tab, linefeed, return, formfeed, and |
| 103 | vertical tab. Do not change its definition --- the effect on the routines |
| 104 | :func:`strip` and :func:`split` is undefined. |
| 105 | |
| 106 | |
| 107 | Template strings |
| 108 | ---------------- |
| 109 | |
| 110 | Templates provide simpler string substitutions as described in :pep:`292`. |
| 111 | Instead of the normal ``%``\ -based substitutions, Templates support ``$``\ |
| 112 | -based substitutions, using the following rules: |
| 113 | |
| 114 | * ``$$`` is an escape; it is replaced with a single ``$``. |
| 115 | |
| 116 | * ``$identifier`` names a substitution placeholder matching a mapping key of |
| 117 | ``"identifier"``. By default, ``"identifier"`` must spell a Python |
| 118 | identifier. The first non-identifier character after the ``$`` character |
| 119 | terminates this placeholder specification. |
| 120 | |
| 121 | * ``${identifier}`` is equivalent to ``$identifier``. It is required when valid |
| 122 | identifier characters follow the placeholder but are not part of the |
| 123 | placeholder, such as ``"${noun}ification"``. |
| 124 | |
| 125 | Any other appearance of ``$`` in the string will result in a :exc:`ValueError` |
| 126 | being raised. |
| 127 | |
| 128 | .. versionadded:: 2.4 |
| 129 | |
| 130 | The :mod:`string` module provides a :class:`Template` class that implements |
| 131 | these rules. The methods of :class:`Template` are: |
| 132 | |
| 133 | |
| 134 | .. class:: Template(template) |
| 135 | |
| 136 | The constructor takes a single argument which is the template string. |
| 137 | |
| 138 | |
| 139 | .. method:: Template.substitute(mapping[, **kws]) |
| 140 | |
| 141 | Performs the template substitution, returning a new string. *mapping* is any |
| 142 | dictionary-like object with keys that match the placeholders in the template. |
| 143 | Alternatively, you can provide keyword arguments, where the keywords are the |
| 144 | placeholders. When both *mapping* and *kws* are given and there are duplicates, |
| 145 | the placeholders from *kws* take precedence. |
| 146 | |
| 147 | |
| 148 | .. method:: Template.safe_substitute(mapping[, **kws]) |
| 149 | |
| 150 | Like :meth:`substitute`, except that if placeholders are missing from *mapping* |
| 151 | and *kws*, instead of raising a :exc:`KeyError` exception, the original |
| 152 | placeholder will appear in the resulting string intact. Also, unlike with |
| 153 | :meth:`substitute`, any other appearances of the ``$`` will simply return ``$`` |
| 154 | instead of raising :exc:`ValueError`. |
| 155 | |
| 156 | While other exceptions may still occur, this method is called "safe" because |
| 157 | substitutions always tries to return a usable string instead of raising an |
| 158 | exception. In another sense, :meth:`safe_substitute` may be anything other than |
| 159 | safe, since it will silently ignore malformed templates containing dangling |
| 160 | delimiters, unmatched braces, or placeholders that are not valid Python |
| 161 | identifiers. |
| 162 | |
| 163 | :class:`Template` instances also provide one public data attribute: |
| 164 | |
| 165 | |
| 166 | .. attribute:: string.template |
| 167 | |
| 168 | This is the object passed to the constructor's *template* argument. In general, |
| 169 | you shouldn't change it, but read-only access is not enforced. |
| 170 | |
| 171 | Here is an example of how to use a Template:: |
| 172 | |
| 173 | >>> from string import Template |
| 174 | >>> s = Template('$who likes $what') |
| 175 | >>> s.substitute(who='tim', what='kung pao') |
| 176 | 'tim likes kung pao' |
| 177 | >>> d = dict(who='tim') |
| 178 | >>> Template('Give $who $100').substitute(d) |
| 179 | Traceback (most recent call last): |
| 180 | [...] |
| 181 | ValueError: Invalid placeholder in string: line 1, col 10 |
| 182 | >>> Template('$who likes $what').substitute(d) |
| 183 | Traceback (most recent call last): |
| 184 | [...] |
| 185 | KeyError: 'what' |
| 186 | >>> Template('$who likes $what').safe_substitute(d) |
| 187 | 'tim likes $what' |
| 188 | |
| 189 | Advanced usage: you can derive subclasses of :class:`Template` to customize the |
| 190 | placeholder syntax, delimiter character, or the entire regular expression used |
| 191 | to parse template strings. To do this, you can override these class attributes: |
| 192 | |
| 193 | * *delimiter* -- This is the literal string describing a placeholder introducing |
| 194 | delimiter. The default value ``$``. Note that this should *not* be a regular |
| 195 | expression, as the implementation will call :meth:`re.escape` on this string as |
| 196 | needed. |
| 197 | |
| 198 | * *idpattern* -- This is the regular expression describing the pattern for |
| 199 | non-braced placeholders (the braces will be added automatically as |
| 200 | appropriate). The default value is the regular expression |
| 201 | ``[_a-z][_a-z0-9]*``. |
| 202 | |
| 203 | Alternatively, you can provide the entire regular expression pattern by |
| 204 | overriding the class attribute *pattern*. If you do this, the value must be a |
| 205 | regular expression object with four named capturing groups. The capturing |
| 206 | groups correspond to the rules given above, along with the invalid placeholder |
| 207 | rule: |
| 208 | |
| 209 | * *escaped* -- This group matches the escape sequence, e.g. ``$$``, in the |
| 210 | default pattern. |
| 211 | |
| 212 | * *named* -- This group matches the unbraced placeholder name; it should not |
| 213 | include the delimiter in capturing group. |
| 214 | |
| 215 | * *braced* -- This group matches the brace enclosed placeholder name; it should |
| 216 | not include either the delimiter or braces in the capturing group. |
| 217 | |
| 218 | * *invalid* -- This group matches any other delimiter pattern (usually a single |
| 219 | delimiter), and it should appear last in the regular expression. |
| 220 | |
| 221 | |
| 222 | String functions |
| 223 | ---------------- |
| 224 | |
| 225 | The following functions are available to operate on string and Unicode objects. |
| 226 | They are not available as string methods. |
| 227 | |
| 228 | |
| 229 | .. function:: capwords(s) |
| 230 | |
| 231 | Split the argument into words using :func:`split`, capitalize each word using |
| 232 | :func:`capitalize`, and join the capitalized words using :func:`join`. Note |
| 233 | that this replaces runs of whitespace characters by a single space, and removes |
| 234 | leading and trailing whitespace. |
| 235 | |
| 236 | |
| 237 | .. function:: maketrans(from, to) |
| 238 | |
| 239 | Return a translation table suitable for passing to :func:`translate`, that will |
| 240 | map each character in *from* into the character at the same position in *to*; |
| 241 | *from* and *to* must have the same length. |
| 242 | |
| 243 | .. warning:: |
| 244 | |
| 245 | Don't use strings derived from :const:`lowercase` and :const:`uppercase` as |
| 246 | arguments; in some locales, these don't have the same length. For case |
| 247 | conversions, always use :func:`lower` and :func:`upper`. |
| 248 | |
| 249 | |
| 250 | Deprecated string functions |
| 251 | --------------------------- |
| 252 | |
| 253 | The following list of functions are also defined as methods of string and |
| 254 | Unicode objects; see section :ref:`string-methods` for more information on |
| 255 | those. You should consider these functions as deprecated, although they will |
| 256 | not be removed until Python 3.0. The functions defined in this module are: |
| 257 | |
| 258 | |
| 259 | .. function:: atof(s) |
| 260 | |
| 261 | .. deprecated:: 2.0 |
| 262 | Use the :func:`float` built-in function. |
| 263 | |
| 264 | .. index:: builtin: float |
| 265 | |
| 266 | Convert a string to a floating point number. The string must have the standard |
| 267 | syntax for a floating point literal in Python, optionally preceded by a sign |
| 268 | (``+`` or ``-``). Note that this behaves identical to the built-in function |
| 269 | :func:`float` when passed a string. |
| 270 | |
| 271 | .. note:: |
| 272 | |
| 273 | .. index:: |
| 274 | single: NaN |
| 275 | single: Infinity |
| 276 | |
| 277 | When passing in a string, values for NaN and Infinity may be returned, depending |
| 278 | on the underlying C library. The specific set of strings accepted which cause |
| 279 | these values to be returned depends entirely on the C library and is known to |
| 280 | vary. |
| 281 | |
| 282 | |
| 283 | .. function:: atoi(s[, base]) |
| 284 | |
| 285 | .. deprecated:: 2.0 |
| 286 | Use the :func:`int` built-in function. |
| 287 | |
| 288 | .. index:: builtin: eval |
| 289 | |
| 290 | Convert string *s* to an integer in the given *base*. The string must consist |
| 291 | of one or more digits, optionally preceded by a sign (``+`` or ``-``). The |
| 292 | *base* defaults to 10. If it is 0, a default base is chosen depending on the |
| 293 | leading characters of the string (after stripping the sign): ``0x`` or ``0X`` |
| 294 | means 16, ``0`` means 8, anything else means 10. If *base* is 16, a leading |
| 295 | ``0x`` or ``0X`` is always accepted, though not required. This behaves |
| 296 | identically to the built-in function :func:`int` when passed a string. (Also |
| 297 | note: for a more flexible interpretation of numeric literals, use the built-in |
| 298 | function :func:`eval`.) |
| 299 | |
| 300 | |
| 301 | .. function:: atol(s[, base]) |
| 302 | |
| 303 | .. deprecated:: 2.0 |
| 304 | Use the :func:`long` built-in function. |
| 305 | |
| 306 | .. index:: builtin: long |
| 307 | |
| 308 | Convert string *s* to a long integer in the given *base*. The string must |
| 309 | consist of one or more digits, optionally preceded by a sign (``+`` or ``-``). |
| 310 | The *base* argument has the same meaning as for :func:`atoi`. A trailing ``l`` |
| 311 | or ``L`` is not allowed, except if the base is 0. Note that when invoked |
| 312 | without *base* or with *base* set to 10, this behaves identical to the built-in |
| 313 | function :func:`long` when passed a string. |
| 314 | |
| 315 | |
| 316 | .. function:: capitalize(word) |
| 317 | |
| 318 | Return a copy of *word* with only its first character capitalized. |
| 319 | |
| 320 | |
| 321 | .. function:: expandtabs(s[, tabsize]) |
| 322 | |
| 323 | Expand tabs in a string replacing them by one or more spaces, depending on the |
| 324 | current column and the given tab size. The column number is reset to zero after |
| 325 | each newline occurring in the string. This doesn't understand other non-printing |
| 326 | characters or escape sequences. The tab size defaults to 8. |
| 327 | |
| 328 | |
| 329 | .. function:: find(s, sub[, start[,end]]) |
| 330 | |
| 331 | Return the lowest index in *s* where the substring *sub* is found such that |
| 332 | *sub* is wholly contained in ``s[start:end]``. Return ``-1`` on failure. |
| 333 | Defaults for *start* and *end* and interpretation of negative values is the same |
| 334 | as for slices. |
| 335 | |
| 336 | |
| 337 | .. function:: rfind(s, sub[, start[, end]]) |
| 338 | |
| 339 | Like :func:`find` but find the highest index. |
| 340 | |
| 341 | |
| 342 | .. function:: index(s, sub[, start[, end]]) |
| 343 | |
| 344 | Like :func:`find` but raise :exc:`ValueError` when the substring is not found. |
| 345 | |
| 346 | |
| 347 | .. function:: rindex(s, sub[, start[, end]]) |
| 348 | |
| 349 | Like :func:`rfind` but raise :exc:`ValueError` when the substring is not found. |
| 350 | |
| 351 | |
| 352 | .. function:: count(s, sub[, start[, end]]) |
| 353 | |
| 354 | Return the number of (non-overlapping) occurrences of substring *sub* in string |
| 355 | ``s[start:end]``. Defaults for *start* and *end* and interpretation of negative |
| 356 | values are the same as for slices. |
| 357 | |
| 358 | |
| 359 | .. function:: lower(s) |
| 360 | |
| 361 | Return a copy of *s*, but with upper case letters converted to lower case. |
| 362 | |
| 363 | |
| 364 | .. function:: split(s[, sep[, maxsplit]]) |
| 365 | |
| 366 | Return a list of the words of the string *s*. If the optional second argument |
| 367 | *sep* is absent or ``None``, the words are separated by arbitrary strings of |
| 368 | whitespace characters (space, tab, newline, return, formfeed). If the second |
| 369 | argument *sep* is present and not ``None``, it specifies a string to be used as |
| 370 | the word separator. The returned list will then have one more item than the |
| 371 | number of non-overlapping occurrences of the separator in the string. The |
| 372 | optional third argument *maxsplit* defaults to 0. If it is nonzero, at most |
| 373 | *maxsplit* number of splits occur, and the remainder of the string is returned |
| 374 | as the final element of the list (thus, the list will have at most |
| 375 | ``maxsplit+1`` elements). |
| 376 | |
| 377 | The behavior of split on an empty string depends on the value of *sep*. If *sep* |
| 378 | is not specified, or specified as ``None``, the result will be an empty list. |
| 379 | If *sep* is specified as any string, the result will be a list containing one |
| 380 | element which is an empty string. |
| 381 | |
| 382 | |
| 383 | .. function:: rsplit(s[, sep[, maxsplit]]) |
| 384 | |
| 385 | Return a list of the words of the string *s*, scanning *s* from the end. To all |
| 386 | intents and purposes, the resulting list of words is the same as returned by |
| 387 | :func:`split`, except when the optional third argument *maxsplit* is explicitly |
| 388 | specified and nonzero. When *maxsplit* is nonzero, at most *maxsplit* number of |
| 389 | splits -- the *rightmost* ones -- occur, and the remainder of the string is |
| 390 | returned as the first element of the list (thus, the list will have at most |
| 391 | ``maxsplit+1`` elements). |
| 392 | |
| 393 | .. versionadded:: 2.4 |
| 394 | |
| 395 | |
| 396 | .. function:: splitfields(s[, sep[, maxsplit]]) |
| 397 | |
| 398 | This function behaves identically to :func:`split`. (In the past, :func:`split` |
| 399 | was only used with one argument, while :func:`splitfields` was only used with |
| 400 | two arguments.) |
| 401 | |
| 402 | |
| 403 | .. function:: join(words[, sep]) |
| 404 | |
| 405 | Concatenate a list or tuple of words with intervening occurrences of *sep*. |
| 406 | The default value for *sep* is a single space character. It is always true that |
| 407 | ``string.join(string.split(s, sep), sep)`` equals *s*. |
| 408 | |
| 409 | |
| 410 | .. function:: joinfields(words[, sep]) |
| 411 | |
| 412 | This function behaves identically to :func:`join`. (In the past, :func:`join` |
| 413 | was only used with one argument, while :func:`joinfields` was only used with two |
| 414 | arguments.) Note that there is no :meth:`joinfields` method on string objects; |
| 415 | use the :meth:`join` method instead. |
| 416 | |
| 417 | |
| 418 | .. function:: lstrip(s[, chars]) |
| 419 | |
| 420 | Return a copy of the string with leading characters removed. If *chars* is |
| 421 | omitted or ``None``, whitespace characters are removed. If given and not |
| 422 | ``None``, *chars* must be a string; the characters in the string will be |
| 423 | stripped from the beginning of the string this method is called on. |
| 424 | |
| 425 | .. versionchanged:: 2.2.3 |
| 426 | The *chars* parameter was added. The *chars* parameter cannot be passed in |
| 427 | earlier 2.2 versions. |
| 428 | |
| 429 | |
| 430 | .. function:: rstrip(s[, chars]) |
| 431 | |
| 432 | Return a copy of the string with trailing characters removed. If *chars* is |
| 433 | omitted or ``None``, whitespace characters are removed. If given and not |
| 434 | ``None``, *chars* must be a string; the characters in the string will be |
| 435 | stripped from the end of the string this method is called on. |
| 436 | |
| 437 | .. versionchanged:: 2.2.3 |
| 438 | The *chars* parameter was added. The *chars* parameter cannot be passed in |
| 439 | earlier 2.2 versions. |
| 440 | |
| 441 | |
| 442 | .. function:: strip(s[, chars]) |
| 443 | |
| 444 | Return a copy of the string with leading and trailing characters removed. If |
| 445 | *chars* is omitted or ``None``, whitespace characters are removed. If given and |
| 446 | not ``None``, *chars* must be a string; the characters in the string will be |
| 447 | stripped from the both ends of the string this method is called on. |
| 448 | |
| 449 | .. versionchanged:: 2.2.3 |
| 450 | The *chars* parameter was added. The *chars* parameter cannot be passed in |
| 451 | earlier 2.2 versions. |
| 452 | |
| 453 | |
| 454 | .. function:: swapcase(s) |
| 455 | |
| 456 | Return a copy of *s*, but with lower case letters converted to upper case and |
| 457 | vice versa. |
| 458 | |
| 459 | |
| 460 | .. function:: translate(s, table[, deletechars]) |
| 461 | |
| 462 | Delete all characters from *s* that are in *deletechars* (if present), and then |
| 463 | translate the characters using *table*, which must be a 256-character string |
| 464 | giving the translation for each character value, indexed by its ordinal. If |
| 465 | *table* is ``None``, then only the character deletion step is performed. |
| 466 | |
| 467 | |
| 468 | .. function:: upper(s) |
| 469 | |
| 470 | Return a copy of *s*, but with lower case letters converted to upper case. |
| 471 | |
| 472 | |
| 473 | .. function:: ljust(s, width) |
| 474 | rjust(s, width) |
| 475 | center(s, width) |
| 476 | |
| 477 | These functions respectively left-justify, right-justify and center a string in |
| 478 | a field of given width. They return a string that is at least *width* |
| 479 | characters wide, created by padding the string *s* with spaces until the given |
| 480 | width on the right, left or both sides. The string is never truncated. |
| 481 | |
| 482 | |
| 483 | .. function:: zfill(s, width) |
| 484 | |
| 485 | Pad a numeric string on the left with zero digits until the given width is |
| 486 | reached. Strings starting with a sign are handled correctly. |
| 487 | |
| 488 | |
| 489 | .. function:: replace(str, old, new[, maxreplace]) |
| 490 | |
| 491 | Return a copy of string *str* with all occurrences of substring *old* replaced |
| 492 | by *new*. If the optional argument *maxreplace* is given, the first |
| 493 | *maxreplace* occurrences are replaced. |
| 494 | |