blob: 2a7dea10251dc51e697b6c4b1903650f32b39c64 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`tokenize` --- Tokenizer for Python source
3===============================================
4
5.. module:: tokenize
6 :synopsis: Lexical scanner for Python source code.
7.. moduleauthor:: Ka Ping Yee
8.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
9
10
11The :mod:`tokenize` module provides a lexical scanner for Python source code,
12implemented in Python. The scanner in this module returns comments as tokens as
13well, making it useful for implementing "pretty-printers," including colorizers
14for on-screen displays.
15
Raymond Hettingere0e08222010-11-06 07:10:31 +000016.. seealso::
17
18 Latest version of the `tokenize module Python source code
19 <http://svn.python.org/view/python/branches/release27-maint/Lib/tokenize.py?view=markup>`_
20
Georg Brandlcf3fb252007-10-21 10:52:38 +000021The primary entry point is a :term:`generator`:
Georg Brandl8ec7f652007-08-15 14:28:01 +000022
Georg Brandl8ec7f652007-08-15 14:28:01 +000023.. function:: generate_tokens(readline)
24
Georg Brandlebd662d2008-06-08 08:54:40 +000025 The :func:`generate_tokens` generator requires one argument, *readline*,
26 which must be a callable object which provides the same interface as the
Georg Brandl8ec7f652007-08-15 14:28:01 +000027 :meth:`readline` method of built-in file objects (see section
Georg Brandlebd662d2008-06-08 08:54:40 +000028 :ref:`bltin-file-objects`). Each call to the function should return one line
29 of input as a string.
Georg Brandl8ec7f652007-08-15 14:28:01 +000030
31 The generator produces 5-tuples with these members: the token type; the token
Georg Brandlebd662d2008-06-08 08:54:40 +000032 string; a 2-tuple ``(srow, scol)`` of ints specifying the row and column
33 where the token begins in the source; a 2-tuple ``(erow, ecol)`` of ints
34 specifying the row and column where the token ends in the source; and the
Georg Brandl3219df12008-06-08 08:59:38 +000035 line on which the token was found. The line passed (the last tuple item) is
36 the *logical* line; continuation lines are included.
Georg Brandl8ec7f652007-08-15 14:28:01 +000037
38 .. versionadded:: 2.2
39
40An older entry point is retained for backward compatibility:
41
42
43.. function:: tokenize(readline[, tokeneater])
44
45 The :func:`tokenize` function accepts two parameters: one representing the input
46 stream, and one providing an output mechanism for :func:`tokenize`.
47
48 The first parameter, *readline*, must be a callable object which provides the
49 same interface as the :meth:`readline` method of built-in file objects (see
50 section :ref:`bltin-file-objects`). Each call to the function should return one
51 line of input as a string. Alternately, *readline* may be a callable object that
52 signals completion by raising :exc:`StopIteration`.
53
54 .. versionchanged:: 2.5
55 Added :exc:`StopIteration` support.
56
57 The second parameter, *tokeneater*, must also be a callable object. It is
58 called once for each token, with five arguments, corresponding to the tuples
59 generated by :func:`generate_tokens`.
60
61All constants from the :mod:`token` module are also exported from
62:mod:`tokenize`, as are two additional token type values that might be passed to
63the *tokeneater* function by :func:`tokenize`:
64
65
66.. data:: COMMENT
67
68 Token value used to indicate a comment.
69
70
71.. data:: NL
72
73 Token value used to indicate a non-terminating newline. The NEWLINE token
74 indicates the end of a logical line of Python code; NL tokens are generated when
75 a logical line of code is continued over multiple physical lines.
76
77Another function is provided to reverse the tokenization process. This is useful
78for creating tools that tokenize a script, modify the token stream, and write
79back the modified script.
80
81
82.. function:: untokenize(iterable)
83
84 Converts tokens back into Python source code. The *iterable* must return
85 sequences with at least two elements, the token type and the token string. Any
86 additional sequence elements are ignored.
87
88 The reconstructed script is returned as a single string. The result is
89 guaranteed to tokenize back to match the input so that the conversion is
90 lossless and round-trips are assured. The guarantee applies only to the token
91 type and token string as the spacing between tokens (column positions) may
92 change.
93
94 .. versionadded:: 2.5
95
96Example of a script re-writer that transforms float literals into Decimal
97objects::
98
99 def decistmt(s):
100 """Substitute Decimals for floats in a string of statements.
101
102 >>> from decimal import Decimal
103 >>> s = 'print +21.3e-5*-.1234/81.7'
104 >>> decistmt(s)
105 "print +Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7')"
106
107 >>> exec(s)
108 -3.21716034272e-007
109 >>> exec(decistmt(s))
110 -3.217160342717258261933904529E-7
111
112 """
113 result = []
114 g = generate_tokens(StringIO(s).readline) # tokenize the string
115 for toknum, tokval, _, _, _ in g:
116 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
117 result.extend([
118 (NAME, 'Decimal'),
119 (OP, '('),
120 (STRING, repr(tokval)),
121 (OP, ')')
122 ])
123 else:
124 result.append((toknum, tokval))
125 return untokenize(result)
126