blob: 8c9ad3eb79826fafa00de34aa42c42b7c241b08c [file] [log] [blame]
Fred Drake6b103f11999-02-18 21:06:50 +00001\section{\module{tokenize} ---
2 Tokenizer for Python source}
3
4\declaremodule{standard}{tokenize}
5\modulesynopsis{Lexical scanner for Python source code.}
6\moduleauthor{Ka Ping Yee}{}
7\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
8
9
10The \module{tokenize} module provides a lexical scanner for Python
11source code, implemented in Python. The scanner in this module
12returns comments as tokens as well, making it useful for implementing
13``pretty-printers,'' including colorizers for on-screen displays.
14
Tim Peters4efb6e92001-06-29 23:51:08 +000015The primary entry point is a generator:
Fred Drake6b103f11999-02-18 21:06:50 +000016
Tim Peters4efb6e92001-06-29 23:51:08 +000017\begin{funcdesc}{generate_tokens}{readline}
18 The \function{generate_tokens()} generator requires one argment,
19 \var{readline}, which must be a callable object which
20 provides the same interface as the \method{readline()} method of
21 built-in file objects (see section~\ref{bltin-file-objects}). Each
22 call to the function should return one line of input as a string.
23
24 The generator produces 5-tuples with these members:
25 the token type;
26 the token string;
27 a 2-tuple \code{(\var{srow}, \var{scol})} of ints specifying the
28 row and column where the token begins in the source;
29 a 2-tuple \code{(\var{erow}, \var{ecol})} of ints specifying the
30 row and column where the token ends in the source;
31 and the line on which the token was found.
32 The line passed is the \emph{logical} line;
33 continuation lines are included.
34 \versionadded{2.2}
35\end{funcdesc}
36
37An older entry point is retained for backward compatibility:
Fred Drake6b103f11999-02-18 21:06:50 +000038
39\begin{funcdesc}{tokenize}{readline\optional{, tokeneater}}
40 The \function{tokenize()} function accepts two parameters: one
Tim Peters4efb6e92001-06-29 23:51:08 +000041 representing the input stream, and one providing an output mechanism
Fred Drake6b103f11999-02-18 21:06:50 +000042 for \function{tokenize()}.
43
44 The first parameter, \var{readline}, must be a callable object which
Fred Drake16214fb1999-04-23 20:00:53 +000045 provides the same interface as the \method{readline()} method of
Fred Drake6b103f11999-02-18 21:06:50 +000046 built-in file objects (see section~\ref{bltin-file-objects}). Each
47 call to the function should return one line of input as a string.
Raymond Hettinger68c04532005-06-10 11:05:19 +000048 Alternately, \var{readline} may be a callable object that signals
49 completion by raising \exception{StopIteration}.
George Yoshida90df06e2006-05-13 06:53:31 +000050 \versionchanged[Added \exception{StopIteration} support]{2.5}
Fred Drake6b103f11999-02-18 21:06:50 +000051
52 The second parameter, \var{tokeneater}, must also be a callable
Tim Peters4efb6e92001-06-29 23:51:08 +000053 object. It is called once for each token, with five arguments,
54 corresponding to the tuples generated by \function{generate_tokens()}.
Fred Drake6b103f11999-02-18 21:06:50 +000055\end{funcdesc}
56
57
Tim Peters4efb6e92001-06-29 23:51:08 +000058All constants from the \refmodule{token} module are also exported from
59\module{tokenize}, as are two additional token type values that might be
Fred Drake6b103f11999-02-18 21:06:50 +000060passed to the \var{tokeneater} function by \function{tokenize()}:
61
62\begin{datadesc}{COMMENT}
63 Token value used to indicate a comment.
64\end{datadesc}
Skip Montanaro58177b92001-02-28 22:05:41 +000065\begin{datadesc}{NL}
Ka-Ping Yeece7298a2001-03-23 05:22:12 +000066 Token value used to indicate a non-terminating newline. The NEWLINE
67 token indicates the end of a logical line of Python code; NL tokens
68 are generated when a logical line of code is continued over multiple
69 physical lines.
Skip Montanaro58177b92001-02-28 22:05:41 +000070\end{datadesc}
Raymond Hettinger68c04532005-06-10 11:05:19 +000071
72Another function is provided to reverse the tokenization process.
73This is useful for creating tools that tokenize a script, modify
74the token stream, and write back the modified script.
75
76\begin{funcdesc}{untokenize}{iterable}
Gregory P. Smith2e23e082005-06-11 08:16:04 +000077 Converts tokens back into Python source code. The \var{iterable}
Raymond Hettinger68c04532005-06-10 11:05:19 +000078 must return sequences with at least two elements, the token type and
79 the token string. Any additional sequence elements are ignored.
80
81 The reconstructed script is returned as a single string. The
82 result is guaranteed to tokenize back to match the input so that
83 the conversion is lossless and round-trips are assured. The
84 guarantee applies only to the token type and token string as
85 the spacing between tokens (column positions) may change.
86 \versionadded{2.5}
87\end{funcdesc}
88
89Example of a script re-writer that transforms float literals into
90Decimal objects:
91\begin{verbatim}
92def decistmt(s):
93 """Substitute Decimals for floats in a string of statements.
94
95 >>> from decimal import Decimal
96 >>> s = 'print +21.3e-5*-.1234/81.7'
97 >>> decistmt(s)
98 "print +Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7')"
99
100 >>> exec(s)
101 -3.21716034272e-007
102 >>> exec(decistmt(s))
103 -3.217160342717258261933904529E-7
104
105 """
106 result = []
107 g = generate_tokens(StringIO(s).readline) # tokenize the string
108 for toknum, tokval, _, _, _ in g:
109 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
110 result.extend([
111 (NAME, 'Decimal'),
112 (OP, '('),
113 (STRING, repr(tokval)),
114 (OP, ')')
115 ])
116 else:
117 result.append((toknum, tokval))
118 return untokenize(result)
119\end{verbatim}