blob: 0fc0d7ef1b69926004edd61cabbb275602751408 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001:mod:`tokenize` --- Tokenizer for Python source
2===============================================
3
4.. module:: tokenize
5 :synopsis: Lexical scanner for Python source code.
6.. moduleauthor:: Ka Ping Yee
7.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
8
Éric Araujo29a0b572011-08-19 02:14:03 +02009**Source code:** :source:`Lib/tokenize.py`
10
11--------------
Georg Brandl8ec7f652007-08-15 14:28:01 +000012
13The :mod:`tokenize` module provides a lexical scanner for Python source code,
14implemented in Python. The scanner in this module returns comments as tokens as
15well, making it useful for implementing "pretty-printers," including colorizers
16for on-screen displays.
17
Meador Ingeda747c32012-01-19 00:17:44 -060018To simplify token stream handling, all :ref:`operators` and :ref:`delimiters`
19tokens are returned using the generic :data:`token.OP` token type. The exact
Terry Jan Reedyf7568e02014-04-14 16:17:09 -040020type can be determined by checking the second field (containing the actual
21token string matched) of the tuple returned from
22:func:`tokenize.generate_tokens` for the character sequence that identifies a
23specific operator token.
Meador Ingeda747c32012-01-19 00:17:44 -060024
Georg Brandlcf3fb252007-10-21 10:52:38 +000025The primary entry point is a :term:`generator`:
Georg Brandl8ec7f652007-08-15 14:28:01 +000026
Georg Brandl8ec7f652007-08-15 14:28:01 +000027.. function:: generate_tokens(readline)
28
Georg Brandlebd662d2008-06-08 08:54:40 +000029 The :func:`generate_tokens` generator requires one argument, *readline*,
30 which must be a callable object which provides the same interface as the
Georg Brandl8ec7f652007-08-15 14:28:01 +000031 :meth:`readline` method of built-in file objects (see section
Georg Brandlebd662d2008-06-08 08:54:40 +000032 :ref:`bltin-file-objects`). Each call to the function should return one line
Raymond Hettinger2aa85192012-07-01 20:00:09 -070033 of input as a string. Alternately, *readline* may be a callable object that
34 signals completion by raising :exc:`StopIteration`.
Georg Brandl8ec7f652007-08-15 14:28:01 +000035
36 The generator produces 5-tuples with these members: the token type; the token
Georg Brandlebd662d2008-06-08 08:54:40 +000037 string; a 2-tuple ``(srow, scol)`` of ints specifying the row and column
38 where the token begins in the source; a 2-tuple ``(erow, ecol)`` of ints
39 specifying the row and column where the token ends in the source; and the
Georg Brandl3219df12008-06-08 08:59:38 +000040 line on which the token was found. The line passed (the last tuple item) is
41 the *logical* line; continuation lines are included.
Georg Brandl8ec7f652007-08-15 14:28:01 +000042
43 .. versionadded:: 2.2
44
45An older entry point is retained for backward compatibility:
46
47
48.. function:: tokenize(readline[, tokeneater])
49
50 The :func:`tokenize` function accepts two parameters: one representing the input
51 stream, and one providing an output mechanism for :func:`tokenize`.
52
53 The first parameter, *readline*, must be a callable object which provides the
54 same interface as the :meth:`readline` method of built-in file objects (see
55 section :ref:`bltin-file-objects`). Each call to the function should return one
56 line of input as a string. Alternately, *readline* may be a callable object that
57 signals completion by raising :exc:`StopIteration`.
58
59 .. versionchanged:: 2.5
60 Added :exc:`StopIteration` support.
61
62 The second parameter, *tokeneater*, must also be a callable object. It is
63 called once for each token, with five arguments, corresponding to the tuples
64 generated by :func:`generate_tokens`.
65
66All constants from the :mod:`token` module are also exported from
67:mod:`tokenize`, as are two additional token type values that might be passed to
68the *tokeneater* function by :func:`tokenize`:
69
70
71.. data:: COMMENT
72
73 Token value used to indicate a comment.
74
75
76.. data:: NL
77
78 Token value used to indicate a non-terminating newline. The NEWLINE token
79 indicates the end of a logical line of Python code; NL tokens are generated when
80 a logical line of code is continued over multiple physical lines.
81
82Another function is provided to reverse the tokenization process. This is useful
83for creating tools that tokenize a script, modify the token stream, and write
84back the modified script.
85
86
87.. function:: untokenize(iterable)
88
89 Converts tokens back into Python source code. The *iterable* must return
90 sequences with at least two elements, the token type and the token string. Any
91 additional sequence elements are ignored.
92
93 The reconstructed script is returned as a single string. The result is
94 guaranteed to tokenize back to match the input so that the conversion is
95 lossless and round-trips are assured. The guarantee applies only to the token
96 type and token string as the spacing between tokens (column positions) may
97 change.
98
99 .. versionadded:: 2.5
100
101Example of a script re-writer that transforms float literals into Decimal
102objects::
103
104 def decistmt(s):
105 """Substitute Decimals for floats in a string of statements.
106
107 >>> from decimal import Decimal
108 >>> s = 'print +21.3e-5*-.1234/81.7'
109 >>> decistmt(s)
110 "print +Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7')"
111
112 >>> exec(s)
113 -3.21716034272e-007
114 >>> exec(decistmt(s))
115 -3.217160342717258261933904529E-7
116
117 """
118 result = []
119 g = generate_tokens(StringIO(s).readline) # tokenize the string
120 for toknum, tokval, _, _, _ in g:
121 if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
122 result.extend([
123 (NAME, 'Decimal'),
124 (OP, '('),
125 (STRING, repr(tokval)),
126 (OP, ')')
127 ])
128 else:
129 result.append((toknum, tokval))
130 return untokenize(result)
131