Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

2007-08-15 14:28:22 +0000

[diff] [blame]

1

:mod:`tokenize` --- Tokenizer for Python source

2

===============================================

3

4

.. module:: tokenize

5

:synopsis: Lexical scanner for Python source code.

6

.. moduleauthor:: Ka Ping Yee

7

.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>

8

Raymond Hettinger

1048094

2011-01-10 03:26:08 +0000

[diff] [blame]

9

**Source code:** :source:`Lib/tokenize.py`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

10

Raymond Hettinger

4f707fd

2011-01-10 19:54:11 +0000

[diff] [blame]

11

--------------

12

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

13

The :mod:`tokenize` module provides a lexical scanner for Python source code,

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

14

implemented in Python. The scanner in this module returns comments as tokens

15

as well, making it useful for implementing "pretty-printers," including

16

colorizers for on-screen displays.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

17

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

18

To simplify token stream handling, all :ref:`operators` and :ref:`delimiters`

19

tokens are returned using the generic :data:`token.OP` token type. The exact

20

type can be determined by checking the ``exact_type`` property on the

21

:term:`named tuple` returned from :func:`tokenize.tokenize`.

22

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

Tokenizing Input

----------------

Georg Brandl

2007-11-01 20:32:30 +0000

[diff] [blame]

26

The primary entry point is a :term:`generator`:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

27

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

28

.. function:: tokenize(readline)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

29

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

30

The :func:`tokenize` generator requires one argument, *readline*, which

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

31

must be a callable object which provides the same interface as the

Antoine Pitrou

4adb288

2010-01-04 18:50:53 +0000

[diff] [blame]

32

:meth:`io.IOBase.readline` method of file objects. Each call to the

33

function should return one line of input as bytes.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

34

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

35

The generator produces 5-tuples with these members: the token type; the

36

token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and

37

column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of

38

ints specifying the row and column where the token ends in the source; and

Georg Brandl

c28e1fa

2008-06-10 19:20:26 +0000

[diff] [blame]

39

the line on which the token was found. The line passed (the last tuple item)

Raymond Hettinger

a48db39

2009-04-29 00:34:27 +0000

[diff] [blame]

40

is the *logical* line; continuation lines are included. The 5 tuple is

41

returned as a :term:`named tuple` with the field names:

42

``type string start end line``.

43

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

44

The returned :term:`named tuple` has a additional property named

45

``exact_type`` that contains the exact operator type for

46

:data:`token.OP` tokens. For all other token types ``exact_type``

47

equals the named tuple ``type`` field.

48

Raymond Hettinger

a48db39

2009-04-29 00:34:27 +0000

[diff] [blame]

49

.. versionchanged:: 3.1

50

Added support for named tuples.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

51

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

52

.. versionchanged:: 3.3

53

Added support for ``exact_type``.

54

Georg Brandl

c28e1fa

2008-06-10 19:20:26 +0000

[diff] [blame]

55

:func:`tokenize` determines the source encoding of the file by looking for a

56

UTF-8 BOM or encoding cookie, according to :pep:`263`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

57

Georg Brandl

55ac8f0

2007-09-01 13:51:09 +0000

[diff] [blame]

58

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

59

All constants from the :mod:`token` module are also exported from

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

60

:mod:`tokenize`, as are three additional token type values:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

61

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

62

.. data:: COMMENT

63

64

Token value used to indicate a comment.

.. data:: NL

Token value used to indicate a non-terminating newline. The NEWLINE token

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

70

indicates the end of a logical line of Python code; NL tokens are generated

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

71

when a logical line of code is continued over multiple physical lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

72

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

.. data:: ENCODING

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

76

Token value that indicates the encoding used to decode the source bytes

77

into text. The first token returned by :func:`tokenize` will always be an

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

ENCODING token.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

81

Another function is provided to reverse the tokenization process. This is

82

useful for creating tools that tokenize a script, modify the token stream, and

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

83

write back the modified script.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

84

85

86

.. function:: untokenize(iterable)

87

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

88

Converts tokens back into Python source code. The *iterable* must return

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

89

sequences with at least two elements, the token type and the token string.

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

90

Any additional sequence elements are ignored.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

91

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

92

The reconstructed script is returned as a single string. The result is

93

guaranteed to tokenize back to match the input so that the conversion is

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

94

lossless and round-trips are assured. The guarantee applies only to the

95

token type and token string as the spacing between tokens (column

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

96

positions) may change.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

97

98

It returns bytes, encoded using the ENCODING token, which is the first

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

99

token sequence output by :func:`tokenize`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

100

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

101

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

102

:func:`tokenize` needs to detect the encoding of source files it tokenizes. The

103

function it uses to do this is available:

104

105

.. function:: detect_encoding(readline)

106

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

107

The :func:`detect_encoding` function is used to detect the encoding that

Georg Brandl

ae2dbe2

2009-03-13 19:04:40 +0000

[diff] [blame]

108

should be used to decode a Python source file. It requires one argument,

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

109

readline, in the same way as the :func:`tokenize` generator.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

110

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

111

It will call readline a maximum of twice, and return the encoding used

112

(as a string) and a list of any lines (not decoded from bytes) it has read

113

in.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

114

Ezio Melotti

a8f6f1e

2009-12-20 12:24:57 +0000

[diff] [blame]

115

It detects the encoding from the presence of a UTF-8 BOM or an encoding

116

cookie as specified in :pep:`263`. If both a BOM and a cookie are present,

Benjamin Peterson

689a558

2010-03-18 22:29:52 +0000

[diff] [blame]

117

but disagree, a SyntaxError will be raised. Note that if the BOM is found,

118

``'utf-8-sig'`` will be returned as an encoding.

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

119

Benjamin Peterson

b3a4829

2010-03-18 22:43:41 +0000

[diff] [blame]

120

If no encoding is specified, then the default of ``'utf-8'`` will be

121

returned.

122

Victor Stinner

58c0752

2010-11-09 01:08:59 +0000

[diff] [blame]

123

Use :func:`open` to open Python source files: it uses

124

:func:`detect_encoding` to detect the file encoding.

Benjamin Peterson

b3a4829

2010-03-18 22:43:41 +0000

[diff] [blame]

125

Victor Stinner

58c0752

2010-11-09 01:08:59 +0000

[diff] [blame]

126

127

.. function:: open(filename)

128

129

Open a file in read only mode using the encoding detected by

130

:func:`detect_encoding`.

131

132

.. versionadded:: 3.2

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

133

134

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

.. _tokenize-cli:

Command-Line Usage

------------------

.. versionadded:: 3.3

141

142

The :mod:`tokenize` module can be executed as a script from the command line.

It is as simple as:

.. code-block:: sh

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

147

python -m tokenize [-e] [filename.py]

148

149

The following options are accepted:

150

151

.. program:: tokenize

152

153

.. cmdoption:: -h, --help

154

155

show this help message and exit

156

157

.. cmdoption:: -e, --exact

158

159

display token names using the exact type

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

160

161

If :file:`filename.py` is specified its contents are tokenized to stdout.

162

Otherwise, tokenization is performed on stdin.

Examples

------------------

Raymond Hettinger

2010-09-09 04:32:39 +0000

[diff] [blame]

167

Example of a script rewriter that transforms float literals into Decimal

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

168

objects::

169

Ezio Melotti

a8f6f1e

2009-12-20 12:24:57 +0000

[diff] [blame]

170

from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP

171

from io import BytesIO

172

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

173

def decistmt(s):

174

"""Substitute Decimals for floats in a string of statements.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

175

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

176

>>> from decimal import Decimal

177

>>> s = 'print(+21.3e-5*-.1234/81.7)'

178

>>> decistmt(s)

179

"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

180

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

181

The format of the exponent is inherited from the platform C library.

182

Known cases are "e-007" (Windows) and "e-07" (not Windows). Since

183

we're only showing 12 digits, and the 13th isn't close to 5, the

184

rest of the output should be platform-independent.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

185

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

186

>>> exec(s) #doctest: +ELLIPSIS

187

-3.21716034272e-0...7

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

188

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

189

Output from calculations with Decimal should be identical across all

190

platforms.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

191

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

192

>>> exec(decistmt(s))

193

-3.217160342717258261933904529E-7

194

"""

195

result = []

196

g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string

197

for toknum, tokval, _, _, _ in g:

198

if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens

result.extend([

(NAME, 'Decimal'),

(OP, '('),

(STRING, repr(tokval)),

(OP, ')')

])

else:

result.append((toknum, tokval))

207

return untokenize(result).decode('utf-8')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

208

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

209

Example of tokenizing from the command line. The script::

210

211

def say_hello():

212

print("Hello, World!")

say_hello()

will be tokenized to the following output where the first column is the range

217

of the line/column coordinates where the token is found, the second column is

218

the name of the token, and the final column is the value of the token (if any)

.. code-block:: sh

$ python -m tokenize hello.py

223

0,0-0,0: ENCODING 'utf-8'

224

1,0-1,3: NAME 'def'

225

1,4-1,13: NAME 'say_hello'

1,13-1,14: OP '('

1,14-1,15: OP ')'

1,15-1,16: OP ':'

1,16-1,17: NEWLINE '\n'

230

2,0-2,4: INDENT ' '

231

2,4-2,9: NAME 'print'

232

2,9-2,10: OP '('

233

2,10-2,25: STRING '"Hello, World!"'

234

2,25-2,26: OP ')'

235

2,26-2,27: NEWLINE '\n'

236

3,0-3,1: NL '\n'

237

4,0-4,0: DEDENT ''

238

4,0-4,9: NAME 'say_hello'

239

4,9-4,10: OP '('

240

4,10-4,11: OP ')'

241

4,11-4,12: NEWLINE '\n'

242

5,0-5,0: ENDMARKER ''

Meador Inge