Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

2007-08-15 14:28:22 +0000

[diff] [blame]

1

:mod:`tokenize` --- Tokenizer for Python source

2

===============================================

3

4

.. module:: tokenize

5

:synopsis: Lexical scanner for Python source code.

6

.. moduleauthor:: Ka Ping Yee

7

.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>

8

Raymond Hettinger

1048094

2011-01-10 03:26:08 +0000

[diff] [blame]

9

**Source code:** :source:`Lib/tokenize.py`

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

10

Raymond Hettinger

4f707fd

2011-01-10 19:54:11 +0000

[diff] [blame]

11

--------------

12

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

13

The :mod:`tokenize` module provides a lexical scanner for Python source code,

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

14

implemented in Python. The scanner in this module returns comments as tokens

15

as well, making it useful for implementing "pretty-printers," including

16

colorizers for on-screen displays.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

17

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

18

To simplify token stream handling, all :ref:`operators` and :ref:`delimiters`

19

tokens are returned using the generic :data:`token.OP` token type. The exact

20

type can be determined by checking the ``exact_type`` property on the

21

:term:`named tuple` returned from :func:`tokenize.tokenize`.

22

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

Tokenizing Input

----------------

Georg Brandl

2007-11-01 20:32:30 +0000

[diff] [blame]

26

The primary entry point is a :term:`generator`:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

27

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

28

.. function:: tokenize(readline)

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

29

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

30

The :func:`tokenize` generator requires one argument, *readline*, which

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

31

must be a callable object which provides the same interface as the

Antoine Pitrou

4adb288

2010-01-04 18:50:53 +0000

[diff] [blame]

32

:meth:`io.IOBase.readline` method of file objects. Each call to the

33

function should return one line of input as bytes.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

34

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

35

The generator produces 5-tuples with these members: the token type; the

36

token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and

37

column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of

38

ints specifying the row and column where the token ends in the source; and

Georg Brandl

c28e1fa

2008-06-10 19:20:26 +0000

[diff] [blame]

39

the line on which the token was found. The line passed (the last tuple item)

Raymond Hettinger

a48db39

2009-04-29 00:34:27 +0000

[diff] [blame]

40

is the *logical* line; continuation lines are included. The 5 tuple is

41

returned as a :term:`named tuple` with the field names:

42

``type string start end line``.

43

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

44

The returned :term:`named tuple` has a additional property named

45

``exact_type`` that contains the exact operator type for

46

:data:`token.OP` tokens. For all other token types ``exact_type``

47

equals the named tuple ``type`` field.

48

Raymond Hettinger

a48db39

2009-04-29 00:34:27 +0000

[diff] [blame]

49

.. versionchanged:: 3.1

50

Added support for named tuples.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

51

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

52

.. versionchanged:: 3.3

53

Added support for ``exact_type``.

54

Georg Brandl

c28e1fa

2008-06-10 19:20:26 +0000

[diff] [blame]

55

:func:`tokenize` determines the source encoding of the file by looking for a

56

UTF-8 BOM or encoding cookie, according to :pep:`263`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

57

Georg Brandl

55ac8f0

2007-09-01 13:51:09 +0000

[diff] [blame]

58

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

59

All constants from the :mod:`token` module are also exported from

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

60

:mod:`tokenize`, as are three additional token type values:

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

61

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

62

.. data:: COMMENT

63

64

Token value used to indicate a comment.

.. data:: NL

Token value used to indicate a non-terminating newline. The NEWLINE token

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

70

indicates the end of a logical line of Python code; NL tokens are generated

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

71

when a logical line of code is continued over multiple physical lines.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

72

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

.. data:: ENCODING

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

76

Token value that indicates the encoding used to decode the source bytes

77

into text. The first token returned by :func:`tokenize` will always be an

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

ENCODING token.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

81

Another function is provided to reverse the tokenization process. This is

82

useful for creating tools that tokenize a script, modify the token stream, and

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

83

write back the modified script.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

84

85

86

.. function:: untokenize(iterable)

87

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

88

Converts tokens back into Python source code. The *iterable* must return

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

89

sequences with at least two elements, the token type and the token string.

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

90

Any additional sequence elements are ignored.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

91

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

92

The reconstructed script is returned as a single string. The result is

93

guaranteed to tokenize back to match the input so that the conversion is

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

94

lossless and round-trips are assured. The guarantee applies only to the

95

token type and token string as the spacing between tokens (column

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

96

positions) may change.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

97

98

It returns bytes, encoded using the ENCODING token, which is the first

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

99

token sequence output by :func:`tokenize`.

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

100

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

101

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

102

:func:`tokenize` needs to detect the encoding of source files it tokenizes. The

103

function it uses to do this is available:

104

105

.. function:: detect_encoding(readline)

106

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

107

The :func:`detect_encoding` function is used to detect the encoding that

Georg Brandl

ae2dbe2

2009-03-13 19:04:40 +0000

[diff] [blame]

108

should be used to decode a Python source file. It requires one argument,

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

109

readline, in the same way as the :func:`tokenize` generator.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

110

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

111

It will call readline a maximum of twice, and return the encoding used

112

(as a string) and a list of any lines (not decoded from bytes) it has read

113

in.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

114

Ezio Melotti

a8f6f1e

2009-12-20 12:24:57 +0000

[diff] [blame]

115

It detects the encoding from the presence of a UTF-8 BOM or an encoding

116

cookie as specified in :pep:`263`. If both a BOM and a cookie are present,

Benjamin Peterson

689a558

2010-03-18 22:29:52 +0000

[diff] [blame]

117

but disagree, a SyntaxError will be raised. Note that if the BOM is found,

118

``'utf-8-sig'`` will be returned as an encoding.

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

119

Benjamin Peterson

b3a4829

2010-03-18 22:43:41 +0000

[diff] [blame]

120

If no encoding is specified, then the default of ``'utf-8'`` will be

121

returned.

122

Victor Stinner

58c0752

2010-11-09 01:08:59 +0000

[diff] [blame]

123

Use :func:`open` to open Python source files: it uses

124

:func:`detect_encoding` to detect the file encoding.

Benjamin Peterson

b3a4829

2010-03-18 22:43:41 +0000

[diff] [blame]

125

Victor Stinner

58c0752

2010-11-09 01:08:59 +0000

[diff] [blame]

126

127

.. function:: open(filename)

128

129

Open a file in read only mode using the encoding detected by

130

:func:`detect_encoding`.

131

132

.. versionadded:: 3.2

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

133

Benjamin Peterson

96e0430

2014-06-07 17:47:41 -0700

[diff] [blame]

134

.. exception:: TokenError

135

136

Raised when either a docstring or expression that may be split over several

137

lines is not completed anywhere in the file, for example::

"""Beginning of

docstring

or::

[1,

2,

3

Note that unclosed single-quoted strings do not cause an error to be

149

raised. They are tokenized as ``ERRORTOKEN``, followed by the tokenization of

150

their contents.

151

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

152

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

.. _tokenize-cli:

Command-Line Usage

------------------

.. versionadded:: 3.3

159

160

The :mod:`tokenize` module can be executed as a script from the command line.

It is as simple as:

.. code-block:: sh

Meador Inge

2012-01-19 00:44:45 -0600

[diff] [blame]

165

python -m tokenize [-e] [filename.py]

166

167

The following options are accepted:

168

169

.. program:: tokenize

170

171

.. cmdoption:: -h, --help

172

173

show this help message and exit

174

175

.. cmdoption:: -e, --exact

176

177

display token names using the exact type

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

178

179

If :file:`filename.py` is specified its contents are tokenized to stdout.

180

Otherwise, tokenization is performed on stdin.

Examples

------------------

Raymond Hettinger

2010-09-09 04:32:39 +0000

[diff] [blame]

185

Example of a script rewriter that transforms float literals into Decimal

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

186

objects::

187

Ezio Melotti

a8f6f1e

2009-12-20 12:24:57 +0000

[diff] [blame]

188

from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP

189

from io import BytesIO

190

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

191

def decistmt(s):

192

"""Substitute Decimals for floats in a string of statements.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

193

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

194

>>> from decimal import Decimal

195

>>> s = 'print(+21.3e-5*-.1234/81.7)'

196

>>> decistmt(s)

197

"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

198

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

199

The format of the exponent is inherited from the platform C library.

200

Known cases are "e-007" (Windows) and "e-07" (not Windows). Since

201

we're only showing 12 digits, and the 13th isn't close to 5, the

202

rest of the output should be platform-independent.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

203

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

204

>>> exec(s) #doctest: +ELLIPSIS

205

-3.21716034272e-0...7

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

206

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

207

Output from calculations with Decimal should be identical across all

208

platforms.

Georg Brandl

2009-01-03 21:18:54 +0000

[diff] [blame]

209

Trent Nelson

2008-03-18 22:41:35 +0000

[diff] [blame]

210

>>> exec(decistmt(s))

211

-3.217160342717258261933904529E-7

212

"""

213

result = []

214

g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string

215

for toknum, tokval, _, _, _ in g:

216

if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens

result.extend([

(NAME, 'Decimal'),

(OP, '('),

(STRING, repr(tokval)),

(OP, ')')

])

else:

result.append((toknum, tokval))

225

return untokenize(result).decode('utf-8')

Georg Brandl

2007-08-15 14:28:22 +0000

[diff] [blame]

226

Meador Inge

2011-10-07 08:53:38 -0500

[diff] [blame]

227

Example of tokenizing from the command line. The script::

228

229

def say_hello():

230

print("Hello, World!")

say_hello()

will be tokenized to the following output where the first column is the range

235

of the line/column coordinates where the token is found, the second column is

236

the name of the token, and the final column is the value of the token (if any)

.. code-block:: sh

$ python -m tokenize hello.py

241

0,0-0,0: ENCODING 'utf-8'

242

1,0-1,3: NAME 'def'

243

1,4-1,13: NAME 'say_hello'

1,13-1,14: OP '('

1,14-1,15: OP ')'

1,15-1,16: OP ':'

1,16-1,17: NEWLINE '\n'

248

2,0-2,4: INDENT ' '

249

2,4-2,9: NAME 'print'

250

2,9-2,10: OP '('

251

2,10-2,25: STRING '"Hello, World!"'

252

2,25-2,26: OP ')'

253

2,26-2,27: NEWLINE '\n'

254

3,0-3,1: NL '\n'

255

4,0-4,0: DEDENT ''

256

4,0-4,9: NAME 'say_hello'

257

4,9-4,10: OP '('

258

4,10-4,11: OP ')'

259

4,11-4,12: NEWLINE '\n'

260

5,0-5,0: ENDMARKER ''

Meador Inge