Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: 050d74c652c45bffce1931922765505bf0bd167c [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`tokenize` --- Tokenizer for Python source
				2	===============================================
				3
				4	.. module:: tokenize
				5	:synopsis: Lexical scanner for Python source code.
				6	.. moduleauthor:: Ka Ping Yee
				7	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				8
Raymond Hettinger	1048094	2011-01-10 03:26:08 +0000	[diff] [blame]	9	Source code: :source:`Lib/tokenize.py`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10
Raymond Hettinger	4f707fd	2011-01-10 19:54:11 +0000	[diff] [blame]	11	--------------
				12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	14	implemented in Python. The scanner in this module returns comments as tokens
				15	as well, making it useful for implementing "pretty-printers," including
				16	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	17
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	18	Tokenizing Input
				19	----------------
				20
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	21	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	22
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	23	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	24
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	25	The :func:`tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	26	must be a callable object which provides the same interface as the
Antoine Pitrou	4adb288	2010-01-04 18:50:53 +0000	[diff] [blame]	27	:meth:`io.IOBase.readline` method of file objects. Each call to the
				28	function should return one line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	29
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	30	The generator produces 5-tuples with these members: the token type; the
				31	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				32	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				33	ints specifying the row and column where the token ends in the source; and
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	34	the line on which the token was found. The line passed (the last tuple item)
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	35	is the logical line; continuation lines are included. The 5 tuple is
				36	returned as a :term:`named tuple` with the field names:
				37	``type string start end line``.
				38
				39	.. versionchanged:: 3.1
				40	Added support for named tuples.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	41
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	42	:func:`tokenize` determines the source encoding of the file by looking for a
				43	UTF-8 BOM or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	44
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	45
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	All constants from the :mod:`token` module are also exported from
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	47	:mod:`tokenize`, as are three additional token type values:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	48
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	49	.. data:: COMMENT
				50
				51	Token value used to indicate a comment.
				52
				53
				54	.. data:: NL
				55
				56	Token value used to indicate a non-terminating newline. The NEWLINE token
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	57	indicates the end of a logical line of Python code; NL tokens are generated
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	58	when a logical line of code is continued over multiple physical lines.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	59
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	60
				61	.. data:: ENCODING
				62
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	63	Token value that indicates the encoding used to decode the source bytes
				64	into text. The first token returned by :func:`tokenize` will always be an
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	65	ENCODING token.
				66
				67
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	68	Another function is provided to reverse the tokenization process. This is
				69	useful for creating tools that tokenize a script, modify the token stream, and
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	70	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	71
				72
				73	.. function:: untokenize(iterable)
				74
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	75	Converts tokens back into Python source code. The iterable must return
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	76	sequences with at least two elements, the token type and the token string.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	77	Any additional sequence elements are ignored.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	78
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	79	The reconstructed script is returned as a single string. The result is
				80	guaranteed to tokenize back to match the input so that the conversion is
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	81	lossless and round-trips are assured. The guarantee applies only to the
				82	token type and token string as the spacing between tokens (column
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	83	positions) may change.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	84
				85	It returns bytes, encoded using the ENCODING token, which is the first
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	86	token sequence output by :func:`tokenize`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	87
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	88
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	89	:func:`tokenize` needs to detect the encoding of source files it tokenizes. The
				90	function it uses to do this is available:
				91
				92	.. function:: detect_encoding(readline)
				93
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	94	The :func:`detect_encoding` function is used to detect the encoding that
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	95	should be used to decode a Python source file. It requires one argument,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	96	readline, in the same way as the :func:`tokenize` generator.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	97
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	98	It will call readline a maximum of twice, and return the encoding used
				99	(as a string) and a list of any lines (not decoded from bytes) it has read
				100	in.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	101
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	102	It detects the encoding from the presence of a UTF-8 BOM or an encoding
				103	cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
Benjamin Peterson	689a558	2010-03-18 22:29:52 +0000	[diff] [blame]	104	but disagree, a SyntaxError will be raised. Note that if the BOM is found,
				105	``'utf-8-sig'`` will be returned as an encoding.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	106
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	107	If no encoding is specified, then the default of ``'utf-8'`` will be
				108	returned.
				109
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	110	Use :func:`open` to open Python source files: it uses
				111	:func:`detect_encoding` to detect the file encoding.
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	112
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	113
				114	.. function:: open(filename)
				115
				116	Open a file in read only mode using the encoding detected by
				117	:func:`detect_encoding`.
				118
				119	.. versionadded:: 3.2
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	120
				121
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	122	.. _tokenize-cli:
				123
				124	Command-Line Usage
				125	------------------
				126
				127	.. versionadded:: 3.3
				128
				129	The :mod:`tokenize` module can be executed as a script from the command line.
				130	It is as simple as:
				131
				132	.. code-block:: sh
				133
				134	python -m tokenize [filename.py]
				135
				136	If :file:`filename.py` is specified its contents are tokenized to stdout.
				137	Otherwise, tokenization is performed on stdin.
				138
				139	Examples
				140	------------------
				141
Raymond Hettinger	6c60d09	2010-09-09 04:32:39 +0000	[diff] [blame]	142	Example of a script rewriter that transforms float literals into Decimal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	143	objects::
				144
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	145	from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
				146	from io import BytesIO
				147
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	148	def decistmt(s):
				149	"""Substitute Decimals for floats in a string of statements.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	150
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	151	>>> from decimal import Decimal
				152	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				153	>>> decistmt(s)
				154	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	155
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	156	The format of the exponent is inherited from the platform C library.
				157	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				158	we're only showing 12 digits, and the 13th isn't close to 5, the
				159	rest of the output should be platform-independent.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	160
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	161	>>> exec(s) #doctest: +ELLIPSIS
				162	-3.21716034272e-0...7
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	163
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	164	Output from calculations with Decimal should be identical across all
				165	platforms.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	166
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	167	>>> exec(decistmt(s))
				168	-3.217160342717258261933904529E-7
				169	"""
				170	result = []
				171	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				172	for toknum, tokval, _, _, _ in g:
				173	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				174	result.extend([
				175	(NAME, 'Decimal'),
				176	(OP, '('),
				177	(STRING, repr(tokval)),
				178	(OP, ')')
				179	])
				180	else:
				181	result.append((toknum, tokval))
				182	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	183
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	184	Example of tokenizing from the command line. The script::
				185
				186	def say_hello():
				187	print("Hello, World!")
				188
				189	say_hello()
				190
				191	will be tokenized to the following output where the first column is the range
				192	of the line/column coordinates where the token is found, the second column is
				193	the name of the token, and the final column is the value of the token (if any)
				194
				195	.. code-block:: sh
				196
				197	$ python -m tokenize hello.py
				198	0,0-0,0: ENCODING 'utf-8'
				199	1,0-1,3: NAME 'def'
				200	1,4-1,13: NAME 'say_hello'
				201	1,13-1,14: OP '('
				202	1,14-1,15: OP ')'
				203	1,15-1,16: OP ':'
				204	1,16-1,17: NEWLINE '\n'
				205	2,0-2,4: INDENT ' '
				206	2,4-2,9: NAME 'print'
				207	2,9-2,10: OP '('
				208	2,10-2,25: STRING '"Hello, World!"'
				209	2,25-2,26: OP ')'
				210	2,26-2,27: NEWLINE '\n'
				211	3,0-3,1: NL '\n'
				212	4,0-4,0: DEDENT ''
				213	4,0-4,9: NAME 'say_hello'
				214	4,9-4,10: OP '('
				215	4,10-4,11: OP ')'
				216	4,11-4,12: NEWLINE '\n'
				217	5,0-5,0: ENDMARKER ''