Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: ff55aacbd44c5b50e4c8afeff3b7ff2559b051b2 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`tokenize` --- Tokenizer for Python source
				2	===============================================
				3
				4	.. module:: tokenize
				5	:synopsis: Lexical scanner for Python source code.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Ka Ping Yee
				8	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				9
Raymond Hettinger	1048094	2011-01-10 03:26:08 +0000	[diff] [blame]	10	Source code: :source:`Lib/tokenize.py`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	4f707fd	2011-01-10 19:54:11 +0000	[diff] [blame]	12	--------------
				13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	15	implemented in Python. The scanner in this module returns comments as tokens
				16	as well, making it useful for implementing "pretty-printers," including
				17	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	19	To simplify token stream handling, all :ref:`operators` and :ref:`delimiters`
				20	tokens are returned using the generic :data:`token.OP` token type. The exact
				21	type can be determined by checking the ``exact_type`` property on the
				22	:term:`named tuple` returned from :func:`tokenize.tokenize`.
				23
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	24	Tokenizing Input
				25	----------------
				26
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	27	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	28
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	29	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	30
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	31	The :func:`.tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	32	must be a callable object which provides the same interface as the
Antoine Pitrou	4adb288	2010-01-04 18:50:53 +0000	[diff] [blame]	33	:meth:`io.IOBase.readline` method of file objects. Each call to the
				34	function should return one line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	35
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	36	The generator produces 5-tuples with these members: the token type; the
				37	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				38	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				39	ints specifying the row and column where the token ends in the source; and
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	40	the line on which the token was found. The line passed (the last tuple item)
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	41	is the logical line; continuation lines are included. The 5 tuple is
				42	returned as a :term:`named tuple` with the field names:
				43	``type string start end line``.
				44
Serhiy Storchaka	d65c949	2015-11-02 14:10:23 +0200	[diff] [blame]	45	The returned :term:`named tuple` has an additional property named
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	46	``exact_type`` that contains the exact operator type for
				47	:data:`token.OP` tokens. For all other token types ``exact_type``
				48	equals the named tuple ``type`` field.
				49
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	50	.. versionchanged:: 3.1
				51	Added support for named tuples.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	52
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	53	.. versionchanged:: 3.3
				54	Added support for ``exact_type``.
				55
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	56	:func:`.tokenize` determines the source encoding of the file by looking for a
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	57	UTF-8 BOM or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	58
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	59
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	60	All constants from the :mod:`token` module are also exported from
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	61	:mod:`tokenize`, as are three additional token type values:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	62
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	63	.. data:: COMMENT
				64
				65	Token value used to indicate a comment.
				66
				67
				68	.. data:: NL
				69
				70	Token value used to indicate a non-terminating newline. The NEWLINE token
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	71	indicates the end of a logical line of Python code; NL tokens are generated
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	72	when a logical line of code is continued over multiple physical lines.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	73
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	74
				75	.. data:: ENCODING
				76
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	77	Token value that indicates the encoding used to decode the source bytes
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	78	into text. The first token returned by :func:`.tokenize` will always be an
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	79	ENCODING token.
				80
				81
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	82	Another function is provided to reverse the tokenization process. This is
				83	useful for creating tools that tokenize a script, modify the token stream, and
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	84	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	85
				86
				87	.. function:: untokenize(iterable)
				88
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	89	Converts tokens back into Python source code. The iterable must return
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	90	sequences with at least two elements, the token type and the token string.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	91	Any additional sequence elements are ignored.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	92
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	93	The reconstructed script is returned as a single string. The result is
				94	guaranteed to tokenize back to match the input so that the conversion is
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	95	lossless and round-trips are assured. The guarantee applies only to the
				96	token type and token string as the spacing between tokens (column
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	97	positions) may change.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	98
				99	It returns bytes, encoded using the ENCODING token, which is the first
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	100	token sequence output by :func:`.tokenize`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	101
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	103	:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	104	function it uses to do this is available:
				105
				106	.. function:: detect_encoding(readline)
				107
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	108	The :func:`detect_encoding` function is used to detect the encoding that
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	109	should be used to decode a Python source file. It requires one argument,
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	110	readline, in the same way as the :func:`.tokenize` generator.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	111
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	112	It will call readline a maximum of twice, and return the encoding used
				113	(as a string) and a list of any lines (not decoded from bytes) it has read
				114	in.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	115
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	116	It detects the encoding from the presence of a UTF-8 BOM or an encoding
				117	cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
Benjamin Peterson	689a558	2010-03-18 22:29:52 +0000	[diff] [blame]	118	but disagree, a SyntaxError will be raised. Note that if the BOM is found,
				119	``'utf-8-sig'`` will be returned as an encoding.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	120
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	121	If no encoding is specified, then the default of ``'utf-8'`` will be
				122	returned.
				123
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	124	Use :func:`.open` to open Python source files: it uses
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	125	:func:`detect_encoding` to detect the file encoding.
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	126
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	127
				128	.. function:: open(filename)
				129
				130	Open a file in read only mode using the encoding detected by
				131	:func:`detect_encoding`.
				132
				133	.. versionadded:: 3.2
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	134
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	135	.. exception:: TokenError
				136
				137	Raised when either a docstring or expression that may be split over several
				138	lines is not completed anywhere in the file, for example::
				139
				140	"""Beginning of
				141	docstring
				142
				143	or::
				144
				145	[1,
				146	2,
				147	3
				148
				149	Note that unclosed single-quoted strings do not cause an error to be
				150	raised. They are tokenized as ``ERRORTOKEN``, followed by the tokenization of
				151	their contents.
				152
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	153
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	154	.. _tokenize-cli:
				155
				156	Command-Line Usage
				157	------------------
				158
				159	.. versionadded:: 3.3
				160
				161	The :mod:`tokenize` module can be executed as a script from the command line.
				162	It is as simple as:
				163
				164	.. code-block:: sh
				165
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	166	python -m tokenize [-e] [filename.py]
				167
				168	The following options are accepted:
				169
				170	.. program:: tokenize
				171
				172	.. cmdoption:: -h, --help
				173
				174	show this help message and exit
				175
				176	.. cmdoption:: -e, --exact
				177
				178	display token names using the exact type
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	179
				180	If :file:`filename.py` is specified its contents are tokenized to stdout.
				181	Otherwise, tokenization is performed on stdin.
				182
				183	Examples
				184	------------------
				185
Raymond Hettinger	6c60d09	2010-09-09 04:32:39 +0000	[diff] [blame]	186	Example of a script rewriter that transforms float literals into Decimal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	187	objects::
				188
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	189	from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
				190	from io import BytesIO
				191
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	192	def decistmt(s):
				193	"""Substitute Decimals for floats in a string of statements.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	194
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	195	>>> from decimal import Decimal
				196	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				197	>>> decistmt(s)
				198	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	199
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	200	The format of the exponent is inherited from the platform C library.
				201	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				202	we're only showing 12 digits, and the 13th isn't close to 5, the
				203	rest of the output should be platform-independent.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	204
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	205	>>> exec(s) #doctest: +ELLIPSIS
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	206	-3.21716034272e-0...7
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	207
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	208	Output from calculations with Decimal should be identical across all
				209	platforms.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	210
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	211	>>> exec(decistmt(s))
				212	-3.217160342717258261933904529E-7
				213	"""
				214	result = []
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	215	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				216	for toknum, tokval, _, _, _ in g:
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	217	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				218	result.extend([
				219	(NAME, 'Decimal'),
				220	(OP, '('),
				221	(STRING, repr(tokval)),
				222	(OP, ')')
				223	])
				224	else:
				225	result.append((toknum, tokval))
				226	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	227
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	228	Example of tokenizing from the command line. The script::
				229
				230	def say_hello():
				231	print("Hello, World!")
				232
				233	say_hello()
				234
				235	will be tokenized to the following output where the first column is the range
				236	of the line/column coordinates where the token is found, the second column is
				237	the name of the token, and the final column is the value of the token (if any)
				238
				239	.. code-block:: sh
				240
				241	$ python -m tokenize hello.py
				242	0,0-0,0: ENCODING 'utf-8'
				243	1,0-1,3: NAME 'def'
				244	1,4-1,13: NAME 'say_hello'
				245	1,13-1,14: OP '('
				246	1,14-1,15: OP ')'
				247	1,15-1,16: OP ':'
				248	1,16-1,17: NEWLINE '\n'
				249	2,0-2,4: INDENT ' '
				250	2,4-2,9: NAME 'print'
				251	2,9-2,10: OP '('
				252	2,10-2,25: STRING '"Hello, World!"'
				253	2,25-2,26: OP ')'
				254	2,26-2,27: NEWLINE '\n'
				255	3,0-3,1: NL '\n'
				256	4,0-4,0: DEDENT ''
				257	4,0-4,9: NAME 'say_hello'
				258	4,9-4,10: OP '('
				259	4,10-4,11: OP ')'
				260	4,11-4,12: NEWLINE '\n'
				261	5,0-5,0: ENDMARKER ''
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	262
				263	The exact token type names can be displayed using the ``-e`` option:
				264
				265	.. code-block:: sh
				266
				267	$ python -m tokenize -e hello.py
				268	0,0-0,0: ENCODING 'utf-8'
				269	1,0-1,3: NAME 'def'
				270	1,4-1,13: NAME 'say_hello'
				271	1,13-1,14: LPAR '('
				272	1,14-1,15: RPAR ')'
				273	1,15-1,16: COLON ':'
				274	1,16-1,17: NEWLINE '\n'
				275	2,0-2,4: INDENT ' '
				276	2,4-2,9: NAME 'print'
				277	2,9-2,10: LPAR '('
				278	2,10-2,25: STRING '"Hello, World!"'
				279	2,25-2,26: RPAR ')'
				280	2,26-2,27: NEWLINE '\n'
				281	3,0-3,1: NL '\n'
				282	4,0-4,0: DEDENT ''
				283	4,0-4,9: NAME 'say_hello'
				284	4,9-4,10: LPAR '('
				285	4,10-4,11: RPAR ')'
				286	4,11-4,12: NEWLINE '\n'
				287	5,0-5,0: ENDMARKER ''