Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: cd27a101a8fef2c217bc839dca99a3698b213511 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`tokenize` --- Tokenizer for Python source
				2	===============================================
				3
				4	.. module:: tokenize
				5	:synopsis: Lexical scanner for Python source code.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Ka Ping Yee
				8	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				9
Raymond Hettinger	1048094	2011-01-10 03:26:08 +0000	[diff] [blame]	10	Source code: :source:`Lib/tokenize.py`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	4f707fd	2011-01-10 19:54:11 +0000	[diff] [blame]	12	--------------
				13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	15	implemented in Python. The scanner in this module returns comments as tokens
				16	as well, making it useful for implementing "pretty-printers," including
				17	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	19	To simplify token stream handling, all :ref:`operators` and :ref:`delimiters`
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	20	tokens are returned using the generic :data:`~token.OP` token type. The exact
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	21	type can be determined by checking the ``exact_type`` property on the
				22	:term:`named tuple` returned from :func:`tokenize.tokenize`.
				23
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	24	Tokenizing Input
				25	----------------
				26
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	27	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	28
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	29	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	30
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	31	The :func:`.tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	32	must be a callable object which provides the same interface as the
Antoine Pitrou	4adb288	2010-01-04 18:50:53 +0000	[diff] [blame]	33	:meth:`io.IOBase.readline` method of file objects. Each call to the
				34	function should return one line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	35
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	36	The generator produces 5-tuples with these members: the token type; the
				37	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				38	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				39	ints specifying the row and column where the token ends in the source; and
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	40	the line on which the token was found. The line passed (the last tuple item)
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	41	is the logical line; continuation lines are included. The 5 tuple is
				42	returned as a :term:`named tuple` with the field names:
				43	``type string start end line``.
				44
Serhiy Storchaka	d65c949	2015-11-02 14:10:23 +0200	[diff] [blame]	45	The returned :term:`named tuple` has an additional property named
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	46	``exact_type`` that contains the exact operator type for
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	47	:data:`~token.OP` tokens. For all other token types ``exact_type``
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	48	equals the named tuple ``type`` field.
				49
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	50	.. versionchanged:: 3.1
				51	Added support for named tuples.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	52
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	53	.. versionchanged:: 3.3
				54	Added support for ``exact_type``.
				55
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	56	:func:`.tokenize` determines the source encoding of the file by looking for a
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	57	UTF-8 BOM or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	58
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	59
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	60	All constants from the :mod:`token` module are also exported from
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	61	:mod:`tokenize`.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	62
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	63	Another function is provided to reverse the tokenization process. This is
				64	useful for creating tools that tokenize a script, modify the token stream, and
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	65	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	66
				67
				68	.. function:: untokenize(iterable)
				69
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	70	Converts tokens back into Python source code. The iterable must return
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	71	sequences with at least two elements, the token type and the token string.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	72	Any additional sequence elements are ignored.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	73
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	74	The reconstructed script is returned as a single string. The result is
				75	guaranteed to tokenize back to match the input so that the conversion is
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	76	lossless and round-trips are assured. The guarantee applies only to the
				77	token type and token string as the spacing between tokens (column
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	78	positions) may change.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	79
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	80	It returns bytes, encoded using the :data:`~token.ENCODING` token, which
				81	is the first token sequence output by :func:`.tokenize`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	82
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	83
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	84	:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	85	function it uses to do this is available:
				86
				87	.. function:: detect_encoding(readline)
				88
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	89	The :func:`detect_encoding` function is used to detect the encoding that
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	90	should be used to decode a Python source file. It requires one argument,
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	91	readline, in the same way as the :func:`.tokenize` generator.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	92
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	93	It will call readline a maximum of twice, and return the encoding used
				94	(as a string) and a list of any lines (not decoded from bytes) it has read
				95	in.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	96
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	97	It detects the encoding from the presence of a UTF-8 BOM or an encoding
				98	cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	99	but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found,
Benjamin Peterson	689a558	2010-03-18 22:29:52 +0000	[diff] [blame]	100	``'utf-8-sig'`` will be returned as an encoding.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	101
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	102	If no encoding is specified, then the default of ``'utf-8'`` will be
				103	returned.
				104
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	105	Use :func:`.open` to open Python source files: it uses
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	106	:func:`detect_encoding` to detect the file encoding.
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	107
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	108
				109	.. function:: open(filename)
				110
				111	Open a file in read only mode using the encoding detected by
				112	:func:`detect_encoding`.
				113
				114	.. versionadded:: 3.2
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	115
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	116	.. exception:: TokenError
				117
				118	Raised when either a docstring or expression that may be split over several
				119	lines is not completed anywhere in the file, for example::
				120
				121	"""Beginning of
				122	docstring
				123
				124	or::
				125
				126	[1,
				127	2,
				128	3
				129
				130	Note that unclosed single-quoted strings do not cause an error to be
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	131	raised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the
				132	tokenization of their contents.
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	133
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	134
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	135	.. _tokenize-cli:
				136
				137	Command-Line Usage
				138	------------------
				139
				140	.. versionadded:: 3.3
				141
				142	The :mod:`tokenize` module can be executed as a script from the command line.
				143	It is as simple as:
				144
				145	.. code-block:: sh
				146
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	147	python -m tokenize [-e] [filename.py]
				148
				149	The following options are accepted:
				150
				151	.. program:: tokenize
				152
				153	.. cmdoption:: -h, --help
				154
				155	show this help message and exit
				156
				157	.. cmdoption:: -e, --exact
				158
				159	display token names using the exact type
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	160
				161	If :file:`filename.py` is specified its contents are tokenized to stdout.
				162	Otherwise, tokenization is performed on stdin.
				163
				164	Examples
				165	------------------
				166
Raymond Hettinger	6c60d09	2010-09-09 04:32:39 +0000	[diff] [blame]	167	Example of a script rewriter that transforms float literals into Decimal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	168	objects::
				169
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	170	from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
				171	from io import BytesIO
				172
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	173	def decistmt(s):
				174	"""Substitute Decimals for floats in a string of statements.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	175
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	176	>>> from decimal import Decimal
				177	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				178	>>> decistmt(s)
				179	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	180
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	181	The format of the exponent is inherited from the platform C library.
				182	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				183	we're only showing 12 digits, and the 13th isn't close to 5, the
				184	rest of the output should be platform-independent.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	185
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	186	>>> exec(s) #doctest: +ELLIPSIS
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	187	-3.21716034272e-0...7
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	188
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	189	Output from calculations with Decimal should be identical across all
				190	platforms.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	191
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	192	>>> exec(decistmt(s))
				193	-3.217160342717258261933904529E-7
				194	"""
				195	result = []
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	196	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				197	for toknum, tokval, _, _, _ in g:
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	198	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				199	result.extend([
				200	(NAME, 'Decimal'),
				201	(OP, '('),
				202	(STRING, repr(tokval)),
				203	(OP, ')')
				204	])
				205	else:
				206	result.append((toknum, tokval))
				207	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	208
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	209	Example of tokenizing from the command line. The script::
				210
				211	def say_hello():
				212	print("Hello, World!")
				213
				214	say_hello()
				215
				216	will be tokenized to the following output where the first column is the range
				217	of the line/column coordinates where the token is found, the second column is
				218	the name of the token, and the final column is the value of the token (if any)
				219
				220	.. code-block:: sh
				221
				222	$ python -m tokenize hello.py
				223	0,0-0,0: ENCODING 'utf-8'
				224	1,0-1,3: NAME 'def'
				225	1,4-1,13: NAME 'say_hello'
				226	1,13-1,14: OP '('
				227	1,14-1,15: OP ')'
				228	1,15-1,16: OP ':'
				229	1,16-1,17: NEWLINE '\n'
				230	2,0-2,4: INDENT ' '
				231	2,4-2,9: NAME 'print'
				232	2,9-2,10: OP '('
				233	2,10-2,25: STRING '"Hello, World!"'
				234	2,25-2,26: OP ')'
				235	2,26-2,27: NEWLINE '\n'
				236	3,0-3,1: NL '\n'
				237	4,0-4,0: DEDENT ''
				238	4,0-4,9: NAME 'say_hello'
				239	4,9-4,10: OP '('
				240	4,10-4,11: OP ')'
				241	4,11-4,12: NEWLINE '\n'
				242	5,0-5,0: ENDMARKER ''
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	243
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	244	The exact token type names can be displayed using the :option:`-e` option:
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	245
				246	.. code-block:: sh
				247
				248	$ python -m tokenize -e hello.py
				249	0,0-0,0: ENCODING 'utf-8'
				250	1,0-1,3: NAME 'def'
				251	1,4-1,13: NAME 'say_hello'
				252	1,13-1,14: LPAR '('
				253	1,14-1,15: RPAR ')'
				254	1,15-1,16: COLON ':'
				255	1,16-1,17: NEWLINE '\n'
				256	2,0-2,4: INDENT ' '
				257	2,4-2,9: NAME 'print'
				258	2,9-2,10: LPAR '('
				259	2,10-2,25: STRING '"Hello, World!"'
				260	2,25-2,26: RPAR ')'
				261	2,26-2,27: NEWLINE '\n'
				262	3,0-3,1: NL '\n'
				263	4,0-4,0: DEDENT ''
				264	4,0-4,9: NAME 'say_hello'
				265	4,9-4,10: LPAR '('
				266	4,10-4,11: RPAR ')'
				267	4,11-4,12: NEWLINE '\n'
				268	5,0-5,0: ENDMARKER ''