Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: 02a0428f21bc769835ba970d16ec1381cf35d0b8 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`tokenize` --- Tokenizer for Python source
				2	===============================================
				3
				4	.. module:: tokenize
				5	:synopsis: Lexical scanner for Python source code.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Ka Ping Yee
				8	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				9
Raymond Hettinger	1048094	2011-01-10 03:26:08 +0000	[diff] [blame]	10	Source code: :source:`Lib/tokenize.py`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	4f707fd	2011-01-10 19:54:11 +0000	[diff] [blame]	12	--------------
				13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	15	implemented in Python. The scanner in this module returns comments as tokens
				16	as well, making it useful for implementing "pretty-printers," including
				17	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Mariatta	ea0f7c2	2017-09-12 21:00:00 -0700	[diff] [blame]	19	To simplify token stream handling, all :ref:`operator <operators>` and
				20	:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using
				21	the generic :data:`~token.OP` token type. The exact
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	22	type can be determined by checking the ``exact_type`` property on the
				23	:term:`named tuple` returned from :func:`tokenize.tokenize`.
				24
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	25	Tokenizing Input
				26	----------------
				27
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	28	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	29
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	30	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	32	The :func:`.tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	33	must be a callable object which provides the same interface as the
Antoine Pitrou	4adb288	2010-01-04 18:50:53 +0000	[diff] [blame]	34	:meth:`io.IOBase.readline` method of file objects. Each call to the
				35	function should return one line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	37	The generator produces 5-tuples with these members: the token type; the
				38	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				39	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				40	ints specifying the row and column where the token ends in the source; and
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	41	the line on which the token was found. The line passed (the last tuple item)
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	42	is the logical line; continuation lines are included. The 5 tuple is
				43	returned as a :term:`named tuple` with the field names:
				44	``type string start end line``.
				45
Serhiy Storchaka	d65c949	2015-11-02 14:10:23 +0200	[diff] [blame]	46	The returned :term:`named tuple` has an additional property named
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	47	``exact_type`` that contains the exact operator type for
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	48	:data:`~token.OP` tokens. For all other token types ``exact_type``
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	49	equals the named tuple ``type`` field.
				50
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	51	.. versionchanged:: 3.1
				52	Added support for named tuples.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	53
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	54	.. versionchanged:: 3.3
				55	Added support for ``exact_type``.
				56
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	57	:func:`.tokenize` determines the source encoding of the file by looking for a
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	58	UTF-8 BOM or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	59
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	60
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	61	All constants from the :mod:`token` module are also exported from
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	62	:mod:`tokenize`.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	63
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	64	Another function is provided to reverse the tokenization process. This is
				65	useful for creating tools that tokenize a script, modify the token stream, and
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	66	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
				68
				69	.. function:: untokenize(iterable)
				70
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	71	Converts tokens back into Python source code. The iterable must return
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	72	sequences with at least two elements, the token type and the token string.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	73	Any additional sequence elements are ignored.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	74
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	75	The reconstructed script is returned as a single string. The result is
				76	guaranteed to tokenize back to match the input so that the conversion is
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	77	lossless and round-trips are assured. The guarantee applies only to the
				78	token type and token string as the spacing between tokens (column
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	79	positions) may change.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	80
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	81	It returns bytes, encoded using the :data:`~token.ENCODING` token, which
				82	is the first token sequence output by :func:`.tokenize`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	83
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	84
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	85	:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	86	function it uses to do this is available:
				87
				88	.. function:: detect_encoding(readline)
				89
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	90	The :func:`detect_encoding` function is used to detect the encoding that
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	91	should be used to decode a Python source file. It requires one argument,
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	92	readline, in the same way as the :func:`.tokenize` generator.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	93
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	94	It will call readline a maximum of twice, and return the encoding used
				95	(as a string) and a list of any lines (not decoded from bytes) it has read
				96	in.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	97
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	98	It detects the encoding from the presence of a UTF-8 BOM or an encoding
				99	cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	100	but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found,
Benjamin Peterson	689a558	2010-03-18 22:29:52 +0000	[diff] [blame]	101	``'utf-8-sig'`` will be returned as an encoding.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	102
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	103	If no encoding is specified, then the default of ``'utf-8'`` will be
				104	returned.
				105
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	106	Use :func:`.open` to open Python source files: it uses
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	107	:func:`detect_encoding` to detect the file encoding.
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	108
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	109
				110	.. function:: open(filename)
				111
				112	Open a file in read only mode using the encoding detected by
				113	:func:`detect_encoding`.
				114
				115	.. versionadded:: 3.2
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	116
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	117	.. exception:: TokenError
				118
				119	Raised when either a docstring or expression that may be split over several
				120	lines is not completed anywhere in the file, for example::
				121
				122	"""Beginning of
				123	docstring
				124
				125	or::
				126
				127	[1,
				128	2,
				129	3
				130
				131	Note that unclosed single-quoted strings do not cause an error to be
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	132	raised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the
				133	tokenization of their contents.
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	134
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	135
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	136	.. _tokenize-cli:
				137
				138	Command-Line Usage
				139	------------------
				140
				141	.. versionadded:: 3.3
				142
				143	The :mod:`tokenize` module can be executed as a script from the command line.
				144	It is as simple as:
				145
				146	.. code-block:: sh
				147
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	148	python -m tokenize [-e] [filename.py]
				149
				150	The following options are accepted:
				151
				152	.. program:: tokenize
				153
				154	.. cmdoption:: -h, --help
				155
				156	show this help message and exit
				157
				158	.. cmdoption:: -e, --exact
				159
				160	display token names using the exact type
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	161
				162	If :file:`filename.py` is specified its contents are tokenized to stdout.
				163	Otherwise, tokenization is performed on stdin.
				164
				165	Examples
				166	------------------
				167
Raymond Hettinger	6c60d09	2010-09-09 04:32:39 +0000	[diff] [blame]	168	Example of a script rewriter that transforms float literals into Decimal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	169	objects::
				170
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	171	from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
				172	from io import BytesIO
				173
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	174	def decistmt(s):
				175	"""Substitute Decimals for floats in a string of statements.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	176
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	177	>>> from decimal import Decimal
				178	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				179	>>> decistmt(s)
				180	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	181
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	182	The format of the exponent is inherited from the platform C library.
				183	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				184	we're only showing 12 digits, and the 13th isn't close to 5, the
				185	rest of the output should be platform-independent.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	186
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	187	>>> exec(s) #doctest: +ELLIPSIS
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	188	-3.21716034272e-0...7
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	189
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	190	Output from calculations with Decimal should be identical across all
				191	platforms.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	192
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	193	>>> exec(decistmt(s))
				194	-3.217160342717258261933904529E-7
				195	"""
				196	result = []
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	197	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				198	for toknum, tokval, _, _, _ in g:
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	199	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				200	result.extend([
				201	(NAME, 'Decimal'),
				202	(OP, '('),
				203	(STRING, repr(tokval)),
				204	(OP, ')')
				205	])
				206	else:
				207	result.append((toknum, tokval))
				208	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	209
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	210	Example of tokenizing from the command line. The script::
				211
				212	def say_hello():
				213	print("Hello, World!")
				214
				215	say_hello()
				216
				217	will be tokenized to the following output where the first column is the range
				218	of the line/column coordinates where the token is found, the second column is
				219	the name of the token, and the final column is the value of the token (if any)
				220
				221	.. code-block:: sh
				222
				223	$ python -m tokenize hello.py
				224	0,0-0,0: ENCODING 'utf-8'
				225	1,0-1,3: NAME 'def'
				226	1,4-1,13: NAME 'say_hello'
				227	1,13-1,14: OP '('
				228	1,14-1,15: OP ')'
				229	1,15-1,16: OP ':'
				230	1,16-1,17: NEWLINE '\n'
				231	2,0-2,4: INDENT ' '
				232	2,4-2,9: NAME 'print'
				233	2,9-2,10: OP '('
				234	2,10-2,25: STRING '"Hello, World!"'
				235	2,25-2,26: OP ')'
				236	2,26-2,27: NEWLINE '\n'
				237	3,0-3,1: NL '\n'
				238	4,0-4,0: DEDENT ''
				239	4,0-4,9: NAME 'say_hello'
				240	4,9-4,10: OP '('
				241	4,10-4,11: OP ')'
				242	4,11-4,12: NEWLINE '\n'
				243	5,0-5,0: ENDMARKER ''
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	244
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	245	The exact token type names can be displayed using the :option:`-e` option:
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	246
				247	.. code-block:: sh
				248
				249	$ python -m tokenize -e hello.py
				250	0,0-0,0: ENCODING 'utf-8'
				251	1,0-1,3: NAME 'def'
				252	1,4-1,13: NAME 'say_hello'
				253	1,13-1,14: LPAR '('
				254	1,14-1,15: RPAR ')'
				255	1,15-1,16: COLON ':'
				256	1,16-1,17: NEWLINE '\n'
				257	2,0-2,4: INDENT ' '
				258	2,4-2,9: NAME 'print'
				259	2,9-2,10: LPAR '('
				260	2,10-2,25: STRING '"Hello, World!"'
				261	2,25-2,26: RPAR ')'
				262	2,26-2,27: NEWLINE '\n'
				263	3,0-3,1: NL '\n'
				264	4,0-4,0: DEDENT ''
				265	4,0-4,9: NAME 'say_hello'
				266	4,9-4,10: LPAR '('
				267	4,10-4,11: RPAR ')'
				268	4,11-4,12: NEWLINE '\n'
				269	5,0-5,0: ENDMARKER ''