Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: 111289c767f35c3e7be2fbb2f8571eb08c8ce79a [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`tokenize` --- Tokenizer for Python source
				2	===============================================
				3
				4	.. module:: tokenize
				5	:synopsis: Lexical scanner for Python source code.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Ka Ping Yee
				8	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				9
Raymond Hettinger	1048094	2011-01-10 03:26:08 +0000	[diff] [blame]	10	Source code: :source:`Lib/tokenize.py`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	4f707fd	2011-01-10 19:54:11 +0000	[diff] [blame]	12	--------------
				13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	15	implemented in Python. The scanner in this module returns comments as tokens
				16	as well, making it useful for implementing "pretty-printers," including
				17	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Mariatta	ea0f7c2	2017-09-12 21:00:00 -0700	[diff] [blame]	19	To simplify token stream handling, all :ref:`operator <operators>` and
				20	:ref:`delimiter <delimiters>` tokens and :data:`Ellipsis` are returned using
				21	the generic :data:`~token.OP` token type. The exact
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	22	type can be determined by checking the ``exact_type`` property on the
				23	:term:`named tuple` returned from :func:`tokenize.tokenize`.
				24
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	25	Tokenizing Input
				26	----------------
				27
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	28	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	29
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	30	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	32	The :func:`.tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	33	must be a callable object which provides the same interface as the
Antoine Pitrou	4adb288	2010-01-04 18:50:53 +0000	[diff] [blame]	34	:meth:`io.IOBase.readline` method of file objects. Each call to the
				35	function should return one line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	37	The generator produces 5-tuples with these members: the token type; the
				38	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				39	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				40	ints specifying the row and column where the token ends in the source; and
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	41	the line on which the token was found. The line passed (the last tuple item)
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	42	is the logical line; continuation lines are included. The 5 tuple is
				43	returned as a :term:`named tuple` with the field names:
				44	``type string start end line``.
				45
Serhiy Storchaka	d65c949	2015-11-02 14:10:23 +0200	[diff] [blame]	46	The returned :term:`named tuple` has an additional property named
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	47	``exact_type`` that contains the exact operator type for
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	48	:data:`~token.OP` tokens. For all other token types ``exact_type``
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	49	equals the named tuple ``type`` field.
				50
Raymond Hettinger	a48db39	2009-04-29 00:34:27 +0000	[diff] [blame]	51	.. versionchanged:: 3.1
				52	Added support for named tuples.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	53
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	54	.. versionchanged:: 3.3
				55	Added support for ``exact_type``.
				56
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	57	:func:`.tokenize` determines the source encoding of the file by looking for a
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	58	UTF-8 BOM or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	59
Thomas Kluyver	c56b17b	2018-06-05 19:26:39 +0200	[diff] [blame]	60	.. function:: generate_tokens(readline)
				61
				62	Tokenize a source reading unicode strings instead of bytes.
				63
				64	Like :func:`.tokenize`, the readline argument is a callable returning
				65	a single line of input. However, :func:`generate_tokens` expects readline
				66	to return a str object rather than bytes.
				67
				68	The result is an iterator yielding named tuples, exactly like
				69	:func:`.tokenize`. It does not yield an :data:`~token.ENCODING` token.
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	70
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	71	All constants from the :mod:`token` module are also exported from
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	72	:mod:`tokenize`.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	73
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	74	Another function is provided to reverse the tokenization process. This is
				75	useful for creating tools that tokenize a script, modify the token stream, and
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	76	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	77
				78
				79	.. function:: untokenize(iterable)
				80
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	81	Converts tokens back into Python source code. The iterable must return
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	82	sequences with at least two elements, the token type and the token string.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	83	Any additional sequence elements are ignored.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	84
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	85	The reconstructed script is returned as a single string. The result is
				86	guaranteed to tokenize back to match the input so that the conversion is
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	87	lossless and round-trips are assured. The guarantee applies only to the
				88	token type and token string as the spacing between tokens (column
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	89	positions) may change.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	90
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	91	It returns bytes, encoded using the :data:`~token.ENCODING` token, which
Thomas Kluyver	c56b17b	2018-06-05 19:26:39 +0200	[diff] [blame]	92	is the first token sequence output by :func:`.tokenize`. If there is no
				93	encoding token in the input, it returns a str instead.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	94
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	95
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	96	:func:`.tokenize` needs to detect the encoding of source files it tokenizes. The
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	97	function it uses to do this is available:
				98
				99	.. function:: detect_encoding(readline)
				100
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	101	The :func:`detect_encoding` function is used to detect the encoding that
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	102	should be used to decode a Python source file. It requires one argument,
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	103	readline, in the same way as the :func:`.tokenize` generator.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	104
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	105	It will call readline a maximum of twice, and return the encoding used
				106	(as a string) and a list of any lines (not decoded from bytes) it has read
				107	in.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	108
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	109	It detects the encoding from the presence of a UTF-8 BOM or an encoding
				110	cookie as specified in :pep:`263`. If both a BOM and a cookie are present,
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	111	but disagree, a :exc:`SyntaxError` will be raised. Note that if the BOM is found,
Benjamin Peterson	689a558	2010-03-18 22:29:52 +0000	[diff] [blame]	112	``'utf-8-sig'`` will be returned as an encoding.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	113
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	114	If no encoding is specified, then the default of ``'utf-8'`` will be
				115	returned.
				116
Martin Panter	20b1bfa	2016-01-16 04:32:52 +0000	[diff] [blame]	117	Use :func:`.open` to open Python source files: it uses
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	118	:func:`detect_encoding` to detect the file encoding.
Benjamin Peterson	b3a4829	2010-03-18 22:43:41 +0000	[diff] [blame]	119
Victor Stinner	58c0752	2010-11-09 01:08:59 +0000	[diff] [blame]	120
				121	.. function:: open(filename)
				122
				123	Open a file in read only mode using the encoding detected by
				124	:func:`detect_encoding`.
				125
				126	.. versionadded:: 3.2
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	127
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	128	.. exception:: TokenError
				129
				130	Raised when either a docstring or expression that may be split over several
				131	lines is not completed anywhere in the file, for example::
				132
				133	"""Beginning of
				134	docstring
				135
				136	or::
				137
				138	[1,
				139	2,
				140	3
				141
				142	Note that unclosed single-quoted strings do not cause an error to be
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	143	raised. They are tokenized as :data:`~token.ERRORTOKEN`, followed by the
				144	tokenization of their contents.
Benjamin Peterson	96e0430	2014-06-07 17:47:41 -0700	[diff] [blame]	145
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	146
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	147	.. _tokenize-cli:
				148
				149	Command-Line Usage
				150	------------------
				151
				152	.. versionadded:: 3.3
				153
				154	The :mod:`tokenize` module can be executed as a script from the command line.
				155	It is as simple as:
				156
				157	.. code-block:: sh
				158
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	159	python -m tokenize [-e] [filename.py]
				160
				161	The following options are accepted:
				162
				163	.. program:: tokenize
				164
				165	.. cmdoption:: -h, --help
				166
				167	show this help message and exit
				168
				169	.. cmdoption:: -e, --exact
				170
				171	display token names using the exact type
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	172
				173	If :file:`filename.py` is specified its contents are tokenized to stdout.
				174	Otherwise, tokenization is performed on stdin.
				175
				176	Examples
				177	------------------
				178
Raymond Hettinger	6c60d09	2010-09-09 04:32:39 +0000	[diff] [blame]	179	Example of a script rewriter that transforms float literals into Decimal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	180	objects::
				181
Ezio Melotti	a8f6f1e	2009-12-20 12:24:57 +0000	[diff] [blame]	182	from tokenize import tokenize, untokenize, NUMBER, STRING, NAME, OP
				183	from io import BytesIO
				184
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	185	def decistmt(s):
				186	"""Substitute Decimals for floats in a string of statements.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	187
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	188	>>> from decimal import Decimal
				189	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				190	>>> decistmt(s)
				191	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	192
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	193	The format of the exponent is inherited from the platform C library.
				194	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				195	we're only showing 12 digits, and the 13th isn't close to 5, the
				196	rest of the output should be platform-independent.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	197
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	198	>>> exec(s) #doctest: +ELLIPSIS
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	199	-3.21716034272e-0...7
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	200
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	201	Output from calculations with Decimal should be identical across all
				202	platforms.
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	203
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	204	>>> exec(decistmt(s))
				205	-3.217160342717258261933904529E-7
				206	"""
				207	result = []
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	208	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				209	for toknum, tokval, _, _, _ in g:
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	210	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				211	result.extend([
				212	(NAME, 'Decimal'),
				213	(OP, '('),
				214	(STRING, repr(tokval)),
				215	(OP, ')')
				216	])
				217	else:
				218	result.append((toknum, tokval))
				219	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	220
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	221	Example of tokenizing from the command line. The script::
				222
				223	def say_hello():
				224	print("Hello, World!")
				225
				226	say_hello()
				227
				228	will be tokenized to the following output where the first column is the range
				229	of the line/column coordinates where the token is found, the second column is
				230	the name of the token, and the final column is the value of the token (if any)
				231
Serhiy Storchaka	46936d5	2018-04-08 19:18:04 +0300	[diff] [blame]	232	.. code-block:: shell-session
Meador Inge	14c0f03	2011-10-07 08:53:38 -0500	[diff] [blame]	233
				234	$ python -m tokenize hello.py
				235	0,0-0,0: ENCODING 'utf-8'
				236	1,0-1,3: NAME 'def'
				237	1,4-1,13: NAME 'say_hello'
				238	1,13-1,14: OP '('
				239	1,14-1,15: OP ')'
				240	1,15-1,16: OP ':'
				241	1,16-1,17: NEWLINE '\n'
				242	2,0-2,4: INDENT ' '
				243	2,4-2,9: NAME 'print'
				244	2,9-2,10: OP '('
				245	2,10-2,25: STRING '"Hello, World!"'
				246	2,25-2,26: OP ')'
				247	2,26-2,27: NEWLINE '\n'
				248	3,0-3,1: NL '\n'
				249	4,0-4,0: DEDENT ''
				250	4,0-4,9: NAME 'say_hello'
				251	4,9-4,10: OP '('
				252	4,10-4,11: OP ')'
				253	4,11-4,12: NEWLINE '\n'
				254	5,0-5,0: ENDMARKER ''
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	255
Serhiy Storchaka	5cefb6c	2017-06-06 18:43:35 +0300	[diff] [blame]	256	The exact token type names can be displayed using the :option:`-e` option:
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	257
Serhiy Storchaka	46936d5	2018-04-08 19:18:04 +0300	[diff] [blame]	258	.. code-block:: shell-session
Meador Inge	00c7f85	2012-01-19 00:44:45 -0600	[diff] [blame]	259
				260	$ python -m tokenize -e hello.py
				261	0,0-0,0: ENCODING 'utf-8'
				262	1,0-1,3: NAME 'def'
				263	1,4-1,13: NAME 'say_hello'
				264	1,13-1,14: LPAR '('
				265	1,14-1,15: RPAR ')'
				266	1,15-1,16: COLON ':'
				267	1,16-1,17: NEWLINE '\n'
				268	2,0-2,4: INDENT ' '
				269	2,4-2,9: NAME 'print'
				270	2,9-2,10: LPAR '('
				271	2,10-2,25: STRING '"Hello, World!"'
				272	2,25-2,26: RPAR ')'
				273	2,26-2,27: NEWLINE '\n'
				274	3,0-3,1: NL '\n'
				275	4,0-4,0: DEDENT ''
				276	4,0-4,9: NAME 'say_hello'
				277	4,9-4,10: LPAR '('
				278	4,10-4,11: RPAR ')'
				279	4,11-4,12: NEWLINE '\n'
				280	5,0-5,0: ENDMARKER ''