Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: b2caded0a7a6093f5ac07fc49f41843d95187c16 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`tokenize` --- Tokenizer for Python source
				2	===============================================
				3
				4	.. module:: tokenize
				5	:synopsis: Lexical scanner for Python source code.
				6	.. moduleauthor:: Ka Ping Yee
				7	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				8
				9
				10	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	11	implemented in Python. The scanner in this module returns comments as tokens
				12	as well, making it useful for implementing "pretty-printers," including
				13	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	15	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	17	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	19	The :func:`tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	20	must be a callable object which provides the same interface as the
				21	:meth:`readline` method of built-in file objects (see section
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	22	:ref:`bltin-file-objects`). Each call to the function should return one
				23	line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	24
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	25	The generator produces 5-tuples with these members: the token type; the
				26	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				27	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				28	ints specifying the row and column where the token ends in the source; and
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	29	the line on which the token was found. The line passed (the last tuple item)
				30	is the logical line; continuation lines are included.
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	31
Georg Brandl	c28e1fa	2008-06-10 19:20:26 +0000	[diff] [blame]	32	:func:`tokenize` determines the source encoding of the file by looking for a
				33	UTF-8 BOM or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	35
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36	All constants from the :mod:`token` module are also exported from
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	37	:mod:`tokenize`, as are three additional token type values:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	38
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39	.. data:: COMMENT
				40
				41	Token value used to indicate a comment.
				42
				43
				44	.. data:: NL
				45
				46	Token value used to indicate a non-terminating newline. The NEWLINE token
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	47	indicates the end of a logical line of Python code; NL tokens are generated
				48	when a logical line of code is continued over multiple physical lines.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	49
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	50
				51	.. data:: ENCODING
				52
				53	Token value that indicates the encoding used to decode the source bytes
				54	into text. The first token returned by :func:`tokenize` will always be an
				55	ENCODING token.
				56
				57
				58	Another function is provided to reverse the tokenization process. This is
				59	useful for creating tools that tokenize a script, modify the token stream, and
				60	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	61
				62
				63	.. function:: untokenize(iterable)
				64
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	65	Converts tokens back into Python source code. The iterable must return
				66	sequences with at least two elements, the token type and the token string.
				67	Any additional sequence elements are ignored.
				68
				69	The reconstructed script is returned as a single string. The result is
				70	guaranteed to tokenize back to match the input so that the conversion is
				71	lossless and round-trips are assured. The guarantee applies only to the
				72	token type and token string as the spacing between tokens (column
				73	positions) may change.
				74
				75	It returns bytes, encoded using the ENCODING token, which is the first
				76	token sequence output by :func:`tokenize`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	77
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	78
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	79	:func:`tokenize` needs to detect the encoding of source files it tokenizes. The
				80	function it uses to do this is available:
				81
				82	.. function:: detect_encoding(readline)
				83
				84	The :func:`detect_encoding` function is used to detect the encoding that
				85	should be used to decode a Python source file. It requires one argment,
				86	readline, in the same way as the :func:`tokenize` generator.
				87
				88	It will call readline a maximum of twice, and return the encoding used
				89	(as a string) and a list of any lines (not decoded from bytes) it has read
				90	in.
				91
				92	It detects the encoding from the presence of a utf-8 bom or an encoding
				93	cookie as specified in pep-0263. If both a bom and a cookie are present,
				94	but disagree, a SyntaxError will be raised.
				95
				96	If no encoding is specified, then the default of 'utf-8' will be returned.
				97
				98
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	99	Example of a script re-writer that transforms float literals into Decimal
				100	objects::
				101
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	102	def decistmt(s):
				103	"""Substitute Decimals for floats in a string of statements.
				104
				105	>>> from decimal import Decimal
				106	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				107	>>> decistmt(s)
				108	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
				109
				110	The format of the exponent is inherited from the platform C library.
				111	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				112	we're only showing 12 digits, and the 13th isn't close to 5, the
				113	rest of the output should be platform-independent.
				114
				115	>>> exec(s) #doctest: +ELLIPSIS
				116	-3.21716034272e-0...7
				117
				118	Output from calculations with Decimal should be identical across all
				119	platforms.
				120
				121	>>> exec(decistmt(s))
				122	-3.217160342717258261933904529E-7
				123	"""
				124	result = []
				125	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				126	for toknum, tokval, _, _, _ in g:
				127	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				128	result.extend([
				129	(NAME, 'Decimal'),
				130	(OP, '('),
				131	(STRING, repr(tokval)),
				132	(OP, ')')
				133	])
				134	else:
				135	result.append((toknum, tokval))
				136	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	137
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	138