Blame - Doc/library/tokenize.rst - platform/external/python/cpython3

blob: bbe73d00e9a5582997161072e5397182a92fcb24 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	:mod:`tokenize` --- Tokenizer for Python source
				3	===============================================
				4
				5	.. module:: tokenize
				6	:synopsis: Lexical scanner for Python source code.
				7	.. moduleauthor:: Ka Ping Yee
				8	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				9
				10
				11	The :mod:`tokenize` module provides a lexical scanner for Python source code,
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	12	implemented in Python. The scanner in this module returns comments as tokens
				13	as well, making it useful for implementing "pretty-printers," including
				14	colorizers for on-screen displays.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	15
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	16	The primary entry point is a :term:`generator`:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	17
				18
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	19	.. function:: tokenize(readline)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	20
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	21	The :func:`tokenize` generator requires one argument, readline, which
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	22	must be a callable object which provides the same interface as the
				23	:meth:`readline` method of built-in file objects (see section
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	24	:ref:`bltin-file-objects`). Each call to the function should return one
				25	line of input as bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	26
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	27	The generator produces 5-tuples with these members: the token type; the
				28	token string; a 2-tuple ``(srow, scol)`` of ints specifying the row and
				29	column where the token begins in the source; a 2-tuple ``(erow, ecol)`` of
				30	ints specifying the row and column where the token ends in the source; and
				31	the line on which the token was found. The line passed is the logical
				32	line; continuation lines are included.
				33
				34	tokenize determines the source encoding of the file by looking for a utf-8
				35	bom or encoding cookie, according to :pep:`263`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
Georg Brandl	55ac8f0	2007-09-01 13:51:09 +0000	[diff] [blame]	37
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	38	All constants from the :mod:`token` module are also exported from
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	39	:mod:`tokenize`, as are three additional token type values:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	40
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	41	.. data:: COMMENT
				42
				43	Token value used to indicate a comment.
				44
				45
				46	.. data:: NL
				47
				48	Token value used to indicate a non-terminating newline. The NEWLINE token
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	49	indicates the end of a logical line of Python code; NL tokens are generated
				50	when a logical line of code is continued over multiple physical lines.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	51
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	52
				53	.. data:: ENCODING
				54
				55	Token value that indicates the encoding used to decode the source bytes
				56	into text. The first token returned by :func:`tokenize` will always be an
				57	ENCODING token.
				58
				59
				60	Another function is provided to reverse the tokenization process. This is
				61	useful for creating tools that tokenize a script, modify the token stream, and
				62	write back the modified script.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	63
				64
				65	.. function:: untokenize(iterable)
				66
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	67	Converts tokens back into Python source code. The iterable must return
				68	sequences with at least two elements, the token type and the token string.
				69	Any additional sequence elements are ignored.
				70
				71	The reconstructed script is returned as a single string. The result is
				72	guaranteed to tokenize back to match the input so that the conversion is
				73	lossless and round-trips are assured. The guarantee applies only to the
				74	token type and token string as the spacing between tokens (column
				75	positions) may change.
				76
				77	It returns bytes, encoded using the ENCODING token, which is the first
				78	token sequence output by :func:`tokenize`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	79
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	81	:func:`tokenize` needs to detect the encoding of source files it tokenizes. The
				82	function it uses to do this is available:
				83
				84	.. function:: detect_encoding(readline)
				85
				86	The :func:`detect_encoding` function is used to detect the encoding that
				87	should be used to decode a Python source file. It requires one argment,
				88	readline, in the same way as the :func:`tokenize` generator.
				89
				90	It will call readline a maximum of twice, and return the encoding used
				91	(as a string) and a list of any lines (not decoded from bytes) it has read
				92	in.
				93
				94	It detects the encoding from the presence of a utf-8 bom or an encoding
				95	cookie as specified in pep-0263. If both a bom and a cookie are present,
				96	but disagree, a SyntaxError will be raised.
				97
				98	If no encoding is specified, then the default of 'utf-8' will be returned.
				99
				100
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	101	Example of a script re-writer that transforms float literals into Decimal
				102	objects::
				103
Trent Nelson	428de65	2008-03-18 22:41:35 +0000	[diff] [blame]	104	def decistmt(s):
				105	"""Substitute Decimals for floats in a string of statements.
				106
				107	>>> from decimal import Decimal
				108	>>> s = 'print(+21.3e-5*-.1234/81.7)'
				109	>>> decistmt(s)
				110	"print (+Decimal ('21.3e-5')*-Decimal ('.1234')/Decimal ('81.7'))"
				111
				112	The format of the exponent is inherited from the platform C library.
				113	Known cases are "e-007" (Windows) and "e-07" (not Windows). Since
				114	we're only showing 12 digits, and the 13th isn't close to 5, the
				115	rest of the output should be platform-independent.
				116
				117	>>> exec(s) #doctest: +ELLIPSIS
				118	-3.21716034272e-0...7
				119
				120	Output from calculations with Decimal should be identical across all
				121	platforms.
				122
				123	>>> exec(decistmt(s))
				124	-3.217160342717258261933904529E-7
				125	"""
				126	result = []
				127	g = tokenize(BytesIO(s.encode('utf-8')).readline) # tokenize the string
				128	for toknum, tokval, _, _, _ in g:
				129	if toknum == NUMBER and '.' in tokval: # replace NUMBER tokens
				130	result.extend([
				131	(NAME, 'Decimal'),
				132	(OP, '('),
				133	(STRING, repr(tokval)),
				134	(OP, ')')
				135	])
				136	else:
				137	result.append((toknum, tokval))
				138	return untokenize(result).decode('utf-8')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	139
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	140