Blame - Parser/pgen/pgen.py - platform/external/python/cpython3

blob: 2f444eb8c86ffecd449442db8712a04d04352cef [file] [log] [blame]

Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	1	"""Python parser generator
				2
				3
				4	This parser generator transforms a Python grammar file into parsing tables
				5	that can be consumed by Python's LL(1) parser written in C.
				6
				7	Concepts
				8	--------
				9
				10	* An LL(1) parser (Left-to-right, Leftmost derivation, 1 token-lookahead) is a
				11	top-down parser for a subset of context-free languages. It parses the input
				12	from Left to right, performing Leftmost derivation of the sentence, and can
Shashi Ranjan	43710b6	2019-08-24 23:37:24 +0530	[diff] [blame]	13	only use 1 token of lookahead when parsing a sentence.
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	14
				15	* A parsing table is a collection of data that a generic implementation of the
				16	LL(1) parser consumes to know how to parse a given context-free grammar. In
Shashi Ranjan	43710b6	2019-08-24 23:37:24 +0530	[diff] [blame]	17	this case the collection of data involves Deterministic Finite Automatons,
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	18	calculated first sets, keywords and transition labels.
				19
				20	* A grammar is defined by production rules (or just 'productions') that specify
				21	which symbols may replace which other symbols; these rules may be used to
				22	generate strings, or to parse them. Each such rule has a head, or left-hand
				23	side, which consists of the string that may be replaced, and a body, or
				24	right-hand side, which consists of a string that may replace it. In the
				25	Python grammar, rules are written in the form
				26
				27	rule_name: rule_description;
				28
Shashi Ranjan	43710b6	2019-08-24 23:37:24 +0530	[diff] [blame]	29	meaning the rule 'a: b' specifies that a can be replaced by b. A context-free
				30	grammar is a grammar in which the left-hand side of each production rule
				31	consists of only a single nonterminal symbol. Context-free grammars can
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	32	always be recognized by a Non-Deterministic Automatons.
				33
				34	* Terminal symbols are literal symbols which may appear in the outputs of the
				35	production rules of the grammar and which cannot be changed using the rules
				36	of the grammar. Applying the rules recursively to a source string of symbols
				37	will usually terminate in a final output string consisting only of terminal
				38	symbols.
				39
				40	* Nonterminal symbols are those symbols which can be replaced. The grammar
				41	includes a start symbol a designated member of the set of nonterminals from
				42	which all the strings in the language may be derived by successive
				43	applications of the production rules.
				44
				45	* The language defined by the grammar is defined as the set of terminal strings
				46	that can be derived using the production rules.
				47
				48	* The first sets of a rule (FIRST(rule)) are defined to be the set of terminals
				49	that can appear in the first position of any string derived from the rule.
Shashi Ranjan	43710b6	2019-08-24 23:37:24 +0530	[diff] [blame]	50	This is useful for LL(1) parsers as the parser is only allowed to look at the
				51	next token in the input to know which rule needs to parse. For example, given
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	52	this grammar:
				53
				54	start: '(' A \| B ')'
				55	A: 'a' '<'
				56	B: 'b' '<'
				57
				58	and the input '(b<)' the parser can only look at 'b' to know if it needs
				59	to parse A o B. Because FIRST(A) = {'a'} and FIRST(B) = {'b'} it knows
				60	that needs to continue parsing rule B because only that rule can start
				61	with 'b'.
				62
				63	Description
				64	-----------
				65
				66	The input for the parser generator is a grammar in extended BNF form (using *
				67	for repetition, + for at-least-once repetition, [] for optional parts, \| for
				68	alternatives and () for grouping).
				69
				70	Each rule in the grammar file is considered as a regular expression in its
				71	own right. It is turned into a Non-deterministic Finite Automaton (NFA),
				72	which is then turned into a Deterministic Finite Automaton (DFA), which is
				73	then optimized to reduce the number of states. See [Aho&Ullman 77] chapter 3,
				74	or similar compiler books (this technique is more often used for lexical
				75	analyzers).
				76
				77	The DFA's are used by the parser as parsing tables in a special way that's
				78	probably unique. Before they are usable, the FIRST sets of all non-terminals
				79	are computed so the LL(1) parser consuming the parsing tables can distinguish
				80	between different transitions.
				81	Reference
				82	---------
				83
				84	[Aho&Ullman 77]
				85	Aho&Ullman, Principles of Compiler Design, Addison-Wesley 1977
				86	(first edition)
				87	"""
				88
				89	from ast import literal_eval
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	90	import collections
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	91
				92	from . import grammar, token
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	93	from .automata import DFA
				94	from .metaparser import GrammarParser
				95
				96	import enum
				97
				98
				99	class LabelType(enum.Enum):
				100	NONTERMINAL = 0
				101	NAMED_TOKEN = 1
				102	KEYWORD = 2
				103	OPERATOR = 3
				104	NONE = 4
				105
				106
				107	class Label(str):
				108	def __init__(self, value):
				109	self.type = self._get_type()
				110
				111	def _get_type(self):
				112	if self[0].isalpha():
				113	if self.upper() == self:
				114	# NAMED tokens (ASYNC, NAME...) are all uppercase by convention
				115	return LabelType.NAMED_TOKEN
				116	else:
				117	# If is not uppercase it must be a non terminal.
				118	return LabelType.NONTERMINAL
				119	else:
				120	# Keywords and operators are wrapped in quotes
				121	assert self[0] == self[-1] in ('"', "'"), self
				122	value = literal_eval(self)
				123	if value[0].isalpha():
				124	return LabelType.KEYWORD
				125	else:
				126	return LabelType.OPERATOR
				127
				128	def __repr__(self):
				129	return "{}({})".format(self.type, super().__repr__())
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	130
Pablo Galindo	8bc401a	2019-03-04 07:26:13 +0000	[diff] [blame]	131
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	132	class ParserGenerator(object):
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	133	def __init__(self, grammar_file, token_file, verbose=False):
				134	with open(grammar_file) as f:
				135	self.grammar = f.read()
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	136	with open(token_file) as tok_file:
				137	token_lines = tok_file.readlines()
				138	self.tokens = dict(token.generate_tokens(token_lines))
				139	self.opmap = dict(token.generate_opmap(token_lines))
				140	# Manually add <> so it does not collide with !=
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	141	self.opmap["<>"] = "NOTEQUAL"
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	142	self.verbose = verbose
				143	self.filename = grammar_file
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	144	self.dfas, self.startsymbol = self.create_dfas()
				145	self.first = {} # map from symbol name to set of tokens
				146	self.calculate_first_sets()
				147
				148	def create_dfas(self):
				149	rule_to_dfas = collections.OrderedDict()
				150	start_nonterminal = None
				151	for nfa in GrammarParser(self.grammar).parse():
				152	if self.verbose:
				153	print("Dump of NFA for", nfa.name)
				154	nfa.dump()
				155	dfa = DFA.from_nfa(nfa)
				156	if self.verbose:
				157	print("Dump of DFA for", dfa.name)
				158	dfa.dump()
				159	dfa.simplify()
				160	rule_to_dfas[dfa.name] = dfa
				161
				162	if start_nonterminal is None:
				163	start_nonterminal = dfa.name
				164
				165	return rule_to_dfas, start_nonterminal
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	166
				167	def make_grammar(self):
				168	c = grammar.Grammar()
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	169	c.all_labels = set()
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	170	names = list(self.dfas.keys())
				171	names.remove(self.startsymbol)
				172	names.insert(0, self.startsymbol)
				173	for name in names:
				174	i = 256 + len(c.symbol2number)
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	175	c.symbol2number[Label(name)] = i
				176	c.number2symbol[i] = Label(name)
				177	c.all_labels.add(name)
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	178	for name in names:
				179	self.make_label(c, name)
				180	dfa = self.dfas[name]
				181	states = []
				182	for state in dfa:
				183	arcs = []
				184	for label, next in sorted(state.arcs.items()):
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	185	c.all_labels.add(label)
				186	arcs.append((self.make_label(c, label), dfa.states.index(next)))
				187	if state.is_final:
				188	arcs.append((0, dfa.states.index(state)))
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	189	states.append(arcs)
				190	c.states.append(states)
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	191	c.dfas[c.symbol2number[name]] = (states, self.make_first_sets(c, name))
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	192	c.start = c.symbol2number[self.startsymbol]
				193
				194	if self.verbose:
				195	print("")
				196	print("Grammar summary")
				197	print("===============")
				198
				199	print("- {n_labels} labels".format(n_labels=len(c.labels)))
				200	print("- {n_dfas} dfas".format(n_dfas=len(c.dfas)))
				201	print("- {n_tokens} tokens".format(n_tokens=len(c.tokens)))
				202	print("- {n_keywords} keywords".format(n_keywords=len(c.keywords)))
				203	print(
				204	"- Start symbol: {start_symbol}".format(
				205	start_symbol=c.number2symbol[c.start]
				206	)
				207	)
				208	return c
				209
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	210	def make_first_sets(self, c, name):
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	211	rawfirst = self.first[name]
				212	first = set()
				213	for label in sorted(rawfirst):
				214	ilabel = self.make_label(c, label)
				215	##assert ilabel not in first # XXX failed on <> ... !=
				216	first.add(ilabel)
				217	return first
				218
				219	def make_label(self, c, label):
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	220	label = Label(label)
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	221	ilabel = len(c.labels)
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	222
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	223	if label.type == LabelType.NONTERMINAL:
				224	if label in c.symbol2label:
				225	return c.symbol2label[label]
				226	else:
				227	c.labels.append((c.symbol2number[label], None))
				228	c.symbol2label[label] = ilabel
				229	return ilabel
				230	elif label.type == LabelType.NAMED_TOKEN:
				231	# A named token (NAME, NUMBER, STRING)
				232	itoken = self.tokens.get(label, None)
				233	assert isinstance(itoken, int), label
				234	assert itoken in self.tokens.values(), label
				235	if itoken in c.tokens:
				236	return c.tokens[itoken]
				237	else:
				238	c.labels.append((itoken, None))
				239	c.tokens[itoken] = ilabel
				240	return ilabel
				241	elif label.type == LabelType.KEYWORD:
				242	# A keyword
				243	value = literal_eval(label)
				244	if value in c.keywords:
				245	return c.keywords[value]
				246	else:
				247	c.labels.append((self.tokens["NAME"], value))
				248	c.keywords[value] = ilabel
				249	return ilabel
				250	elif label.type == LabelType.OPERATOR:
				251	# An operator (any non-numeric token)
				252	value = literal_eval(label)
				253	tok_name = self.opmap[value] # Fails if unknown token
				254	itoken = self.tokens[tok_name]
				255	if itoken in c.tokens:
				256	return c.tokens[itoken]
				257	else:
				258	c.labels.append((itoken, None))
				259	c.tokens[itoken] = ilabel
				260	return ilabel
				261	else:
				262	raise ValueError("Cannot categorize label {}".format(label))
				263
				264	def calculate_first_sets(self):
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	265	names = list(self.dfas.keys())
				266	for name in names:
				267	if name not in self.first:
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	268	self.calculate_first_sets_for_rule(name)
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	269
				270	if self.verbose:
				271	print("First set for {dfa_name}".format(dfa_name=name))
				272	for item in self.first[name]:
				273	print(" - {terminal}".format(terminal=item))
				274
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	275	def calculate_first_sets_for_rule(self, name):
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	276	dfa = self.dfas[name]
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	277	self.first[name] = None # dummy to detect left recursion
				278	state = dfa.states[0]
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	279	totalset = set()
				280	overlapcheck = {}
				281	for label, next in state.arcs.items():
				282	if label in self.dfas:
				283	if label in self.first:
				284	fset = self.first[label]
				285	if fset is None:
				286	raise ValueError("recursion for rule %r" % name)
				287	else:
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	288	self.calculate_first_sets_for_rule(label)
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	289	fset = self.first[label]
				290	totalset.update(fset)
				291	overlapcheck[label] = fset
				292	else:
				293	totalset.add(label)
				294	overlapcheck[label] = {label}
				295	inverse = {}
				296	for label, itsfirst in overlapcheck.items():
				297	for symbol in itsfirst:
				298	if symbol in inverse:
Pablo Galindo	71876fa	2019-08-22 02:38:39 +0100	[diff] [blame]	299	raise ValueError(
				300	"rule %s is ambiguous; %s is in the"
				301	" first sets of %s as well as %s"
				302	% (name, symbol, label, inverse[symbol])
				303	)
Pablo Galindo	1f24a71	2019-03-01 15:34:44 -0800	[diff] [blame]	304	inverse[symbol] = label
				305	self.first[name] = totalset