Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 1 | """Python parser generator |
| 2 | |
| 3 | |
| 4 | This parser generator transforms a Python grammar file into parsing tables |
| 5 | that can be consumed by Python's LL(1) parser written in C. |
| 6 | |
| 7 | Concepts |
| 8 | -------- |
| 9 | |
| 10 | * An LL(1) parser (Left-to-right, Leftmost derivation, 1 token-lookahead) is a |
| 11 | top-down parser for a subset of context-free languages. It parses the input |
| 12 | from Left to right, performing Leftmost derivation of the sentence, and can |
Shashi Ranjan | 43710b6 | 2019-08-24 23:37:24 +0530 | [diff] [blame] | 13 | only use 1 token of lookahead when parsing a sentence. |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 14 | |
| 15 | * A parsing table is a collection of data that a generic implementation of the |
| 16 | LL(1) parser consumes to know how to parse a given context-free grammar. In |
Shashi Ranjan | 43710b6 | 2019-08-24 23:37:24 +0530 | [diff] [blame] | 17 | this case the collection of data involves Deterministic Finite Automatons, |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 18 | calculated first sets, keywords and transition labels. |
| 19 | |
| 20 | * A grammar is defined by production rules (or just 'productions') that specify |
| 21 | which symbols may replace which other symbols; these rules may be used to |
| 22 | generate strings, or to parse them. Each such rule has a head, or left-hand |
| 23 | side, which consists of the string that may be replaced, and a body, or |
| 24 | right-hand side, which consists of a string that may replace it. In the |
| 25 | Python grammar, rules are written in the form |
| 26 | |
| 27 | rule_name: rule_description; |
| 28 | |
Shashi Ranjan | 43710b6 | 2019-08-24 23:37:24 +0530 | [diff] [blame] | 29 | meaning the rule 'a: b' specifies that a can be replaced by b. A context-free |
| 30 | grammar is a grammar in which the left-hand side of each production rule |
| 31 | consists of only a single nonterminal symbol. Context-free grammars can |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 32 | always be recognized by a Non-Deterministic Automatons. |
| 33 | |
| 34 | * Terminal symbols are literal symbols which may appear in the outputs of the |
| 35 | production rules of the grammar and which cannot be changed using the rules |
| 36 | of the grammar. Applying the rules recursively to a source string of symbols |
| 37 | will usually terminate in a final output string consisting only of terminal |
| 38 | symbols. |
| 39 | |
| 40 | * Nonterminal symbols are those symbols which can be replaced. The grammar |
| 41 | includes a start symbol a designated member of the set of nonterminals from |
| 42 | which all the strings in the language may be derived by successive |
| 43 | applications of the production rules. |
| 44 | |
| 45 | * The language defined by the grammar is defined as the set of terminal strings |
| 46 | that can be derived using the production rules. |
| 47 | |
| 48 | * The first sets of a rule (FIRST(rule)) are defined to be the set of terminals |
| 49 | that can appear in the first position of any string derived from the rule. |
Shashi Ranjan | 43710b6 | 2019-08-24 23:37:24 +0530 | [diff] [blame] | 50 | This is useful for LL(1) parsers as the parser is only allowed to look at the |
| 51 | next token in the input to know which rule needs to parse. For example, given |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 52 | this grammar: |
| 53 | |
| 54 | start: '(' A | B ')' |
| 55 | A: 'a' '<' |
| 56 | B: 'b' '<' |
| 57 | |
| 58 | and the input '(b<)' the parser can only look at 'b' to know if it needs |
| 59 | to parse A o B. Because FIRST(A) = {'a'} and FIRST(B) = {'b'} it knows |
| 60 | that needs to continue parsing rule B because only that rule can start |
| 61 | with 'b'. |
| 62 | |
| 63 | Description |
| 64 | ----------- |
| 65 | |
| 66 | The input for the parser generator is a grammar in extended BNF form (using * |
| 67 | for repetition, + for at-least-once repetition, [] for optional parts, | for |
| 68 | alternatives and () for grouping). |
| 69 | |
| 70 | Each rule in the grammar file is considered as a regular expression in its |
| 71 | own right. It is turned into a Non-deterministic Finite Automaton (NFA), |
| 72 | which is then turned into a Deterministic Finite Automaton (DFA), which is |
| 73 | then optimized to reduce the number of states. See [Aho&Ullman 77] chapter 3, |
| 74 | or similar compiler books (this technique is more often used for lexical |
| 75 | analyzers). |
| 76 | |
| 77 | The DFA's are used by the parser as parsing tables in a special way that's |
| 78 | probably unique. Before they are usable, the FIRST sets of all non-terminals |
| 79 | are computed so the LL(1) parser consuming the parsing tables can distinguish |
| 80 | between different transitions. |
| 81 | Reference |
| 82 | --------- |
| 83 | |
| 84 | [Aho&Ullman 77] |
| 85 | Aho&Ullman, Principles of Compiler Design, Addison-Wesley 1977 |
| 86 | (first edition) |
| 87 | """ |
| 88 | |
| 89 | from ast import literal_eval |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 90 | import collections |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 91 | |
| 92 | from . import grammar, token |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 93 | from .automata import DFA |
| 94 | from .metaparser import GrammarParser |
| 95 | |
| 96 | import enum |
| 97 | |
| 98 | |
| 99 | class LabelType(enum.Enum): |
| 100 | NONTERMINAL = 0 |
| 101 | NAMED_TOKEN = 1 |
| 102 | KEYWORD = 2 |
| 103 | OPERATOR = 3 |
| 104 | NONE = 4 |
| 105 | |
| 106 | |
| 107 | class Label(str): |
| 108 | def __init__(self, value): |
| 109 | self.type = self._get_type() |
| 110 | |
| 111 | def _get_type(self): |
| 112 | if self[0].isalpha(): |
| 113 | if self.upper() == self: |
| 114 | # NAMED tokens (ASYNC, NAME...) are all uppercase by convention |
| 115 | return LabelType.NAMED_TOKEN |
| 116 | else: |
| 117 | # If is not uppercase it must be a non terminal. |
| 118 | return LabelType.NONTERMINAL |
| 119 | else: |
| 120 | # Keywords and operators are wrapped in quotes |
| 121 | assert self[0] == self[-1] in ('"', "'"), self |
| 122 | value = literal_eval(self) |
| 123 | if value[0].isalpha(): |
| 124 | return LabelType.KEYWORD |
| 125 | else: |
| 126 | return LabelType.OPERATOR |
| 127 | |
| 128 | def __repr__(self): |
| 129 | return "{}({})".format(self.type, super().__repr__()) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 130 | |
Pablo Galindo | 8bc401a | 2019-03-04 07:26:13 +0000 | [diff] [blame] | 131 | |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 132 | class ParserGenerator(object): |
Pablo Galindo | 45cf5db | 2020-01-14 22:32:55 +0000 | [diff] [blame^] | 133 | def __init__(self, grammar_file, token_file, verbose=False, graph_file=None): |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 134 | with open(grammar_file) as f: |
| 135 | self.grammar = f.read() |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 136 | with open(token_file) as tok_file: |
| 137 | token_lines = tok_file.readlines() |
| 138 | self.tokens = dict(token.generate_tokens(token_lines)) |
| 139 | self.opmap = dict(token.generate_opmap(token_lines)) |
| 140 | # Manually add <> so it does not collide with != |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 141 | self.opmap["<>"] = "NOTEQUAL" |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 142 | self.verbose = verbose |
| 143 | self.filename = grammar_file |
Pablo Galindo | 45cf5db | 2020-01-14 22:32:55 +0000 | [diff] [blame^] | 144 | self.graph_file = graph_file |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 145 | self.dfas, self.startsymbol = self.create_dfas() |
| 146 | self.first = {} # map from symbol name to set of tokens |
| 147 | self.calculate_first_sets() |
| 148 | |
| 149 | def create_dfas(self): |
| 150 | rule_to_dfas = collections.OrderedDict() |
| 151 | start_nonterminal = None |
| 152 | for nfa in GrammarParser(self.grammar).parse(): |
| 153 | if self.verbose: |
| 154 | print("Dump of NFA for", nfa.name) |
| 155 | nfa.dump() |
Pablo Galindo | 45cf5db | 2020-01-14 22:32:55 +0000 | [diff] [blame^] | 156 | if self.graph_file is not None: |
| 157 | nfa.dump_graph(self.graph_file.write) |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 158 | dfa = DFA.from_nfa(nfa) |
| 159 | if self.verbose: |
| 160 | print("Dump of DFA for", dfa.name) |
| 161 | dfa.dump() |
| 162 | dfa.simplify() |
Pablo Galindo | 45cf5db | 2020-01-14 22:32:55 +0000 | [diff] [blame^] | 163 | if self.graph_file is not None: |
| 164 | dfa.dump_graph(self.graph_file.write) |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 165 | rule_to_dfas[dfa.name] = dfa |
| 166 | |
| 167 | if start_nonterminal is None: |
| 168 | start_nonterminal = dfa.name |
| 169 | |
| 170 | return rule_to_dfas, start_nonterminal |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 171 | |
| 172 | def make_grammar(self): |
| 173 | c = grammar.Grammar() |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 174 | c.all_labels = set() |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 175 | names = list(self.dfas.keys()) |
| 176 | names.remove(self.startsymbol) |
| 177 | names.insert(0, self.startsymbol) |
| 178 | for name in names: |
| 179 | i = 256 + len(c.symbol2number) |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 180 | c.symbol2number[Label(name)] = i |
| 181 | c.number2symbol[i] = Label(name) |
| 182 | c.all_labels.add(name) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 183 | for name in names: |
| 184 | self.make_label(c, name) |
| 185 | dfa = self.dfas[name] |
| 186 | states = [] |
| 187 | for state in dfa: |
| 188 | arcs = [] |
| 189 | for label, next in sorted(state.arcs.items()): |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 190 | c.all_labels.add(label) |
| 191 | arcs.append((self.make_label(c, label), dfa.states.index(next))) |
| 192 | if state.is_final: |
| 193 | arcs.append((0, dfa.states.index(state))) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 194 | states.append(arcs) |
| 195 | c.states.append(states) |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 196 | c.dfas[c.symbol2number[name]] = (states, self.make_first_sets(c, name)) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 197 | c.start = c.symbol2number[self.startsymbol] |
| 198 | |
| 199 | if self.verbose: |
| 200 | print("") |
| 201 | print("Grammar summary") |
| 202 | print("===============") |
| 203 | |
| 204 | print("- {n_labels} labels".format(n_labels=len(c.labels))) |
| 205 | print("- {n_dfas} dfas".format(n_dfas=len(c.dfas))) |
| 206 | print("- {n_tokens} tokens".format(n_tokens=len(c.tokens))) |
| 207 | print("- {n_keywords} keywords".format(n_keywords=len(c.keywords))) |
| 208 | print( |
| 209 | "- Start symbol: {start_symbol}".format( |
| 210 | start_symbol=c.number2symbol[c.start] |
| 211 | ) |
| 212 | ) |
| 213 | return c |
| 214 | |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 215 | def make_first_sets(self, c, name): |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 216 | rawfirst = self.first[name] |
| 217 | first = set() |
| 218 | for label in sorted(rawfirst): |
| 219 | ilabel = self.make_label(c, label) |
| 220 | ##assert ilabel not in first # XXX failed on <> ... != |
| 221 | first.add(ilabel) |
| 222 | return first |
| 223 | |
| 224 | def make_label(self, c, label): |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 225 | label = Label(label) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 226 | ilabel = len(c.labels) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 227 | |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 228 | if label.type == LabelType.NONTERMINAL: |
| 229 | if label in c.symbol2label: |
| 230 | return c.symbol2label[label] |
| 231 | else: |
| 232 | c.labels.append((c.symbol2number[label], None)) |
| 233 | c.symbol2label[label] = ilabel |
| 234 | return ilabel |
| 235 | elif label.type == LabelType.NAMED_TOKEN: |
| 236 | # A named token (NAME, NUMBER, STRING) |
| 237 | itoken = self.tokens.get(label, None) |
| 238 | assert isinstance(itoken, int), label |
| 239 | assert itoken in self.tokens.values(), label |
| 240 | if itoken in c.tokens: |
| 241 | return c.tokens[itoken] |
| 242 | else: |
| 243 | c.labels.append((itoken, None)) |
| 244 | c.tokens[itoken] = ilabel |
| 245 | return ilabel |
| 246 | elif label.type == LabelType.KEYWORD: |
| 247 | # A keyword |
| 248 | value = literal_eval(label) |
| 249 | if value in c.keywords: |
| 250 | return c.keywords[value] |
| 251 | else: |
| 252 | c.labels.append((self.tokens["NAME"], value)) |
| 253 | c.keywords[value] = ilabel |
| 254 | return ilabel |
| 255 | elif label.type == LabelType.OPERATOR: |
| 256 | # An operator (any non-numeric token) |
| 257 | value = literal_eval(label) |
| 258 | tok_name = self.opmap[value] # Fails if unknown token |
| 259 | itoken = self.tokens[tok_name] |
| 260 | if itoken in c.tokens: |
| 261 | return c.tokens[itoken] |
| 262 | else: |
| 263 | c.labels.append((itoken, None)) |
| 264 | c.tokens[itoken] = ilabel |
| 265 | return ilabel |
| 266 | else: |
| 267 | raise ValueError("Cannot categorize label {}".format(label)) |
| 268 | |
| 269 | def calculate_first_sets(self): |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 270 | names = list(self.dfas.keys()) |
| 271 | for name in names: |
| 272 | if name not in self.first: |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 273 | self.calculate_first_sets_for_rule(name) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 274 | |
| 275 | if self.verbose: |
| 276 | print("First set for {dfa_name}".format(dfa_name=name)) |
| 277 | for item in self.first[name]: |
| 278 | print(" - {terminal}".format(terminal=item)) |
| 279 | |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 280 | def calculate_first_sets_for_rule(self, name): |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 281 | dfa = self.dfas[name] |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 282 | self.first[name] = None # dummy to detect left recursion |
| 283 | state = dfa.states[0] |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 284 | totalset = set() |
| 285 | overlapcheck = {} |
| 286 | for label, next in state.arcs.items(): |
| 287 | if label in self.dfas: |
| 288 | if label in self.first: |
| 289 | fset = self.first[label] |
| 290 | if fset is None: |
| 291 | raise ValueError("recursion for rule %r" % name) |
| 292 | else: |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 293 | self.calculate_first_sets_for_rule(label) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 294 | fset = self.first[label] |
| 295 | totalset.update(fset) |
| 296 | overlapcheck[label] = fset |
| 297 | else: |
| 298 | totalset.add(label) |
| 299 | overlapcheck[label] = {label} |
| 300 | inverse = {} |
| 301 | for label, itsfirst in overlapcheck.items(): |
| 302 | for symbol in itsfirst: |
| 303 | if symbol in inverse: |
Pablo Galindo | 71876fa | 2019-08-22 02:38:39 +0100 | [diff] [blame] | 304 | raise ValueError( |
| 305 | "rule %s is ambiguous; %s is in the" |
| 306 | " first sets of %s as well as %s" |
| 307 | % (name, symbol, label, inverse[symbol]) |
| 308 | ) |
Pablo Galindo | 1f24a71 | 2019-03-01 15:34:44 -0800 | [diff] [blame] | 309 | inverse[symbol] = label |
| 310 | self.first[name] = totalset |