| //===----------------------------------------------------------------------===// |
| // C Language Family Front-end |
| //===----------------------------------------------------------------------===// |
| Chris Lattner |
| |
| I. Introduction: |
| |
| clang: noun |
| 1. A loud, resonant, metallic sound. |
| 2. The strident call of a crane or goose. |
| 3. C-language family front-end toolkit. |
| |
| The world needs better compiler tools, tools which are built as libraries. This |
| design point allows reuse of the tools in new and novel ways. However, building |
| the tools as libraries isn't enough: they must have clean APIs, be as |
| decoupled from each other as possible, and be easy to modify/extend. This |
| requires clean layering, decent design, and avoiding tying the libraries to a |
| specific use. Oh yeah, did I mention that we want the resultant libraries to |
| be as fast as possible? :) |
| |
| This front-end is built as a component of the LLVM toolkit that can be used |
| with the LLVM backend or independently of it. In this spirit, the API has been |
| carefully designed as the following components: |
| |
| libsupport - Basic support library, reused from LLVM. |
| libsystem - System abstraction library, reused from LLVM. |
| |
| libbasic - Diagnostics, SourceLocations, SourceBuffer abstraction, |
| file system caching for input source files. This depends on |
| libsupport and libsystem. |
| libast - Provides classes to represent the C AST, the C type system, |
| builtin functions, and various helpers for analyzing and |
| manipulating the AST (visitors, pretty printers, etc). This |
| library depends on libbasic. |
| |
| liblex - C/C++/ObjC lexing and preprocessing, identifier hash table, |
| pragma handling, tokens, and macros. This depends on libbasic. |
| libparse - C (for now) parsing and local semantic analysis. This library |
| invokes coarse-grained 'Actions' provided by the client to do |
| stuff (e.g. libsema builds ASTs). This depends on liblex. |
| libsema - Provides a set of parser actions to build a standardized AST |
| for programs. AST's are 'streamed' out a top-level declaration |
| at a time, allowing clients to use decl-at-a-time processing, |
| build up entire translation units, or even build 'whole |
| program' ASTs depending on how they use the APIs. This depends |
| on libast and libparse. |
| |
| libcodegen - Lower the AST to LLVM IR for optimization & codegen. Depends |
| on libast. |
| clang - An example driver, client of the libraries at various levels. |
| This depends on all these libraries, and on LLVM VMCore. |
| |
| This front-end has been intentionally built as a DAG of libraries, making it |
| easy to reuse individual parts or replace pieces if desired. For example, to |
| build a preprocessor, you take the Basic and Lexer libraries. If you want an |
| indexer, you take those plus the Parser library and provide some actions for |
| indexing. If you want a refactoring, static analysis, or source-to-source |
| compiler tool, it makes sense to take those plus the AST building and semantic |
| analyzer library. Finally, if you want to use this with the LLVM backend, |
| you'd take these components plus the AST to LLVM lowering code. |
| |
| In the future I hope this toolkit will grow to include new and interesting |
| components, including a C++ front-end, ObjC support, and a whole lot of other |
| things. |
| |
| Finally, it should be pointed out that the goal here is to build something that |
| is high-quality and industrial-strength: all the obnoxious features of the C |
| family must be correctly supported (trigraphs, preprocessor arcana, K&R-style |
| prototypes, GCC/MS extensions, etc). It cannot be used if it is not 'real'. |
| |
| |
| II. Usage of clang driver: |
| |
| * Basic Command-Line Options: |
| - Help: clang --help |
| - Standard GCC options accepted: -E, -I*, -i*, -pedantic, -std=c90, etc. |
| - To make diagnostics more gcc-like: -fno-caret-diagnostics -fno-show-column |
| - Enable metric printing: -stats |
| |
| * -fsyntax-only is currently the default mode. |
| |
| * -E mode works the same way as GCC. |
| |
| * -Eonly mode does all preprocessing, but does not print the output, useful for |
| timing the preprocessor. |
| |
| * -fsyntax-only is currently partially implemented, lacking some semantic |
| analysis (some errors and warnings are not produced). |
| |
| * -parse-noop parses code without building an AST. This is useful for timing |
| the cost of the parser without including AST building time. |
| |
| * -parse-ast builds ASTs, but doesn't print them. This is most useful for |
| timing AST building vs -parse-noop. |
| |
| * -parse-ast-print pretty prints most expression and statements nodes. |
| |
| * -parse-ast-check checks that diagnostic messages that are expected are |
| reported and that those which are reported are expected. |
| |
| |
| III. Current advantages over GCC: |
| |
| * Column numbers are fully tracked (no 256 col limit, no GCC-style pruning). |
| * All diagnostics have column numbers, includes 'caret diagnostics', and they |
| highlight regions of interesting code (e.g. the LHS and RHS of a binop). |
| * Full diagnostic customization by client (can format diagnostics however they |
| like, e.g. in an IDE or refactoring tool) through DiagnosticClient interface. |
| * Built as a framework, can be reused by multiple tools. |
| * All languages supported linked into same library (no cc1,cc1obj, ...). |
| * mmap's code in read-only, does not dirty the pages like GCC (mem footprint). |
| * LLVM License, can be linked into non-GPL projects. |
| * Full diagnostic control, per diagnostic. Diagnostics are identified by ID. |
| * Significantly faster than GCC at semantic analysis, parsing, preprocessing |
| and lexing. |
| * Defers exposing platform-specific stuff to as late as possible, tracks use of |
| platform-specific features (e.g. #ifdef PPC) to allow 'portable bytecodes'. |
| * The lexer doesn't rely on the "lexer hack": it has no notion of scope and |
| does not categorize identifiers as types or variables -- this is up to the |
| parser to decide. |
| |
| Potential Future Features: |
| |
| * Fine grained diag control within the source (#pragma enable/disable warning). |
| * Better token tracking within macros? (Token came from this line, which is |
| a macro argument instantiated here, recursively instantiated here). |
| * Fast #import with a module system. |
| * Dependency tracking: change to header file doesn't recompile every function |
| that texually depends on it: recompile only those functions that need it. |
| This is aka 'incremental parsing'. |
| |
| |
| IV. Missing Functionality / Improvements |
| |
| clang driver: |
| * Include search paths are hard-coded into the driver. Doh. |
| |
| File Manager: |
| * Reduce syscalls for reduced compile time, see NOTES.txt. |
| |
| Lexer: |
| * Source character mapping. GCC supports ASCII and UTF-8. |
| See GCC options: -ftarget-charset and -ftarget-wide-charset. |
| * Universal character support. Experimental in GCC, enabled with |
| -fextended-identifiers. |
| * -fpreprocessed mode. |
| |
| Preprocessor: |
| * Know about apple header maps. |
| * #assert/#unassert |
| * #line / #file directives (currently accepted and ignored). |
| * MSExtension: "L#param" stringizes to a wide string literal. |
| * Charize extension: "#define F(o) #@o F(a)" -> 'a'. |
| * Consider merging the parser's expression parser into the preprocessor to |
| eliminate duplicate code. |
| * Add support for -M* |
| |
| Traditional Preprocessor: |
| * Currently, we have none. :) |
| |
| Parser: |
| * C90/K&R modes are only partially implemented. |
| * __extension__ is currently just skipped and ignored. |
| * "initializers", GCC inline asm. |
| |
| Semantic Analysis: |
| * Perhaps 75% done. |
| |
| LLVM Code Gen: |
| * Still very early. |
| |