Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 1 | //===----------------------------------------------------------------------===// |
| 2 | // C Language Family Front-end |
| 3 | //===----------------------------------------------------------------------===// |
| 4 | Chris Lattner |
| 5 | |
| 6 | I. Introduction: |
| 7 | |
| 8 | clang: noun |
| 9 | 1. A loud, resonant, metallic sound. |
| 10 | 2. The strident call of a crane or goose. |
| 11 | 3. C-language family front-end toolkit. |
| 12 | |
| 13 | The world needs better compiler tools, tools which are built as libraries. This |
| 14 | design point allows reuse of the tools in new and novel ways. However, building |
| 15 | the tools as libraries isn't enough: they must have clean APIs, be as |
| 16 | decoupled from each other as possible, and be easy to modify/extend. This |
| 17 | requires clean layering, decent design, and avoiding tying the libraries to a |
| 18 | specific use. Oh yeah, did I mention that we want the resultant libraries to |
| 19 | be as fast as possible? :) |
| 20 | |
| 21 | This front-end is built as a component of the LLVM toolkit that can be used |
| 22 | with the LLVM backend or independently of it. In this spirit, the API has been |
| 23 | carefully designed as the following components: |
| 24 | |
| 25 | libsupport - Basic support library, reused from LLVM. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 26 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 27 | libsystem - System abstraction library, reused from LLVM. |
| 28 | |
| 29 | libbasic - Diagnostics, SourceLocations, SourceBuffer abstraction, |
| 30 | file system caching for input source files. This depends on |
| 31 | libsupport and libsystem. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 32 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 33 | libast - Provides classes to represent the C AST, the C type system, |
| 34 | builtin functions, and various helpers for analyzing and |
| 35 | manipulating the AST (visitors, pretty printers, etc). This |
| 36 | library depends on libbasic. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 37 | |
| 38 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 39 | liblex - C/C++/ObjC lexing and preprocessing, identifier hash table, |
| 40 | pragma handling, tokens, and macros. This depends on libbasic. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 41 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 42 | libparse - C (for now) parsing and local semantic analysis. This library |
| 43 | invokes coarse-grained 'Actions' provided by the client to do |
| 44 | stuff (e.g. libsema builds ASTs). This depends on liblex. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 45 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 46 | libsema - Provides a set of parser actions to build a standardized AST |
| 47 | for programs. AST's are 'streamed' out a top-level declaration |
| 48 | at a time, allowing clients to use decl-at-a-time processing, |
| 49 | build up entire translation units, or even build 'whole |
| 50 | program' ASTs depending on how they use the APIs. This depends |
| 51 | on libast and libparse. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 52 | |
| 53 | librewrite - Fast, scalable rewriting of source code. This operates on |
Ted Kremenek | 14b16b4 | 2008-05-09 17:53:57 +0000 | [diff] [blame] | 54 | the raw syntactic text of source code, allowing a client |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 55 | to insert and delete text in very large source files using |
| 56 | the same source location information embedded in ASTs. This |
| 57 | is intended to be a low-level API that is useful for |
| 58 | higher-level clients and libraries such as code refactoring. |
| 59 | |
| 60 | libanalysis - Source-level dataflow analysis useful for performing analyses |
| 61 | such as computing live variables. It also includes a |
| 62 | path-sensitive "graph-reachability" engine for writing |
| 63 | analyses that reason about different possible paths of |
| 64 | execution through source code. This is currently being |
Ted Kremenek | 5eaedc5 | 2008-05-09 17:13:18 +0000 | [diff] [blame] | 65 | employed to write a set of checks for finding bugs in software. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 66 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 67 | libcodegen - Lower the AST to LLVM IR for optimization & codegen. Depends |
| 68 | on libast. |
Ted Kremenek | f07410c | 2008-05-09 17:12:45 +0000 | [diff] [blame] | 69 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 70 | clang - An example driver, client of the libraries at various levels. |
| 71 | This depends on all these libraries, and on LLVM VMCore. |
| 72 | |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 73 | This front-end has been intentionally built as a DAG of libraries, making it |
| 74 | easy to reuse individual parts or replace pieces if desired. For example, to |
| 75 | build a preprocessor, you take the Basic and Lexer libraries. If you want an |
| 76 | indexer, you take those plus the Parser library and provide some actions for |
| 77 | indexing. If you want a refactoring, static analysis, or source-to-source |
| 78 | compiler tool, it makes sense to take those plus the AST building and semantic |
| 79 | analyzer library. Finally, if you want to use this with the LLVM backend, |
| 80 | you'd take these components plus the AST to LLVM lowering code. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 81 | |
| 82 | In the future I hope this toolkit will grow to include new and interesting |
| 83 | components, including a C++ front-end, ObjC support, and a whole lot of other |
| 84 | things. |
| 85 | |
| 86 | Finally, it should be pointed out that the goal here is to build something that |
| 87 | is high-quality and industrial-strength: all the obnoxious features of the C |
| 88 | family must be correctly supported (trigraphs, preprocessor arcana, K&R-style |
| 89 | prototypes, GCC/MS extensions, etc). It cannot be used if it is not 'real'. |
| 90 | |
| 91 | |
| 92 | II. Usage of clang driver: |
| 93 | |
| 94 | * Basic Command-Line Options: |
| 95 | - Help: clang --help |
| 96 | - Standard GCC options accepted: -E, -I*, -i*, -pedantic, -std=c90, etc. |
| 97 | - To make diagnostics more gcc-like: -fno-caret-diagnostics -fno-show-column |
| 98 | - Enable metric printing: -stats |
| 99 | |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 100 | * -fsyntax-only is currently the default mode. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 101 | |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 102 | * -E mode works the same way as GCC. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 103 | |
Ted Kremenek | a01a1ee | 2007-08-29 23:26:37 +0000 | [diff] [blame] | 104 | * -Eonly mode does all preprocessing, but does not print the output, |
| 105 | useful for timing the preprocessor. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 106 | |
Ted Kremenek | a01a1ee | 2007-08-29 23:26:37 +0000 | [diff] [blame] | 107 | * -fsyntax-only is currently partially implemented, lacking some |
| 108 | semantic analysis (some errors and warnings are not produced). |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 109 | |
Ted Kremenek | a01a1ee | 2007-08-29 23:26:37 +0000 | [diff] [blame] | 110 | * -parse-noop parses code without building an AST. This is useful |
| 111 | for timing the cost of the parser without including AST building |
| 112 | time. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 113 | |
Ted Kremenek | a01a1ee | 2007-08-29 23:26:37 +0000 | [diff] [blame] | 114 | * -parse-ast builds ASTs, but doesn't print them. This is most |
| 115 | useful for timing AST building vs -parse-noop. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 116 | |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 117 | * -parse-ast-print pretty prints most expression and statements nodes. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 118 | |
Ted Kremenek | a01a1ee | 2007-08-29 23:26:37 +0000 | [diff] [blame] | 119 | * -parse-ast-check checks that diagnostic messages that are expected |
| 120 | are reported and that those which are reported are expected. |
| 121 | |
| 122 | * -dump-cfg builds ASTs and then CFGs. CFGs are then pretty-printed. |
| 123 | |
| 124 | * -view-cfg builds ASTs and then CFGs. CFGs are then visualized by |
| 125 | invoking Graphviz. |
| 126 | |
| 127 | For more information on getting Graphviz to work with clang/LLVM, |
| 128 | see: http://llvm.org/docs/ProgrammersManual.html#ViewGraph |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 129 | |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 130 | |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 131 | III. Current advantages over GCC: |
| 132 | |
| 133 | * Column numbers are fully tracked (no 256 col limit, no GCC-style pruning). |
| 134 | * All diagnostics have column numbers, includes 'caret diagnostics', and they |
| 135 | highlight regions of interesting code (e.g. the LHS and RHS of a binop). |
| 136 | * Full diagnostic customization by client (can format diagnostics however they |
| 137 | like, e.g. in an IDE or refactoring tool) through DiagnosticClient interface. |
| 138 | * Built as a framework, can be reused by multiple tools. |
| 139 | * All languages supported linked into same library (no cc1,cc1obj, ...). |
| 140 | * mmap's code in read-only, does not dirty the pages like GCC (mem footprint). |
| 141 | * LLVM License, can be linked into non-GPL projects. |
| 142 | * Full diagnostic control, per diagnostic. Diagnostics are identified by ID. |
| 143 | * Significantly faster than GCC at semantic analysis, parsing, preprocessing |
| 144 | and lexing. |
| 145 | * Defers exposing platform-specific stuff to as late as possible, tracks use of |
| 146 | platform-specific features (e.g. #ifdef PPC) to allow 'portable bytecodes'. |
| 147 | * The lexer doesn't rely on the "lexer hack": it has no notion of scope and |
| 148 | does not categorize identifiers as types or variables -- this is up to the |
| 149 | parser to decide. |
| 150 | |
| 151 | Potential Future Features: |
| 152 | |
| 153 | * Fine grained diag control within the source (#pragma enable/disable warning). |
| 154 | * Better token tracking within macros? (Token came from this line, which is |
| 155 | a macro argument instantiated here, recursively instantiated here). |
| 156 | * Fast #import with a module system. |
| 157 | * Dependency tracking: change to header file doesn't recompile every function |
| 158 | that texually depends on it: recompile only those functions that need it. |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 159 | This is aka 'incremental parsing'. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 160 | |
| 161 | |
| 162 | IV. Missing Functionality / Improvements |
| 163 | |
| 164 | clang driver: |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 165 | * Include search paths are hard-coded into the driver. Doh. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 166 | |
| 167 | File Manager: |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 168 | * Reduce syscalls for reduced compile time, see NOTES.txt. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 169 | |
| 170 | Lexer: |
| 171 | * Source character mapping. GCC supports ASCII and UTF-8. |
| 172 | See GCC options: -ftarget-charset and -ftarget-wide-charset. |
| 173 | * Universal character support. Experimental in GCC, enabled with |
| 174 | -fextended-identifiers. |
| 175 | * -fpreprocessed mode. |
| 176 | |
| 177 | Preprocessor: |
| 178 | * Know about apple header maps. |
| 179 | * #assert/#unassert |
| 180 | * #line / #file directives (currently accepted and ignored). |
| 181 | * MSExtension: "L#param" stringizes to a wide string literal. |
| 182 | * Charize extension: "#define F(o) #@o F(a)" -> 'a'. |
| 183 | * Consider merging the parser's expression parser into the preprocessor to |
| 184 | eliminate duplicate code. |
| 185 | * Add support for -M* |
| 186 | |
| 187 | Traditional Preprocessor: |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 188 | * Currently, we have none. :) |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 189 | |
| 190 | Parser: |
| 191 | * C90/K&R modes are only partially implemented. |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 192 | * __extension__ is currently just skipped and ignored. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 193 | |
| 194 | Semantic Analysis: |
Chris Lattner | 7ee5cb3 | 2007-12-10 05:11:40 +0000 | [diff] [blame] | 195 | * Perhaps 85% done. |
Reid Spencer | 5f016e2 | 2007-07-11 17:01:13 +0000 | [diff] [blame] | 196 | |
Chris Lattner | 3321f9f | 2007-07-11 18:58:19 +0000 | [diff] [blame] | 197 | LLVM Code Gen: |
Chris Lattner | 037ba07 | 2008-06-27 21:56:03 +0000 | [diff] [blame] | 198 | * Most of the easy stuff is done, probably 65.42% done so far. |