blob: 1982c6ad29c1723e7e65256337a598d66cd1937d [file] [log] [blame]
Reid Spencer5f016e22007-07-11 17:01:13 +00001//===----------------------------------------------------------------------===//
2// C Language Family Front-end
3//===----------------------------------------------------------------------===//
4 Chris Lattner
5
6I. Introduction:
7
8 clang: noun
9 1. A loud, resonant, metallic sound.
10 2. The strident call of a crane or goose.
11 3. C-language family front-end toolkit.
12
13 The world needs better compiler tools, tools which are built as libraries. This
14 design point allows reuse of the tools in new and novel ways. However, building
15 the tools as libraries isn't enough: they must have clean APIs, be as
16 decoupled from each other as possible, and be easy to modify/extend. This
17 requires clean layering, decent design, and avoiding tying the libraries to a
18 specific use. Oh yeah, did I mention that we want the resultant libraries to
19 be as fast as possible? :)
20
21 This front-end is built as a component of the LLVM toolkit that can be used
22 with the LLVM backend or independently of it. In this spirit, the API has been
23 carefully designed as the following components:
24
25 libsupport - Basic support library, reused from LLVM.
Ted Kremenekf07410c2008-05-09 17:12:45 +000026
Reid Spencer5f016e22007-07-11 17:01:13 +000027 libsystem - System abstraction library, reused from LLVM.
28
29 libbasic - Diagnostics, SourceLocations, SourceBuffer abstraction,
30 file system caching for input source files. This depends on
31 libsupport and libsystem.
Ted Kremenekf07410c2008-05-09 17:12:45 +000032
Reid Spencer5f016e22007-07-11 17:01:13 +000033 libast - Provides classes to represent the C AST, the C type system,
34 builtin functions, and various helpers for analyzing and
35 manipulating the AST (visitors, pretty printers, etc). This
36 library depends on libbasic.
Ted Kremenekf07410c2008-05-09 17:12:45 +000037
38
Reid Spencer5f016e22007-07-11 17:01:13 +000039 liblex - C/C++/ObjC lexing and preprocessing, identifier hash table,
40 pragma handling, tokens, and macros. This depends on libbasic.
Ted Kremenekf07410c2008-05-09 17:12:45 +000041
Reid Spencer5f016e22007-07-11 17:01:13 +000042 libparse - C (for now) parsing and local semantic analysis. This library
43 invokes coarse-grained 'Actions' provided by the client to do
44 stuff (e.g. libsema builds ASTs). This depends on liblex.
Ted Kremenekf07410c2008-05-09 17:12:45 +000045
Reid Spencer5f016e22007-07-11 17:01:13 +000046 libsema - Provides a set of parser actions to build a standardized AST
47 for programs. AST's are 'streamed' out a top-level declaration
48 at a time, allowing clients to use decl-at-a-time processing,
49 build up entire translation units, or even build 'whole
50 program' ASTs depending on how they use the APIs. This depends
51 on libast and libparse.
Ted Kremenekf07410c2008-05-09 17:12:45 +000052
53 librewrite - Fast, scalable rewriting of source code. This operates on
Ted Kremenek14b16b42008-05-09 17:53:57 +000054 the raw syntactic text of source code, allowing a client
Ted Kremenekf07410c2008-05-09 17:12:45 +000055 to insert and delete text in very large source files using
56 the same source location information embedded in ASTs. This
57 is intended to be a low-level API that is useful for
58 higher-level clients and libraries such as code refactoring.
59
60 libanalysis - Source-level dataflow analysis useful for performing analyses
61 such as computing live variables. It also includes a
62 path-sensitive "graph-reachability" engine for writing
63 analyses that reason about different possible paths of
64 execution through source code. This is currently being
Ted Kremenek5eaedc52008-05-09 17:13:18 +000065 employed to write a set of checks for finding bugs in software.
Ted Kremenekf07410c2008-05-09 17:12:45 +000066
Reid Spencer5f016e22007-07-11 17:01:13 +000067 libcodegen - Lower the AST to LLVM IR for optimization & codegen. Depends
68 on libast.
Ted Kremenekf07410c2008-05-09 17:12:45 +000069
Reid Spencer5f016e22007-07-11 17:01:13 +000070 clang - An example driver, client of the libraries at various levels.
71 This depends on all these libraries, and on LLVM VMCore.
72
Chris Lattner3321f9f2007-07-11 18:58:19 +000073 This front-end has been intentionally built as a DAG of libraries, making it
74 easy to reuse individual parts or replace pieces if desired. For example, to
75 build a preprocessor, you take the Basic and Lexer libraries. If you want an
76 indexer, you take those plus the Parser library and provide some actions for
77 indexing. If you want a refactoring, static analysis, or source-to-source
78 compiler tool, it makes sense to take those plus the AST building and semantic
79 analyzer library. Finally, if you want to use this with the LLVM backend,
80 you'd take these components plus the AST to LLVM lowering code.
Reid Spencer5f016e22007-07-11 17:01:13 +000081
82 In the future I hope this toolkit will grow to include new and interesting
83 components, including a C++ front-end, ObjC support, and a whole lot of other
84 things.
85
86 Finally, it should be pointed out that the goal here is to build something that
87 is high-quality and industrial-strength: all the obnoxious features of the C
88 family must be correctly supported (trigraphs, preprocessor arcana, K&R-style
89 prototypes, GCC/MS extensions, etc). It cannot be used if it is not 'real'.
90
91
92II. Usage of clang driver:
93
94 * Basic Command-Line Options:
95 - Help: clang --help
96 - Standard GCC options accepted: -E, -I*, -i*, -pedantic, -std=c90, etc.
97 - To make diagnostics more gcc-like: -fno-caret-diagnostics -fno-show-column
98 - Enable metric printing: -stats
99
Chris Lattner3321f9f2007-07-11 18:58:19 +0000100 * -fsyntax-only is currently the default mode.
Reid Spencer5f016e22007-07-11 17:01:13 +0000101
Chris Lattner3321f9f2007-07-11 18:58:19 +0000102 * -E mode works the same way as GCC.
Reid Spencer5f016e22007-07-11 17:01:13 +0000103
Ted Kremeneka01a1ee2007-08-29 23:26:37 +0000104 * -Eonly mode does all preprocessing, but does not print the output,
105 useful for timing the preprocessor.
Reid Spencer5f016e22007-07-11 17:01:13 +0000106
Ted Kremeneka01a1ee2007-08-29 23:26:37 +0000107 * -fsyntax-only is currently partially implemented, lacking some
108 semantic analysis (some errors and warnings are not produced).
Chris Lattner3321f9f2007-07-11 18:58:19 +0000109
Ted Kremeneka01a1ee2007-08-29 23:26:37 +0000110 * -parse-noop parses code without building an AST. This is useful
111 for timing the cost of the parser without including AST building
112 time.
Reid Spencer5f016e22007-07-11 17:01:13 +0000113
Ted Kremeneka01a1ee2007-08-29 23:26:37 +0000114 * -parse-ast builds ASTs, but doesn't print them. This is most
115 useful for timing AST building vs -parse-noop.
Reid Spencer5f016e22007-07-11 17:01:13 +0000116
Chris Lattner3321f9f2007-07-11 18:58:19 +0000117 * -parse-ast-print pretty prints most expression and statements nodes.
Reid Spencer5f016e22007-07-11 17:01:13 +0000118
Ted Kremeneka01a1ee2007-08-29 23:26:37 +0000119 * -parse-ast-check checks that diagnostic messages that are expected
120 are reported and that those which are reported are expected.
121
122 * -dump-cfg builds ASTs and then CFGs. CFGs are then pretty-printed.
123
124 * -view-cfg builds ASTs and then CFGs. CFGs are then visualized by
125 invoking Graphviz.
126
127 For more information on getting Graphviz to work with clang/LLVM,
128 see: http://llvm.org/docs/ProgrammersManual.html#ViewGraph
Reid Spencer5f016e22007-07-11 17:01:13 +0000129
Chris Lattner3321f9f2007-07-11 18:58:19 +0000130
Reid Spencer5f016e22007-07-11 17:01:13 +0000131III. Current advantages over GCC:
132
133 * Column numbers are fully tracked (no 256 col limit, no GCC-style pruning).
134 * All diagnostics have column numbers, includes 'caret diagnostics', and they
135 highlight regions of interesting code (e.g. the LHS and RHS of a binop).
136 * Full diagnostic customization by client (can format diagnostics however they
137 like, e.g. in an IDE or refactoring tool) through DiagnosticClient interface.
138 * Built as a framework, can be reused by multiple tools.
139 * All languages supported linked into same library (no cc1,cc1obj, ...).
140 * mmap's code in read-only, does not dirty the pages like GCC (mem footprint).
141 * LLVM License, can be linked into non-GPL projects.
142 * Full diagnostic control, per diagnostic. Diagnostics are identified by ID.
143 * Significantly faster than GCC at semantic analysis, parsing, preprocessing
144 and lexing.
145 * Defers exposing platform-specific stuff to as late as possible, tracks use of
146 platform-specific features (e.g. #ifdef PPC) to allow 'portable bytecodes'.
147 * The lexer doesn't rely on the "lexer hack": it has no notion of scope and
148 does not categorize identifiers as types or variables -- this is up to the
149 parser to decide.
150
151Potential Future Features:
152
153 * Fine grained diag control within the source (#pragma enable/disable warning).
154 * Better token tracking within macros? (Token came from this line, which is
155 a macro argument instantiated here, recursively instantiated here).
156 * Fast #import with a module system.
157 * Dependency tracking: change to header file doesn't recompile every function
158 that texually depends on it: recompile only those functions that need it.
Chris Lattner3321f9f2007-07-11 18:58:19 +0000159 This is aka 'incremental parsing'.
Reid Spencer5f016e22007-07-11 17:01:13 +0000160
161
162IV. Missing Functionality / Improvements
163
164clang driver:
Chris Lattner3321f9f2007-07-11 18:58:19 +0000165 * Include search paths are hard-coded into the driver. Doh.
Reid Spencer5f016e22007-07-11 17:01:13 +0000166
167File Manager:
Chris Lattner3321f9f2007-07-11 18:58:19 +0000168 * Reduce syscalls for reduced compile time, see NOTES.txt.
Reid Spencer5f016e22007-07-11 17:01:13 +0000169
170Lexer:
171 * Source character mapping. GCC supports ASCII and UTF-8.
172 See GCC options: -ftarget-charset and -ftarget-wide-charset.
173 * Universal character support. Experimental in GCC, enabled with
174 -fextended-identifiers.
175 * -fpreprocessed mode.
176
177Preprocessor:
178 * Know about apple header maps.
179 * #assert/#unassert
180 * #line / #file directives (currently accepted and ignored).
181 * MSExtension: "L#param" stringizes to a wide string literal.
182 * Charize extension: "#define F(o) #@o F(a)" -> 'a'.
183 * Consider merging the parser's expression parser into the preprocessor to
184 eliminate duplicate code.
185 * Add support for -M*
186
187Traditional Preprocessor:
Chris Lattner3321f9f2007-07-11 18:58:19 +0000188 * Currently, we have none. :)
Reid Spencer5f016e22007-07-11 17:01:13 +0000189
190Parser:
191 * C90/K&R modes are only partially implemented.
Chris Lattner3321f9f2007-07-11 18:58:19 +0000192 * __extension__ is currently just skipped and ignored.
Reid Spencer5f016e22007-07-11 17:01:13 +0000193
194Semantic Analysis:
Chris Lattner7ee5cb32007-12-10 05:11:40 +0000195 * Perhaps 85% done.
Reid Spencer5f016e22007-07-11 17:01:13 +0000196
Chris Lattner3321f9f2007-07-11 18:58:19 +0000197LLVM Code Gen:
Chris Lattner037ba072008-06-27 21:56:03 +0000198 * Most of the easy stuff is done, probably 65.42% done so far.