blob: 9ec1cc4a3d531877fe0cdde925a28d8cc89c6d09 [file] [log] [blame]
Reid Spencer5f016e22007-07-11 17:01:13 +00001//===----------------------------------------------------------------------===//
2// C Language Family Front-end
3//===----------------------------------------------------------------------===//
4 Chris Lattner
5
6I. Introduction:
7
8 clang: noun
9 1. A loud, resonant, metallic sound.
10 2. The strident call of a crane or goose.
11 3. C-language family front-end toolkit.
12
13 The world needs better compiler tools, tools which are built as libraries. This
14 design point allows reuse of the tools in new and novel ways. However, building
15 the tools as libraries isn't enough: they must have clean APIs, be as
16 decoupled from each other as possible, and be easy to modify/extend. This
17 requires clean layering, decent design, and avoiding tying the libraries to a
18 specific use. Oh yeah, did I mention that we want the resultant libraries to
19 be as fast as possible? :)
20
21 This front-end is built as a component of the LLVM toolkit that can be used
22 with the LLVM backend or independently of it. In this spirit, the API has been
23 carefully designed as the following components:
24
25 libsupport - Basic support library, reused from LLVM.
26 libsystem - System abstraction library, reused from LLVM.
27
28 libbasic - Diagnostics, SourceLocations, SourceBuffer abstraction,
29 file system caching for input source files. This depends on
30 libsupport and libsystem.
31 libast - Provides classes to represent the C AST, the C type system,
32 builtin functions, and various helpers for analyzing and
33 manipulating the AST (visitors, pretty printers, etc). This
34 library depends on libbasic.
35
36 liblex - C/C++/ObjC lexing and preprocessing, identifier hash table,
37 pragma handling, tokens, and macros. This depends on libbasic.
38 libparse - C (for now) parsing and local semantic analysis. This library
39 invokes coarse-grained 'Actions' provided by the client to do
40 stuff (e.g. libsema builds ASTs). This depends on liblex.
41 libsema - Provides a set of parser actions to build a standardized AST
42 for programs. AST's are 'streamed' out a top-level declaration
43 at a time, allowing clients to use decl-at-a-time processing,
44 build up entire translation units, or even build 'whole
45 program' ASTs depending on how they use the APIs. This depends
46 on libast and libparse.
47
48 libcodegen - Lower the AST to LLVM IR for optimization & codegen. Depends
49 on libast.
50 clang - An example driver, client of the libraries at various levels.
51 This depends on all these libraries, and on LLVM VMCore.
52
53 This front-end has been intentionally built as a DAG, making it easy to
54 reuse individual parts or replace pieces if desired. For example, to build a
55 preprocessor, you take the Basic and Lexer libraries. If you want an indexer,
56 you take those plus the Parser library and provide some actions for indexing.
57 If you want a refactoring, static analysis, or source-to-source compiler tool,
58 it makes sense to take those plus the AST building and semantic analyzer
59 library. Finally, if you want to use this with the LLVM backend, you'd take
60 these components plus the AST to LLVM lowering code.
61
62 In the future I hope this toolkit will grow to include new and interesting
63 components, including a C++ front-end, ObjC support, and a whole lot of other
64 things.
65
66 Finally, it should be pointed out that the goal here is to build something that
67 is high-quality and industrial-strength: all the obnoxious features of the C
68 family must be correctly supported (trigraphs, preprocessor arcana, K&R-style
69 prototypes, GCC/MS extensions, etc). It cannot be used if it is not 'real'.
70
71
72II. Usage of clang driver:
73
74 * Basic Command-Line Options:
75 - Help: clang --help
76 - Standard GCC options accepted: -E, -I*, -i*, -pedantic, -std=c90, etc.
77 - To make diagnostics more gcc-like: -fno-caret-diagnostics -fno-show-column
78 - Enable metric printing: -stats
79
80 * -fsyntax-only is the default mode.
81
82 * -E mode gives output nearly identical to GCC, though not all bugs in
83 whitespace calculation have been emulated (e.g. the number of blank lines
84 emitted).
85
86 * -fsyntax-only is currently partially implemented, lacking some semantic
87 analysis.
88
89 * -Eonly mode does all preprocessing, but does not print the output, useful for
90 timing the preprocessor.
91
92 * -parse-print-callbacks prints almost no callbacks so far.
93
94 * -parse-ast builds ASTs, but doesn't print them. This is most useful for
95 timing AST building vs -parse-noop.
96
97 * -parse-ast-print prints most expression and statements nodes, but some
98 minor things are missing.
99
100 * -parse-ast-check checks that diagnostic messages that are expected are
101 reported and that those which are reported are expected.
102
103III. Current advantages over GCC:
104
105 * Column numbers are fully tracked (no 256 col limit, no GCC-style pruning).
106 * All diagnostics have column numbers, includes 'caret diagnostics', and they
107 highlight regions of interesting code (e.g. the LHS and RHS of a binop).
108 * Full diagnostic customization by client (can format diagnostics however they
109 like, e.g. in an IDE or refactoring tool) through DiagnosticClient interface.
110 * Built as a framework, can be reused by multiple tools.
111 * All languages supported linked into same library (no cc1,cc1obj, ...).
112 * mmap's code in read-only, does not dirty the pages like GCC (mem footprint).
113 * LLVM License, can be linked into non-GPL projects.
114 * Full diagnostic control, per diagnostic. Diagnostics are identified by ID.
115 * Significantly faster than GCC at semantic analysis, parsing, preprocessing
116 and lexing.
117 * Defers exposing platform-specific stuff to as late as possible, tracks use of
118 platform-specific features (e.g. #ifdef PPC) to allow 'portable bytecodes'.
119 * The lexer doesn't rely on the "lexer hack": it has no notion of scope and
120 does not categorize identifiers as types or variables -- this is up to the
121 parser to decide.
122
123Potential Future Features:
124
125 * Fine grained diag control within the source (#pragma enable/disable warning).
126 * Better token tracking within macros? (Token came from this line, which is
127 a macro argument instantiated here, recursively instantiated here).
128 * Fast #import with a module system.
129 * Dependency tracking: change to header file doesn't recompile every function
130 that texually depends on it: recompile only those functions that need it.
131
132
133IV. Missing Functionality / Improvements
134
135clang driver:
136 * Include search paths are hard-coded into the driver.
137
138File Manager:
139 * Reduce syscalls, see NOTES.txt.
140
141Lexer:
142 * Source character mapping. GCC supports ASCII and UTF-8.
143 See GCC options: -ftarget-charset and -ftarget-wide-charset.
144 * Universal character support. Experimental in GCC, enabled with
145 -fextended-identifiers.
146 * -fpreprocessed mode.
147
148Preprocessor:
149 * Know about apple header maps.
150 * #assert/#unassert
151 * #line / #file directives (currently accepted and ignored).
152 * MSExtension: "L#param" stringizes to a wide string literal.
153 * Charize extension: "#define F(o) #@o F(a)" -> 'a'.
154 * Consider merging the parser's expression parser into the preprocessor to
155 eliminate duplicate code.
156 * Add support for -M*
157
158Traditional Preprocessor:
159 * All.
160
161Parser:
162 * C90/K&R modes are only partially implemented.
163 * __extension__, __attribute__ [currently just skipped and ignored].
164 * "initializers", GCC inline asm.
165
166Semantic Analysis:
167 * Perhaps 75% done.
168
169Code Gen:
170 * Mostly missing.
171