[PDB] Begin adding documentation for the PDB file format.

Differential Revision: https://reviews.llvm.org/D26374

llvm-svn: 286491
diff --git a/llvm/docs/PDB/DbiStream.rst b/llvm/docs/PDB/DbiStream.rst
new file mode 100644
index 0000000..0a247a1
--- /dev/null
+++ b/llvm/docs/PDB/DbiStream.rst
@@ -0,0 +1,3 @@
+=====================================

+The PDB DBI (Debug Info) Stream

+=====================================

diff --git a/llvm/docs/PDB/GlobalStream.rst b/llvm/docs/PDB/GlobalStream.rst
new file mode 100644
index 0000000..314b9f0
--- /dev/null
+++ b/llvm/docs/PDB/GlobalStream.rst
@@ -0,0 +1,3 @@
+=====================================

+The PDB Global Symbol Stream

+=====================================

diff --git a/llvm/docs/PDB/HashStream.rst b/llvm/docs/PDB/HashStream.rst
new file mode 100644
index 0000000..a758db4
--- /dev/null
+++ b/llvm/docs/PDB/HashStream.rst
@@ -0,0 +1,3 @@
+=====================================

+The TPI & IPI Hash Streams

+=====================================

diff --git a/llvm/docs/PDB/ModiStream.rst b/llvm/docs/PDB/ModiStream.rst
new file mode 100644
index 0000000..3eb4505
--- /dev/null
+++ b/llvm/docs/PDB/ModiStream.rst
@@ -0,0 +1,3 @@
+=====================================

+The Module Information Stream

+=====================================

diff --git a/llvm/docs/PDB/MsfFile.rst b/llvm/docs/PDB/MsfFile.rst
new file mode 100644
index 0000000..bdceca3
--- /dev/null
+++ b/llvm/docs/PDB/MsfFile.rst
@@ -0,0 +1,121 @@
+=====================================

+The MSF File Format

+=====================================

+

+.. contents::

+   :local:

+

+.. _msf_superblock:

+

+The Superblock

+==============

+At file offset 0 in an MSF file is the MSF *SuperBlock*, which is laid out as

+follows:

+

+.. code-block:: c++

+

+  struct SuperBlock {

+    char FileMagic[sizeof(Magic)];

+    ulittle32_t BlockSize;

+    ulittle32_t FreeBlockMapBlock;

+    ulittle32_t NumBlocks;

+    ulittle32_t NumDirectoryBytes;

+    ulittle32_t Unknown;

+    ulittle32_t BlockMapAddr;

+  };

+

+- **FileMagic** - Must be equal to ``"Microsoft C / C++ MSF 7.00\\r\\n"``

+  followed by the bytes ``1A 44 53 00 00 00``.

+- **BlockSize** - The block size of the internal file system.  Valid values are

+  512, 1024, 2048, and 4096 bytes.  Certain aspects of the MSF file layout vary

+  depending on the block sizes.  For the purposes of LLVM, we handle only block

+  sizes of 4KiB, and all further discussion assumes a block size of 4KiB.

+- **FreeBlockMapBlock** - The index of a block within the file, at which begins

+  a bitfield representing the set of all blocks within the file which are "free"

+  (i.e. the data within that block is not used).  This bitfield is spread across

+  the MSF file at ``BlockSize`` intervals.

+  **Important**: ``FreeBlockMapBlock`` can only be ``1`` or ``2``!  This field

+  is designed to support incremental and atomic updates of the underlying MSF

+  file.  While writing to an MSF file, if the value of this field is `1`, you

+  can write your new modified bitfield to page 2, and vice versa.  Only when

+  you commit the file to disk do you need to swap the value in the SuperBlock

+  to point to the new ``FreeBlockMapBlock``.

+- **NumBlocks** - The total number of blocks in the file.  ``NumBlocks * BlockSize``

+  should equal the size of the file on disk.

+- **NumDirectoryBytes** - The size of the stream directory, in bytes.  The stream

+  directory contains information about each stream's size and the set of blocks

+  that it occupies.  It will be described in more detail later.

+- **BlockMapAddr** - The index of a block within the MSF file.  At this block is

+  an array of ``ulittle32_t``'s listing the blocks that the stream directory

+  resides on.  For large MSF files, the stream directory (which describes the

+  block layout of each stream) may not fit entirely on a single block.  As a

+  result, this extra layer of indirection is introduced, whereby this block

+  contains the list of blocks that the stream directory occupies, and the stream

+  directory itself can be stitched together accordingly.  The number of

+  ``ulittle32_t``'s in this array is given by ``ceil(NumDirectoryBytes / BlockSize)``.

+  

+The Stream Directory

+====================

+The Stream Directory is the root of all access to the other streams in an MSF

+file.  Beginning at byte 0 of the stream directory is the following structure:

+

+.. code-block:: c++

+

+  struct StreamDirectory {

+    ulittle32_t NumStreams;

+    ulittle32_t StreamSizes[NumStreams];

+    ulittle32_t StreamBlocks[NumStreams][];

+  };

+  

+And this structure occupies exactly ``SuperBlock->NumDirectoryBytes`` bytes.

+Note that each of the last two arrays is of variable length, and in particular

+that the second array is jagged.  

+

+**Example:** Suppose a hypothetical PDB file with a 4KiB block size, and 4

+streams of lengths {1000 bytes, 8000 bytes, 16000 bytes, 9000 bytes}.

+

+Stream 0: ceil(1000 / 4096) = 1 block

+

+Stream 1: ceil(8000 / 4096) = 2 blocks

+

+Stream 2: ceil(16000 / 4096) = 4 blocks

+

+Stream 3: ceil(9000 / 4096) = 3 blocks

+

+In total, 10 blocks are used.  Let's see what the stream directory might look

+like:

+

+.. code-block:: c++

+

+  struct StreamDirectory {

+    ulittle32_t NumStreams = 4;

+    ulittle32_t StreamSizes[] = {1000, 8000, 16000, 9000};

+    ulittle32_t StreamBlocks[][] = {

+      {4},

+      {5, 6},

+      {11, 9, 7, 8},

+      {10, 15, 12}

+    };

+  };

+  

+In total, this occupies ``15 * 4 = 60`` bytes, so ``SuperBlock->NumDirectoryBytes``

+would equal ``60``, and ``SuperBlock->BlockMapAddr`` would be an array of one

+``ulittle32_t``, since ``60 <= SuperBlock->BlockSize``.

+

+Note also that the streams are discontiguous, and that part of stream 3 is in the

+middle of part of stream 2.  You cannot assume anything about the layout of the

+blocks!

+

+Alignment and Block Boundaries

+==============================

+As may be clear by now, it is possible for a single field (whether it be a high

+level record, a long string field, or even a single ``uint16``) to begin and

+end in separate blocks.  For example, if the block size is 4096 bytes, and a

+``uint16`` field begins at the last byte of the current block, then it would

+need to end on the first byte of the next block.  Since blocks are not

+necessarily contiguously laid out in the file, this means that both the consumer

+and the producer of an MSF file must be prepared to split data apart

+accordingly.  In the aforementioned example, the high byte of the ``uint16``

+would be written to the last byte of block N, and the low byte would be written

+to the first byte of block N+1, which could be tens of thousands of bytes later

+(or even earlier!) in the file, depending on what the stream directory says.

diff --git a/llvm/docs/PDB/PdbStream.rst b/llvm/docs/PDB/PdbStream.rst
new file mode 100644
index 0000000..65adb9c
--- /dev/null
+++ b/llvm/docs/PDB/PdbStream.rst
@@ -0,0 +1,3 @@
+========================================

+The PDB Info Stream (aka the PDB Stream)

+========================================

diff --git a/llvm/docs/PDB/PublicStream.rst b/llvm/docs/PDB/PublicStream.rst
new file mode 100644
index 0000000..5b413cf
--- /dev/null
+++ b/llvm/docs/PDB/PublicStream.rst
@@ -0,0 +1,3 @@
+=====================================

+The PDB Public Symbol Stream

+=====================================

diff --git a/llvm/docs/PDB/TpiStream.rst b/llvm/docs/PDB/TpiStream.rst
new file mode 100644
index 0000000..1e3297e
--- /dev/null
+++ b/llvm/docs/PDB/TpiStream.rst
@@ -0,0 +1,3 @@
+=====================================

+The PDB TPI Stream

+=====================================

diff --git a/llvm/docs/PDB/index.rst b/llvm/docs/PDB/index.rst
new file mode 100644
index 0000000..2c1c3e3
--- /dev/null
+++ b/llvm/docs/PDB/index.rst
@@ -0,0 +1,160 @@
+=====================================

+The PDB File Format

+=====================================

+

+.. contents::

+   :local:

+

+.. _pdb_intro:

+

+Introduction

+============

+

+PDB (Program Database) is a file format invented by Microsoft and which contains

+debug information that can be consumed by debuggers and other tools.  Since

+officially supported APIs exist on Windows for querying debug information from

+PDBs even without the user understanding the internals of the file format, a

+large ecosystem of tools has been built for Windows to consume this format.  In

+order for Clang to be able to generate programs that can interoperate with these

+tools, it is necessary for us to generate PDB files ourselves.

+

+At the same time, LLVM has a long history of being able to cross-compile from

+any platform to any platform, and we wish for the same to be true here.  So it

+is necessary for us to understand the PDB file format at the byte-level so that

+we can generate PDB files entirely on our own.

+

+This manual describes what we know about the PDB file format today.  The layout

+of the file, the various streams contained within, the format of individual

+records within, and more.

+

+We would like to extend our heartfelt gratitude to Microsoft, without whom we

+would not be where we are today.  Much of the knowledge contained within this

+manual was learned through reading code published by Microsoft on their `GitHub

+repo <https://github.com/Microsoft/microsoft-pdb>`__.

+

+.. _pdb_layout:

+

+File Layout

+===========

+

+.. toctree::

+   :hidden:

+   

+   MsfFile

+   PdbStream

+   TpiStream

+   DbiStream

+   ModiStream

+   PublicStream

+   GlobalStream

+   HashStream

+

+.. _msf:

+

+The MSF Container

+-----------------

+A PDB file is really just a special case of an MSF (Multi-Stream Format) file.

+An MSF file is actually a miniature "file system within a file".  It contains

+multiple streams (aka files) which can represent arbitrary data, and these

+streams are divided into blocks which may not necessarily be contiguously

+laid out within the file (aka fragmented).  Additionally, the MSF contains a

+stream directory (aka MFT) which describes how the streams (files) are laid

+out within the MSF.

+

+For more information about the MSF container format, stream directory, and

+block layout, see :doc:`MsfFile`.

+

+.. _streams:

+

+Streams

+-------

+The PDB format contains a number of streams which describe various information

+such as the types, symbols, source files, and compilands (e.g. object files)

+of a program, as well as some additional streams containing hash tables that are

+used by debuggers and other tools to provide fast lookup of records and types

+by name, and various other information about how the program was compiled such

+as the specific toolchain used, and more.  A summary of streams contained in a

+PDB file is as follows:

+

++--------------------+------------------------------+-------------------------------------------+

+| Name               | Stream Index                 | Contents                                  |

++====================+==============================+===========================================+

+| Old Directory      | - Fixed Stream Index 0       | - Previous MSF Stream Directory           |

++--------------------+------------------------------+-------------------------------------------+

+| PDB Stream         | - Fixed Stream Index 1       | - Basic File Information                  |

+|                    |                              | - Fields to match EXE to this PDB         |

+|                    |                              | - Map of named streams to stream indices  |

++--------------------+------------------------------+-------------------------------------------+

+| TPI Stream         | - Fixed Stream Index 2       | - CodeView Type Records                   |

+|                    |                              | - Index of TPI Hash Stream                |

++--------------------+------------------------------+-------------------------------------------+

+| DBI Stream         | - Fixed Stream Index 3       | - Module/Compiland Information            |

+|                    |                              | - Indices of individual module streams    |

+|                    |                              | - Indices of public / global streams      |

+|                    |                              | - Section Contribution Information        |

+|                    |                              | - Source File Information                 |

+|                    |                              | - FPO / PGO Data                          |

++--------------------+------------------------------+-------------------------------------------+

+| IPI Stream         | - Fixed Stream Index 4       | - CodeView Type Records                   |

+|                    |                              | - Index of IPI Hash Stream                |

++--------------------+------------------------------+-------------------------------------------+

+| /LinkInfo          | - Contained in PDB Stream    | - Unknown                                 |

+|                    |   Named Stream map           |                                           |

++--------------------+------------------------------+-------------------------------------------+

+| /src/headerblock   | - Contained in PDB Stream    | - Unknown                                 |

+|                    |   Named Stream map           |                                           |

++--------------------+------------------------------+-------------------------------------------+

+| /names             | - Contained in PDB Stream    | - PDB-wide global string table used for   |

+|                    |   Named Stream map           |   string de-duplication                   |

++--------------------+------------------------------+-------------------------------------------+

+| Module Info Stream | - Contained in DBI Stream    | - CodeView Symbol Records for this module |

+|                    | - One for each compiland     | - Line Number Information                 |

++--------------------+------------------------------+-------------------------------------------+

+| Public Stream      | - Contained in DBI Stream    | - Public (Exported) Symbol Records        |

+|                    |                              | - Index of Public Hash Stream             |

++--------------------+------------------------------+-------------------------------------------+

+| Global Stream      | - Contained in DBI Stream    | - Global Symbol Records                   |

+|                    |                              | - Index of Global Hash Stream             |

++--------------------+------------------------------+-------------------------------------------+

+| TPI Hash Stream    | - Contained in TPI Stream    | - Hash table for looking up TPI records   |

+|                    |                              |   by name                                 |

++--------------------+------------------------------+-------------------------------------------+

+| IPI Hash Stream    | - Contained in IPI Stream    | - Hash table for looking up IPI records   |

+|                    |                              |   by name                                 |

++--------------------+------------------------------+-------------------------------------------+

+

+More information about the structure of each of these can be found on the

+following pages:

+   

+:doc:`PdbStream`

+   Information about the PDB Info Stream and how it is used to match PDBs to EXEs.

+

+:doc:`TpiStream`

+   Information about the TPI stream and the CodeView records contained within.

+

+:doc:`DbiStream`

+   Information about the DBI stream and relevant substreams including the Module Substreams,

+   source file information, and CodeView symbol records contained within.

+

+:doc:`ModiStream`

+   Information about the Module Information Stream, of which there is one for each compilation

+   unit and the format of symbols contained within.

+

+:doc:`PublicStream`

+   Information about the Public Symbol Stream.

+

+:doc:`GlobalStream`

+   Information about the Global Symbol Stream.

+

+:doc:`HashStream`

+   Information about the Hash Table stream, and how it can be used to quickly look up records

+   by name.

+

+CodeView

+========

+CodeView is another format which comes into the picture.  While MSF defines

+the structure of the overall file, and PDB defines the set of streams that

+appear within the MSF file and the format of those streams, CodeView defines

+the format of **symbol and type records** that appear within specific streams.

+Refer to the pages on `CodeView Symbol Records` and `CodeView Type Records` for

+more information about the CodeView format.