blob: 581ec598791c1ef2c0d40e50b9ea43057a932566 [file] [log] [blame]
Nico Weber0a4aeec2019-05-01 19:15:05 +00001The PDB Serialized Hash Table Format
2====================================
3
4.. contents::
5 :local:
6
7.. _hash_intro:
8
9Introduction
10============
11
12One of the design goals of the PDB format is to provide accelerated access to
13debug information, and for this reason there are several occasions where hash
14tables are serialized and embedded directly to the file, rather than requiring
15a consumer to read a list of values and reconstruct the hash table on the fly.
16
17The serialization format supports hash tables of arbitrarily large size and
18capacity, as well as value types and hash functions. The only supported key
19value type is a uint32. The only requirement is that the producer and consumer
20agree on the hash function. As such, the hash function can is not discussed
21further in this document, it is assumed that for a particular instance of a PDB
22file hash table, the appropriate hash function is being used.
23
24On-Disk Format
25==============
26
27.. code-block:: none
28
29 .--------------------.-- +0
30 | Size |
31 .--------------------.-- +4
32 | Capacity |
33 .--------------------.-- +8
34 | Present Bit Vector |
35 .--------------------.-- +N
36 | Deleted Bit Vector |
37 .--------------------.-- +M ─╮
38 | Key |
39 .--------------------.-- +M+4
40 | Value |
41 .--------------------.-- +M+4+sizeof(Value)
42 ... ├─ |Capacity| Bucket entries
43 .--------------------.
44 | Key |
45 .--------------------.
46 | Value |
47 .--------------------. ─╯
48
49- **Size** - The number of values contained in the hash table.
Nico Weberae02f6b2019-06-22 11:23:01 +000050
Nico Weber0a4aeec2019-05-01 19:15:05 +000051- **Capacity** - The number of buckets in the hash table. Producers should
52 maintain a load factor of no greater than ``2/3*Capacity+1``.
Nico Weberae02f6b2019-06-22 11:23:01 +000053
Nico Weber0a4aeec2019-05-01 19:15:05 +000054- **Present Bit Vector** - A serialized bit vector which contains information
55 about which buckets have valid values. If the bucket has a value, the
56 corresponding bit will be set, and if the bucket doesn't have a value (either
57 because the bucket is empty or because the value is a tombstone value) the bit
58 will be unset.
Nico Weberae02f6b2019-06-22 11:23:01 +000059
Nico Weber0a4aeec2019-05-01 19:15:05 +000060- **Deleted Bit Vector** - A serialized bit vector which contains information
61 about which buckets have tombstone values. If the entry in this bucket is
62 deleted, the bit will be set, otherwise it will be unset.
63
64- **Keys and Values** - A list of ``Capacity`` hash buckets, where the first
65 entry is the key (always a uint32), and the second entry is the value. The
66 state of each bucket (valid, empty, deleted) can be determined by examining
67 the present and deleted bit vectors.
68
69
70.. _hash_bit_vectors:
71
72Present and Deleted Bit Vectors
73===============================
74
75The bit vectors indicating the status of each bucket are serialized as follows:
76
77.. code-block:: none
78
79 .--------------------.-- +0
80 | Word Count |
81 .--------------------.-- +4
82 | Word_0 | ─╮
83 .--------------------.-- +8 │
84 | Word_1 | │
85 .--------------------.-- +12 ├─ |Word Count| values
86 ... │
87 .--------------------. │
88 | Word_N | │
89 .--------------------. ─╯
90
Nico Weberae02f6b2019-06-22 11:23:01 +000091The words, when viewed as a contiguous block of bytes, represent a bit vector
92with the following layout:
Nico Weber0a4aeec2019-05-01 19:15:05 +000093
94.. code-block:: none
95
96 .------------. .------------.------------.
97 | Word_N | ... | Word_1 | Word_0 |
98 .------------. .------------.------------.
99 | | | | |
100 +N*32 +(N-1)*32 +64 +32 +0
101
102where the k'th bit of this bit vector represents the status of the k'th bucket
103in the hash table.