Wyatt Hepler | f9fb90f | 2020-09-30 18:59:33 -0700 | [diff] [blame] | 1 | .. _module-pw_kvs: |
Armando Montanez | 0054a9b | 2020-03-13 13:06:24 -0700 | [diff] [blame] | 2 | |
| 3 | ------ |
| 4 | pw_kvs |
| 5 | ------ |
David Rogers | dcdd486 | 2020-10-27 03:34:37 -0700 | [diff] [blame] | 6 | |
| 7 | .. note:: |
| 8 | The documentation for this module is currently under construction. |
| 9 | |
Armando Montanez | 0054a9b | 2020-03-13 13:06:24 -0700 | [diff] [blame] | 10 | ``pw_kvs`` is Pigweed's Key Value Store (KVS) library. KVS is a flash-backed |
| 11 | persistent storage system with integrated wear-leveling that serves as a |
| 12 | relatively lightweight alternative to a file system. |
| 13 | |
David Rogers | dcdd486 | 2020-10-27 03:34:37 -0700 | [diff] [blame] | 14 | KeyValueStore |
| 15 | ============= |
| 16 | |
| 17 | The KVS system stores key and value data pairs. The key value pairs are stored |
| 18 | in `flash memory`_ as a `key-value entry`_ (KV entry) that consists of a |
| 19 | header/metadata, the key data, and value data. KV entries are accessed through |
| 20 | Put, Get, and Delete operations. |
| 21 | |
| 22 | Each flash sector is written sequentially in an append-only manner, with each |
| 23 | following entry write being at a higher address than all of the previous entry |
| 24 | writes to that sector since erase. Once information (header, metadata, data, |
| 25 | etc) is written to flash, that information is not modified or cleared until a |
| 26 | full sector erase occurs as part of garbage collection. |
| 27 | |
| 28 | Individual KV entries are contained within a single flash sector (do not cross |
| 29 | sector boundaries). Flash sectors can contain as many KV entries as fit in the |
| 30 | sector. |
| 31 | |
| 32 | KVS does not store any data/metadata/state in flash beyond the KV entries. All |
| 33 | KVS system state can be derived from the stored KV entries. Current KVS system |
| 34 | state is determined at boot from flash-stored KV entries and then maintained in |
| 35 | ram by the KVS. The KVS is at all times in a valid state on-flash, so there are |
| 36 | no windows of vulnerability to unexpected power loss or crash. The old entry |
| 37 | for a key is maintained until the new entry for that key is written and |
| 38 | verified. |
| 39 | |
| 40 | Each `key-value entry`_ has a unique transaction ID that is incremented for |
| 41 | each KVS update transaction. When determining system state from flash-stored KV |
| 42 | entries, the valid entry with the highest transaction ID is considered to be |
| 43 | the “current” entry of the key. All stored entries of the same key with lower |
| 44 | transaction ID are considered old or “stale”. |
| 45 | |
| 46 | Updates/rewrites of a key that has been previously stored is done as a new KV |
| 47 | entry with an updated transaction ID and the new value for the key. The KVS |
| 48 | internal state is updated to reflect the new entry. The previously stored KV |
| 49 | entries for that key are not modified or removed from flash storage, until |
| 50 | garbage collection reclaims the “stale” entries. |
| 51 | |
Shiva Rajagopal | 9e51656 | 2021-05-11 17:04:15 -0700 | [diff] [blame] | 52 | `Garbage collection`_ is done by copying any currently valid KV entries in the |
David Rogers | dcdd486 | 2020-10-27 03:34:37 -0700 | [diff] [blame] | 53 | sector to be garbage collected to a different sector and then erasing the |
| 54 | sector. |
| 55 | |
| 56 | Flash Memory |
| 57 | ------------- |
| 58 | |
| 59 | The flash storage used by KVS is comprised of two layers, FlashMemory and |
| 60 | FlashPartition. |
| 61 | |
| 62 | FlashMemory is the lower level that manages the raw read/write/erase of the |
| 63 | flash memory device. |
| 64 | |
| 65 | FlashPartition is a portion of a FlashMemory. A FlashMemory may have multiple |
| 66 | FlashPartitions that represent different parts of the FlashMemory - such as |
| 67 | partitions for KVS, OTA, snapshots/crashlogs, etc. Each FlashPartition has its |
| 68 | own separate logical address space starting from zero to size of the partition. |
| 69 | FlashPartition logical address does not always map directly to FlashMemory |
| 70 | addresses due to partition encryption, sector headers, etc. |
| 71 | |
| 72 | Writes to flash must have a start address that is a multiple of the flash |
| 73 | write alignment. Write size must also be a multiple of flash write alignment. |
David Rogers | 2d4fa78 | 2021-07-08 21:07:28 -0700 | [diff] [blame] | 74 | Write alignment varies by flash device and partition type. Reads from flash do |
| 75 | not have any address or size alignment requirement - reads always have a |
| 76 | minimum alignment of 1. |
David Rogers | dcdd486 | 2020-10-27 03:34:37 -0700 | [diff] [blame] | 77 | |
David Rogers | 2d4fa78 | 2021-07-08 21:07:28 -0700 | [diff] [blame] | 78 | FlashPartitions may have a different alignment than the FlashMemory they are |
| 79 | part of, so long as the partition's alignment is a multiple of the alignment |
| 80 | for the FlashMemory. |
| 81 | |
| 82 | Sectors are the minimum erase size for both FlashMemory and FlashPartition. |
| 83 | FlashPartitions may have a different logical sector size than the FlashMemory |
| 84 | they are part of. Partition logical sectors may be smaller due to partition |
| 85 | overhead (encryption, wear tracking, etc) or larger due to combining raw |
| 86 | sectors into larger logical sectors. |
David Rogers | dcdd486 | 2020-10-27 03:34:37 -0700 | [diff] [blame] | 87 | |
David Rogers | b9acbb0 | 2022-02-11 14:19:55 -0800 | [diff] [blame] | 88 | FlashPartition supports access via NonSeekableWriter and SeekableReader. |
| 89 | |
David Rogers | a2c4d1d | 2021-01-15 03:17:28 -0800 | [diff] [blame] | 90 | Size report |
| 91 | ----------- |
| 92 | The following size report showcases the memory usage of the KVS and |
| 93 | FlashPartition. |
| 94 | |
| 95 | .. include:: kvs_size |
| 96 | |
David Rogers | dcdd486 | 2020-10-27 03:34:37 -0700 | [diff] [blame] | 97 | Storage Allocation |
| 98 | ------------------ |
| 99 | |
| 100 | KVS requires more storage space than the size of the key-value data stored. |
| 101 | This is due to the always free sector required for garbage collection and the |
| 102 | "write and garbage collect later" approach KVS uses. |
| 103 | |
| 104 | KVS works poorly with stored data being more than 75% of the available |
| 105 | storage. It works best with stored data being less than 50% of the available |
| 106 | storage. For applications that prefer/need to do garbage collection at |
| 107 | scheduled times or that write very heavily can benefit from additional flash |
| 108 | store space. |
| 109 | |
| 110 | The flash storage used by KVS is multiplied by `redundancy`_ used. A redundancy |
| 111 | of 2 will use twice the storage. |
| 112 | |
| 113 | Key-Value Entry |
| 114 | --------------- |
| 115 | |
| 116 | Each key-value (KV) entry consists of a header/metadata, the key data, and |
| 117 | value data. Individual KV entries are contained within a single flash sector |
| 118 | (do not cross sector boundaries). Because of this the maximum KV entry size is |
| 119 | the partition sector size. |
| 120 | |
| 121 | KV entries are appended as needed to sectors, with append operations spread |
| 122 | over time. Each individual KV entry is written completely as a single |
| 123 | high-level operation. KV entries are appended to a sector as long as space is |
| 124 | available for a given KV entry. Multiple sectors can be active for writing at |
| 125 | any time. |
| 126 | |
| 127 | When a key is rewritten (writing a new KV entry of an existing key), the KV |
| 128 | entry is stored at a new location that may or may not be located in the same |
| 129 | sector as the previous entry for that key. The new entry uses a transaction |
| 130 | ID greater than the previous transaction ID. The previous KV entry for that key |
| 131 | remains unaltered “on-disk” but is considered “stale”. It is garbage collected |
| 132 | at some future time. |
| 133 | |
| 134 | Redundancy |
| 135 | ---------- |
| 136 | |
| 137 | KVS supports storing redundant copies of KV entries. For a given redundancy |
| 138 | level (N), N total copies of each KV entry are stored. Redundant copies are |
| 139 | always stored in different sectors. This protects against corruption or even |
| 140 | full sector loss in N-1 sectors without data loss. |
| 141 | |
| 142 | Redundancy increases flash usage proportional to the redundancy level. The RAM |
| 143 | usage for KVS internal state has a small increase with redundancy. |
| 144 | |
| 145 | Garbage Collection |
| 146 | ------------------ |
| 147 | |
| 148 | Storage space occupied by stale KV entries is reclaimed and made available |
| 149 | for reuse through a garbage collection process. The base garbage collection |
| 150 | operation is done to reclaim one sector at a time. |
| 151 | |
| 152 | KVS always keeps at least one sector free at all times to ensure the ability to |
| 153 | garbage collect. This free sector is used to copy valid entries from the sector |
| 154 | to be garbage collected before erasing the sector to be garbage collected. The |
| 155 | always free sector is rotated as part of the KVS wear leveling. |
| 156 | |
| 157 | Full Maintenance does garbage collection of all sectors except those that have |
| 158 | current valid KV entries. |
| 159 | |
| 160 | Heavy Maintenance does garbage collection of all sectors. Use strong caution |
| 161 | when doing Heavy Maintenance as it can, compared to Full Maintenance, result |
| 162 | in a significant amount of moving valid entries, |
| 163 | |
| 164 | Garbage collection can be performed by request of higher level software or |
| 165 | automatically as needed to make space available to write new entries. |
| 166 | |
| 167 | Flash wear management |
| 168 | --------------------- |
| 169 | |
| 170 | Wear leveling is accomplished by cycling selection of the next sector to write |
| 171 | to. This cycling spreads flash wear across all free sectors so that no one |
| 172 | sector is prematurely worn out. |
| 173 | |
| 174 | Wear leveling through cycling selection of next sector to write |
| 175 | |
| 176 | * Location of new writes/rewrites of key-values will prefer sectors already |
| 177 | in-use (partially filled), with new (blank) sectors used when no in-use |
| 178 | sectors have large enough available space for the new write |
| 179 | * New (blank) sectors selected cycle sequentially between available free |
| 180 | sectors |
| 181 | * Search for the first available sector, starting from current write sector + 1 |
| 182 | and wrap around to start at the end of partition. |
| 183 | * This spreads the erase/write cycles for heavily written/rewritten key-values |
| 184 | across all free sectors, reducing wear on any single sector |
| 185 | * Erase count is not considered as part of the wear leveling decision making |
| 186 | process |
| 187 | * Sectors with already written key-values that are not modified will remain in |
| 188 | the original sector and not participate in wear-leveling, so long as the |
| 189 | key-values in the sector remain unchanged |