Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 1 | LIBNVDIMM: Non-Volatile Devices |
| 2 | libnvdimm - kernel / libndctl - userspace helper library |
| 3 | linux-nvdimm@lists.01.org |
| 4 | v13 |
| 5 | |
| 6 | |
| 7 | Glossary |
| 8 | Overview |
| 9 | Supporting Documents |
| 10 | Git Trees |
| 11 | LIBNVDIMM PMEM and BLK |
| 12 | Why BLK? |
| 13 | PMEM vs BLK |
| 14 | BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX |
| 15 | Example NVDIMM Platform |
| 16 | LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API |
| 17 | LIBNDCTL: Context |
| 18 | libndctl: instantiate a new library context example |
| 19 | LIBNVDIMM/LIBNDCTL: Bus |
| 20 | libnvdimm: control class device in /sys/class |
| 21 | libnvdimm: bus |
| 22 | libndctl: bus enumeration example |
| 23 | LIBNVDIMM/LIBNDCTL: DIMM (NMEM) |
| 24 | libnvdimm: DIMM (NMEM) |
| 25 | libndctl: DIMM enumeration example |
| 26 | LIBNVDIMM/LIBNDCTL: Region |
| 27 | libnvdimm: region |
| 28 | libndctl: region enumeration example |
| 29 | Why Not Encode the Region Type into the Region Name? |
| 30 | How Do I Determine the Major Type of a Region? |
| 31 | LIBNVDIMM/LIBNDCTL: Namespace |
| 32 | libnvdimm: namespace |
| 33 | libndctl: namespace enumeration example |
| 34 | libndctl: namespace creation example |
| 35 | Why the Term "namespace"? |
| 36 | LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" |
| 37 | libnvdimm: btt layout |
| 38 | libndctl: btt creation example |
| 39 | Summary LIBNDCTL Diagram |
| 40 | |
| 41 | |
| 42 | Glossary |
| 43 | -------- |
| 44 | |
| 45 | PMEM: A system-physical-address range where writes are persistent. A |
| 46 | block device composed of PMEM is capable of DAX. A PMEM address range |
| 47 | may span an interleave of several DIMMs. |
| 48 | |
| 49 | BLK: A set of one or more programmable memory mapped apertures provided |
| 50 | by a DIMM to access its media. This indirection precludes the |
| 51 | performance benefit of interleaving, but enables DIMM-bounded failure |
| 52 | modes. |
| 53 | |
| 54 | DPA: DIMM Physical Address, is a DIMM-relative offset. With one DIMM in |
| 55 | the system there would be a 1:1 system-physical-address:DPA association. |
| 56 | Once more DIMMs are added a memory controller interleave must be |
| 57 | decoded to determine the DPA associated with a given |
| 58 | system-physical-address. BLK capacity always has a 1:1 relationship |
| 59 | with a single-DIMM's DPA range. |
| 60 | |
| 61 | DAX: File system extensions to bypass the page cache and block layer to |
| 62 | mmap persistent memory, from a PMEM block device, directly into a |
| 63 | process address space. |
| 64 | |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 65 | DSM: Device Specific Method: ACPI method to to control specific |
| 66 | device - in this case the firmware. |
| 67 | |
| 68 | DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. |
| 69 | It defines a vendor-id, device-id, and interface format for a given DIMM. |
| 70 | |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 71 | BTT: Block Translation Table: Persistent memory is byte addressable. |
| 72 | Existing software may have an expectation that the power-fail-atomicity |
| 73 | of writes is at least one sector, 512 bytes. The BTT is an indirection |
| 74 | table with atomic update semantics to front a PMEM/BLK block device |
| 75 | driver and present arbitrary atomic sector sizes. |
| 76 | |
| 77 | LABEL: Metadata stored on a DIMM device that partitions and identifies |
| 78 | (persistently names) storage between PMEM and BLK. It also partitions |
| 79 | BLK storage to host BTTs with different parameters per BLK-partition. |
| 80 | Note that traditional partition tables, GPT/MBR, are layered on top of a |
| 81 | BLK or PMEM device. |
| 82 | |
| 83 | |
| 84 | Overview |
| 85 | -------- |
| 86 | |
| 87 | The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, |
| 88 | PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM |
| 89 | and BLK mode access. These three modes of operation are described by |
| 90 | the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM |
| 91 | implementation is generic and supports pre-NFIT platforms, it was guided |
| 92 | by the superset of capabilities need to support this ACPI 6 definition |
| 93 | for NVDIMM resources. The bulk of the kernel implementation is in place |
| 94 | to handle the case where DPA accessible via PMEM is aliased with DPA |
| 95 | accessible via BLK. When that occurs a LABEL is needed to reserve DPA |
| 96 | for exclusive access via one mode a time. |
| 97 | |
| 98 | Supporting Documents |
| 99 | ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf |
| 100 | NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf |
| 101 | DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf |
| 102 | Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf |
| 103 | |
| 104 | Git Trees |
| 105 | LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git |
| 106 | LIBNDCTL: https://github.com/pmem/ndctl.git |
| 107 | PMEM: https://github.com/01org/prd |
| 108 | |
| 109 | |
| 110 | LIBNVDIMM PMEM and BLK |
| 111 | ------------------ |
| 112 | |
| 113 | Prior to the arrival of the NFIT, non-volatile memory was described to a |
| 114 | system in various ad-hoc ways. Usually only the bare minimum was |
| 115 | provided, namely, a single system-physical-address range where writes |
| 116 | are expected to be durable after a system power loss. Now, the NFIT |
| 117 | specification standardizes not only the description of PMEM, but also |
| 118 | BLK and platform message-passing entry points for control and |
| 119 | configuration. |
| 120 | |
| 121 | For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block |
| 122 | device driver: |
| 123 | |
| 124 | 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This |
| 125 | range is contiguous in system memory and may be interleaved (hardware |
| 126 | memory controller striped) across multiple DIMMs. When interleaved the |
| 127 | platform may optionally provide details of which DIMMs are participating |
| 128 | in the interleave. |
| 129 | |
| 130 | Note that while LIBNVDIMM describes system-physical-address ranges that may |
| 131 | alias with BLK access as ND_NAMESPACE_PMEM ranges and those without |
| 132 | alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no |
| 133 | distinction. The different device-types are an implementation detail |
| 134 | that userspace can exploit to implement policies like "only interface |
| 135 | with address ranges from certain DIMMs". It is worth noting that when |
| 136 | aliasing is present and a DIMM lacks a label, then no block device can |
| 137 | be created by default as userspace needs to do at least one allocation |
| 138 | of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once |
| 139 | registered, can be immediately attached to nd_pmem. |
| 140 | |
| 141 | 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 142 | defined apertures. A set of apertures will access just one DIMM. |
| 143 | Multiple windows (apertures) allow multiple concurrent accesses, much like |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 144 | tagged-command-queuing, and would likely be used by different threads or |
| 145 | different CPUs. |
| 146 | |
| 147 | The NFIT specification defines a standard format for a BLK-aperture, but |
| 148 | the spec also allows for vendor specific layouts, and non-NFIT BLK |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 149 | implementations may have other designs for BLK I/O. For this reason |
| 150 | "nd_blk" calls back into platform-specific code to perform the I/O. |
| 151 | One such implementation is defined in the "Driver Writer's Guide" and "DSM |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 152 | Interface Example". |
| 153 | |
| 154 | |
| 155 | Why BLK? |
| 156 | -------- |
| 157 | |
| 158 | While PMEM provides direct byte-addressable CPU-load/store access to |
| 159 | NVDIMM storage, it does not provide the best system RAS (recovery, |
| 160 | availability, and serviceability) model. An access to a corrupted |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 161 | system-physical-address address causes a CPU exception while an access |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 162 | to a corrupted address through an BLK-aperture causes that block window |
| 163 | to raise an error status in a register. The latter is more aligned with |
| 164 | the standard error model that host-bus-adapter attached disks present. |
| 165 | Also, if an administrator ever wants to replace a memory it is easier to |
| 166 | service a system at DIMM module boundaries. Compare this to PMEM where |
| 167 | data could be interleaved in an opaque hardware specific manner across |
| 168 | several DIMMs. |
| 169 | |
| 170 | PMEM vs BLK |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 171 | BLK-apertures solve these RAS problems, but their presence is also the |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 172 | major contributing factor to the complexity of the ND subsystem. They |
| 173 | complicate the implementation because PMEM and BLK alias in DPA space. |
| 174 | Any given DIMM's DPA-range may contribute to one or more |
| 175 | system-physical-address sets of interleaved DIMMs, *and* may also be |
| 176 | accessed in its entirety through its BLK-aperture. Accessing a DPA |
| 177 | through a system-physical-address while simultaneously accessing the |
| 178 | same DPA through a BLK-aperture has undefined results. For this reason, |
| 179 | DIMMs with this dual interface configuration include a DSM function to |
| 180 | store/retrieve a LABEL. The LABEL effectively partitions the DPA-space |
| 181 | into exclusive system-physical-address and BLK-aperture accessible |
| 182 | regions. For simplicity a DIMM is allowed a PMEM "region" per each |
| 183 | interleave set in which it is a member. The remaining DPA space can be |
| 184 | carved into an arbitrary number of BLK devices with discontiguous |
| 185 | extents. |
| 186 | |
| 187 | BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX |
| 188 | -------------------------------------------------- |
| 189 | |
| 190 | One of the few |
| 191 | reasons to allow multiple BLK namespaces per REGION is so that each |
| 192 | BLK-namespace can be configured with a BTT with unique atomic sector |
| 193 | sizes. While a PMEM device can host a BTT the LABEL specification does |
| 194 | not provide for a sector size to be specified for a PMEM namespace. |
| 195 | This is due to the expectation that the primary usage model for PMEM is |
| 196 | via DAX, and the BTT is incompatible with DAX. However, for the cases |
| 197 | where an application or filesystem still needs atomic sector update |
| 198 | guarantees it can register a BTT on a PMEM device or partition. See |
| 199 | LIBNVDIMM/NDCTL: Block Translation Table "btt" |
| 200 | |
| 201 | |
| 202 | Example NVDIMM Platform |
| 203 | ----------------------- |
| 204 | |
| 205 | For the remainder of this document the following diagram will be |
| 206 | referenced for any example sysfs layouts. |
| 207 | |
| 208 | |
| 209 | (a) (b) DIMM BLK-REGION |
| 210 | +-------------------+--------+--------+--------+ |
| 211 | +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 |
| 212 | | imc0 +--+- - - region0- - - +--------+ +--------+ |
| 213 | +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 |
| 214 | | +-------------------+--------v v--------+ |
| 215 | +--+---+ | | |
| 216 | | cpu0 | region1 |
| 217 | +--+---+ | | |
| 218 | | +----------------------------^ ^--------+ |
| 219 | +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 |
| 220 | | imc1 +--+----------------------------| +--------+ |
| 221 | +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 |
| 222 | +----------------------------+--------+--------+ |
| 223 | |
| 224 | In this platform we have four DIMMs and two memory controllers in one |
| 225 | socket. Each unique interface (BLK or PMEM) to DPA space is identified |
| 226 | by a region device with a dynamically assigned id (REGION0 - REGION5). |
| 227 | |
| 228 | 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 229 | single PMEM namespace is created in the REGION0-SPA-range that spans most |
| 230 | of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 231 | interleaved system-physical-address range is reclaimed as BLK-aperture |
| 232 | accessed space starting at DPA-offset (a) into each DIMM. In that |
| 233 | reclaimed space we create two BLK-aperture "namespaces" from REGION2 and |
| 234 | REGION3 where "blk2.0" and "blk3.0" are just human readable names that |
| 235 | could be set to any user-desired name in the LABEL. |
| 236 | |
| 237 | 2. In the last portion of DIMM0 and DIMM1 we have an interleaved |
| 238 | system-physical-address range, REGION1, that spans those two DIMMs as |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 239 | well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace |
| 240 | named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 241 | each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and |
| 242 | "blk5.0". |
| 243 | |
| 244 | 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 245 | interleaved system-physical-address range (i.e. the DPA address past |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 246 | offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. |
| 247 | Note, that this example shows that BLK-aperture namespaces don't need to |
| 248 | be contiguous in DPA-space. |
| 249 | |
| 250 | This bus is provided by the kernel under the device |
| 251 | /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and |
| 252 | the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the |
| 253 | acpi_nfit.ko driver as well. |
| 254 | |
| 255 | |
| 256 | LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API |
| 257 | ---------------------------------------------------- |
| 258 | |
| 259 | What follows is a description of the LIBNVDIMM sysfs layout and a |
| 260 | corresponding object hierarchy diagram as viewed through the LIBNDCTL |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 261 | API. The example sysfs paths and diagrams are relative to the Example |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 262 | NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit |
| 263 | test. |
| 264 | |
| 265 | LIBNDCTL: Context |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 266 | Every API call in the LIBNDCTL library requires a context that holds the |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 267 | logging parameters and other library instance state. The library is |
| 268 | based on the libabc template: |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 269 | https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 270 | |
| 271 | LIBNDCTL: instantiate a new library context example |
| 272 | |
| 273 | struct ndctl_ctx *ctx; |
| 274 | |
| 275 | if (ndctl_new(&ctx) == 0) |
| 276 | return ctx; |
| 277 | else |
| 278 | return NULL; |
| 279 | |
| 280 | LIBNVDIMM/LIBNDCTL: Bus |
| 281 | ------------------- |
| 282 | |
| 283 | A bus has a 1:1 relationship with an NFIT. The current expectation for |
| 284 | ACPI based systems is that there is only ever one platform-global NFIT. |
| 285 | That said, it is trivial to register multiple NFITs, the specification |
| 286 | does not preclude it. The infrastructure supports multiple busses and |
| 287 | we we use this capability to test multiple NFIT configurations in the |
| 288 | unit test. |
| 289 | |
| 290 | LIBNVDIMM: control class device in /sys/class |
| 291 | |
| 292 | This character device accepts DSM messages to be passed to DIMM |
| 293 | identified by its NFIT handle. |
| 294 | |
| 295 | /sys/class/nd/ndctl0 |
| 296 | |-- dev |
| 297 | |-- device -> ../../../ndbus0 |
| 298 | |-- subsystem -> ../../../../../../../class/nd |
| 299 | |
| 300 | |
| 301 | |
| 302 | LIBNVDIMM: bus |
| 303 | |
| 304 | struct nvdimm_bus *nvdimm_bus_register(struct device *parent, |
| 305 | struct nvdimm_bus_descriptor *nfit_desc); |
| 306 | |
| 307 | /sys/devices/platform/nfit_test.0/ndbus0 |
| 308 | |-- commands |
| 309 | |-- nd |
| 310 | |-- nfit |
| 311 | |-- nmem0 |
| 312 | |-- nmem1 |
| 313 | |-- nmem2 |
| 314 | |-- nmem3 |
| 315 | |-- power |
| 316 | |-- provider |
| 317 | |-- region0 |
| 318 | |-- region1 |
| 319 | |-- region2 |
| 320 | |-- region3 |
| 321 | |-- region4 |
| 322 | |-- region5 |
| 323 | |-- uevent |
| 324 | `-- wait_probe |
| 325 | |
| 326 | LIBNDCTL: bus enumeration example |
| 327 | Find the bus handle that describes the bus from Example NVDIMM Platform |
| 328 | |
| 329 | static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, |
| 330 | const char *provider) |
| 331 | { |
| 332 | struct ndctl_bus *bus; |
| 333 | |
| 334 | ndctl_bus_foreach(ctx, bus) |
| 335 | if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) |
| 336 | return bus; |
| 337 | |
| 338 | return NULL; |
| 339 | } |
| 340 | |
| 341 | bus = get_bus_by_provider(ctx, "nfit_test.0"); |
| 342 | |
| 343 | |
| 344 | LIBNVDIMM/LIBNDCTL: DIMM (NMEM) |
| 345 | --------------------------- |
| 346 | |
| 347 | The DIMM device provides a character device for sending commands to |
| 348 | hardware, and it is a container for LABELs. If the DIMM is defined by |
| 349 | NFIT then an optional 'nfit' attribute sub-directory is available to add |
| 350 | NFIT-specifics. |
| 351 | |
| 352 | Note that the kernel device name for "DIMMs" is "nmemX". The NFIT |
| 353 | describes these devices via "Memory Device to System Physical Address |
| 354 | Range Mapping Structure", and there is no requirement that they actually |
| 355 | be physical DIMMs, so we use a more generic name. |
| 356 | |
| 357 | LIBNVDIMM: DIMM (NMEM) |
| 358 | |
| 359 | struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, |
| 360 | const struct attribute_group **groups, unsigned long flags, |
| 361 | unsigned long *dsm_mask); |
| 362 | |
| 363 | /sys/devices/platform/nfit_test.0/ndbus0 |
| 364 | |-- nmem0 |
| 365 | | |-- available_slots |
| 366 | | |-- commands |
| 367 | | |-- dev |
| 368 | | |-- devtype |
| 369 | | |-- driver -> ../../../../../bus/nd/drivers/nvdimm |
| 370 | | |-- modalias |
| 371 | | |-- nfit |
| 372 | | | |-- device |
| 373 | | | |-- format |
| 374 | | | |-- handle |
| 375 | | | |-- phys_id |
| 376 | | | |-- rev_id |
| 377 | | | |-- serial |
| 378 | | | `-- vendor |
| 379 | | |-- state |
| 380 | | |-- subsystem -> ../../../../../bus/nd |
| 381 | | `-- uevent |
| 382 | |-- nmem1 |
| 383 | [..] |
| 384 | |
| 385 | |
| 386 | LIBNDCTL: DIMM enumeration example |
| 387 | |
| 388 | Note, in this example we are assuming NFIT-defined DIMMs which are |
| 389 | identified by an "nfit_handle" a 32-bit value where: |
| 390 | Bit 3:0 DIMM number within the memory channel |
| 391 | Bit 7:4 memory channel number |
| 392 | Bit 11:8 memory controller ID |
| 393 | Bit 15:12 socket ID (within scope of a Node controller if node controller is present) |
| 394 | Bit 27:16 Node Controller ID |
| 395 | Bit 31:28 Reserved |
| 396 | |
| 397 | static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, |
| 398 | unsigned int handle) |
| 399 | { |
| 400 | struct ndctl_dimm *dimm; |
| 401 | |
| 402 | ndctl_dimm_foreach(bus, dimm) |
| 403 | if (ndctl_dimm_get_handle(dimm) == handle) |
| 404 | return dimm; |
| 405 | |
| 406 | return NULL; |
| 407 | } |
| 408 | |
| 409 | #define DIMM_HANDLE(n, s, i, c, d) \ |
| 410 | (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ |
| 411 | | ((c & 0xf) << 4) | (d & 0xf)) |
| 412 | |
| 413 | dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); |
| 414 | |
| 415 | LIBNVDIMM/LIBNDCTL: Region |
| 416 | ---------------------- |
| 417 | |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 418 | A generic REGION device is registered for each PMEM range or BLK-aperture |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 419 | set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture |
| 420 | sets on the "nfit_test.0" bus. The primary role of regions are to be a |
| 421 | container of "mappings". A mapping is a tuple of <DIMM, |
| 422 | DPA-start-offset, length>. |
| 423 | |
| 424 | LIBNVDIMM provides a built-in driver for these REGION devices. This driver |
| 425 | is responsible for reconciling the aliased DPA mappings across all |
| 426 | regions, parsing the LABEL, if present, and then emitting NAMESPACE |
| 427 | devices with the resolved/exclusive DPA-boundaries for the nd_pmem or |
| 428 | nd_blk device driver to consume. |
| 429 | |
| 430 | In addition to the generic attributes of "mapping"s, "interleave_ways" |
| 431 | and "size" the REGION device also exports some convenience attributes. |
| 432 | "nstype" indicates the integer type of namespace-device this region |
| 433 | emits, "devtype" duplicates the DEVTYPE variable stored by udev at the |
| 434 | 'add' event, "modalias" duplicates the MODALIAS variable stored by udev |
| 435 | at the 'add' event, and finally, the optional "spa_index" is provided in |
| 436 | the case where the region is defined by a SPA. |
| 437 | |
| 438 | LIBNVDIMM: region |
| 439 | |
| 440 | struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, |
| 441 | struct nd_region_desc *ndr_desc); |
| 442 | struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, |
| 443 | struct nd_region_desc *ndr_desc); |
| 444 | |
| 445 | /sys/devices/platform/nfit_test.0/ndbus0 |
| 446 | |-- region0 |
| 447 | | |-- available_size |
| 448 | | |-- btt0 |
| 449 | | |-- btt_seed |
| 450 | | |-- devtype |
| 451 | | |-- driver -> ../../../../../bus/nd/drivers/nd_region |
| 452 | | |-- init_namespaces |
| 453 | | |-- mapping0 |
| 454 | | |-- mapping1 |
| 455 | | |-- mappings |
| 456 | | |-- modalias |
| 457 | | |-- namespace0.0 |
| 458 | | |-- namespace_seed |
| 459 | | |-- numa_node |
| 460 | | |-- nfit |
| 461 | | | `-- spa_index |
| 462 | | |-- nstype |
| 463 | | |-- set_cookie |
| 464 | | |-- size |
| 465 | | |-- subsystem -> ../../../../../bus/nd |
| 466 | | `-- uevent |
| 467 | |-- region1 |
| 468 | [..] |
| 469 | |
| 470 | LIBNDCTL: region enumeration example |
| 471 | |
| 472 | Sample region retrieval routines based on NFIT-unique data like |
| 473 | "spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for |
| 474 | BLK. |
| 475 | |
| 476 | static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, |
| 477 | unsigned int spa_index) |
| 478 | { |
| 479 | struct ndctl_region *region; |
| 480 | |
| 481 | ndctl_region_foreach(bus, region) { |
| 482 | if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) |
| 483 | continue; |
| 484 | if (ndctl_region_get_spa_index(region) == spa_index) |
| 485 | return region; |
| 486 | } |
| 487 | return NULL; |
| 488 | } |
| 489 | |
| 490 | static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, |
| 491 | unsigned int handle) |
| 492 | { |
| 493 | struct ndctl_region *region; |
| 494 | |
| 495 | ndctl_region_foreach(bus, region) { |
| 496 | struct ndctl_mapping *map; |
| 497 | |
| 498 | if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) |
| 499 | continue; |
| 500 | ndctl_mapping_foreach(region, map) { |
| 501 | struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); |
| 502 | |
| 503 | if (ndctl_dimm_get_handle(dimm) == handle) |
| 504 | return region; |
| 505 | } |
| 506 | } |
| 507 | return NULL; |
| 508 | } |
| 509 | |
| 510 | |
| 511 | Why Not Encode the Region Type into the Region Name? |
| 512 | ---------------------------------------------------- |
| 513 | |
| 514 | At first glance it seems since NFIT defines just PMEM and BLK interface |
| 515 | types that we should simply name REGION devices with something derived |
| 516 | from those type names. However, the ND subsystem explicitly keeps the |
| 517 | REGION name generic and expects userspace to always consider the |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 518 | region-attributes for four reasons: |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 519 | |
| 520 | 1. There are already more than two REGION and "namespace" types. For |
| 521 | PMEM there are two subtypes. As mentioned previously we have PMEM where |
| 522 | the constituent DIMM devices are known and anonymous PMEM. For BLK |
| 523 | regions the NFIT specification already anticipates vendor specific |
| 524 | implementations. The exact distinction of what a region contains is in |
| 525 | the region-attributes not the region-name or the region-devtype. |
| 526 | |
| 527 | 2. A region with zero child-namespaces is a possible configuration. For |
| 528 | example, the NFIT allows for a DCR to be published without a |
| 529 | corresponding BLK-aperture. This equates to a DIMM that can only accept |
| 530 | control/configuration messages, but no i/o through a descendant block |
| 531 | device. Again, this "type" is advertised in the attributes ('mappings' |
| 532 | == 0) and the name does not tell you much. |
| 533 | |
| 534 | 3. What if a third major interface type arises in the future? Outside |
| 535 | of vendor specific implementations, it's not difficult to envision a |
| 536 | third class of interface type beyond BLK and PMEM. With a generic name |
| 537 | for the REGION level of the device-hierarchy old userspace |
| 538 | implementations can still make sense of new kernel advertised |
| 539 | region-types. Userspace can always rely on the generic region |
| 540 | attributes like "mappings", "size", etc and the expected child devices |
| 541 | named "namespace". This generic format of the device-model hierarchy |
| 542 | allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and |
| 543 | future-proof. |
| 544 | |
| 545 | 4. There are more robust mechanisms for determining the major type of a |
| 546 | region than a device name. See the next section, How Do I Determine the |
| 547 | Major Type of a Region? |
| 548 | |
| 549 | How Do I Determine the Major Type of a Region? |
| 550 | ---------------------------------------------- |
| 551 | |
| 552 | Outside of the blanket recommendation of "use libndctl", or simply |
| 553 | looking at the kernel header (/usr/include/linux/ndctl.h) to decode the |
| 554 | "nstype" integer attribute, here are some other options. |
| 555 | |
| 556 | 1. module alias lookup: |
| 557 | |
| 558 | The whole point of region/namespace device type differentiation is to |
| 559 | decide which block-device driver will attach to a given LIBNVDIMM namespace. |
| 560 | One can simply use the modalias to lookup the resulting module. It's |
| 561 | important to note that this method is robust in the presence of a |
| 562 | vendor-specific driver down the road. If a vendor-specific |
| 563 | implementation wants to supplant the standard nd_blk driver it can with |
| 564 | minimal impact to the rest of LIBNVDIMM. |
| 565 | |
| 566 | In fact, a vendor may also want to have a vendor-specific region-driver |
| 567 | (outside of nd_region). For example, if a vendor defined its own LABEL |
| 568 | format it would need its own region driver to parse that LABEL and emit |
| 569 | the resulting namespaces. The output from module resolution is more |
| 570 | accurate than a region-name or region-devtype. |
| 571 | |
| 572 | 2. udev: |
| 573 | |
| 574 | The kernel "devtype" is registered in the udev database |
| 575 | # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 |
| 576 | P: /devices/platform/nfit_test.0/ndbus0/region0 |
| 577 | E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 |
| 578 | E: DEVTYPE=nd_pmem |
| 579 | E: MODALIAS=nd:t2 |
| 580 | E: SUBSYSTEM=nd |
| 581 | |
| 582 | # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 |
| 583 | P: /devices/platform/nfit_test.0/ndbus0/region4 |
| 584 | E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 |
| 585 | E: DEVTYPE=nd_blk |
| 586 | E: MODALIAS=nd:t3 |
| 587 | E: SUBSYSTEM=nd |
| 588 | |
| 589 | ...and is available as a region attribute, but keep in mind that the |
| 590 | "devtype" does not indicate sub-type variations and scripts should |
| 591 | really be understanding the other attributes. |
| 592 | |
| 593 | 3. type specific attributes: |
| 594 | |
| 595 | As it currently stands a BLK-aperture region will never have a |
| 596 | "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A |
| 597 | BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM |
| 598 | that does not allow I/O. A PMEM region with a "mappings" value of zero |
| 599 | is a simple system-physical-address range. |
| 600 | |
| 601 | |
| 602 | LIBNVDIMM/LIBNDCTL: Namespace |
| 603 | ------------------------- |
| 604 | |
| 605 | A REGION, after resolving DPA aliasing and LABEL specified boundaries, |
| 606 | surfaces one or more "namespace" devices. The arrival of a "namespace" |
| 607 | device currently triggers either the nd_blk or nd_pmem driver to load |
| 608 | and register a disk/block device. |
| 609 | |
| 610 | LIBNVDIMM: namespace |
| 611 | Here is a sample layout from the three major types of NAMESPACE where |
| 612 | namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' |
| 613 | attribute), namespace2.0 represents a BLK namespace (note it has a |
| 614 | 'sector_size' attribute) that, and namespace6.0 represents an anonymous |
| 615 | PMEM namespace (note that has no 'uuid' attribute due to not support a |
| 616 | LABEL). |
| 617 | |
| 618 | /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 |
| 619 | |-- alt_name |
| 620 | |-- devtype |
| 621 | |-- dpa_extents |
| 622 | |-- force_raw |
| 623 | |-- modalias |
| 624 | |-- numa_node |
| 625 | |-- resource |
| 626 | |-- size |
| 627 | |-- subsystem -> ../../../../../../bus/nd |
| 628 | |-- type |
| 629 | |-- uevent |
| 630 | `-- uuid |
| 631 | /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 |
| 632 | |-- alt_name |
| 633 | |-- devtype |
| 634 | |-- dpa_extents |
| 635 | |-- force_raw |
| 636 | |-- modalias |
| 637 | |-- numa_node |
| 638 | |-- sector_size |
| 639 | |-- size |
| 640 | |-- subsystem -> ../../../../../../bus/nd |
| 641 | |-- type |
| 642 | |-- uevent |
| 643 | `-- uuid |
| 644 | /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 |
| 645 | |-- block |
| 646 | | `-- pmem0 |
| 647 | |-- devtype |
| 648 | |-- driver -> ../../../../../../bus/nd/drivers/pmem |
| 649 | |-- force_raw |
| 650 | |-- modalias |
| 651 | |-- numa_node |
| 652 | |-- resource |
| 653 | |-- size |
| 654 | |-- subsystem -> ../../../../../../bus/nd |
| 655 | |-- type |
| 656 | `-- uevent |
| 657 | |
| 658 | LIBNDCTL: namespace enumeration example |
| 659 | Namespaces are indexed relative to their parent region, example below. |
| 660 | These indexes are mostly static from boot to boot, but subsystem makes |
| 661 | no guarantees in this regard. For a static namespace identifier use its |
| 662 | 'uuid' attribute. |
| 663 | |
| 664 | static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region, |
| 665 | unsigned int id) |
| 666 | { |
| 667 | struct ndctl_namespace *ndns; |
| 668 | |
| 669 | ndctl_namespace_foreach(region, ndns) |
| 670 | if (ndctl_namespace_get_id(ndns) == id) |
| 671 | return ndns; |
| 672 | |
| 673 | return NULL; |
| 674 | } |
| 675 | |
| 676 | LIBNDCTL: namespace creation example |
| 677 | Idle namespaces are automatically created by the kernel if a given |
| 678 | region has enough available capacity to create a new namespace. |
| 679 | Namespace instantiation involves finding an idle namespace and |
| 680 | configuring it. For the most part the setting of namespace attributes |
| 681 | can occur in any order, the only constraint is that 'uuid' must be set |
| 682 | before 'size'. This enables the kernel to track DPA allocations |
| 683 | internally with a static identifier. |
| 684 | |
| 685 | static int configure_namespace(struct ndctl_region *region, |
| 686 | struct ndctl_namespace *ndns, |
| 687 | struct namespace_parameters *parameters) |
| 688 | { |
| 689 | char devname[50]; |
| 690 | |
| 691 | snprintf(devname, sizeof(devname), "namespace%d.%d", |
| 692 | ndctl_region_get_id(region), paramaters->id); |
| 693 | |
| 694 | ndctl_namespace_set_alt_name(ndns, devname); |
| 695 | /* 'uuid' must be set prior to setting size! */ |
| 696 | ndctl_namespace_set_uuid(ndns, paramaters->uuid); |
| 697 | ndctl_namespace_set_size(ndns, paramaters->size); |
| 698 | /* unlike pmem namespaces, blk namespaces have a sector size */ |
| 699 | if (parameters->lbasize) |
| 700 | ndctl_namespace_set_sector_size(ndns, parameters->lbasize); |
| 701 | ndctl_namespace_enable(ndns); |
| 702 | } |
| 703 | |
| 704 | |
| 705 | Why the Term "namespace"? |
| 706 | |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 707 | 1. Why not "volume" for instance? "volume" ran the risk of confusing |
| 708 | ND (libnvdimm subsystem) to a volume manager like device-mapper. |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 709 | |
| 710 | 2. The term originated to describe the sub-devices that can be created |
| 711 | within a NVME controller (see the nvme specification: |
| 712 | http://www.nvmexpress.org/specifications/), and NFIT namespaces are |
| 713 | meant to parallel the capabilities and configurability of |
| 714 | NVME-namespaces. |
| 715 | |
| 716 | |
| 717 | LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" |
| 718 | --------------------------------------------- |
| 719 | |
| 720 | A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked |
| 721 | block device driver that fronts either the whole block device or a |
| 722 | partition of a block device emitted by either a PMEM or BLK NAMESPACE. |
| 723 | |
| 724 | LIBNVDIMM: btt layout |
| 725 | Every region will start out with at least one BTT device which is the |
| 726 | seed device. To activate it set the "namespace", "uuid", and |
| 727 | "sector_size" attributes and then bind the device to the nd_pmem or |
| 728 | nd_blk driver depending on the region type. |
| 729 | |
| 730 | /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ |
| 731 | |-- namespace |
| 732 | |-- delete |
| 733 | |-- devtype |
| 734 | |-- modalias |
| 735 | |-- numa_node |
| 736 | |-- sector_size |
| 737 | |-- subsystem -> ../../../../../bus/nd |
| 738 | |-- uevent |
| 739 | `-- uuid |
| 740 | |
| 741 | LIBNDCTL: btt creation example |
| 742 | Similar to namespaces an idle BTT device is automatically created per |
| 743 | region. Each time this "seed" btt device is configured and enabled a new |
| 744 | seed is created. Creating a BTT configuration involves two steps of |
| 745 | finding and idle BTT and assigning it to consume a PMEM or BLK namespace. |
| 746 | |
| 747 | static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) |
| 748 | { |
| 749 | struct ndctl_btt *btt; |
| 750 | |
| 751 | ndctl_btt_foreach(region, btt) |
| 752 | if (!ndctl_btt_is_enabled(btt) |
| 753 | && !ndctl_btt_is_configured(btt)) |
| 754 | return btt; |
| 755 | |
| 756 | return NULL; |
| 757 | } |
| 758 | |
| 759 | static int configure_btt(struct ndctl_region *region, |
| 760 | struct btt_parameters *parameters) |
| 761 | { |
| 762 | btt = get_idle_btt(region); |
| 763 | |
| 764 | ndctl_btt_set_uuid(btt, parameters->uuid); |
| 765 | ndctl_btt_set_sector_size(btt, parameters->sector_size); |
| 766 | ndctl_btt_set_namespace(btt, parameters->ndns); |
| 767 | /* turn off raw mode device */ |
| 768 | ndctl_namespace_disable(parameters->ndns); |
| 769 | /* turn on btt access */ |
| 770 | ndctl_btt_enable(btt); |
| 771 | } |
| 772 | |
| 773 | Once instantiated a new inactive btt seed device will appear underneath |
| 774 | the region. |
| 775 | |
| 776 | Once a "namespace" is removed from a BTT that instance of the BTT device |
| 777 | will be deleted or otherwise reset to default values. This deletion is |
| 778 | only at the device model level. In order to destroy a BTT the "info |
| 779 | block" needs to be destroyed. Note, that to destroy a BTT the media |
| 780 | needs to be written in raw mode. By default, the kernel will autodetect |
| 781 | the presence of a BTT and disable raw mode. This autodetect behavior |
| 782 | can be suppressed by enabling raw mode for the namespace via the |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 783 | ndctl_namespace_set_raw_mode() API. |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 784 | |
| 785 | |
| 786 | Summary LIBNDCTL Diagram |
| 787 | ------------------------ |
| 788 | |
Konrad Rzeszutek Wilk | 8de5dff | 2015-11-10 16:10:45 -0800 | [diff] [blame] | 789 | For the given example above, here is the view of the objects as seen by the |
| 790 | LIBNDCTL API: |
Dan Williams | bc30196 | 2015-06-25 04:48:19 -0400 | [diff] [blame] | 791 | +---+ |
| 792 | |CTX| +---------+ +--------------+ +---------------+ |
| 793 | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | |
| 794 | | | +---------+ +--------------+ +---------------+ |
| 795 | +-------+ | | +---------+ +--------------+ +---------------+ |
| 796 | | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | |
| 797 | +-------+ | | | +---------+ +--------------+ +---------------+ |
| 798 | | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ |
| 799 | +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | |
| 800 | | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ |
| 801 | +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | |
| 802 | | DIMM3 <-+ | +--------------+ +----------------------+ |
| 803 | +-------+ | +---------+ +--------------+ +---------------+ |
| 804 | +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | |
| 805 | | +---------+ | +--------------+ +----------------------+ |
| 806 | | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | |
| 807 | | +--------------+ +----------------------+ |
| 808 | | +---------+ +--------------+ +---------------+ |
| 809 | +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | |
| 810 | | +---------+ +--------------+ +---------------+ |
| 811 | | +---------+ +--------------+ +----------------------+ |
| 812 | +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | |
| 813 | +---------+ +--------------+ +---------------+------+ |
| 814 | |
| 815 | |