Grant Likely | 31134ef | 2011-11-04 11:51:22 -0400 | [diff] [blame] | 1 | Linux and the Device Tree |
| 2 | ------------------------- |
| 3 | The Linux usage model for device tree data |
| 4 | |
| 5 | Author: Grant Likely <grant.likely@secretlab.ca> |
| 6 | |
| 7 | This article describes how Linux uses the device tree. An overview of |
| 8 | the device tree data format can be found on the device tree usage page |
| 9 | at devicetree.org[1]. |
| 10 | |
| 11 | [1] http://devicetree.org/Device_Tree_Usage |
| 12 | |
| 13 | The "Open Firmware Device Tree", or simply Device Tree (DT), is a data |
| 14 | structure and language for describing hardware. More specifically, it |
| 15 | is a description of hardware that is readable by an operating system |
| 16 | so that the operating system doesn't need to hard code details of the |
| 17 | machine. |
| 18 | |
| 19 | Structurally, the DT is a tree, or acyclic graph with named nodes, and |
| 20 | nodes may have an arbitrary number of named properties encapsulating |
| 21 | arbitrary data. A mechanism also exists to create arbitrary |
| 22 | links from one node to another outside of the natural tree structure. |
| 23 | |
| 24 | Conceptually, a common set of usage conventions, called 'bindings', |
| 25 | is defined for how data should appear in the tree to describe typical |
| 26 | hardware characteristics including data busses, interrupt lines, GPIO |
| 27 | connections, and peripheral devices. |
| 28 | |
| 29 | As much as possible, hardware is described using existing bindings to |
| 30 | maximize use of existing support code, but since property and node |
| 31 | names are simply text strings, it is easy to extend existing bindings |
| 32 | or create new ones by defining new nodes and properties. Be wary, |
| 33 | however, of creating a new binding without first doing some homework |
| 34 | about what already exists. There are currently two different, |
| 35 | incompatible, bindings for i2c busses that came about because the new |
| 36 | binding was created without first investigating how i2c devices were |
| 37 | already being enumerated in existing systems. |
| 38 | |
| 39 | 1. History |
| 40 | ---------- |
| 41 | The DT was originally created by Open Firmware as part of the |
| 42 | communication method for passing data from Open Firmware to a client |
| 43 | program (like to an operating system). An operating system used the |
| 44 | Device Tree to discover the topology of the hardware at runtime, and |
| 45 | thereby support a majority of available hardware without hard coded |
| 46 | information (assuming drivers were available for all devices). |
| 47 | |
| 48 | Since Open Firmware is commonly used on PowerPC and SPARC platforms, |
| 49 | the Linux support for those architectures has for a long time used the |
| 50 | Device Tree. |
| 51 | |
| 52 | In 2005, when PowerPC Linux began a major cleanup and to merge 32-bit |
| 53 | and 64-bit support, the decision was made to require DT support on all |
| 54 | powerpc platforms, regardless of whether or not they used Open |
| 55 | Firmware. To do this, a DT representation called the Flattened Device |
| 56 | Tree (FDT) was created which could be passed to the kernel as a binary |
| 57 | blob without requiring a real Open Firmware implementation. U-Boot, |
| 58 | kexec, and other bootloaders were modified to support both passing a |
| 59 | Device Tree Binary (dtb) and to modify a dtb at boot time. DT was |
| 60 | also added to the PowerPC boot wrapper (arch/powerpc/boot/*) so that |
| 61 | a dtb could be wrapped up with the kernel image to support booting |
| 62 | existing non-DT aware firmware. |
| 63 | |
| 64 | Some time later, FDT infrastructure was generalized to be usable by |
| 65 | all architectures. At the time of this writing, 6 mainlined |
| 66 | architectures (arm, microblaze, mips, powerpc, sparc, and x86) and 1 |
| 67 | out of mainline (nios) have some level of DT support. |
| 68 | |
| 69 | 2. Data Model |
| 70 | ------------- |
| 71 | If you haven't already read the Device Tree Usage[1] page, |
| 72 | then go read it now. It's okay, I'll wait.... |
| 73 | |
| 74 | 2.1 High Level View |
| 75 | ------------------- |
| 76 | The most important thing to understand is that the DT is simply a data |
| 77 | structure that describes the hardware. There is nothing magical about |
| 78 | it, and it doesn't magically make all hardware configuration problems |
| 79 | go away. What it does do is provide a language for decoupling the |
| 80 | hardware configuration from the board and device driver support in the |
| 81 | Linux kernel (or any other operating system for that matter). Using |
| 82 | it allows board and device support to become data driven; to make |
| 83 | setup decisions based on data passed into the kernel instead of on |
| 84 | per-machine hard coded selections. |
| 85 | |
| 86 | Ideally, data driven platform setup should result in less code |
| 87 | duplication and make it easier to support a wide range of hardware |
| 88 | with a single kernel image. |
| 89 | |
| 90 | Linux uses DT data for three major purposes: |
| 91 | 1) platform identification, |
| 92 | 2) runtime configuration, and |
| 93 | 3) device population. |
| 94 | |
| 95 | 2.2 Platform Identification |
| 96 | --------------------------- |
| 97 | First and foremost, the kernel will use data in the DT to identify the |
| 98 | specific machine. In a perfect world, the specific platform shouldn't |
| 99 | matter to the kernel because all platform details would be described |
| 100 | perfectly by the device tree in a consistent and reliable manner. |
| 101 | Hardware is not perfect though, and so the kernel must identify the |
| 102 | machine during early boot so that it has the opportunity to run |
| 103 | machine-specific fixups. |
| 104 | |
| 105 | In the majority of cases, the machine identity is irrelevant, and the |
| 106 | kernel will instead select setup code based on the machine's core |
| 107 | CPU or SoC. On ARM for example, setup_arch() in |
| 108 | arch/arm/kernel/setup.c will call setup_machine_fdt() in |
| 109 | arch/arm/kernel/devicetree.c which searches through the machine_desc |
| 110 | table and selects the machine_desc which best matches the device tree |
| 111 | data. It determines the best match by looking at the 'compatible' |
| 112 | property in the root device tree node, and comparing it with the |
| 113 | dt_compat list in struct machine_desc. |
| 114 | |
| 115 | The 'compatible' property contains a sorted list of strings starting |
| 116 | with the exact name of the machine, followed by an optional list of |
| 117 | boards it is compatible with sorted from most compatible to least. For |
| 118 | example, the root compatible properties for the TI BeagleBoard and its |
| 119 | successor, the BeagleBoard xM board might look like: |
| 120 | |
| 121 | compatible = "ti,omap3-beagleboard", "ti,omap3450", "ti,omap3"; |
| 122 | compatible = "ti,omap3-beagleboard-xm", "ti,omap3450", "ti,omap3"; |
| 123 | |
| 124 | Where "ti,omap3-beagleboard-xm" specifies the exact model, it also |
| 125 | claims that it compatible with the OMAP 3450 SoC, and the omap3 family |
| 126 | of SoCs in general. You'll notice that the list is sorted from most |
| 127 | specific (exact board) to least specific (SoC family). |
| 128 | |
| 129 | Astute readers might point out that the Beagle xM could also claim |
| 130 | compatibility with the original Beagle board. However, one should be |
| 131 | cautioned about doing so at the board level since there is typically a |
| 132 | high level of change from one board to another, even within the same |
| 133 | product line, and it is hard to nail down exactly what is meant when one |
| 134 | board claims to be compatible with another. For the top level, it is |
| 135 | better to err on the side of caution and not claim one board is |
| 136 | compatible with another. The notable exception would be when one |
| 137 | board is a carrier for another, such as a CPU module attached to a |
| 138 | carrier board. |
| 139 | |
| 140 | One more note on compatible values. Any string used in a compatible |
| 141 | property must be documented as to what it indicates. Add |
| 142 | documentation for compatible strings in Documentation/devicetree/bindings. |
| 143 | |
| 144 | Again on ARM, for each machine_desc, the kernel looks to see if |
| 145 | any of the dt_compat list entries appear in the compatible property. |
| 146 | If one does, then that machine_desc is a candidate for driving the |
| 147 | machine. After searching the entire table of machine_descs, |
| 148 | setup_machine_fdt() returns the 'most compatible' machine_desc based |
| 149 | on which entry in the compatible property each machine_desc matches |
| 150 | against. If no matching machine_desc is found, then it returns NULL. |
| 151 | |
| 152 | The reasoning behind this scheme is the observation that in the majority |
| 153 | of cases, a single machine_desc can support a large number of boards |
| 154 | if they all use the same SoC, or same family of SoCs. However, |
| 155 | invariably there will be some exceptions where a specific board will |
| 156 | require special setup code that is not useful in the generic case. |
| 157 | Special cases could be handled by explicitly checking for the |
| 158 | troublesome board(s) in generic setup code, but doing so very quickly |
| 159 | becomes ugly and/or unmaintainable if it is more than just a couple of |
| 160 | cases. |
| 161 | |
| 162 | Instead, the compatible list allows a generic machine_desc to provide |
| 163 | support for a wide common set of boards by specifying "less |
| 164 | compatible" value in the dt_compat list. In the example above, |
| 165 | generic board support can claim compatibility with "ti,omap3" or |
| 166 | "ti,omap3450". If a bug was discovered on the original beagleboard |
| 167 | that required special workaround code during early boot, then a new |
| 168 | machine_desc could be added which implements the workarounds and only |
| 169 | matches on "ti,omap3-beagleboard". |
| 170 | |
| 171 | PowerPC uses a slightly different scheme where it calls the .probe() |
| 172 | hook from each machine_desc, and the first one returning TRUE is used. |
| 173 | However, this approach does not take into account the priority of the |
| 174 | compatible list, and probably should be avoided for new architecture |
| 175 | support. |
| 176 | |
| 177 | 2.3 Runtime configuration |
| 178 | ------------------------- |
| 179 | In most cases, a DT will be the sole method of communicating data from |
| 180 | firmware to the kernel, so also gets used to pass in runtime and |
| 181 | configuration data like the kernel parameters string and the location |
| 182 | of an initrd image. |
| 183 | |
| 184 | Most of this data is contained in the /chosen node, and when booting |
| 185 | Linux it will look something like this: |
| 186 | |
| 187 | chosen { |
| 188 | bootargs = "console=ttyS0,115200 loglevel=8"; |
| 189 | initrd-start = <0xc8000000>; |
| 190 | initrd-end = <0xc8200000>; |
| 191 | }; |
| 192 | |
| 193 | The bootargs property contains the kernel arguments, and the initrd-* |
| 194 | properties define the address and size of an initrd blob. The |
| 195 | chosen node may also optionally contain an arbitrary number of |
| 196 | additional properties for platform-specific configuration data. |
| 197 | |
| 198 | During early boot, the architecture setup code calls of_scan_flat_dt() |
| 199 | several times with different helper callbacks to parse device tree |
| 200 | data before paging is setup. The of_scan_flat_dt() code scans through |
| 201 | the device tree and uses the helpers to extract information required |
| 202 | during early boot. Typically the early_init_dt_scan_chosen() helper |
| 203 | is used to parse the chosen node including kernel parameters, |
| 204 | early_init_dt_scan_root() to initialize the DT address space model, |
| 205 | and early_init_dt_scan_memory() to determine the size and |
| 206 | location of usable RAM. |
| 207 | |
| 208 | On ARM, the function setup_machine_fdt() is responsible for early |
| 209 | scanning of the device tree after selecting the correct machine_desc |
| 210 | that supports the board. |
| 211 | |
| 212 | 2.4 Device population |
| 213 | --------------------- |
| 214 | After the board has been identified, and after the early configuration data |
| 215 | has been parsed, then kernel initialization can proceed in the normal |
| 216 | way. At some point in this process, unflatten_device_tree() is called |
| 217 | to convert the data into a more efficient runtime representation. |
| 218 | This is also when machine-specific setup hooks will get called, like |
| 219 | the machine_desc .init_early(), .init_irq() and .init_machine() hooks |
| 220 | on ARM. The remainder of this section uses examples from the ARM |
| 221 | implementation, but all architectures will do pretty much the same |
| 222 | thing when using a DT. |
| 223 | |
| 224 | As can be guessed by the names, .init_early() is used for any machine- |
| 225 | specific setup that needs to be executed early in the boot process, |
| 226 | and .init_irq() is used to set up interrupt handling. Using a DT |
| 227 | doesn't materially change the behaviour of either of these functions. |
| 228 | If a DT is provided, then both .init_early() and .init_irq() are able |
| 229 | to call any of the DT query functions (of_* in include/linux/of*.h) to |
| 230 | get additional data about the platform. |
| 231 | |
| 232 | The most interesting hook in the DT context is .init_machine() which |
| 233 | is primarily responsible for populating the Linux device model with |
| 234 | data about the platform. Historically this has been implemented on |
| 235 | embedded platforms by defining a set of static clock structures, |
| 236 | platform_devices, and other data in the board support .c file, and |
| 237 | registering it en-masse in .init_machine(). When DT is used, then |
| 238 | instead of hard coding static devices for each platform, the list of |
| 239 | devices can be obtained by parsing the DT, and allocating device |
| 240 | structures dynamically. |
| 241 | |
| 242 | The simplest case is when .init_machine() is only responsible for |
| 243 | registering a block of platform_devices. A platform_device is a concept |
| 244 | used by Linux for memory or I/O mapped devices which cannot be detected |
| 245 | by hardware, and for 'composite' or 'virtual' devices (more on those |
| 246 | later). While there is no 'platform device' terminology for the DT, |
| 247 | platform devices roughly correspond to device nodes at the root of the |
| 248 | tree and children of simple memory mapped bus nodes. |
| 249 | |
| 250 | About now is a good time to lay out an example. Here is part of the |
| 251 | device tree for the NVIDIA Tegra board. |
| 252 | |
| 253 | /{ |
| 254 | compatible = "nvidia,harmony", "nvidia,tegra20"; |
| 255 | #address-cells = <1>; |
| 256 | #size-cells = <1>; |
| 257 | interrupt-parent = <&intc>; |
| 258 | |
| 259 | chosen { }; |
| 260 | aliases { }; |
| 261 | |
| 262 | memory { |
| 263 | device_type = "memory"; |
| 264 | reg = <0x00000000 0x40000000>; |
| 265 | }; |
| 266 | |
| 267 | soc { |
| 268 | compatible = "nvidia,tegra20-soc", "simple-bus"; |
| 269 | #address-cells = <1>; |
| 270 | #size-cells = <1>; |
| 271 | ranges; |
| 272 | |
| 273 | intc: interrupt-controller@50041000 { |
| 274 | compatible = "nvidia,tegra20-gic"; |
| 275 | interrupt-controller; |
| 276 | #interrupt-cells = <1>; |
| 277 | reg = <0x50041000 0x1000>, < 0x50040100 0x0100 >; |
| 278 | }; |
| 279 | |
| 280 | serial@70006300 { |
| 281 | compatible = "nvidia,tegra20-uart"; |
| 282 | reg = <0x70006300 0x100>; |
| 283 | interrupts = <122>; |
| 284 | }; |
| 285 | |
| 286 | i2s1: i2s@70002800 { |
| 287 | compatible = "nvidia,tegra20-i2s"; |
| 288 | reg = <0x70002800 0x100>; |
| 289 | interrupts = <77>; |
| 290 | codec = <&wm8903>; |
| 291 | }; |
| 292 | |
| 293 | i2c@7000c000 { |
| 294 | compatible = "nvidia,tegra20-i2c"; |
| 295 | #address-cells = <1>; |
| 296 | #size-cells = <0>; |
| 297 | reg = <0x7000c000 0x100>; |
| 298 | interrupts = <70>; |
| 299 | |
| 300 | wm8903: codec@1a { |
| 301 | compatible = "wlf,wm8903"; |
| 302 | reg = <0x1a>; |
| 303 | interrupts = <347>; |
| 304 | }; |
| 305 | }; |
| 306 | }; |
| 307 | |
| 308 | sound { |
| 309 | compatible = "nvidia,harmony-sound"; |
| 310 | i2s-controller = <&i2s1>; |
| 311 | i2s-codec = <&wm8903>; |
| 312 | }; |
| 313 | }; |
| 314 | |
| 315 | At .machine_init() time, Tegra board support code will need to look at |
| 316 | this DT and decide which nodes to create platform_devices for. |
| 317 | However, looking at the tree, it is not immediately obvious what kind |
| 318 | of device each node represents, or even if a node represents a device |
| 319 | at all. The /chosen, /aliases, and /memory nodes are informational |
| 320 | nodes that don't describe devices (although arguably memory could be |
| 321 | considered a device). The children of the /soc node are memory mapped |
| 322 | devices, but the codec@1a is an i2c device, and the sound node |
| 323 | represents not a device, but rather how other devices are connected |
| 324 | together to create the audio subsystem. I know what each device is |
| 325 | because I'm familiar with the board design, but how does the kernel |
| 326 | know what to do with each node? |
| 327 | |
| 328 | The trick is that the kernel starts at the root of the tree and looks |
| 329 | for nodes that have a 'compatible' property. First, it is generally |
| 330 | assumed that any node with a 'compatible' property represents a device |
| 331 | of some kind, and second, it can be assumed that any node at the root |
| 332 | of the tree is either directly attached to the processor bus, or is a |
| 333 | miscellaneous system device that cannot be described any other way. |
| 334 | For each of these nodes, Linux allocates and registers a |
| 335 | platform_device, which in turn may get bound to a platform_driver. |
| 336 | |
| 337 | Why is using a platform_device for these nodes a safe assumption? |
| 338 | Well, for the way that Linux models devices, just about all bus_types |
| 339 | assume that its devices are children of a bus controller. For |
| 340 | example, each i2c_client is a child of an i2c_master. Each spi_device |
| 341 | is a child of an SPI bus. Similarly for USB, PCI, MDIO, etc. The |
| 342 | same hierarchy is also found in the DT, where I2C device nodes only |
| 343 | ever appear as children of an I2C bus node. Ditto for SPI, MDIO, USB, |
| 344 | etc. The only devices which do not require a specific type of parent |
| 345 | device are platform_devices (and amba_devices, but more on that |
| 346 | later), which will happily live at the base of the Linux /sys/devices |
| 347 | tree. Therefore, if a DT node is at the root of the tree, then it |
| 348 | really probably is best registered as a platform_device. |
| 349 | |
| 350 | Linux board support code calls of_platform_populate(NULL, NULL, NULL) |
| 351 | to kick off discovery of devices at the root of the tree. The |
| 352 | parameters are all NULL because when starting from the root of the |
| 353 | tree, there is no need to provide a starting node (the first NULL), a |
| 354 | parent struct device (the last NULL), and we're not using a match |
| 355 | table (yet). For a board that only needs to register devices, |
| 356 | .init_machine() can be completely empty except for the |
| 357 | of_platform_populate() call. |
| 358 | |
| 359 | In the Tegra example, this accounts for the /soc and /sound nodes, but |
| 360 | what about the children of the SoC node? Shouldn't they be registered |
| 361 | as platform devices too? For Linux DT support, the generic behaviour |
| 362 | is for child devices to be registered by the parent's device driver at |
| 363 | driver .probe() time. So, an i2c bus device driver will register a |
| 364 | i2c_client for each child node, an SPI bus driver will register |
| 365 | its spi_device children, and similarly for other bus_types. |
| 366 | According to that model, a driver could be written that binds to the |
| 367 | SoC node and simply registers platform_devices for each of its |
| 368 | children. The board support code would allocate and register an SoC |
| 369 | device, a (theoretical) SoC device driver could bind to the SoC device, |
| 370 | and register platform_devices for /soc/interrupt-controller, /soc/serial, |
| 371 | /soc/i2s, and /soc/i2c in its .probe() hook. Easy, right? |
| 372 | |
| 373 | Actually, it turns out that registering children of some |
| 374 | platform_devices as more platform_devices is a common pattern, and the |
| 375 | device tree support code reflects that and makes the above example |
| 376 | simpler. The second argument to of_platform_populate() is an |
| 377 | of_device_id table, and any node that matches an entry in that table |
| 378 | will also get its child nodes registered. In the tegra case, the code |
| 379 | can look something like this: |
| 380 | |
| 381 | static void __init harmony_init_machine(void) |
| 382 | { |
| 383 | /* ... */ |
| 384 | of_platform_populate(NULL, of_default_bus_match_table, NULL, NULL); |
| 385 | } |
| 386 | |
| 387 | "simple-bus" is defined in the ePAPR 1.0 specification as a property |
| 388 | meaning a simple memory mapped bus, so the of_platform_populate() code |
| 389 | could be written to just assume simple-bus compatible nodes will |
| 390 | always be traversed. However, we pass it in as an argument so that |
| 391 | board support code can always override the default behaviour. |
| 392 | |
| 393 | [Need to add discussion of adding i2c/spi/etc child devices] |
| 394 | |
| 395 | Appendix A: AMBA devices |
| 396 | ------------------------ |
| 397 | |
| 398 | ARM Primecells are a certain kind of device attached to the ARM AMBA |
| 399 | bus which include some support for hardware detection and power |
| 400 | management. In Linux, struct amba_device and the amba_bus_type is |
| 401 | used to represent Primecell devices. However, the fiddly bit is that |
| 402 | not all devices on an AMBA bus are Primecells, and for Linux it is |
| 403 | typical for both amba_device and platform_device instances to be |
| 404 | siblings of the same bus segment. |
| 405 | |
| 406 | When using the DT, this creates problems for of_platform_populate() |
| 407 | because it must decide whether to register each node as either a |
| 408 | platform_device or an amba_device. This unfortunately complicates the |
| 409 | device creation model a little bit, but the solution turns out not to |
| 410 | be too invasive. If a node is compatible with "arm,amba-primecell", then |
| 411 | of_platform_populate() will register it as an amba_device instead of a |
| 412 | platform_device. |