David 'Digit' Turner | 454cb83 | 2014-04-12 01:02:51 +0200 | [diff] [blame] | 1 | An overview of memory management in QEMU: |
| 2 | |
| 3 | I. RAM Management: |
| 4 | ================== |
| 5 | |
| 6 | I.1. RAM Address space: |
| 7 | ----------------------- |
| 8 | |
| 9 | All pages of virtual RAM used by QEMU at runtime are allocated from |
| 10 | contiguous blocks in a specific abstract "RAM address space". |
| 11 | |ram_addr_t| is the type of block addresses in this space. |
| 12 | |
| 13 | A single block of contiguous RAM is allocated with 'qemu_ram_alloc()', which |
| 14 | takes a size in bytes, and allocates the pages through mmap() in the QEMU |
| 15 | host process. It also sets up the corresponding KVM / Xen / HAX mappings, |
| 16 | depending on each accelerator's specific needs. |
| 17 | |
| 18 | Each block has a name, which is used for snapshot support. |
| 19 | |
| 20 | 'qemu_ram_alloc_from_ptr()' can also be used to allocated a new RAM |
| 21 | block, by passing its content explicitly (can be useful for pages of |
| 22 | ROM). |
| 23 | |
| 24 | 'qemu_get_ram_ptr()' will translate a 'ram_addr_t' into the corresponding |
| 25 | address in the QEMU host process. 'qemu_ram_addr_from_host()' does the |
| 26 | opposite (i.e. translates a host address into a ram_addr_t if possible, |
| 27 | or return an error). |
| 28 | |
| 29 | Note that ram_addr_t addresses are an internal implementation detail of |
| 30 | QEMU, i.e. the virtual CPU never sees their values directly; it relies |
| 31 | instead of addresses in its virtual physical address space, described |
| 32 | in section II. below. |
| 33 | |
| 34 | As an example, when emulating an Android/x86 virtual device, the following |
| 35 | RAM space is being used: |
| 36 | |
| 37 | 0x0000_0000 ... 0x1000_0000 "pc.ram" |
| 38 | 0x1000_0000 ... 0x1002_0000 "bios.bin" |
| 39 | 0x1002_0000 ... 0x1004_0000 "pc.rom" |
| 40 | |
| 41 | |
| 42 | I.2. RAM Dirty tracking: |
| 43 | ------------------------ |
| 44 | |
| 45 | QEMU also associates with each RAM page an 8-bit 'dirty' bitmap. The |
| 46 | main idea is that whenever a page is written to, the value 0xff is |
| 47 | written to the page's 'dirty' bitmap. Various clients can later inspect |
| 48 | some of the flags and clear them. I.e.: |
| 49 | |
| 50 | VGA_DIRTY_FLAG (0x1) is typically used by framebuffer drivers to detect |
| 51 | which pages of video RAM were touched since the latest VSYNC. The driver |
| 52 | typically copies the pixel values to the real QEMU output, then clears |
| 53 | the bits. This is very useful to avoid needless copies if nothing |
| 54 | changed in the framebuffer. |
| 55 | |
| 56 | MIGRATION_DIRTY_FLAG (0x8) is used to tracked modified RAM pages during |
| 57 | live migration (i.e. moving a QEMU virtual machine from one host to |
| 58 | another) |
| 59 | |
| 60 | CODE_DIRTY_FLAG (0x2) is a bit more special, and is used to support |
| 61 | self-modifying code properly. More on this later. |
| 62 | |
| 63 | |
| 64 | II. The physical address space: |
| 65 | =============================== |
| 66 | |
| 67 | Represents the address space that the virtual CPU can read from / write to. |
| 68 | |hwaddr| is the type of addresses in this space, which is decomposed |
| 69 | into 'pages'. Each page in the address space is either unassigned, or |
| 70 | mapped to a specific kind of memory region. |
| 71 | |
| 72 | See |phys_page_find()| and |phys_page_find_alloc()| in translate-all.c for |
| 73 | the implementation details. |
| 74 | |
| 75 | |
| 76 | II.1. Memory region types: |
| 77 | -------------------------- |
| 78 | |
| 79 | There are several memory region types: |
| 80 | |
| 81 | - Regions of RAM pages. |
| 82 | - Regions of ROM pages (similar to RAM, but cannot be written to). |
| 83 | - Regions of I/O pages, used to communicate with virtual hardware. |
| 84 | |
| 85 | Virtual devices can register a new I/O region type by calling |
| 86 | |cpu_register_io_memory()|. This function allows them to provide |
| 87 | callbacks that will be invoked every time the virtual CPU reads from |
| 88 | or writes to any page of the corresponding type. |
| 89 | |
| 90 | The memory region type of a given page is encoded using PAGE_BITS bits |
| 91 | in the following format: |
| 92 | |
| 93 | +-------------------------------+ |
| 94 | | mem_type_index | flags | |
| 95 | +-------------------------------+ |
| 96 | |
| 97 | Where |mem_type_index| is a unique value identifying a given memory |
| 98 | region type, and |flags| is a 3-bit bitmap used to store flags that are |
| 99 | only relevant for I/O pages. |
| 100 | |
| 101 | The following memory region type values are important: |
| 102 | |
| 103 | IO_MEM_RAM (mem_type_index=0, flags=0): |
| 104 | Used for regular RAM pages, always all zero on purpose. |
| 105 | |
| 106 | IO_MEM_ROM (mem_type_index=1, flags=0): |
| 107 | Used for ROM pages. |
| 108 | |
| 109 | IO_MEM_UNASSIGNED (mem_type_index=2, flags=0): |
| 110 | Used to identify unassigned pages of the physical address space. |
| 111 | |
| 112 | IO_MEM_NOTDIRTY (mem_type_index=3, flags=0): |
| 113 | Used to implement tracking of dirty RAM pages. This is essentially |
| 114 | used for RAM pages that have not been written to yet. |
| 115 | |
| 116 | Any mem_type_index value of 4 or higher corresponds to a device-specific |
| 117 | I/O memory region type (i.e. with custom read/write callbaks, a |
| 118 | corresponding 'opaque' value), and can also use the following bits |
| 119 | in |flags|: |
| 120 | |
| 121 | IO_MEM_ROMD (0x1): |
| 122 | Used for ROM-like I/O pages, i.e. they are backed by a page from |
| 123 | the RAM address space, but writing to them triggers a device-specific |
| 124 | write callback (instead of being ignored or faulting the CPU). |
| 125 | |
| 126 | IO_MEM_SUBPAGE (0x02) |
| 127 | Used to indicate that not all addresses in this page map to the same |
| 128 | I/O region type / callbacks. |
| 129 | |
| 130 | IO_MEM_SUBWIDTH (0x04) |
| 131 | Probably obsolete. Set to indicate that the corresponding I/O region |
| 132 | type doesn't support reading/writing values of all possible sizes |
| 133 | (1, 2 and 4 bytes). This seems to be never used by the current code. |
| 134 | |
| 135 | Note that cpu_register_io_memory() returns a new memory region type value. |
| 136 | |
| 137 | II.2. Physical address map: |
| 138 | --------------------------- |
| 139 | |
| 140 | QEMU maintains for each assigned page in the physical address space |
| 141 | two values: |
| 142 | |
| 143 | |phys_offset|, a combination of ram address and memory region type. |
| 144 | |
| 145 | |region_offset|, an optional offset into the region backing the |
| 146 | page. This is only useful for I/O pages. |
| 147 | |
| 148 | The |phys_offset| value has many interesting encoding which require |
| 149 | further clarification: |
| 150 | |
| 151 | - Generally speaking, a phys_offset value is decomposed into |
| 152 | the following bit fields: |
| 153 | |
| 154 | +-----------------------------------------------------+ |
| 155 | | high_addr | mem_type | |
| 156 | +-----------------------------------------------------+ |
| 157 | |
| 158 | where |mem_type| is a PAGE_BITS memory region type as described |
| 159 | previously, and |high_addr| may contain the high bits of a |
| 160 | ram_addr_t address for RAM-backed pages. |
| 161 | |
| 162 | More specifically: |
| 163 | |
| 164 | - Unassigned pages always have the special value IO_MEM_UNASSIGNED |
| 165 | (high_addr=0, mem_type=IO_MEM_UNASSIGNED) |
| 166 | |
| 167 | - RAM pages have mem_type=0 (i.e. IO_MEM_RAM) while high_addr are |
| 168 | the high bits of the corresponding ram_addr_t. Hence, a simple call to |
| 169 | qemu_get_ram_ptr(phys_offset) will return the corresponding |
| 170 | address in host QEMU memory. |
| 171 | |
| 172 | This is the reson why IO_MEM_RAM is always 0: |
| 173 | |
| 174 | RAM page phys_offset value: |
| 175 | +-----------------------------------------------------+ |
| 176 | | high_addr | 0 | |
| 177 | +-----------------------------------------------------+ |
| 178 | |
| 179 | |
| 180 | - ROM pages are like RAM pages, but have mem_type=IO_MEM_ROM. |
| 181 | QEMU ensures that writing to such a page is a no-op, except on |
| 182 | some target architectures, like Sparc, this may cause a CPU fault. |
| 183 | |
| 184 | ROM page phys_offset value: |
| 185 | +-----------------------------------------------------+ |
| 186 | | high_addr | IO_MEM_ROM | |
| 187 | +-----------------------------------------------------+ |
| 188 | |
| 189 | - Dirty RAM page tracking is implemented by using special |
| 190 | phys_offset values with mem_type=IO_MEM_NOTDIRTY. Note that these |
| 191 | values do not appear directly in the physical page map, but in |
| 192 | the CPU TLB cache (explained later). |
| 193 | |
| 194 | non-dirty RAM page phys_offset value (CPU TLB cache only): |
| 195 | +-----------------------------------------------------+ |
| 196 | | high_addr | IO_MEM_NOTDIRTY | |
| 197 | +-----------------------------------------------------+ |
| 198 | |
| 199 | - Other pages are I/O pages, and their high_addr value will |
| 200 | be 0 / ignored: |
| 201 | |
| 202 | I/O page phys_offset value: |
| 203 | +----------------------------------------------------------+ |
| 204 | | 0 | mem_type_index | flags | |
| 205 | +----------------------------------------------------------+ |
| 206 | |
| 207 | Note that when reading from or writing to I/O pages, the lowest |
| 208 | PAGE_BITS bits of the corresponding hwaddr value will be added |
| 209 | to the page's |region_offset| value. This new address is passed |
| 210 | to the read/write callback as the 'i/o address' for the operation. |
| 211 | |
| 212 | - As a special exception, if the I/O page's IO_MEM_ROMD flag is |
| 213 | set, then high_addr is not 0, but the high bits of the corresponding |
| 214 | ram_addr_t backing the page's contents on reads. On write operations |
| 215 | though, the I/O region type's write callback will be called instead. |
| 216 | |
| 217 | ROMD I/O page phys_offset value: |
| 218 | +----------------------------------------------------------+ |
| 219 | | high_addr | mem_type_index | flags | |
| 220 | +----------------------------------------------------------+ |
| 221 | |
| 222 | Note that |region_offset| is ignored when reading from such pages, |
| 223 | it's only used when writing to the I/O page. |