Ryan Prichard | 9491c54 | 2018-11-09 15:18:05 -0800 | [diff] [blame] | 1 | # Android ELF TLS (Draft) |
| 2 | |
| 3 | Internal links: |
| 4 | * [go/android-elf-tls](http://go/android-elf-tls) |
| 5 | * [One-pager](https://docs.google.com/document/d/1leyPTnwSs24P2LGiqnU6HetnN5YnDlZkihigi6qdf_M) |
| 6 | * Tracking bugs: http://b/110100012, http://b/78026329 |
| 7 | |
| 8 | [TOC] |
| 9 | |
| 10 | # Overview |
| 11 | |
| 12 | ELF TLS is a system for automatically allocating thread-local variables with cooperation among the |
| 13 | compiler, linker, dynamic loader, and libc. |
| 14 | |
| 15 | Thread-local variables are declared in C and C++ with a specifier, e.g.: |
| 16 | |
| 17 | ```cpp |
| 18 | thread_local int tls_var; |
| 19 | ``` |
| 20 | |
| 21 | At run-time, TLS variables are allocated on a module-by-module basis, where a module is a shared |
| 22 | object or executable. At program startup, TLS for all initially-loaded modules comprises the "Static |
| 23 | TLS Block". TLS variables within the Static TLS Block exist at fixed offsets from an |
| 24 | architecture-specific thread pointer (TP) and can be accessed very efficiently -- typically just a |
| 25 | few instructions. TLS variables belonging to dlopen'ed shared objects, on the other hand, may be |
| 26 | allocated lazily, and accessing them typically requires a function call. |
| 27 | |
| 28 | # Thread-Specific Memory Layout |
| 29 | |
| 30 | Ulrich Drepper's ELF TLS document specifies two ways of organizing memory pointed at by the |
| 31 | architecture-specific thread-pointer ([`__get_tls()`] in Bionic): |
| 32 | |
| 33 |  |
| 34 | |
| 35 |  |
| 36 | |
| 37 | Variant 1 places the static TLS block after the TP, whereas variant 2 places it before the TP. |
| 38 | According to Drepper, variant 2 was motivated by backwards compatibility, and variant 1 was designed |
| 39 | for Itanium. The choice has effects on the toolchain, loader, and libc. In particular, when linking |
| 40 | an executable, the linker needs to know where an executable's TLS segment is relative to the TP so |
| 41 | it can correctly relocate TLS accesses. Both variants are incompatible with Bionic's current |
| 42 | thread-specific data layout, but variant 1 is more problematic than variant 2. |
| 43 | |
| 44 | Each thread has a "Dynamic Thread Vector" (DTV) with a pointer to each module's TLS block (or NULL |
| 45 | if it hasn't been allocated yet). If the executable has a TLS segment, then it will always be module |
| 46 | 1, and its storage will always be immediately after (or before) the TP. In variant 1, the TP is |
| 47 | expected to point immediately at the DTV pointer, whereas in variant 2, the DTV pointer's offset |
| 48 | from TP is implementation-defined. |
| 49 | |
| 50 | The DTV's "generation" field is used to lazily update/reallocate the DTV when new modules are loaded |
| 51 | or unloaded. |
| 52 | |
| 53 | [`__get_tls()`]: https://android.googlesource.com/platform/bionic/+/7245c082658182c15d2a423fe770388fec707cbc/libc/private/__get_tls.h |
| 54 | |
| 55 | # Access Models |
| 56 | |
| 57 | When a C/C++ file references a TLS variable, the toolchain generates instructions to find its |
| 58 | address using a TLS "access model". The access models trade generality against efficiency. The four |
| 59 | models are: |
| 60 | |
| 61 | * GD: General Dynamic (aka Global Dynamic) |
| 62 | * LD: Local Dynamic |
| 63 | * IE: Initial Exec |
| 64 | * LE: Local Exec |
| 65 | |
| 66 | A TLS variable may be in a different module than the reference. |
| 67 | |
| 68 | ## General Dynamic (or Global Dynamic) (GD) |
| 69 | |
| 70 | A GD access can refer to a TLS variable anywhere. To access a variable `tls_var` using the |
| 71 | "traditional" non-TLSDESC design described in Drepper's TLS document, the toolchain compiler emits a |
| 72 | call to a `__tls_get_addr` function provided by libc. |
| 73 | |
| 74 | For example, if we have this C code in a shared object: |
| 75 | |
| 76 | ```cpp |
| 77 | extern thread_local char tls_var; |
| 78 | char* get_tls_var() { |
| 79 | return &tls_var; |
| 80 | } |
| 81 | ``` |
| 82 | |
| 83 | The toolchain generates code like this: |
| 84 | |
| 85 | ```cpp |
| 86 | struct TlsIndex { |
| 87 | long module; // starts counting at 1 |
| 88 | long offset; |
| 89 | }; |
| 90 | |
| 91 | char* get_tls_var() { |
| 92 | static TlsIndex tls_var_idx = { // allocated in the .got |
| 93 | R_TLS_DTPMOD(tls_var), // dynamic TP module ID |
| 94 | R_TLS_DTPOFF(tls_var), // dynamic TP offset |
| 95 | }; |
| 96 | return __tls_get_addr(&tls_var_idx); |
| 97 | } |
| 98 | ``` |
| 99 | |
| 100 | `R_TLS_DTPMOD` is a dynamic relocation to the index of the module containing `tls_var`, and |
| 101 | `R_TLS_DTPOFF` is a dynamic relocation to the offset of `tls_var` within its module's `PT_TLS` |
| 102 | segment. |
| 103 | |
| 104 | `__tls_get_addr` looks up `TlsIndex::module`'s entry in the DTV and adds `TlsIndex::offset` to the |
| 105 | module's TLS block. Before it can do this, it ensures that the module's TLS block is allocated. A |
| 106 | simple approach is to allocate memory lazily: |
| 107 | |
| 108 | 1. If the current thread's DTV generation count is less than the current global TLS generation, then |
| 109 | `__tls_get_addr` may reallocate the DTV or free blocks for unloaded modules. |
| 110 | |
| 111 | 2. If the DTV's entry for the given module is `NULL`, then `__tls_get_addr` allocates the module's |
| 112 | memory. |
| 113 | |
| 114 | If an allocation fails, `__tls_get_addr` calls `abort` (like emutls). |
| 115 | |
| 116 | musl, on the other, preallocates TLS memory in `pthread_create` and in `dlopen`, and each can report |
| 117 | out-of-memory. |
| 118 | |
| 119 | ## Local Dynamic (LD) |
| 120 | |
| 121 | LD is a specialization of GD that's useful when a function has references to two or more TLS |
| 122 | variables that are both part of the same module as the reference. Instead of a call to |
| 123 | `__tls_get_addr` for each variable, the compiler calls `__tls_get_addr` once to get the current |
| 124 | module's TLS block, then adds each variable's DTPOFF to the result. |
| 125 | |
| 126 | For example, suppose we have this C code: |
| 127 | |
| 128 | ```cpp |
| 129 | static thread_local int x; |
| 130 | static thread_local int y; |
| 131 | int sum() { |
| 132 | return x + y; |
| 133 | } |
| 134 | ``` |
| 135 | |
| 136 | The toolchain generates code like this: |
| 137 | |
| 138 | ```cpp |
| 139 | int sum() { |
| 140 | static TlsIndex tls_module_idx = { // allocated in the .got |
| 141 | // a dynamic relocation against symbol 0 => current module ID |
| 142 | R_TLS_DTPMOD(NULL), |
| 143 | 0, |
| 144 | }; |
| 145 | char* base = __tls_get_addr(&tls_module_idx); |
| 146 | // These R_TLS_DTPOFF() relocations are resolved at link-time. |
| 147 | int* px = base + R_TLS_DTPOFF(x); |
| 148 | int* py = base + R_TLS_DTPOFF(y); |
| 149 | return *px + *py; |
| 150 | } |
| 151 | ``` |
| 152 | |
| 153 | (XXX: LD might be important for C++ `thread_local` variables -- even a single `thread_local` |
| 154 | variable with a dynamic initializer has an associated TLS guard variable.) |
| 155 | |
| 156 | ## Initial Exec (IE) |
| 157 | |
| 158 | If the variable is part of the Static TLS Block (i.e. the executable or an initially-loaded shared |
| 159 | object), then its offset from the TP is known at load-time. The variable can be accessed with a few |
| 160 | loads. |
| 161 | |
| 162 | Example: a C file for an executable: |
| 163 | |
| 164 | ```cpp |
| 165 | // tls_var could be defined in the executable, or it could be defined |
| 166 | // in a shared object the executable links against. |
| 167 | extern thread_local char tls_var; |
| 168 | char* get_addr() { return &tls_var; } |
| 169 | ``` |
| 170 | |
| 171 | Compiles to: |
| 172 | |
| 173 | ```cpp |
| 174 | // allocated in the .got, resolved at load-time with a dynamic reloc. |
| 175 | // Unlike DTPOFF, which is relative to the start of the module’s block, |
| 176 | // TPOFF is directly relative to the thread pointer. |
| 177 | static long tls_var_gotoff = R_TLS_TPOFF(tls_var); |
| 178 | |
| 179 | char* get_addr() { |
| 180 | return (char*)__get_tls() + tls_var_gotoff; |
| 181 | } |
| 182 | ``` |
| 183 | |
| 184 | ## Local Exec (LE) |
| 185 | |
| 186 | LE is a specialization of IE. If the variable is not just part of the Static TLS Block, but is also |
| 187 | part of the executable (and referenced from the executable), then a GOT access can be avoided. The |
| 188 | IE example compiles to: |
| 189 | |
| 190 | ```cpp |
| 191 | char* get_addr() { |
| 192 | // R_TLS_TPOFF() is resolved at (static) link-time |
| 193 | return (char*)__get_tls() + R_TLS_TPOFF(tls_var); |
| 194 | } |
| 195 | ``` |
| 196 | |
| 197 | ## Selecting an Access Model |
| 198 | |
| 199 | The compiler selects an access model for each variable reference using these factors: |
| 200 | * The absence of `-fpic` implies an executable, so use IE/LE. |
| 201 | * Code compiled with `-fpic` could be in a shared object, so use GD/LD. |
| 202 | * The per-file default can be overridden with `-ftls-model=<model>`. |
| 203 | * Specifiers on the variable (`static`, `extern`, ELF visibility attributes). |
| 204 | * A variable can be annotated with `__attribute__((tls_model(...)))`. Clang may still use a more |
| 205 | efficient model than the one specified. |
| 206 | |
| 207 | # Shared Objects with Static TLS |
| 208 | |
| 209 | Shared objects are sometimes compiled with `-ftls-model=initial-exec` (i.e. "static TLS") for better |
| 210 | performance. On Ubuntu, for example, `libc.so.6` and `libOpenGL.so.0` are compiled this way. Shared |
| 211 | objects using static TLS can't be loaded with `dlopen` unless libc has reserved enough surplus |
| 212 | memory in the static TLS block. glibc reserves a kilobyte or two (`TLS_STATIC_SURPLUS`) with the |
| 213 | intent that only a few core system libraries would use static TLS. Non-core libraries also sometimes |
| 214 | use it, which can break `dlopen` if the surplus area is exhausted. See: |
| 215 | * https://bugzilla.redhat.com/show_bug.cgi?id=1124987 |
| 216 | * web search: [`"dlopen: cannot load any more object with static TLS"`][glibc-static-tls-error] |
| 217 | |
| 218 | Neither musl nor the Bionic TLS prototype currently allocate any surplus TLS memory. |
| 219 | |
| 220 | In general, supporting surplus TLS memory probably requires maintaining a thread list so that |
| 221 | `dlopen` can initialize the new static TLS memory in all existing threads. A thread list could be |
| 222 | omitted if the loader only allowed zero-initialized TLS segments and didn't reclaim memory on |
| 223 | `dlclose`. |
| 224 | |
| 225 | As long as a shared object is one of the initially-loaded modules, a better option is to use |
| 226 | TLSDESC. |
| 227 | |
| 228 | [glibc-static-tls-error]: https://www.google.com/search?q=%22dlopen:+cannot+load+any+more+object+with+static+TLS%22 |
| 229 | |
| 230 | # TLS Descriptors (TLSDESC) |
| 231 | |
| 232 | The code fragments above match the "traditional" TLS design from Drepper's document. For the GD and |
| 233 | LD models, there is a newer, more efficient design that uses "TLS descriptors". Each TLS variable |
| 234 | reference has a corresponding descriptor, which contains a resolver function address and an argument |
| 235 | to pass to the resolver. |
| 236 | |
| 237 | For example, if we have this C code in a shared object: |
| 238 | |
| 239 | ```cpp |
| 240 | extern thread_local char tls_var; |
| 241 | char* get_tls_var() { |
| 242 | return &tls_var; |
| 243 | } |
| 244 | ``` |
| 245 | |
| 246 | The toolchain generates code like this: |
| 247 | |
| 248 | ```cpp |
| 249 | struct TlsDescriptor { // NB: arm32 reverses these fields |
| 250 | long (*resolver)(long); |
| 251 | long arg; |
| 252 | }; |
| 253 | |
| 254 | char* get_tls_var() { |
| 255 | // allocated in the .got, uses a dynamic relocation |
| 256 | static TlsDescriptor desc = R_TLS_DESC(tls_var); |
| 257 | return (char*)__get_tls() + desc.resolver(desc.arg); |
| 258 | } |
| 259 | ``` |
| 260 | |
| 261 | The dynamic loader fills in the TLS descriptors. For a reference to a variable allocated in the |
| 262 | Static TLS Block, it can use a simple resolver function: |
| 263 | |
| 264 | ```cpp |
| 265 | long static_tls_resolver(long arg) { |
| 266 | return arg; |
| 267 | } |
| 268 | ``` |
| 269 | |
| 270 | The loader writes `tls_var@TPOFF` into the descriptor's argument. |
| 271 | |
| 272 | To support modules loaded with `dlopen`, the loader must use a resolver function that calls |
| 273 | `__tls_get_addr`. In principle, this simple implementation would work: |
| 274 | |
| 275 | ```cpp |
| 276 | long dynamic_tls_resolver(TlsIndex* arg) { |
| 277 | return (long)__tls_get_addr(arg) - (long)__get_tls(); |
| 278 | } |
| 279 | ``` |
| 280 | |
| 281 | There are optimizations that complicate the design a little: |
| 282 | * Unlike `__tls_get_addr`, the resolver function has a special calling convention that preserves |
| 283 | almost all registers, reducing register pressure in the caller |
| 284 | ([example](https://godbolt.org/g/gywcxk)). |
| 285 | * In general, the resolver function must call `__tls_get_addr`, so it must save and restore all |
| 286 | registers. |
| 287 | * To keep the fast path fast, the resolver inlines the fast path of `__tls_get_addr`. |
| 288 | * By storing the module's initial generation alongside the TlsIndex, the resolver function doesn't |
| 289 | need to use an atomic or synchronized access of the global TLS generation counter. |
| 290 | |
| 291 | The resolver must be written in assembly, but in C, the function looks like so: |
| 292 | |
| 293 | ```cpp |
| 294 | struct TlsDescDynamicArg { |
| 295 | unsigned long first_generation; |
| 296 | TlsIndex idx; |
| 297 | }; |
| 298 | |
| 299 | struct TlsDtv { // DTV == dynamic thread vector |
| 300 | unsigned long generation; |
| 301 | char* modules[]; |
| 302 | }; |
| 303 | |
| 304 | long dynamic_tls_resolver(TlsDescDynamicArg* arg) { |
| 305 | TlsDtv* dtv = __get_dtv(); |
| 306 | char* addr; |
| 307 | if (dtv->generation >= arg->first_generation && |
| 308 | dtv->modules[arg->idx.module] != nullptr) { |
| 309 | addr = dtv->modules[arg->idx.module] + arg->idx.offset; |
| 310 | } else { |
| 311 | addr = __tls_get_addr(&arg->idx); |
| 312 | } |
| 313 | return (long)addr - (long)__get_tls(); |
| 314 | } |
| 315 | ``` |
| 316 | |
| 317 | The loader needs to allocate a table of `TlsDescDynamicArg` objects for each TLS module with dynamic |
| 318 | TLSDESC relocations. |
| 319 | |
| 320 | The static linker can still relax a TLSDESC-based access to an IE/LE access. |
| 321 | |
| 322 | The traditional TLS design is implemented everywhere, but the TLSDESC design has less toolchain |
| 323 | support: |
| 324 | * GCC and the BFD linker support both designs on all supported Android architectures (arm32, arm64, |
| 325 | x86, x86-64). |
| 326 | * GCC can select the design at run-time using `-mtls-dialect=<dialect>` (`trad`-vs-`desc` on arm64, |
| 327 | otherwise `gnu`-vs-`gnu2`). Clang always uses the default mode. |
| 328 | * GCC and Clang default to TLSDESC on arm64 and the traditional design on other architectures. |
| 329 | * Gold and LLD support for TLSDESC is spotty (except when targeting arm64). |
| 330 | |
| 331 | # Linker Relaxations |
| 332 | |
| 333 | The (static) linker frequently has more information about the location of a referenced TLS variable |
| 334 | than the compiler, so it can "relax" TLS accesses to more efficient models. For example, if an |
| 335 | object file compiled with `-fpic` is linked into an executable, the linker could relax GD accesses |
| 336 | to IE or LE. To relax a TLS access, the linker looks for an expected sequences of instructions and |
| 337 | static relocations, then replaces the sequence with a different one of equal size. It may need to |
| 338 | add or remove no-op instructions. |
| 339 | |
| 340 | ## Current Support for GD->LE Relaxations Across Linkers |
| 341 | |
| 342 | Versions tested: |
| 343 | * BFD and Gold linkers: version 2.30 |
| 344 | * LLD version 6.0.0 (upstream) |
| 345 | |
| 346 | Linker support for GD->LE relaxation with `-mtls-dialect=gnu/trad` (traditional): |
| 347 | |
| 348 | Architecture | BFD | Gold | LLD |
| 349 | --------------- | --- | ---- | --- |
| 350 | arm32 | no | no | no |
| 351 | arm64 (unusual) | yes | yes | no |
| 352 | x86 | yes | yes | yes |
| 353 | x86_64 | yes | yes | yes |
| 354 | |
| 355 | Linker support for GD->LE relaxation with `-mtls-dialect=gnu2/desc` (TLSDESC): |
| 356 | |
| 357 | Architecture | BFD | Gold | LLD |
| 358 | --------------------- | --- | ------------------ | ------------------ |
| 359 | arm32 (experimental) | yes | unsupported relocs | unsupported relocs |
| 360 | arm64 | yes | yes | yes |
| 361 | x86 (experimental) | yes | yes | unsupported relocs |
| 362 | X86_64 (experimental) | yes | yes | unsupported relocs |
| 363 | |
| 364 | arm32 linkers can't relax traditional TLS accesses. BFD can relax an arm32 TLSDESC access, but LLD |
| 365 | can't link code using TLSDESC at all, except on arm64, where it's used by default. |
| 366 | |
| 367 | # dlsym |
| 368 | |
| 369 | Calling `dlsym` on a TLS variable returns the address of the current thread's variable. |
| 370 | |
| 371 | # Debugger Support |
| 372 | |
| 373 | ## gdb |
| 374 | |
| 375 | gdb uses a libthread_db plugin library to retrieve thread-related information from a target. This |
| 376 | library is typically a shared object, but for Android, we link our own `libthread_db.a` into |
| 377 | gdbserver. We will need to implement at least 2 APIs in `libthread_db.a` to find TLS variables, and |
| 378 | gdb provides APIs for looking up symbols, reading or writing memory, and retrieving the current |
| 379 | thread pointer (e.g. `ps_get_thread_area`). |
| 380 | * Reference: [gdb_proc_service.h]: APIs gdb provides to libthread_db |
| 381 | * Reference: [Currently unimplemented TLS functions in Android's libthread_tb][libthread_db.c] |
| 382 | |
| 383 | [gdb_proc_service.h]: https://android.googlesource.com/toolchain/gdb/+/a7e49fd02c21a496095c828841f209eef8ae2985/gdb-8.0.1/gdb/gdb_proc_service.h#41 |
| 384 | [libthread_db.c]: https://android.googlesource.com/platform/ndk/+/e1f0ad12fc317c0ca3183529cc9625d3f084d981/sources/android/libthread_db/libthread_db.c#115 |
| 385 | |
| 386 | ## LLDB |
| 387 | |
| 388 | LLDB more-or-less implemented Linux TLS debugging in [r192922][rL192922] ([D1944]) for x86 and |
| 389 | x86-64. [arm64 support came later][D5073]. However, the Linux TLS functionality no longer does |
| 390 | anything: the `GetThreadPointer` function is no longer implemented. Code for reading the thread |
| 391 | pointer was removed in [D10661] ([this function][r240543]). (arm32 was apparently never supported.) |
| 392 | |
| 393 | [rL192922]: https://reviews.llvm.org/rL192922 |
| 394 | [D1944]: https://reviews.llvm.org/D1944 |
| 395 | [D5073]: https://reviews.llvm.org/D5073 |
| 396 | [D10661]: https://reviews.llvm.org/D10661 |
| 397 | [r240543]: https://github.com/llvm-mirror/lldb/commit/79246050b0f8d6b54acb5366f153d07f235d2780#diff-52dee3d148892cccfcdab28bc2165548L962 |
| 398 | |
| 399 | ## Threading Library Metadata |
| 400 | |
| 401 | Both debuggers need metadata from the threading library (`libc.so` / `libpthread.so`) to find TLS |
| 402 | variables. From [LLDB r192922][rL192922]'s commit message: |
| 403 | |
| 404 | > ... All OSes use basically the same algorithm (a per-module lookup table) as detailed in Ulrich |
| 405 | > Drepper's TLS ELF ABI document, so we can easily write code to decode it ourselves. The only |
| 406 | > question therefore is the exact field layouts required. Happily, the implementors of libpthread |
| 407 | > expose the structure of the DTV via metadata exported as symbols from the .so itself, designed |
| 408 | > exactly for this kind of thing. So this patch simply reads that metadata in, and re-implements |
| 409 | > libthread_db's algorithm itself. We thereby get cross-platform TLS lookup without either requiring |
| 410 | > third-party libraries, while still being independent of the version of libpthread being used. |
| 411 | |
| 412 | LLDB uses these variables: |
| 413 | |
| 414 | Name | Notes |
| 415 | --------------------------------- | --------------------------------------------------------------------------------------- |
| 416 | `_thread_db_pthread_dtvp` | Offset from TP to DTV pointer (0 for variant 1, implementation-defined for variant 2) |
| 417 | `_thread_db_dtv_dtv` | Size of a DTV slot (typically/always sizeof(void*)) |
| 418 | `_thread_db_dtv_t_pointer_val` | Offset within a DTV slot to the pointer to the allocated TLS block (typically/always 0) |
| 419 | `_thread_db_link_map_l_tls_modid` | Offset of a `link_map` field containing the module's 1-based TLS module ID |
| 420 | |
| 421 | The metadata variables are local symbols in glibc's `libpthread.so` symbol table (but not its |
| 422 | dynamic symbol table). Debuggers can access them, but applications can't. |
| 423 | |
| 424 | The debugger lookup process is straightforward: |
| 425 | * Find the `link_map` object and module-relative offset for a TLS variable. |
| 426 | * Use `_thread_db_link_map_l_tls_modid` to find the TLS variable's module ID. |
| 427 | * Read the target thread pointer. |
| 428 | * Use `_thread_db_pthread_dtvp` to find the thread's DTV. |
| 429 | * Use `_thread_db_dtv_dtv` and `_thread_db_dtv_t_pointer_val` to find the desired module's block |
| 430 | within the DTV. |
| 431 | * Add the module-relative offset to the module pointer. |
| 432 | |
| 433 | This process doesn't appear robust in the face of lazy DTV initialization -- presumably it could |
| 434 | read past the end of an out-of-date DTV or access an unloaded module. To be robust, it needs to |
| 435 | compare a module's initial generation count against the DTV's generation count. (XXX: Does gdb have |
| 436 | these sorts of problems with glibc's libpthread?) |
| 437 | |
| 438 | ## Reading the Thread Pointer with Ptrace |
| 439 | |
| 440 | There are ptrace interfaces for reading the thread pointer for each of arm32, arm64, x86, and x86-64 |
| 441 | (XXX: check 32-vs-64-bit for inferiors, debuggers, and kernels): |
| 442 | * arm32: `PTRACE_GET_THREAD_AREA` |
| 443 | * arm64: `PTRACE_GETREGSET`, `NT_ARM_TLS` |
| 444 | * x86_32: `PTRACE_GET_THREAD_AREA` |
| 445 | * x86_64: use `PTRACE_PEEKUSER` to read the `{fs,gs}_base` fields of `user_regs_struct` |
| 446 | |
| 447 | # C/C++ Specifiers |
| 448 | |
| 449 | C/C++ TLS variables are declared with a specifier: |
| 450 | |
| 451 | Specifier | Notes |
| 452 | --------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| 453 | `__thread` | - non-standard, but ubiquitous in GCC and Clang<br/> - cannot have dynamic initialization or destruction |
| 454 | `_Thread_local` | - a keyword standardized in C11<br/> - cannot have dynamic initialization or destruction |
| 455 | `thread_local` | - C11: a macro for `_Thread_local` via `threads.h`<br/> - C++11: a keyword, allows dynamic initialization and/or destruction |
| 456 | |
| 457 | The dynamic initialization and destruction of C++ `thread_local` variables is layered on top of ELF |
| 458 | TLS (or emutls), so this design document mostly ignores it. Like emutls, ELF TLS variables either |
| 459 | have a static initializer or are zero-initialized. |
| 460 | |
| 461 | Aside: Because a `__thread` variable cannot have dynamic initialization, `__thread` is more |
| 462 | efficient in C++ than `thread_local` when the compiler cannot see the definition of a declared TLS |
| 463 | variable. The compiler assumes the variable could have a dynamic initializer and generates code, at |
| 464 | each access, to call a function to initialize the variable. |
| 465 | |
| 466 | # Graceful Failure on Old Platforms |
| 467 | |
| 468 | ELF TLS isn't implemented on older Android platforms, so dynamic executables and shared objects |
| 469 | using it generally won't work on them. Ideally, the older platforms would reject these binaries |
| 470 | rather than experience memory corruption at run-time. |
| 471 | |
| 472 | Static executables aren't a problem--the necessary runtime support is part of the executable, so TLS |
| 473 | just works. |
| 474 | |
| 475 | XXX: Shared objects are less of a problem. |
| 476 | * On arm32, x86, and x86_64, the loader [should reject a TLS relocation]. (XXX: I haven't verified |
| 477 | this.) |
| 478 | * On arm64, the primary TLS relocation (R_AARCH64_TLSDESC) is [confused with an obsolete |
| 479 | R_AARCH64_TLS_DTPREL32 relocation][R_AARCH64_TLS_DTPREL32] and is [quietly ignored]. |
| 480 | * Android P [added compatibility checks] for TLS symbols and `DT_TLSDESC_{GOT|PLT}` entries. |
| 481 | |
| 482 | XXX: A dynamic executable using ELF TLS would have a PT_TLS segment and no other distinguishing |
| 483 | marks, so running it on an older platform would result in memory corruption. Should we add something |
| 484 | to these executables that only newer platforms recognize? (e.g. maybe an entry in .dynamic, a |
| 485 | reference to a symbol only a new libc.so has...) |
| 486 | |
| 487 | [should reject a TLS relocation]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2852 |
| 488 | [R_AARCH64_TLS_DTPREL32]: https://android-review.googlesource.com/c/platform/bionic/+/723696 |
| 489 | [quietly ignored]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2784 |
| 490 | [added compatibility checks]: https://android-review.googlesource.com/c/platform/bionic/+/648760 |
| 491 | |
| 492 | # Bionic Prototype Notes |
| 493 | |
| 494 | There is an [ELF TLS prototype] uploaded on Gerrit. It implements: |
| 495 | * Static TLS Block allocation for static and dynamic executables |
| 496 | * TLS for dynamically loaded and unloaded modules (`__tls_get_addr`) |
| 497 | * TLSDESC for arm64 only |
| 498 | |
| 499 | Missing: |
| 500 | * `dlsym` of a TLS variable |
| 501 | * debugger support |
| 502 | |
| 503 | [ELF TLS prototype]: https://android-review.googlesource.com/q/topic:%22elf-tls-prototype%22+(status:open%20OR%20status:merged) |
| 504 | |
| 505 | ## Loader/libc Communication |
| 506 | |
| 507 | The loader exposes a list of TLS modules ([`struct TlsModules`][TlsModules]) to `libc.so` using the |
| 508 | `__libc_shared_globals` variable (see `tls_modules()` in [linker_tls.cpp][tls_modules-linker] and |
| 509 | [elf_tls.cpp][tls_modules-libc]). `__tls_get_addr` in libc.so acquires the `TlsModules::mutex` and |
| 510 | iterates its module list to lazily allocate and free TLS blocks. |
| 511 | |
| 512 | [TlsModules]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.h#53 |
| 513 | [tls_modules-linker]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/linker/linker_tls.cpp#45 |
| 514 | [tls_modules-libc]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.cpp#49 |
| 515 | |
| 516 | ## TLS Allocator |
| 517 | |
| 518 | The prototype currently allocates a `pthread_internal_t` object and static TLS in a single mmap'ed |
| 519 | region, along with a thread's stack if it needs one allocated. It doesn't place TLS memory on a |
| 520 | preallocated stack (either the main thread's stack or one provided with `pthread_attr_setstack`). |
| 521 | |
| 522 | The DTV and blocks for dlopen'ed modules are instead allocated using the Bionic loader's |
| 523 | `LinkerMemoryAllocator`, adapted to avoid the STL and to provide `memalign`. The prototype tries to |
| 524 | achieve async-signal safety by blocking signals and acquiring a lock. |
| 525 | |
| 526 | There are three "entry points" to dynamically locate a TLS variable's address: |
| 527 | * libc.so: `__tls_get_addr` |
| 528 | * loader: TLSDESC dynamic resolver |
| 529 | * loader: dlsym |
| 530 | |
| 531 | The loader's entry points need to call `__tls_get_addr`, which needs to allocate memory. Currently, |
| 532 | the prototype uses a [special function pointer] to call libc.so's `__tls_get_addr` from the loader. |
| 533 | (This should probably be removed.) |
| 534 | |
| 535 | The prototype currently allows for arbitrarily-large TLS variable alignment. IIRC, different |
| 536 | implementations (glibc, musl, FreeBSD) vary in their level of respect for TLS alignment. It looks |
| 537 | like the Bionic loader ignores segments' alignment and aligns loaded libraries to 256 KiB. See |
| 538 | `ReserveAligned`. |
| 539 | |
| 540 | [special function pointer]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/private/bionic_globals.h#52 |
| 541 | |
| 542 | ## Async-Signal Safety |
| 543 | |
| 544 | The prototype's `__tls_get_addr` might be async-signal safe. Making it AS-safe is a good idea if |
| 545 | it's feasible. musl's function is AS-safe, but glibc's isn't (or wasn't). Google had a patch to make |
| 546 | glibc AS-safe back in 2012-2013. See: |
| 547 | * https://sourceware.org/glibc/wiki/TLSandSignals |
| 548 | * https://sourceware.org/ml/libc-alpha/2012-06/msg00335.html |
| 549 | * https://sourceware.org/ml/libc-alpha/2013-09/msg00563.html |
| 550 | |
| 551 | ## Out-of-Memory Handling (abort) |
| 552 | |
| 553 | The prototype lazily allocates TLS memory for dlopen'ed modules (see `__tls_get_addr`), and an |
| 554 | out-of-memory error on a TLS access aborts the process. musl, on the other hand, preallocates TLS |
| 555 | memory on `pthread_create` and `dlopen`, so either function can return out-of-memory. Both functions |
| 556 | probably need to acquire the same lock. |
| 557 | |
| 558 | Maybe Bionic should do the same as musl? Perhaps musl's robustness argument holds for Bionic, |
| 559 | though, because Bionic (at least the linker) probably already aborts on OOM. musl doesn't support |
| 560 | `dlclose`/unloading, so it might have an easier time. |
| 561 | |
| 562 | On the other hand, maybe lazy allocation is a feature, because not all threads will use a dlopen'ed |
| 563 | solib's TLS variables. Drepper makes this argument in his TLS document: |
| 564 | |
| 565 | > In addition the run-time support should avoid creating the thread-local storage if it is not |
| 566 | > necessary. For instance, a loaded module might only be used by one thread of the many which make |
| 567 | > up the process. It would be a waste of memory and time to allocate the storage for all threads. A |
| 568 | > lazy method is wanted. This is not much extra burden since the requirement to handle dynamically |
| 569 | > loaded objects already requires recognizing storage which is not yet allocated. This is the only |
| 570 | > alternative to stopping all threads and allocating storage for all threads before letting them run |
| 571 | > again. |
| 572 | |
| 573 | FWIW: emutls also aborts on out-of-memory. |
| 574 | |
| 575 | ## ELF TLS Not Usable in libc |
| 576 | |
| 577 | The dynamic loader currently can't use ELF TLS, so any part of libc linked into the loader (i.e. |
| 578 | most of it) also can't use ELF TLS. It might be possible to lift this restriction, perhaps with |
| 579 | specialized `__tls_get_addr` and TLSDESC resolver functions. |
| 580 | |
| 581 | # Open Issues |
| 582 | |
| 583 | ## Bionic Memory Layout Conflicts with Common TLS Layout |
| 584 | |
| 585 | Bionic already allocates thread-specific data in a way that conflicts with TLS variants 1 and 2: |
| 586 |  |
| 587 | |
| 588 | TLS variant 1 allocates everything after the TP to ELF TLS (except the first two words), and variant |
| 589 | 2 allocates everything before the TP. Bionic currently allocates memory before and after the TP to |
| 590 | the `pthread_internal_t` struct. |
| 591 | |
| 592 | The `bionic_tls.h` header is marked with a warning: |
| 593 | |
| 594 | ```cpp |
| 595 | /** WARNING WARNING WARNING |
| 596 | ** |
| 597 | ** This header file is *NOT* part of the public Bionic ABI/API |
| 598 | ** and should not be used/included by user-serviceable parts of |
| 599 | ** the system (e.g. applications). |
| 600 | ** |
| 601 | ** It is only provided here for the benefit of the system dynamic |
| 602 | ** linker and the OpenGL sub-system (which needs to access the |
| 603 | ** pre-allocated slot directly for performance reason). |
| 604 | **/ |
| 605 | ``` |
| 606 | |
| 607 | There are issues with rearranging this memory: |
| 608 | |
| 609 | * `TLS_SLOT_STACK_GUARD` is used for `-fstack-protector`. The location (word #5) was initially used |
| 610 | by GCC on x86 (and x86-64), where it is compatible with x86's TLS variant 2. We [modified Clang |
| 611 | to use this slot for arm64 in 2016][D18632], though, and the slot isn't compatible with ARM's |
| 612 | variant 1 layout. This change shipped in NDK r14, and the NDK's build systems (ndk-build and the |
| 613 | CMake toolchain file) enable `-fstack-protector-strong` by default. |
| 614 | |
| 615 | * `TLS_SLOT_TSAN` is used for more than just TSAN -- it's also used by [HWASAN and |
| 616 | Scudo](https://reviews.llvm.org/D53906#1285002). |
| 617 | |
| 618 | * The Go runtime allocates a thread-local "g" variable on Android by creating a pthread key and |
| 619 | searching for its TP-relative offset, which it assumes is nonnegative: |
| 620 | * On arm32/arm64, it creates a pthread key, sets it to a magic value, then scans forward from |
| 621 | the thread pointer looking for it. [The scan count was bumped to 384 to fix a reported |
| 622 | breakage happening with Android N.](https://go-review.googlesource.com/c/go/+/38636) (XXX: I |
| 623 | suspect the actual platform breakage happened with Android M's [lock-free pthread key |
| 624 | work][bionic-lockfree-keys].) |
| 625 | * On x86/x86-64, it uses a fixed offset from the thread pointer (TP+0xf8 or TP+0x1d0) and |
| 626 | creates pthread keys until one of them hits the fixed offset. |
| 627 | * CLs: |
| 628 | * arm32: https://codereview.appspot.com/106380043 |
| 629 | * arm64: https://go-review.googlesource.com/c/go/+/17245 |
| 630 | * x86: https://go-review.googlesource.com/c/go/+/16678 |
| 631 | * x86-64: https://go-review.googlesource.com/c/go/+/15991 |
| 632 | * Moving the pthread keys before the thread pointer breaks Go-based apps. |
| 633 | * It's unclear how many Android apps use Go. There are at least two with 1,000,000+ installs. |
| 634 | * [Some motivation for Go's design][golang-post], [runtime/HACKING.md][go-hacking] |
| 635 | * [On x86/x86-64 Darwin, Go uses a TLS slot reserved for both Go and Wine][go-darwin-x86] (On |
| 636 | [arm32][go-darwin-arm32]/[arm64][go-darwin-arm64] Darwin, Go scans for pthread keys like it |
| 637 | does on Android.) |
| 638 | |
| 639 | * Android's "native bridge" system allows the Zygote to load an app solib of a non-native ABI. (For |
| 640 | example, it could be used to load an arm32 solib into an x86 Zygote.) The solib is translated |
| 641 | into the host architecture. TLS accesses in the app solib (whether ELF TLS, Bionic slots, or |
| 642 | `pthread_internal_t` fields) become host accesses. Laying out TLS memory differently across |
| 643 | architectures could complicate this translation. |
| 644 | |
| 645 | * A `pthread_t` is practically just a `pthread_internal_t*`, and some apps directly access the |
| 646 | `pthread_internal_t::tid` field. Past examples: http://b/17389248, [aosp/107467]. Reorganizing |
| 647 | the initial `pthread_internal_t` fields could break those apps. |
| 648 | |
| 649 | It seems easy to fix the incompatibility for variant 2 (x86 and x86_64) by splitting out the Bionic |
| 650 | slots into a new data structure. Variant 1 is a harder problem. |
| 651 | |
| 652 | The TLS prototype currently uses a patched LLD that uses a variant 1 TLS layout with a 16-word TCB |
| 653 | on all architectures. |
| 654 | |
| 655 | Aside: gcc's arm64ilp32 target uses a 32-bit unsigned offset for a TLS IE access |
| 656 | (https://godbolt.org/z/_NIXjF). If Android ever supports this target, and in a configuration with |
| 657 | variant 2 TLS, we might need to change the compiler to emit a sign-extending load. |
| 658 | |
| 659 | [D18632]: https://reviews.llvm.org/D18632 |
| 660 | [bionic-lockfree-keys]: https://android-review.googlesource.com/c/platform/bionic/+/134202 |
| 661 | [golang-post]: https://groups.google.com/forum/#!msg/golang-nuts/EhndTzcPJxQ/i-w7kAMfBQAJ |
| 662 | [go-hacking]: https://github.com/golang/go/blob/master/src/runtime/HACKING.md |
| 663 | [go-darwin-x86]: https://github.com/golang/go/issues/23617 |
| 664 | [go-darwin-arm32]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm.c |
| 665 | [go-darwin-arm64]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm64.c |
| 666 | [aosp/107467]: https://android-review.googlesource.com/c/platform/bionic/+/107467 |
| 667 | |
| 668 | ### Workaround: Use Variant 2 on arm32/arm64 |
| 669 | |
| 670 | Pros: simplifies Bionic |
| 671 | |
| 672 | Cons: |
| 673 | * arm64: requires either subtle reinterpretation of a TLS relocation or addition of a new |
| 674 | relocation |
| 675 | * arm64: a new TLS relocation reduces compiler/assembler compatibility with non-Android |
| 676 | |
| 677 | The point of variant 2 was backwards-compatibility, and ARM Android needs to remain |
| 678 | backwards-compatible, so we could use variant 2 for ARM. Problems: |
| 679 | |
| 680 | * When linking an executable, the static linker needs to know how TLS is allocated because it |
| 681 | writes TP-relative offsets for IE/LE-model accesses. Clang doesn't tell the linker to target |
| 682 | Android, so it could pass an `--tls-variant2` flag to configure lld. |
| 683 | |
| 684 | * On arm64, there are different sets of static LE relocations accommodating different ranges of |
| 685 | offsets from TP: |
| 686 | |
| 687 | Size | TP offset range | Static LE relocation types |
| 688 | ---- | ----------------- | --------------------------------------- |
| 689 | 12 | 0 <= x < 2^12 | `R_AARCH64_TLSLE_ADD_TPREL_LO12` |
| 690 | " | " | `R_AARCH64_TLSLE_LDST8_TPREL_LO12` |
| 691 | " | " | `R_AARCH64_TLSLE_LDST16_TPREL_LO12` |
| 692 | " | " | `R_AARCH64_TLSLE_LDST32_TPREL_LO12` |
| 693 | " | " | `R_AARCH64_TLSLE_LDST64_TPREL_LO12` |
| 694 | " | " | `R_AARCH64_TLSLE_LDST128_TPREL_LO12` |
| 695 | 16 | -2^16 <= x < 2^16 | `R_AARCH64_TLSLE_MOVW_TPREL_G0` |
| 696 | 24 | 0 <= x < 2^24 | `R_AARCH64_TLSLE_ADD_TPREL_HI12` |
| 697 | " | " | `R_AARCH64_TLSLE_ADD_TPREL_LO12_NC` |
| 698 | " | " | `R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC` |
| 699 | " | " | `R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC` |
| 700 | " | " | `R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC` |
| 701 | " | " | `R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC` |
| 702 | " | " | `R_AARCH64_TLSLE_LDST128_TPREL_LO12_NC` |
| 703 | 32 | -2^32 <= x < 2^32 | `R_AARCH64_TLSLE_MOVW_TPREL_G1` |
| 704 | " | " | `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC` |
| 705 | 48 | -2^48 <= x < 2^48 | `R_AARCH64_TLSLE_MOVW_TPREL_G2` |
| 706 | " | " | `R_AARCH64_TLSLE_MOVW_TPREL_G1_NC` |
| 707 | " | " | `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC` |
| 708 | |
| 709 | GCC for arm64 defaults to the 24-bit model and has an `-mtls-size=SIZE` option for setting other |
| 710 | supported sizes. (It supports 12, 24, 32, and 48.) Clang has only implemented the 24-bit model, |
| 711 | but that could change. (Clang [briefly used][D44355] load/store relocations, but it was reverted |
| 712 | because no linker supported them: [BFD], [Gold], [LLD]). |
| 713 | |
| 714 | The 16-, 32-, and 48-bit models use a `movn/movz` instruction to set the highest 16 bits to a |
| 715 | positive or negative value, then `movk` to set the remaining 16 bit chunks. In principle, these |
| 716 | relocations should be able to accommodate a negative TP offset. |
| 717 | |
| 718 | The 24-bit model uses `add` to set the high 12 bits, then places the low 12 bits into another |
| 719 | `add` or a load/store instruction. |
| 720 | |
| 721 | Maybe we could modify the `R_AARCH64_TLSLE_ADD_TPREL_HI12` relocation to allow a negative TP offset |
| 722 | by converting the relocated `add` instruction to a `sub`. Alternately, we could add a new |
| 723 | `R_AARCH64_TLSLE_SUB_TPREL_HI12` relocation, and Clang would use a different TLS LE instruction |
| 724 | sequence when targeting Android/arm64. |
| 725 | |
| 726 | * LLD's arm64 relaxations from GD and IE to LE would need to use `movn` instead of `movk` for |
| 727 | Android. |
| 728 | |
| 729 | * Binaries linked with the flag crash on non-Bionic, and binaries without the flag crash on Bionic. |
| 730 | We might want to mark the binaries somehow to indicate the non-standard TLS ABI. Suggestion: |
| 731 | * Use an `--android-tls-variant2` flag (or `--bionic-tls-variant2`, we're trying to make [Bionic |
| 732 | run on the host](http://b/31559095)) |
| 733 | * Add a `PT_ANDROID_TLS_TPOFF` segment? |
| 734 | * Add a [`.note.gnu.property`](https://reviews.llvm.org/D53906#1283425) with a |
| 735 | "`GNU_PROPERTY_TLS_TPOFF`" property value? |
| 736 | |
| 737 | [D44355]: https://reviews.llvm.org/D44355 |
| 738 | [BFD]: https://sourceware.org/bugzilla/show_bug.cgi?id=22970 |
| 739 | [Gold]: https://sourceware.org/bugzilla/show_bug.cgi?id=22969 |
| 740 | [LLD]: https://bugs.llvm.org/show_bug.cgi?id=36727 |
| 741 | |
| 742 | ### Workaround: Reserve an Extra-Large TCB on ARM |
| 743 | |
| 744 | Pros: Minimal linker change, no change to TLS relocations. |
| 745 | Cons: The reserved amount becomes an arbitrary but immutable part of the Android ABI. |
| 746 | |
| 747 | Add an lld option: `--android-tls[-tcb=SIZE]` |
| 748 | |
| 749 | As with the first workaround, we'd probably want to mark the binary to indicate the non-standard |
| 750 | TP-to-TLS-segment offset. |
| 751 | |
| 752 | Reservation amount: |
| 753 | * We would reserve at least 6 words to cover the stack guard |
| 754 | * Reserving 16 covers all the existing Bionic slots and gives a little room for expansion. (If we |
| 755 | ever needed more than 16 slots, we could allocate the space before TP.) |
| 756 | * 16 isn't enough for the pthread keys, so the Go runtime is still a problem. |
| 757 | * Reserving 138 words is enough for existing slots and pthread keys. |
| 758 | |
| 759 | ### Workaround: Use Variant 1 Everywhere with an Extra-Large TCB |
| 760 | |
| 761 | Pros: |
| 762 | * memory layout is the same on all architectures, avoids native bridge complications |
| 763 | * x86/x86-64 relocations probably handle positive offsets without issue |
| 764 | |
| 765 | Cons: |
| 766 | * The reserved amount is still arbitrary. |
| 767 | |
| 768 | ### Workaround: No LE Model in Android Executables |
| 769 | |
| 770 | Pros: |
| 771 | * Keeps options open. We can allow LE later if we want. |
| 772 | * Bionic's existing memory layout doesn't change, and arm32 and 32-bit x86 have the same layout |
| 773 | * Fixes everything but static executables |
| 774 | |
| 775 | Cons: |
| 776 | * more intrusive toolchain changes (affects both Clang and LLD) |
| 777 | * statically-linked executables still need another workaround |
| 778 | * somewhat larger/slower executables (they must use IE, not LE) |
| 779 | |
| 780 | The layout conflict is apparently only a problem because an executable assumes that its TLS segment |
| 781 | is located at a statically-known offset from the TP (i.e. it uses the LE model). An initially-loaded |
| 782 | shared object can still use the efficient IE access model, but its TLS segment offset is known at |
| 783 | load-time, not link-time. If we can guarantee that Android's executables also use the IE model, not |
| 784 | LE, then the Bionic loader can place the executable's TLS segment at any offset from the TP, leaving |
| 785 | the existing thread-specific memory layout untouched. |
| 786 | |
| 787 | This workaround doesn't help with statically-linked executables, but they're probably less of a |
| 788 | problem, because the linker and `libc.a` are usually packaged together. |
| 789 | |
| 790 | A likely problem: LD is normally relaxed to LE, not to IE. We'd either have to disable LD usage in |
| 791 | the compiler (bad for performance) or add LD->IE relaxation. This relaxation requires that IE code |
| 792 | sequences be no larger than LD code sequences, which may not be the case on some architectures. |
| 793 | (XXX: In some past testing, it looked feasible for TLSDESC but not the traditional design.) |
| 794 | |
| 795 | To implement: |
| 796 | * Clang would need to stop generating LE accesses. |
| 797 | * LLD would need to relax GD and LD to IE instead of LE. |
| 798 | * LLD should abort if it sees a TLS LE relocation. |
| 799 | * LLD must not statically resolve an executable's IE relocation in the GOT. (It might assume that |
| 800 | it knows its value.) |
| 801 | * Perhaps LLD should mark executables specially, because a normal ELF linker's output would quietly |
| 802 | trample on `pthread_internal_t`. We need something like `DF_STATIC_TLS`, but instead of |
| 803 | indicating IE in an solib, we want to indicate the lack of LE in an executable. |
| 804 | |
| 805 | ### (Non-)workaround for Go: Allocate a Slot with Go's Magic Values |
| 806 | |
| 807 | The Go runtime allocates its thread-local "g" variable by searching for a hard-coded magic constant |
| 808 | (`0x23581321` for arm32 and `0x23581321345589` for arm64). As long as it finds its constant at a |
| 809 | small positive offset from TP (within the first 384 words), it will think it has found the pthread |
| 810 | key it allocated. |
| 811 | |
| 812 | As a temporary compatibility hack, we might try to keep these programs running by reserving a TLS |
| 813 | slot with this magic value. This hack doesn't appear to work, however. The runtime finds its pthread |
| 814 | key, but apps segfault. Perhaps the Go runtime expects its "g" variable to be zero-initialized ([one |
| 815 | example][go-tlsg-zero]). With this hack, it's never zero, but with its current allocation strategy, |
| 816 | it is typically zero. After [Bionic's pthread key system was rewritten to be |
| 817 | lock-free][bionic-lockfree-keys] for Android M, though, it's not guaranteed, because a key could be |
| 818 | recycled. |
| 819 | |
| 820 | [go-tlsg-zero]: https://go.googlesource.com/go/+/5bc1fd42f6d185b8ff0201db09fb82886978908b/src/runtime/asm_arm64.s#980 |
| 821 | |
| 822 | ### Workaround for Go: place pthread keys after the executable's TLS |
| 823 | |
| 824 | Most Android executables do not use any `thread_local` variables. In the current prototype, with the |
| 825 | AOSP hikey960 build, only `/system/bin/netd` has a TLS segment, and it's only 32 bytes. As long as |
| 826 | `/system/bin/app_process{32,64}` limits its use of TLS memory, then the pthread keys could be |
| 827 | allocated after `app_process`' TLS segment, and Go will still find them. |
| 828 | |
| 829 | Go scans 384 words from the thread pointer. If there are at most 16 Bionic slots and 130 pthread |
| 830 | keys (2 words per key), then `app_process` can use at most 108 words of TLS memory. |
| 831 | |
| 832 | Drawback: In principle, this might make pthread key accesses slower, because Bionic can't assume |
| 833 | that pthread keys are at a fixed offset from the thread pointer anymore. It must load an offset from |
| 834 | somewhere (a global variable, another TLS slot, ...). `__get_thread()` already uses a TLS slot to |
| 835 | find `pthread_internal_t`, though, rather than assume a fixed offset. (XXX: I think it could be |
| 836 | optimized.) |
| 837 | |
| 838 | ## TODO: Memory Layout Querying APIs (Proposed) |
| 839 | |
| 840 | * https://sourceware.org/glibc/wiki/ThreadPropertiesAPI |
| 841 | * http://b/30609580 |
| 842 | |
| 843 | ## TODO: Sanitizers |
| 844 | |
| 845 | XXX: Maybe a sanitizer would want to intercept allocations of TLS memory, and that could be hard if |
| 846 | the loader is allocating it. |
| 847 | * It looks like glibc's ld.so re-relocates itself after loading a program, so a program's symbols |
| 848 | can interpose call in the loader: https://sourceware.org/ml/libc-alpha/2014-01/msg00501.html |
| 849 | |
| 850 | # References |
| 851 | |
| 852 | General (and x86/x86-64) |
| 853 | * Ulrich Drepper's TLS document, ["ELF Handling For Thread-Local Storage."][drepper] Describes the |
| 854 | overall ELF TLS design and ABI details for x86 and x86-64 (as well as several other architectures |
| 855 | that Android doesn't target). |
| 856 | * Alexandre Oliva's TLSDESC proposal with details for x86 and x86-64: ["Thread-Local Storage |
| 857 | Descriptors for IA32 and AMD64/EM64T."][tlsdesc-x86] |
| 858 | * [x86 and x86-64 SystemV psABIs][psabi-x86]. |
| 859 | |
| 860 | arm32: |
| 861 | * Alexandre Oliva's TLSDESC proposal for arm32: ["Thread-Local Storage Descriptors for the ARM |
| 862 | platform."][tlsdesc-arm] |
| 863 | * ["Addenda to, and Errata in, the ABI for the ARM® Architecture."][arm-addenda] Section 3, |
| 864 | "Addendum: Thread Local Storage" has details for arm32 non-TLSDESC ELF TLS. |
| 865 | * ["Run-time ABI for the ARM® Architecture."][arm-rtabi] Documents `__aeabi_read_tp`. |
| 866 | * ["ELF for the ARM® Architecture."][arm-elf] List TLS relocations (traditional and TLSDESC). |
| 867 | |
| 868 | arm64: |
| 869 | * [2015 LLVM bugtracker comment][llvm22408] with an excerpt from an unnamed ARM draft specification |
| 870 | describing arm64 code sequences necessary for linker relaxation |
| 871 | * ["ELF for the ARM® 64-bit Architecture (AArch64)."][arm64-elf] Lists TLS relocations (traditional |
| 872 | and TLSDESC). |
| 873 | |
| 874 | [drepper]: https://www.akkadia.org/drepper/tls.pdf |
| 875 | [tlsdesc-x86]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt |
| 876 | [psabi-x86]: https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI |
| 877 | [tlsdesc-arm]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt |
| 878 | [arm-addenda]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0045e/IHI0045E_ABI_addenda.pdf |
| 879 | [arm-rtabi]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf |
| 880 | [arm-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf |
| 881 | [llvm22408]: https://bugs.llvm.org/show_bug.cgi?id=22408#c10 |
| 882 | [arm64-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0056b/IHI0056B_aaelf64.pdf |