Blame - docs/elf-tls.md - platform/bionic

blob: 4a6279382a4581aab06dc9a4c63b5a7cf06094f9 [file] [log] [blame] [view]

Ryan Prichard	9491c54	2018-11-09 15:18:05 -0800	[diff] [blame]	1	# Android ELF TLS (Draft)
				2
				3	Internal links:
				4	* [go/android-elf-tls](http://go/android-elf-tls)
				5	* [One-pager](https://docs.google.com/document/d/1leyPTnwSs24P2LGiqnU6HetnN5YnDlZkihigi6qdf_M)
				6	* Tracking bugs: http://b/110100012, http://b/78026329
				7
				8	[TOC]
				9
				10	# Overview
				11
				12	ELF TLS is a system for automatically allocating thread-local variables with cooperation among the
				13	compiler, linker, dynamic loader, and libc.
				14
				15	Thread-local variables are declared in C and C++ with a specifier, e.g.:
				16
				17	```cpp
				18	thread_local int tls_var;
				19	```
				20
				21	At run-time, TLS variables are allocated on a module-by-module basis, where a module is a shared
				22	object or executable. At program startup, TLS for all initially-loaded modules comprises the "Static
				23	TLS Block". TLS variables within the Static TLS Block exist at fixed offsets from an
				24	architecture-specific thread pointer (TP) and can be accessed very efficiently -- typically just a
				25	few instructions. TLS variables belonging to dlopen'ed shared objects, on the other hand, may be
				26	allocated lazily, and accessing them typically requires a function call.
				27
				28	# Thread-Specific Memory Layout
				29
				30	Ulrich Drepper's ELF TLS document specifies two ways of organizing memory pointed at by the
				31	architecture-specific thread-pointer ([`__get_tls()`] in Bionic):
				32
				33	![TLS Variant 1 Layout](img/tls-variant1.png)
				34
				35	![TLS Variant 2 Layout](img/tls-variant2.png)
				36
				37	Variant 1 places the static TLS block after the TP, whereas variant 2 places it before the TP.
				38	According to Drepper, variant 2 was motivated by backwards compatibility, and variant 1 was designed
				39	for Itanium. The choice has effects on the toolchain, loader, and libc. In particular, when linking
				40	an executable, the linker needs to know where an executable's TLS segment is relative to the TP so
				41	it can correctly relocate TLS accesses. Both variants are incompatible with Bionic's current
				42	thread-specific data layout, but variant 1 is more problematic than variant 2.
				43
				44	Each thread has a "Dynamic Thread Vector" (DTV) with a pointer to each module's TLS block (or NULL
				45	if it hasn't been allocated yet). If the executable has a TLS segment, then it will always be module
				46	1, and its storage will always be immediately after (or before) the TP. In variant 1, the TP is
				47	expected to point immediately at the DTV pointer, whereas in variant 2, the DTV pointer's offset
				48	from TP is implementation-defined.
				49
				50	The DTV's "generation" field is used to lazily update/reallocate the DTV when new modules are loaded
				51	or unloaded.
				52
				53	[`__get_tls()`]: https://android.googlesource.com/platform/bionic/+/7245c082658182c15d2a423fe770388fec707cbc/libc/private/__get_tls.h
				54
				55	# Access Models
				56
				57	When a C/C++ file references a TLS variable, the toolchain generates instructions to find its
				58	address using a TLS "access model". The access models trade generality against efficiency. The four
				59	models are:
				60
				61	* GD: General Dynamic (aka Global Dynamic)
				62	* LD: Local Dynamic
				63	* IE: Initial Exec
				64	* LE: Local Exec
				65
				66	A TLS variable may be in a different module than the reference.
				67
				68	## General Dynamic (or Global Dynamic) (GD)
				69
				70	A GD access can refer to a TLS variable anywhere. To access a variable `tls_var` using the
				71	"traditional" non-TLSDESC design described in Drepper's TLS document, the toolchain compiler emits a
				72	call to a `__tls_get_addr` function provided by libc.
				73
				74	For example, if we have this C code in a shared object:
				75
				76	```cpp
				77	extern thread_local char tls_var;
				78	char* get_tls_var() {
				79	return &tls_var;
				80	}
				81	```
				82
				83	The toolchain generates code like this:
				84
				85	```cpp
				86	struct TlsIndex {
				87	long module; // starts counting at 1
				88	long offset;
				89	};
				90
				91	char* get_tls_var() {
				92	static TlsIndex tls_var_idx = { // allocated in the .got
				93	R_TLS_DTPMOD(tls_var), // dynamic TP module ID
				94	R_TLS_DTPOFF(tls_var), // dynamic TP offset
				95	};
				96	return __tls_get_addr(&tls_var_idx);
				97	}
				98	```
				99
				100	`R_TLS_DTPMOD` is a dynamic relocation to the index of the module containing `tls_var`, and
				101	`R_TLS_DTPOFF` is a dynamic relocation to the offset of `tls_var` within its module's `PT_TLS`
				102	segment.
				103
				104	`__tls_get_addr` looks up `TlsIndex::module`'s entry in the DTV and adds `TlsIndex::offset` to the
				105	module's TLS block. Before it can do this, it ensures that the module's TLS block is allocated. A
				106	simple approach is to allocate memory lazily:
				107
				108	1. If the current thread's DTV generation count is less than the current global TLS generation, then
				109	`__tls_get_addr` may reallocate the DTV or free blocks for unloaded modules.
				110
				111	2. If the DTV's entry for the given module is `NULL`, then `__tls_get_addr` allocates the module's
				112	memory.
				113
				114	If an allocation fails, `__tls_get_addr` calls `abort` (like emutls).
				115
				116	musl, on the other, preallocates TLS memory in `pthread_create` and in `dlopen`, and each can report
				117	out-of-memory.
				118
				119	## Local Dynamic (LD)
				120
				121	LD is a specialization of GD that's useful when a function has references to two or more TLS
				122	variables that are both part of the same module as the reference. Instead of a call to
				123	`__tls_get_addr` for each variable, the compiler calls `__tls_get_addr` once to get the current
				124	module's TLS block, then adds each variable's DTPOFF to the result.
				125
				126	For example, suppose we have this C code:
				127
				128	```cpp
				129	static thread_local int x;
				130	static thread_local int y;
				131	int sum() {
				132	return x + y;
				133	}
				134	```
				135
				136	The toolchain generates code like this:
				137
				138	```cpp
				139	int sum() {
				140	static TlsIndex tls_module_idx = { // allocated in the .got
				141	// a dynamic relocation against symbol 0 => current module ID
				142	R_TLS_DTPMOD(NULL),
				143	0,
				144	};
				145	char* base = __tls_get_addr(&tls_module_idx);
				146	// These R_TLS_DTPOFF() relocations are resolved at link-time.
				147	int* px = base + R_TLS_DTPOFF(x);
				148	int* py = base + R_TLS_DTPOFF(y);
				149	return px + py;
				150	}
				151	```
				152
				153	(XXX: LD might be important for C++ `thread_local` variables -- even a single `thread_local`
				154	variable with a dynamic initializer has an associated TLS guard variable.)
				155
				156	## Initial Exec (IE)
				157
				158	If the variable is part of the Static TLS Block (i.e. the executable or an initially-loaded shared
				159	object), then its offset from the TP is known at load-time. The variable can be accessed with a few
				160	loads.
				161
				162	Example: a C file for an executable:
				163
				164	```cpp
				165	// tls_var could be defined in the executable, or it could be defined
				166	// in a shared object the executable links against.
				167	extern thread_local char tls_var;
				168	char* get_addr() { return &tls_var; }
				169	```
				170
				171	Compiles to:
				172
				173	```cpp
				174	// allocated in the .got, resolved at load-time with a dynamic reloc.
				175	// Unlike DTPOFF, which is relative to the start of the module’s block,
				176	// TPOFF is directly relative to the thread pointer.
				177	static long tls_var_gotoff = R_TLS_TPOFF(tls_var);
				178
				179	char* get_addr() {
				180	return (char*)__get_tls() + tls_var_gotoff;
				181	}
				182	```
				183
				184	## Local Exec (LE)
				185
				186	LE is a specialization of IE. If the variable is not just part of the Static TLS Block, but is also
				187	part of the executable (and referenced from the executable), then a GOT access can be avoided. The
				188	IE example compiles to:
				189
				190	```cpp
				191	char* get_addr() {
				192	// R_TLS_TPOFF() is resolved at (static) link-time
				193	return (char*)__get_tls() + R_TLS_TPOFF(tls_var);
				194	}
				195	```
				196
				197	## Selecting an Access Model
				198
				199	The compiler selects an access model for each variable reference using these factors:
				200	* The absence of `-fpic` implies an executable, so use IE/LE.
				201	* Code compiled with `-fpic` could be in a shared object, so use GD/LD.
				202	* The per-file default can be overridden with `-ftls-model=<model>`.
				203	* Specifiers on the variable (`static`, `extern`, ELF visibility attributes).
				204	* A variable can be annotated with `__attribute__((tls_model(...)))`. Clang may still use a more
				205	efficient model than the one specified.
				206
				207	# Shared Objects with Static TLS
				208
				209	Shared objects are sometimes compiled with `-ftls-model=initial-exec` (i.e. "static TLS") for better
				210	performance. On Ubuntu, for example, `libc.so.6` and `libOpenGL.so.0` are compiled this way. Shared
				211	objects using static TLS can't be loaded with `dlopen` unless libc has reserved enough surplus
				212	memory in the static TLS block. glibc reserves a kilobyte or two (`TLS_STATIC_SURPLUS`) with the
				213	intent that only a few core system libraries would use static TLS. Non-core libraries also sometimes
				214	use it, which can break `dlopen` if the surplus area is exhausted. See:
				215	* https://bugzilla.redhat.com/show_bug.cgi?id=1124987
				216	* web search: [`"dlopen: cannot load any more object with static TLS"`][glibc-static-tls-error]
				217
				218	Neither musl nor the Bionic TLS prototype currently allocate any surplus TLS memory.
				219
				220	In general, supporting surplus TLS memory probably requires maintaining a thread list so that
				221	`dlopen` can initialize the new static TLS memory in all existing threads. A thread list could be
				222	omitted if the loader only allowed zero-initialized TLS segments and didn't reclaim memory on
				223	`dlclose`.
				224
				225	As long as a shared object is one of the initially-loaded modules, a better option is to use
				226	TLSDESC.
				227
				228	[glibc-static-tls-error]: https://www.google.com/search?q=%22dlopen:+cannot+load+any+more+object+with+static+TLS%22
				229
				230	# TLS Descriptors (TLSDESC)
				231
				232	The code fragments above match the "traditional" TLS design from Drepper's document. For the GD and
				233	LD models, there is a newer, more efficient design that uses "TLS descriptors". Each TLS variable
				234	reference has a corresponding descriptor, which contains a resolver function address and an argument
				235	to pass to the resolver.
				236
				237	For example, if we have this C code in a shared object:
				238
				239	```cpp
				240	extern thread_local char tls_var;
				241	char* get_tls_var() {
				242	return &tls_var;
				243	}
				244	```
				245
				246	The toolchain generates code like this:
				247
				248	```cpp
				249	struct TlsDescriptor { // NB: arm32 reverses these fields
				250	long (*resolver)(long);
				251	long arg;
				252	};
				253
				254	char* get_tls_var() {
				255	// allocated in the .got, uses a dynamic relocation
				256	static TlsDescriptor desc = R_TLS_DESC(tls_var);
				257	return (char*)__get_tls() + desc.resolver(desc.arg);
				258	}
				259	```
				260
				261	The dynamic loader fills in the TLS descriptors. For a reference to a variable allocated in the
				262	Static TLS Block, it can use a simple resolver function:
				263
				264	```cpp
				265	long static_tls_resolver(long arg) {
				266	return arg;
				267	}
				268	```
				269
				270	The loader writes `tls_var@TPOFF` into the descriptor's argument.
				271
				272	To support modules loaded with `dlopen`, the loader must use a resolver function that calls
				273	`__tls_get_addr`. In principle, this simple implementation would work:
				274
				275	```cpp
				276	long dynamic_tls_resolver(TlsIndex* arg) {
				277	return (long)__tls_get_addr(arg) - (long)__get_tls();
				278	}
				279	```
				280
				281	There are optimizations that complicate the design a little:
				282	* Unlike `__tls_get_addr`, the resolver function has a special calling convention that preserves
				283	almost all registers, reducing register pressure in the caller
				284	([example](https://godbolt.org/g/gywcxk)).
				285	* In general, the resolver function must call `__tls_get_addr`, so it must save and restore all
				286	registers.
				287	* To keep the fast path fast, the resolver inlines the fast path of `__tls_get_addr`.
				288	* By storing the module's initial generation alongside the TlsIndex, the resolver function doesn't
				289	need to use an atomic or synchronized access of the global TLS generation counter.
				290
				291	The resolver must be written in assembly, but in C, the function looks like so:
				292
				293	```cpp
				294	struct TlsDescDynamicArg {
				295	unsigned long first_generation;
				296	TlsIndex idx;
				297	};
				298
				299	struct TlsDtv { // DTV == dynamic thread vector
				300	unsigned long generation;
				301	char* modules[];
				302	};
				303
				304	long dynamic_tls_resolver(TlsDescDynamicArg* arg) {
				305	TlsDtv* dtv = __get_dtv();
				306	char* addr;
				307	if (dtv->generation >= arg->first_generation &&
				308	dtv->modules[arg->idx.module] != nullptr) {
				309	addr = dtv->modules[arg->idx.module] + arg->idx.offset;
				310	} else {
				311	addr = __tls_get_addr(&arg->idx);
				312	}
				313	return (long)addr - (long)__get_tls();
				314	}
				315	```
				316
				317	The loader needs to allocate a table of `TlsDescDynamicArg` objects for each TLS module with dynamic
				318	TLSDESC relocations.
				319
				320	The static linker can still relax a TLSDESC-based access to an IE/LE access.
				321
				322	The traditional TLS design is implemented everywhere, but the TLSDESC design has less toolchain
				323	support:
				324	* GCC and the BFD linker support both designs on all supported Android architectures (arm32, arm64,
				325	x86, x86-64).
				326	* GCC can select the design at run-time using `-mtls-dialect=<dialect>` (`trad`-vs-`desc` on arm64,
				327	otherwise `gnu`-vs-`gnu2`). Clang always uses the default mode.
				328	* GCC and Clang default to TLSDESC on arm64 and the traditional design on other architectures.
				329	* Gold and LLD support for TLSDESC is spotty (except when targeting arm64).
				330
				331	# Linker Relaxations
				332
				333	The (static) linker frequently has more information about the location of a referenced TLS variable
				334	than the compiler, so it can "relax" TLS accesses to more efficient models. For example, if an
				335	object file compiled with `-fpic` is linked into an executable, the linker could relax GD accesses
				336	to IE or LE. To relax a TLS access, the linker looks for an expected sequences of instructions and
				337	static relocations, then replaces the sequence with a different one of equal size. It may need to
				338	add or remove no-op instructions.
				339
				340	## Current Support for GD->LE Relaxations Across Linkers
				341
				342	Versions tested:
				343	* BFD and Gold linkers: version 2.30
				344	* LLD version 6.0.0 (upstream)
				345
				346	Linker support for GD->LE relaxation with `-mtls-dialect=gnu/trad` (traditional):
				347
				348	Architecture \| BFD \| Gold \| LLD
				349	--------------- \| --- \| ---- \| ---
				350	arm32 \| no \| no \| no
				351	arm64 (unusual) \| yes \| yes \| no
				352	x86 \| yes \| yes \| yes
				353	x86_64 \| yes \| yes \| yes
				354
				355	Linker support for GD->LE relaxation with `-mtls-dialect=gnu2/desc` (TLSDESC):
				356
				357	Architecture \| BFD \| Gold \| LLD
				358	--------------------- \| --- \| ------------------ \| ------------------
				359	arm32 (experimental) \| yes \| unsupported relocs \| unsupported relocs
				360	arm64 \| yes \| yes \| yes
				361	x86 (experimental) \| yes \| yes \| unsupported relocs
				362	X86_64 (experimental) \| yes \| yes \| unsupported relocs
				363
				364	arm32 linkers can't relax traditional TLS accesses. BFD can relax an arm32 TLSDESC access, but LLD
				365	can't link code using TLSDESC at all, except on arm64, where it's used by default.
				366
				367	# dlsym
				368
				369	Calling `dlsym` on a TLS variable returns the address of the current thread's variable.
				370
				371	# Debugger Support
				372
				373	## gdb
				374
				375	gdb uses a libthread_db plugin library to retrieve thread-related information from a target. This
				376	library is typically a shared object, but for Android, we link our own `libthread_db.a` into
				377	gdbserver. We will need to implement at least 2 APIs in `libthread_db.a` to find TLS variables, and
				378	gdb provides APIs for looking up symbols, reading or writing memory, and retrieving the current
				379	thread pointer (e.g. `ps_get_thread_area`).
				380	* Reference: [gdb_proc_service.h]: APIs gdb provides to libthread_db
				381	* Reference: [Currently unimplemented TLS functions in Android's libthread_tb][libthread_db.c]
				382
				383	[gdb_proc_service.h]: https://android.googlesource.com/toolchain/gdb/+/a7e49fd02c21a496095c828841f209eef8ae2985/gdb-8.0.1/gdb/gdb_proc_service.h#41
				384	[libthread_db.c]: https://android.googlesource.com/platform/ndk/+/e1f0ad12fc317c0ca3183529cc9625d3f084d981/sources/android/libthread_db/libthread_db.c#115
				385
				386	## LLDB
				387
				388	LLDB more-or-less implemented Linux TLS debugging in [r192922][rL192922] ([D1944]) for x86 and
				389	x86-64. [arm64 support came later][D5073]. However, the Linux TLS functionality no longer does
				390	anything: the `GetThreadPointer` function is no longer implemented. Code for reading the thread
				391	pointer was removed in [D10661] ([this function][r240543]). (arm32 was apparently never supported.)
				392
				393	[rL192922]: https://reviews.llvm.org/rL192922
				394	[D1944]: https://reviews.llvm.org/D1944
				395	[D5073]: https://reviews.llvm.org/D5073
				396	[D10661]: https://reviews.llvm.org/D10661
				397	[r240543]: https://github.com/llvm-mirror/lldb/commit/79246050b0f8d6b54acb5366f153d07f235d2780#diff-52dee3d148892cccfcdab28bc2165548L962
				398
				399	## Threading Library Metadata
				400
				401	Both debuggers need metadata from the threading library (`libc.so` / `libpthread.so`) to find TLS
				402	variables. From [LLDB r192922][rL192922]'s commit message:
				403
				404	> ... All OSes use basically the same algorithm (a per-module lookup table) as detailed in Ulrich
				405	> Drepper's TLS ELF ABI document, so we can easily write code to decode it ourselves. The only
				406	> question therefore is the exact field layouts required. Happily, the implementors of libpthread
				407	> expose the structure of the DTV via metadata exported as symbols from the .so itself, designed
				408	> exactly for this kind of thing. So this patch simply reads that metadata in, and re-implements
				409	> libthread_db's algorithm itself. We thereby get cross-platform TLS lookup without either requiring
				410	> third-party libraries, while still being independent of the version of libpthread being used.
				411
				412	LLDB uses these variables:
				413
				414	Name \| Notes
				415	--------------------------------- \| ---------------------------------------------------------------------------------------
				416	`_thread_db_pthread_dtvp` \| Offset from TP to DTV pointer (0 for variant 1, implementation-defined for variant 2)
				417	`_thread_db_dtv_dtv` \| Size of a DTV slot (typically/always sizeof(void*))
				418	`_thread_db_dtv_t_pointer_val` \| Offset within a DTV slot to the pointer to the allocated TLS block (typically/always 0)
				419	`_thread_db_link_map_l_tls_modid` \| Offset of a `link_map` field containing the module's 1-based TLS module ID
				420
				421	The metadata variables are local symbols in glibc's `libpthread.so` symbol table (but not its
				422	dynamic symbol table). Debuggers can access them, but applications can't.
				423
				424	The debugger lookup process is straightforward:
				425	* Find the `link_map` object and module-relative offset for a TLS variable.
				426	* Use `_thread_db_link_map_l_tls_modid` to find the TLS variable's module ID.
				427	* Read the target thread pointer.
				428	* Use `_thread_db_pthread_dtvp` to find the thread's DTV.
				429	* Use `_thread_db_dtv_dtv` and `_thread_db_dtv_t_pointer_val` to find the desired module's block
				430	within the DTV.
				431	* Add the module-relative offset to the module pointer.
				432
				433	This process doesn't appear robust in the face of lazy DTV initialization -- presumably it could
				434	read past the end of an out-of-date DTV or access an unloaded module. To be robust, it needs to
				435	compare a module's initial generation count against the DTV's generation count. (XXX: Does gdb have
				436	these sorts of problems with glibc's libpthread?)
				437
				438	## Reading the Thread Pointer with Ptrace
				439
				440	There are ptrace interfaces for reading the thread pointer for each of arm32, arm64, x86, and x86-64
				441	(XXX: check 32-vs-64-bit for inferiors, debuggers, and kernels):
				442	* arm32: `PTRACE_GET_THREAD_AREA`
				443	* arm64: `PTRACE_GETREGSET`, `NT_ARM_TLS`
				444	* x86_32: `PTRACE_GET_THREAD_AREA`
				445	* x86_64: use `PTRACE_PEEKUSER` to read the `{fs,gs}_base` fields of `user_regs_struct`
				446
				447	# C/C++ Specifiers
				448
				449	C/C++ TLS variables are declared with a specifier:
				450
				451	Specifier \| Notes
				452	--------------- \| -----------------------------------------------------------------------------------------------------------------------------
				453	`__thread` \| - non-standard, but ubiquitous in GCC and Clang<br/> - cannot have dynamic initialization or destruction
				454	`_Thread_local` \| - a keyword standardized in C11<br/> - cannot have dynamic initialization or destruction
				455	`thread_local` \| - C11: a macro for `_Thread_local` via `threads.h`<br/> - C++11: a keyword, allows dynamic initialization and/or destruction
				456
				457	The dynamic initialization and destruction of C++ `thread_local` variables is layered on top of ELF
				458	TLS (or emutls), so this design document mostly ignores it. Like emutls, ELF TLS variables either
				459	have a static initializer or are zero-initialized.
				460
				461	Aside: Because a `__thread` variable cannot have dynamic initialization, `__thread` is more
				462	efficient in C++ than `thread_local` when the compiler cannot see the definition of a declared TLS
				463	variable. The compiler assumes the variable could have a dynamic initializer and generates code, at
				464	each access, to call a function to initialize the variable.
				465
				466	# Graceful Failure on Old Platforms
				467
				468	ELF TLS isn't implemented on older Android platforms, so dynamic executables and shared objects
				469	using it generally won't work on them. Ideally, the older platforms would reject these binaries
				470	rather than experience memory corruption at run-time.
				471
				472	Static executables aren't a problem--the necessary runtime support is part of the executable, so TLS
				473	just works.
				474
				475	XXX: Shared objects are less of a problem.
				476	* On arm32, x86, and x86_64, the loader [should reject a TLS relocation]. (XXX: I haven't verified
				477	this.)
				478	* On arm64, the primary TLS relocation (R_AARCH64_TLSDESC) is [confused with an obsolete
				479	R_AARCH64_TLS_DTPREL32 relocation][R_AARCH64_TLS_DTPREL32] and is [quietly ignored].
				480	* Android P [added compatibility checks] for TLS symbols and `DT_TLSDESC_{GOT\|PLT}` entries.
				481
				482	XXX: A dynamic executable using ELF TLS would have a PT_TLS segment and no other distinguishing
				483	marks, so running it on an older platform would result in memory corruption. Should we add something
				484	to these executables that only newer platforms recognize? (e.g. maybe an entry in .dynamic, a
				485	reference to a symbol only a new libc.so has...)
				486
				487	[should reject a TLS relocation]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2852
				488	[R_AARCH64_TLS_DTPREL32]: https://android-review.googlesource.com/c/platform/bionic/+/723696
				489	[quietly ignored]: https://android.googlesource.com/platform/bionic/+/android-8.1.0_r48/linker/linker.cpp#2784
				490	[added compatibility checks]: https://android-review.googlesource.com/c/platform/bionic/+/648760
				491
				492	# Bionic Prototype Notes
				493
				494	There is an [ELF TLS prototype] uploaded on Gerrit. It implements:
				495	* Static TLS Block allocation for static and dynamic executables
				496	* TLS for dynamically loaded and unloaded modules (`__tls_get_addr`)
				497	* TLSDESC for arm64 only
				498
				499	Missing:
				500	* `dlsym` of a TLS variable
				501	* debugger support
				502
				503	[ELF TLS prototype]: https://android-review.googlesource.com/q/topic:%22elf-tls-prototype%22+(status:open%20OR%20status:merged)
				504
				505	## Loader/libc Communication
				506
				507	The loader exposes a list of TLS modules ([`struct TlsModules`][TlsModules]) to `libc.so` using the
				508	`__libc_shared_globals` variable (see `tls_modules()` in [linker_tls.cpp][tls_modules-linker] and
				509	[elf_tls.cpp][tls_modules-libc]). `__tls_get_addr` in libc.so acquires the `TlsModules::mutex` and
				510	iterates its module list to lazily allocate and free TLS blocks.
				511
				512	[TlsModules]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.h#53
				513	[tls_modules-linker]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/linker/linker_tls.cpp#45
				514	[tls_modules-libc]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/bionic/elf_tls.cpp#49
				515
				516	## TLS Allocator
				517
				518	The prototype currently allocates a `pthread_internal_t` object and static TLS in a single mmap'ed
				519	region, along with a thread's stack if it needs one allocated. It doesn't place TLS memory on a
				520	preallocated stack (either the main thread's stack or one provided with `pthread_attr_setstack`).
				521
				522	The DTV and blocks for dlopen'ed modules are instead allocated using the Bionic loader's
				523	`LinkerMemoryAllocator`, adapted to avoid the STL and to provide `memalign`. The prototype tries to
				524	achieve async-signal safety by blocking signals and acquiring a lock.
				525
				526	There are three "entry points" to dynamically locate a TLS variable's address:
				527	* libc.so: `__tls_get_addr`
				528	* loader: TLSDESC dynamic resolver
				529	* loader: dlsym
				530
				531	The loader's entry points need to call `__tls_get_addr`, which needs to allocate memory. Currently,
				532	the prototype uses a [special function pointer] to call libc.so's `__tls_get_addr` from the loader.
				533	(This should probably be removed.)
				534
				535	The prototype currently allows for arbitrarily-large TLS variable alignment. IIRC, different
				536	implementations (glibc, musl, FreeBSD) vary in their level of respect for TLS alignment. It looks
				537	like the Bionic loader ignores segments' alignment and aligns loaded libraries to 256 KiB. See
				538	`ReserveAligned`.
				539
				540	[special function pointer]: https://android-review.googlesource.com/c/platform/bionic/+/723698/1/libc/private/bionic_globals.h#52
				541
				542	## Async-Signal Safety
				543
				544	The prototype's `__tls_get_addr` might be async-signal safe. Making it AS-safe is a good idea if
				545	it's feasible. musl's function is AS-safe, but glibc's isn't (or wasn't). Google had a patch to make
				546	glibc AS-safe back in 2012-2013. See:
				547	* https://sourceware.org/glibc/wiki/TLSandSignals
				548	* https://sourceware.org/ml/libc-alpha/2012-06/msg00335.html
				549	* https://sourceware.org/ml/libc-alpha/2013-09/msg00563.html
				550
				551	## Out-of-Memory Handling (abort)
				552
				553	The prototype lazily allocates TLS memory for dlopen'ed modules (see `__tls_get_addr`), and an
				554	out-of-memory error on a TLS access aborts the process. musl, on the other hand, preallocates TLS
				555	memory on `pthread_create` and `dlopen`, so either function can return out-of-memory. Both functions
				556	probably need to acquire the same lock.
				557
				558	Maybe Bionic should do the same as musl? Perhaps musl's robustness argument holds for Bionic,
				559	though, because Bionic (at least the linker) probably already aborts on OOM. musl doesn't support
				560	`dlclose`/unloading, so it might have an easier time.
				561
				562	On the other hand, maybe lazy allocation is a feature, because not all threads will use a dlopen'ed
				563	solib's TLS variables. Drepper makes this argument in his TLS document:
				564
				565	> In addition the run-time support should avoid creating the thread-local storage if it is not
				566	> necessary. For instance, a loaded module might only be used by one thread of the many which make
				567	> up the process. It would be a waste of memory and time to allocate the storage for all threads. A
				568	> lazy method is wanted. This is not much extra burden since the requirement to handle dynamically
				569	> loaded objects already requires recognizing storage which is not yet allocated. This is the only
				570	> alternative to stopping all threads and allocating storage for all threads before letting them run
				571	> again.
				572
				573	FWIW: emutls also aborts on out-of-memory.
				574
				575	## ELF TLS Not Usable in libc
				576
				577	The dynamic loader currently can't use ELF TLS, so any part of libc linked into the loader (i.e.
				578	most of it) also can't use ELF TLS. It might be possible to lift this restriction, perhaps with
				579	specialized `__tls_get_addr` and TLSDESC resolver functions.
				580
				581	# Open Issues
				582
				583	## Bionic Memory Layout Conflicts with Common TLS Layout
				584
				585	Bionic already allocates thread-specific data in a way that conflicts with TLS variants 1 and 2:
				586	![Bionic TLS Layout in Android P](img/bionic-tls-layout-in-p.png)
				587
				588	TLS variant 1 allocates everything after the TP to ELF TLS (except the first two words), and variant
				589	2 allocates everything before the TP. Bionic currently allocates memory before and after the TP to
				590	the `pthread_internal_t` struct.
				591
				592	The `bionic_tls.h` header is marked with a warning:
				593
				594	```cpp
				595	/** WARNING WARNING WARNING
				596	**
				597	** This header file is NOT part of the public Bionic ABI/API
				598	** and should not be used/included by user-serviceable parts of
				599	** the system (e.g. applications).
				600	**
				601	** It is only provided here for the benefit of the system dynamic
				602	** linker and the OpenGL sub-system (which needs to access the
				603	** pre-allocated slot directly for performance reason).
				604	**/
				605	```
				606
				607	There are issues with rearranging this memory:
				608
				609	* `TLS_SLOT_STACK_GUARD` is used for `-fstack-protector`. The location (word #5) was initially used
				610	by GCC on x86 (and x86-64), where it is compatible with x86's TLS variant 2. We [modified Clang
				611	to use this slot for arm64 in 2016][D18632], though, and the slot isn't compatible with ARM's
				612	variant 1 layout. This change shipped in NDK r14, and the NDK's build systems (ndk-build and the
				613	CMake toolchain file) enable `-fstack-protector-strong` by default.
				614
				615	* `TLS_SLOT_TSAN` is used for more than just TSAN -- it's also used by [HWASAN and
				616	Scudo](https://reviews.llvm.org/D53906#1285002).
				617
				618	* The Go runtime allocates a thread-local "g" variable on Android by creating a pthread key and
				619	searching for its TP-relative offset, which it assumes is nonnegative:
				620	* On arm32/arm64, it creates a pthread key, sets it to a magic value, then scans forward from
				621	the thread pointer looking for it. [The scan count was bumped to 384 to fix a reported
				622	breakage happening with Android N.](https://go-review.googlesource.com/c/go/+/38636) (XXX: I
				623	suspect the actual platform breakage happened with Android M's [lock-free pthread key
				624	work][bionic-lockfree-keys].)
				625	* On x86/x86-64, it uses a fixed offset from the thread pointer (TP+0xf8 or TP+0x1d0) and
				626	creates pthread keys until one of them hits the fixed offset.
				627	* CLs:
				628	* arm32: https://codereview.appspot.com/106380043
				629	* arm64: https://go-review.googlesource.com/c/go/+/17245
				630	* x86: https://go-review.googlesource.com/c/go/+/16678
				631	* x86-64: https://go-review.googlesource.com/c/go/+/15991
				632	* Moving the pthread keys before the thread pointer breaks Go-based apps.
				633	* It's unclear how many Android apps use Go. There are at least two with 1,000,000+ installs.
				634	* [Some motivation for Go's design][golang-post], [runtime/HACKING.md][go-hacking]
				635	* [On x86/x86-64 Darwin, Go uses a TLS slot reserved for both Go and Wine][go-darwin-x86] (On
				636	[arm32][go-darwin-arm32]/[arm64][go-darwin-arm64] Darwin, Go scans for pthread keys like it
				637	does on Android.)
				638
				639	* Android's "native bridge" system allows the Zygote to load an app solib of a non-native ABI. (For
				640	example, it could be used to load an arm32 solib into an x86 Zygote.) The solib is translated
				641	into the host architecture. TLS accesses in the app solib (whether ELF TLS, Bionic slots, or
				642	`pthread_internal_t` fields) become host accesses. Laying out TLS memory differently across
				643	architectures could complicate this translation.
				644
				645	* A `pthread_t` is practically just a `pthread_internal_t*`, and some apps directly access the
				646	`pthread_internal_t::tid` field. Past examples: http://b/17389248, [aosp/107467]. Reorganizing
				647	the initial `pthread_internal_t` fields could break those apps.
				648
				649	It seems easy to fix the incompatibility for variant 2 (x86 and x86_64) by splitting out the Bionic
				650	slots into a new data structure. Variant 1 is a harder problem.
				651
				652	The TLS prototype currently uses a patched LLD that uses a variant 1 TLS layout with a 16-word TCB
				653	on all architectures.
				654
				655	Aside: gcc's arm64ilp32 target uses a 32-bit unsigned offset for a TLS IE access
				656	(https://godbolt.org/z/_NIXjF). If Android ever supports this target, and in a configuration with
				657	variant 2 TLS, we might need to change the compiler to emit a sign-extending load.
				658
				659	[D18632]: https://reviews.llvm.org/D18632
				660	[bionic-lockfree-keys]: https://android-review.googlesource.com/c/platform/bionic/+/134202
				661	[golang-post]: https://groups.google.com/forum/#!msg/golang-nuts/EhndTzcPJxQ/i-w7kAMfBQAJ
				662	[go-hacking]: https://github.com/golang/go/blob/master/src/runtime/HACKING.md
				663	[go-darwin-x86]: https://github.com/golang/go/issues/23617
				664	[go-darwin-arm32]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm.c
				665	[go-darwin-arm64]: https://github.com/golang/go/blob/15c106d99305411b587ec0d9e80c882e538c9d47/src/runtime/cgo/gcc_darwin_arm64.c
				666	[aosp/107467]: https://android-review.googlesource.com/c/platform/bionic/+/107467
				667
				668	### Workaround: Use Variant 2 on arm32/arm64
				669
				670	Pros: simplifies Bionic
				671
				672	Cons:
				673	* arm64: requires either subtle reinterpretation of a TLS relocation or addition of a new
				674	relocation
				675	* arm64: a new TLS relocation reduces compiler/assembler compatibility with non-Android
				676
				677	The point of variant 2 was backwards-compatibility, and ARM Android needs to remain
				678	backwards-compatible, so we could use variant 2 for ARM. Problems:
				679
				680	* When linking an executable, the static linker needs to know how TLS is allocated because it
				681	writes TP-relative offsets for IE/LE-model accesses. Clang doesn't tell the linker to target
				682	Android, so it could pass an `--tls-variant2` flag to configure lld.
				683
				684	* On arm64, there are different sets of static LE relocations accommodating different ranges of
				685	offsets from TP:
				686
				687	Size \| TP offset range \| Static LE relocation types
				688	---- \| ----------------- \| ---------------------------------------
				689	12 \| 0 <= x < 2^12 \| `R_AARCH64_TLSLE_ADD_TPREL_LO12`
				690	" \| " \| `R_AARCH64_TLSLE_LDST8_TPREL_LO12`
				691	" \| " \| `R_AARCH64_TLSLE_LDST16_TPREL_LO12`
				692	" \| " \| `R_AARCH64_TLSLE_LDST32_TPREL_LO12`
				693	" \| " \| `R_AARCH64_TLSLE_LDST64_TPREL_LO12`
				694	" \| " \| `R_AARCH64_TLSLE_LDST128_TPREL_LO12`
				695	16 \| -2^16 <= x < 2^16 \| `R_AARCH64_TLSLE_MOVW_TPREL_G0`
				696	24 \| 0 <= x < 2^24 \| `R_AARCH64_TLSLE_ADD_TPREL_HI12`
				697	" \| " \| `R_AARCH64_TLSLE_ADD_TPREL_LO12_NC`
				698	" \| " \| `R_AARCH64_TLSLE_LDST8_TPREL_LO12_NC`
				699	" \| " \| `R_AARCH64_TLSLE_LDST16_TPREL_LO12_NC`
				700	" \| " \| `R_AARCH64_TLSLE_LDST32_TPREL_LO12_NC`
				701	" \| " \| `R_AARCH64_TLSLE_LDST64_TPREL_LO12_NC`
				702	" \| " \| `R_AARCH64_TLSLE_LDST128_TPREL_LO12_NC`
				703	32 \| -2^32 <= x < 2^32 \| `R_AARCH64_TLSLE_MOVW_TPREL_G1`
				704	" \| " \| `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC`
				705	48 \| -2^48 <= x < 2^48 \| `R_AARCH64_TLSLE_MOVW_TPREL_G2`
				706	" \| " \| `R_AARCH64_TLSLE_MOVW_TPREL_G1_NC`
				707	" \| " \| `R_AARCH64_TLSLE_MOVW_TPREL_G0_NC`
				708
				709	GCC for arm64 defaults to the 24-bit model and has an `-mtls-size=SIZE` option for setting other
				710	supported sizes. (It supports 12, 24, 32, and 48.) Clang has only implemented the 24-bit model,
				711	but that could change. (Clang [briefly used][D44355] load/store relocations, but it was reverted
				712	because no linker supported them: [BFD], [Gold], [LLD]).
				713
				714	The 16-, 32-, and 48-bit models use a `movn/movz` instruction to set the highest 16 bits to a
				715	positive or negative value, then `movk` to set the remaining 16 bit chunks. In principle, these
				716	relocations should be able to accommodate a negative TP offset.
				717
				718	The 24-bit model uses `add` to set the high 12 bits, then places the low 12 bits into another
				719	`add` or a load/store instruction.
				720
				721	Maybe we could modify the `R_AARCH64_TLSLE_ADD_TPREL_HI12` relocation to allow a negative TP offset
				722	by converting the relocated `add` instruction to a `sub`. Alternately, we could add a new
				723	`R_AARCH64_TLSLE_SUB_TPREL_HI12` relocation, and Clang would use a different TLS LE instruction
				724	sequence when targeting Android/arm64.
				725
				726	* LLD's arm64 relaxations from GD and IE to LE would need to use `movn` instead of `movk` for
				727	Android.
				728
				729	* Binaries linked with the flag crash on non-Bionic, and binaries without the flag crash on Bionic.
				730	We might want to mark the binaries somehow to indicate the non-standard TLS ABI. Suggestion:
				731	* Use an `--android-tls-variant2` flag (or `--bionic-tls-variant2`, we're trying to make [Bionic
				732	run on the host](http://b/31559095))
				733	* Add a `PT_ANDROID_TLS_TPOFF` segment?
				734	* Add a [`.note.gnu.property`](https://reviews.llvm.org/D53906#1283425) with a
				735	"`GNU_PROPERTY_TLS_TPOFF`" property value?
				736
				737	[D44355]: https://reviews.llvm.org/D44355
				738	[BFD]: https://sourceware.org/bugzilla/show_bug.cgi?id=22970
				739	[Gold]: https://sourceware.org/bugzilla/show_bug.cgi?id=22969
				740	[LLD]: https://bugs.llvm.org/show_bug.cgi?id=36727
				741
				742	### Workaround: Reserve an Extra-Large TCB on ARM
				743
				744	Pros: Minimal linker change, no change to TLS relocations.
				745	Cons: The reserved amount becomes an arbitrary but immutable part of the Android ABI.
				746
				747	Add an lld option: `--android-tls[-tcb=SIZE]`
				748
				749	As with the first workaround, we'd probably want to mark the binary to indicate the non-standard
				750	TP-to-TLS-segment offset.
				751
				752	Reservation amount:
				753	* We would reserve at least 6 words to cover the stack guard
				754	* Reserving 16 covers all the existing Bionic slots and gives a little room for expansion. (If we
				755	ever needed more than 16 slots, we could allocate the space before TP.)
				756	* 16 isn't enough for the pthread keys, so the Go runtime is still a problem.
				757	* Reserving 138 words is enough for existing slots and pthread keys.
				758
				759	### Workaround: Use Variant 1 Everywhere with an Extra-Large TCB
				760
				761	Pros:
				762	* memory layout is the same on all architectures, avoids native bridge complications
				763	* x86/x86-64 relocations probably handle positive offsets without issue
				764
				765	Cons:
				766	* The reserved amount is still arbitrary.
				767
				768	### Workaround: No LE Model in Android Executables
				769
				770	Pros:
				771	* Keeps options open. We can allow LE later if we want.
				772	* Bionic's existing memory layout doesn't change, and arm32 and 32-bit x86 have the same layout
				773	* Fixes everything but static executables
				774
				775	Cons:
				776	* more intrusive toolchain changes (affects both Clang and LLD)
				777	* statically-linked executables still need another workaround
				778	* somewhat larger/slower executables (they must use IE, not LE)
				779
				780	The layout conflict is apparently only a problem because an executable assumes that its TLS segment
				781	is located at a statically-known offset from the TP (i.e. it uses the LE model). An initially-loaded
				782	shared object can still use the efficient IE access model, but its TLS segment offset is known at
				783	load-time, not link-time. If we can guarantee that Android's executables also use the IE model, not
				784	LE, then the Bionic loader can place the executable's TLS segment at any offset from the TP, leaving
				785	the existing thread-specific memory layout untouched.
				786
				787	This workaround doesn't help with statically-linked executables, but they're probably less of a
				788	problem, because the linker and `libc.a` are usually packaged together.
				789
				790	A likely problem: LD is normally relaxed to LE, not to IE. We'd either have to disable LD usage in
				791	the compiler (bad for performance) or add LD->IE relaxation. This relaxation requires that IE code
				792	sequences be no larger than LD code sequences, which may not be the case on some architectures.
				793	(XXX: In some past testing, it looked feasible for TLSDESC but not the traditional design.)
				794
				795	To implement:
				796	* Clang would need to stop generating LE accesses.
				797	* LLD would need to relax GD and LD to IE instead of LE.
				798	* LLD should abort if it sees a TLS LE relocation.
				799	* LLD must not statically resolve an executable's IE relocation in the GOT. (It might assume that
				800	it knows its value.)
				801	* Perhaps LLD should mark executables specially, because a normal ELF linker's output would quietly
				802	trample on `pthread_internal_t`. We need something like `DF_STATIC_TLS`, but instead of
				803	indicating IE in an solib, we want to indicate the lack of LE in an executable.
				804
				805	### (Non-)workaround for Go: Allocate a Slot with Go's Magic Values
				806
				807	The Go runtime allocates its thread-local "g" variable by searching for a hard-coded magic constant
				808	(`0x23581321` for arm32 and `0x23581321345589` for arm64). As long as it finds its constant at a
				809	small positive offset from TP (within the first 384 words), it will think it has found the pthread
				810	key it allocated.
				811
				812	As a temporary compatibility hack, we might try to keep these programs running by reserving a TLS
				813	slot with this magic value. This hack doesn't appear to work, however. The runtime finds its pthread
				814	key, but apps segfault. Perhaps the Go runtime expects its "g" variable to be zero-initialized ([one
				815	example][go-tlsg-zero]). With this hack, it's never zero, but with its current allocation strategy,
				816	it is typically zero. After [Bionic's pthread key system was rewritten to be
				817	lock-free][bionic-lockfree-keys] for Android M, though, it's not guaranteed, because a key could be
				818	recycled.
				819
				820	[go-tlsg-zero]: https://go.googlesource.com/go/+/5bc1fd42f6d185b8ff0201db09fb82886978908b/src/runtime/asm_arm64.s#980
				821
				822	### Workaround for Go: place pthread keys after the executable's TLS
				823
				824	Most Android executables do not use any `thread_local` variables. In the current prototype, with the
				825	AOSP hikey960 build, only `/system/bin/netd` has a TLS segment, and it's only 32 bytes. As long as
				826	`/system/bin/app_process{32,64}` limits its use of TLS memory, then the pthread keys could be
				827	allocated after `app_process`' TLS segment, and Go will still find them.
				828
				829	Go scans 384 words from the thread pointer. If there are at most 16 Bionic slots and 130 pthread
				830	keys (2 words per key), then `app_process` can use at most 108 words of TLS memory.
				831
				832	Drawback: In principle, this might make pthread key accesses slower, because Bionic can't assume
				833	that pthread keys are at a fixed offset from the thread pointer anymore. It must load an offset from
				834	somewhere (a global variable, another TLS slot, ...). `__get_thread()` already uses a TLS slot to
				835	find `pthread_internal_t`, though, rather than assume a fixed offset. (XXX: I think it could be
				836	optimized.)
				837
				838	## TODO: Memory Layout Querying APIs (Proposed)
				839
				840	* https://sourceware.org/glibc/wiki/ThreadPropertiesAPI
				841	* http://b/30609580
				842
				843	## TODO: Sanitizers
				844
				845	XXX: Maybe a sanitizer would want to intercept allocations of TLS memory, and that could be hard if
				846	the loader is allocating it.
				847	* It looks like glibc's ld.so re-relocates itself after loading a program, so a program's symbols
				848	can interpose call in the loader: https://sourceware.org/ml/libc-alpha/2014-01/msg00501.html
				849
				850	# References
				851
				852	General (and x86/x86-64)
				853	* Ulrich Drepper's TLS document, ["ELF Handling For Thread-Local Storage."][drepper] Describes the
				854	overall ELF TLS design and ABI details for x86 and x86-64 (as well as several other architectures
				855	that Android doesn't target).
				856	* Alexandre Oliva's TLSDESC proposal with details for x86 and x86-64: ["Thread-Local Storage
				857	Descriptors for IA32 and AMD64/EM64T."][tlsdesc-x86]
				858	* [x86 and x86-64 SystemV psABIs][psabi-x86].
				859
				860	arm32:
				861	* Alexandre Oliva's TLSDESC proposal for arm32: ["Thread-Local Storage Descriptors for the ARM
				862	platform."][tlsdesc-arm]
				863	* ["Addenda to, and Errata in, the ABI for the ARM® Architecture."][arm-addenda] Section 3,
				864	"Addendum: Thread Local Storage" has details for arm32 non-TLSDESC ELF TLS.
				865	* ["Run-time ABI for the ARM® Architecture."][arm-rtabi] Documents `__aeabi_read_tp`.
				866	* ["ELF for the ARM® Architecture."][arm-elf] List TLS relocations (traditional and TLSDESC).
				867
				868	arm64:
				869	* [2015 LLVM bugtracker comment][llvm22408] with an excerpt from an unnamed ARM draft specification
				870	describing arm64 code sequences necessary for linker relaxation
				871	* ["ELF for the ARM® 64-bit Architecture (AArch64)."][arm64-elf] Lists TLS relocations (traditional
				872	and TLSDESC).
				873
				874	[drepper]: https://www.akkadia.org/drepper/tls.pdf
				875	[tlsdesc-x86]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-x86.txt
				876	[psabi-x86]: https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI
				877	[tlsdesc-arm]: https://www.fsfla.org/~lxoliva/writeups/TLS/RFC-TLSDESC-ARM.txt
				878	[arm-addenda]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0045e/IHI0045E_ABI_addenda.pdf
				879	[arm-rtabi]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0043d/IHI0043D_rtabi.pdf
				880	[arm-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044f/IHI0044F_aaelf.pdf
				881	[llvm22408]: https://bugs.llvm.org/show_bug.cgi?id=22408#c10
				882	[arm64-elf]: http://infocenter.arm.com/help/topic/com.arm.doc.ihi0056b/IHI0056B_aaelf64.pdf