use a dedicated futex object for pthread_join instead of tid field

the tid field in the pthread structure is not volatile, and really
shouldn't be, so as not to limit the compiler's ability to reorder,
merge, or split loads in code paths that may be relevant to
performance (like controlling lock ownership).

however, use of objects which are not volatile or atomic with futex
wait is inherently broken, since the compiler is free to transform a
single load into multiple loads, thereby using a different value for
the controlling expression of the loop and the value passed to the
futex syscall, leading the syscall to block instead of returning.

reportedly glibc's pthread_join was actually affected by an equivalent
issue in glibc on s390.

add a separate, dedicated join_futex object for pthread_join to use.
diff --git a/src/thread/pthread_create.c b/src/thread/pthread_create.c
index 439ee36..ac06d7a 100644
--- a/src/thread/pthread_create.c
+++ b/src/thread/pthread_create.c
@@ -282,9 +282,10 @@
 	new->robust_list.head = &new->robust_list.head;
 	new->unblock_cancel = self->cancel;
 	new->CANARY = self->CANARY;
+	new->join_futex = -1;
 
 	a_inc(&libc.threads_minus_1);
-	ret = __clone((c11 ? start_c11 : start), stack, flags, new, &new->tid, TP_ADJ(new), &new->tid);
+	ret = __clone((c11 ? start_c11 : start), stack, flags, new, &new->tid, TP_ADJ(new), &new->join_futex);
 
 	__release_ptc();