completely new barrier implementation, addressing major correctness issues

the previous implementation had at least 2 problems:

1. the case where additional threads reached the barrier before the
first wave was finished leaving the barrier was untested and seemed
not to be working.

2. threads leaving the barrier continued to access memory within the
barrier object after other threads had successfully returned from
pthread_barrier_wait. this could lead to memory corruption or crashes
if the barrier object had automatic storage in one of the waiting
threads and went out of scope before all threads finished returning,
or if one thread unmapped the memory in which the barrier object
lived.

the new implementation avoids both problems by making the barrier
state essentially local to the first thread which enters the barrier
wait, and forces that thread to be the last to return.
diff --git a/src/internal/pthread_impl.h b/src/internal/pthread_impl.h
index 304bf98..049f5df 100644
--- a/src/internal/pthread_impl.h
+++ b/src/internal/pthread_impl.h
@@ -68,10 +68,10 @@
 #define _rw_readers __u.__i[1]
 #define _rw_waiters __u.__i[2]
 #define _rw_owner __u.__i[3]
-#define _b_count __u.__i[0]
-#define _b_limit __u.__i[1]
-#define _b_left __u.__i[2]
-#define _b_waiters __u.__i[3]
+#define _b_inst __u.__p[0]
+#define _b_limit __u.__i[2]
+#define _b_lock __u.__i[3]
+#define _b_waiters __u.__i[4]
 
 #include "pthread_arch.h"