Block library shutdown until unreaped threads finish spin-waiting

This change fixes possibly invalid access to the internal data structure during
library shutdown.  In a heavily oversubscribed situation, the library shutdown
sequence can reach the point where resources are deallocated while there still
exist threads in their final spinning loop.  The added loop in
__kmp_internal_end() checks if there are such busy-waiting threads and blocks
the shutdown sequence if that is the case. Two versions of kmp_wait_template()
are now used to minimize performance impact.

Patch by Hansang Bae

Differential Revision: https://reviews.llvm.org/D49452

llvm-svn: 337486
diff --git a/openmp/runtime/src/kmp_runtime.cpp b/openmp/runtime/src/kmp_runtime.cpp
index 69eba21..a16d2fd 100644
--- a/openmp/runtime/src/kmp_runtime.cpp
+++ b/openmp/runtime/src/kmp_runtime.cpp
@@ -4334,6 +4334,9 @@
 
   new_thr->th.th_spin_here = FALSE;
   new_thr->th.th_next_waiting = 0;
+#if KMP_OS_UNIX
+  new_thr->th.th_blocking = false;
+#endif
 
 #if OMP_40_ENABLED && KMP_AFFINITY_SUPPORTED
   new_thr->th.th_current_place = KMP_PLACE_UNDEFINED;
@@ -5961,6 +5964,18 @@
 
     __kmp_reap_task_teams();
 
+#if KMP_OS_UNIX
+    // Threads that are not reaped should not access any resources since they
+    // are going to be deallocated soon, so the shutdown sequence should wait
+    // until all threads either exit the final spin-waiting loop or begin
+    // sleeping after the given blocktime.
+    for (i = 0; i < __kmp_threads_capacity; i++) {
+      kmp_info_t *thr = __kmp_threads[i];
+      while (thr && KMP_ATOMIC_LD_ACQ(&thr->th.th_blocking))
+        KMP_CPU_PAUSE();
+    }
+#endif
+
     for (i = 0; i < __kmp_threads_capacity; ++i) {
       // TBD: Add some checking...
       // Something like KMP_DEBUG_ASSERT( __kmp_thread[ i ] == NULL );