tsan: speed up race deduplication

Race deduplication code proved to be a performance bottleneck in the past if suppressions/annotations are used, or just some races left unaddressed. And we still get user complaints about this:
https://groups.google.com/forum/#!topic/thread-sanitizer/hB0WyiTI4e4
ReportRace already has several layers of caching for racy pcs/addresses to make deduplication faster. However, ReportRace still takes a global mutex (ThreadRegistry and ReportMutex) during deduplication and also calls mmap/munmap (which take process-wide semaphore in kernel), this makes deduplication non-scalable.

This patch moves race deduplication outside of global mutexes and also removes all mmap/munmap calls.
As the result, race_stress.cc with 100 threads and 10000 iterations become 30x faster:

before:
real	0m21.673s
user	0m5.932s
sys	0m34.885s

after:
real	0m0.720s
user	0m23.646s
sys	0m1.254s

http://reviews.llvm.org/D12554

llvm-svn: 246758
diff --git a/compiler-rt/lib/tsan/rtl/tsan_rtl.h b/compiler-rt/lib/tsan/rtl/tsan_rtl.h
index e51e466..ae44bc9 100644
--- a/compiler-rt/lib/tsan/rtl/tsan_rtl.h
+++ b/compiler-rt/lib/tsan/rtl/tsan_rtl.h
@@ -458,7 +458,7 @@
 
 struct FiredSuppression {
   ReportType type;
-  uptr pc;
+  uptr pc_or_addr;
   Suppression *supp;
 };
 
@@ -480,9 +480,11 @@
 
   ThreadRegistry *thread_registry;
 
+  Mutex racy_mtx;
   Vector<RacyStacks> racy_stacks;
   Vector<RacyAddress> racy_addresses;
   // Number of fired suppressions may be large enough.
+  Mutex fired_suppressions_mtx;
   InternalMmapVector<FiredSuppression> fired_suppressions;
   DDetector *dd;
 
@@ -587,8 +589,7 @@
 
 void ReportRace(ThreadState *thr);
 bool OutputReport(ThreadState *thr, const ScopedReport &srep);
-bool IsFiredSuppression(Context *ctx, const ScopedReport &srep,
-                        StackTrace trace);
+bool IsFiredSuppression(Context *ctx, ReportType type, StackTrace trace);
 bool IsExpectedReport(uptr addr, uptr size);
 void PrintMatchedBenignRaces();