[MCA] Show aggregate over Average Wait times for the whole snippet (PR43219)

Summary:
As disscused in https://bugs.llvm.org/show_bug.cgi?id=43219,
i believe it may be somewhat useful to show //some// aggregates
over all the sea of statistics provided.

Example:
```
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     3     1.0    1.0    4.7       vmulps     %xmm0, %xmm1, %xmm2
1.     3     2.7    0.0    2.3       vhaddps    %xmm2, %xmm2, %xmm3
2.     3     6.0    0.0    0.0       vhaddps    %xmm3, %xmm3, %xmm4
       3     3.2    0.3    2.3       <total>
```
I.e. we average the averages.

Reviewers: andreadb, mattd, RKSimon

Reviewed By: andreadb

Subscribers: gbedwell, arphaman, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68714

llvm-svn: 374361
diff --git a/llvm/docs/CommandGuide/llvm-mca.rst b/llvm/docs/CommandGuide/llvm-mca.rst
index 4f8704a..efe29c4 100644
--- a/llvm/docs/CommandGuide/llvm-mca.rst
+++ b/llvm/docs/CommandGuide/llvm-mca.rst
@@ -523,6 +523,7 @@
   0.     3     1.0    1.0    3.3       vmulps	%xmm0, %xmm1, %xmm2
   1.     3     3.3    0.7    1.0       vhaddps	%xmm2, %xmm2, %xmm3
   2.     3     5.7    0.0    0.0       vhaddps	%xmm3, %xmm3, %xmm4
+         3     3.3    0.5    1.4       <total>
 
 The timeline view is interesting because it shows instruction state changes
 during execution.  It also gives an idea of how the tool processes instructions
@@ -574,7 +575,8 @@
 
 Table *Average Wait times* helps diagnose performance issues that are caused by
 the presence of long latency instructions and potentially long data dependencies
-which may limit the ILP.  Note that :program:`llvm-mca`, by default, assumes at
+which may limit the ILP. Last row, ``<total>``, shows a global average over all
+instructions measured. Note that :program:`llvm-mca`, by default, assumes at
 least 1cy between the dispatch event and the issue event.
 
 When the performance is limited by data dependencies and/or long latency