More Quick compile-time tuning: labels & branches

This CL represents a roughly 3.5% performance improvement for the
compile phase of dex2oat.  Move of the gain comes from avoiding
the generation of dex boundary LIR labels unless a debug listing
is requested.  The other significant change is moving from a basic block
ending branch model of "always generate a fall-through branch, and then
delete it if we can" to a "only generate a fall-through branch if we need
it" model.

The data motivating these changes follow.  Note that two area of
potentially attractive gain remain: restructing the assembler model and
reworking the register handling utilities.  These will be addressed
in subsequent CLs.

--- data follows

The Quick compiler's assembler has shown up on profile reports a bit
more than seems reasonable.  We've tried a few quick fixes to apparently
hot portions of the code, but without much gain.  So, I've been looking at
the assembly process at a somewhat higher level.  There look to be several
potentially good opportunities.

First, an analysis of the makeup of the LIR graph showed a surprisingly
high proportion of LIR pseudo ops.  Using the boot classpath as a basis,
we get:

   32.8% of all LIR nodes are pseudo ops.
   10.4% are LIR instructions which require pc-relative fixups.
   11.8% are LIR instructions that have been nop'd by the various
         optimization passes.

Looking only at the LIR pseudo ops, we get:
 kPseudoDalvikByteCodeBoundary 43.46%
       kPseudoNormalBlockLabel 21.14%
            kPseudoSafepointPC 20.20%
            kPseudoThrowTarget  6.94%
                 kPseudoTarget  3.03%
          kPseudoSuspendTarget  1.95%
             kPseudoMethodExit  1.26%
            kPseudoMethodEntry  1.26%
             kPseudoExportedPC  0.37%
              kPseudoCaseLabel  0.30%
                kPseudoBarrier  0.07%
         kPseudoIntrinsicRetry  0.02%
Total LIR count: 10167292

The standout here is the Dalvik opcode boundary marker.  This is just a
label inserted at the beginning of the codegen for each Dalvik bytecode.
If we're also doing a verbose listing, this is also where we hang the
pretty-print disassembly string.  However, this label was also
being used as a convenient way to find the target of switch case
statements (and, I think at one point was used in the Mir->GBC conversion
process).

This CL moves the use of kPseudoDalvikByteCodeBoundary labels to only
verbose listing runs, and replaces the codegen uses of the label with
the kPseudoNormalBlockLabel attached to the basic block that contains the
switch case target.  Great savings here - 14.3% reduction in the number of
LIR nodes needed.  After this CL, our LIR pseudo proportions drop to 21.6%
of all LIR.  That's still a lot, but much better.  Possible further
improvements via combining normal labels with kPseudoSafepointPC labels
where appropriate, and also perhaps reduce memory usage by using a
short-hand form for labels rather than a full LIR node.  Also, many
of the basic block labels are no longer branch targets by the time
we get to assembly - cheaper to delete, or just ingore?

Here's the "after" LIR pseudo op breakdown:

       kPseudoNormalBlockLabel 37.39%
            kPseudoSafepointPC 35.72%
            kPseudoThrowTarget 12.28%
                 kPseudoTarget 5.36%
          kPseudoSuspendTarget 3.45%
            kPseudoMethodEntry 2.24%
             kPseudoMethodExit 2.22%
             kPseudoExportedPC 0.65%
              kPseudoCaseLabel 0.53%
                kPseudoBarrier 0.12%
         kPseudoIntrinsicRetry 0.04%
Total LIR count: 5748232

Not done in this CL, but it will be worth experimenting with actually
deleting LIR nodes from the graph when they are optimized away, rather
than just setting the NOP bit.  Keeping them around is invaluable
during debugging - but when not debugging it may pay off if the cost of
node removal is less than the cost of traversing through dead nodes
in subsequent passes.

Next up (and partially in this CL - but mostly to be done in follow-on
CLs) is the overall assembly process.  Inherited from the trace JIT,
the Quick compiler has a fairly simple-minded approach to instruction
assembly.  First, a pass is made over the LIR list to assign offsets
to each instruction.  Then, the assembly pass is made - which generates
the actual machine instruction bit patterns and pushes the instruction
data into the code_buffer.  However, the code generator takes the "always
optimistic" approach to instruction selection and emits the shortest
instruction.  If, during assembly, we find that a branch or load doesn't
reach, that short-form instruction is replaces with a longer sequence.

Of course, this invalidates the previously-computed offset calculations.
Assembly thus is an iterative process: compute offsets and then assemble
until we survive an assembly pass without invalidation.  This seems
like a likely candidate for improvement.  First, I analyzed the
number of retries required, and the reason for invalidation over the
boot classpath load.

The results: more than half of methods don't require a retry, and
very few require more than 1 extra pass:

5 or more: 6 of 96334
4 or more: 22 of 96334
3 or more: 140 of 96334
2 or more: 1794 of 96334  -  2%
1 or more: 40911 of 96334 - 40%
0 retries: 55423 of 96334 - 58%

The interesting group here is the one that requires 1 retry.  Looking
at the reason, we see three typical reasons:

  1.  A cbnz/cbz doesn't reach (only 7 bits of offset)
  2.  A 16-bit Thumb1 unconditional branch doesn't reach.
  3.  An unconditional branch which branches to the next instruction
      is encountered, and deleted.

The first 2 cases are the cost of the optimistic strategy - nothing
much to change there.  However, the interesting case is #3 - dead
branch elimination.  A further analysis of the single retry group showed
that 42% of the methods (16305) that required a single retry did so
*only* because of dead branch elimination.  The big question here is
why so many dead branches survive to the assembly stage.  We have
a dead branch elimination pass which is supposed to catch these - perhaps
it's not working correctly, should be moved later in the optimization
process, or perhaps run multiple times.

Other things to consider:

  o Combine the offset generation pass with the assembly pass.  Skip
    pc-relative fixup assembly (other than assigning offset), but push
    LIR* for them into work list.  Following the main pass, zip through
    the work list and assemble the pc-relative instructions (now that we
    know the offsets).  This would significantly cut back on traversal
    costs.

  o Store the assembled bits into both the code buffer and the LIR.
    In the event we have to retry, only the pc-relative instructions
    would need to be assembled, and we'd finish with a pass over the
    LIR just to dumb the bits into the code buffer.

Change-Id: I50029d216fa14f273f02b6f1c8b6a0dde5a7d6a6
11 files changed