Update Clang for rebase to r212749.

This also fixes a small issue with arm_neon.h not being generated always.

Includes a cherry-pick of:
r213450 - fixes mac-specific header issue
r213126 - removes a default -Bsymbolic on Android

Change-Id: I2a790a0f5d3b2aab11de596fc3a74e7cbc99081d
diff --git a/docs/LanguageExtensions.rst b/docs/LanguageExtensions.rst
index 5e957cb..ad7821a 100644
--- a/docs/LanguageExtensions.rst
+++ b/docs/LanguageExtensions.rst
@@ -1446,6 +1446,24 @@
     return __builtin_addressof(value);
   }
 
+``__builtin_operator_new`` and ``__builtin_operator_delete``
+------------------------------------------------------------
+
+``__builtin_operator_new`` allocates memory just like a non-placement non-class
+*new-expression*. This is exactly like directly calling the normal
+non-placement ``::operator new``, except that it allows certain optimizations
+that the C++ standard does not permit for a direct function call to
+``::operator new`` (in particular, removing ``new`` / ``delete`` pairs and
+merging allocations).
+
+Likewise, ``__builtin_operator_delete`` deallocates memory just like a
+non-class *delete-expression*, and is exactly like directly calling the normal
+``::operator delete``, except that it permits optimizations. Only the unsized
+form of ``__builtin_operator_delete`` is currently available.
+
+These builtins are intended for use in the implementation of ``std::allocator``
+and other similar allocation libraries, and are only available in C++.
+
 Multiprecision Arithmetic Builtins
 ----------------------------------
 
@@ -1562,7 +1580,9 @@
 .. code-block:: c
 
   T __builtin_arm_ldrex(const volatile T *addr);
+  T __builtin_arm_ldaex(const volatile T *addr);
   int __builtin_arm_strex(T val, volatile T *addr);
+  int __builtin_arm_stlex(T val, volatile T *addr);
   void __builtin_arm_clrex(void);
 
 The types ``T`` currently supported are:
@@ -1571,11 +1591,11 @@
 * Pointer types.
 
 Note that the compiler does not guarantee it will not insert stores which clear
-the exclusive monitor in between an ``ldrex`` and its paired ``strex``. In
-practice this is only usually a risk when the extra store is on the same cache
-line as the variable being modified and Clang will only insert stack stores on
-its own, so it is best not to use these operations on variables with automatic
-storage duration.
+the exclusive monitor in between an ``ldrex`` type operation and its paired
+``strex``. In practice this is only usually a risk when the extra store is on
+the same cache line as the variable being modified and Clang will only insert
+stack stores on its own, so it is best not to use these operations on variables
+with automatic storage duration.
 
 Also, loads and stores may be implicit in code written between the ``ldrex`` and
 ``strex``. Clang will not necessarily mitigate the effects of these either, so
@@ -1667,3 +1687,190 @@
 
 Use ``__has_feature(memory_sanitizer)`` to check if the code is being built
 with :doc:`MemorySanitizer`.
+
+
+Extensions for selectively disabling optimization
+=================================================
+
+Clang provides a mechanism for selectively disabling optimizations in functions
+and methods.
+
+To disable optimizations in a single function definition, the GNU-style or C++11
+non-standard attribute ``optnone`` can be used.
+
+.. code-block:: c++
+
+  // The following functions will not be optimized.
+  // GNU-style attribute
+  __attribute__((optnone)) int foo() {
+    // ... code
+  }
+  // C++11 attribute
+  [[clang::optnone]] int bar() {
+    // ... code
+  }
+
+To facilitate disabling optimization for a range of function definitions, a
+range-based pragma is provided. Its syntax is ``#pragma clang optimize``
+followed by ``off`` or ``on``.
+
+All function definitions in the region between an ``off`` and the following
+``on`` will be decorated with the ``optnone`` attribute unless doing so would
+conflict with explicit attributes already present on the function (e.g. the
+ones that control inlining).
+
+.. code-block:: c++
+
+  #pragma clang optimize off
+  // This function will be decorated with optnone.
+  int foo() {
+    // ... code
+  }
+
+  // optnone conflicts with always_inline, so bar() will not be decorated.
+  __attribute__((always_inline)) int bar() {
+    // ... code
+  }
+  #pragma clang optimize on
+
+If no ``on`` is found to close an ``off`` region, the end of the region is the
+end of the compilation unit.
+
+Note that a stray ``#pragma clang optimize on`` does not selectively enable
+additional optimizations when compiling at low optimization levels. This feature
+can only be used to selectively disable optimizations.
+
+The pragma has an effect on functions only at the point of their definition; for
+function templates, this means that the state of the pragma at the point of an
+instantiation is not necessarily relevant. Consider the following example:
+
+.. code-block:: c++
+
+  template<typename T> T twice(T t) {
+    return 2 * t;
+  }
+
+  #pragma clang optimize off
+  template<typename T> T thrice(T t) {
+    return 3 * t;
+  }
+
+  int container(int a, int b) {
+    return twice(a) + thrice(b);
+  }
+  #pragma clang optimize on
+
+In this example, the definition of the template function ``twice`` is outside
+the pragma region, whereas the definition of ``thrice`` is inside the region.
+The ``container`` function is also in the region and will not be optimized, but
+it causes the instantiation of ``twice`` and ``thrice`` with an ``int`` type; of
+these two instantiations, ``twice`` will be optimized (because its definition
+was outside the region) and ``thrice`` will not be optimized.
+
+Extensions for loop hint optimizations
+======================================
+
+The ``#pragma clang loop`` directive is used to specify hints for optimizing the
+subsequent for, while, do-while, or c++11 range-based for loop. The directive
+provides options for vectorization, interleaving, and unrolling. Loop hints can
+be specified before any loop and will be ignored if the optimization is not safe
+to apply.
+
+Vectorization and Interleaving
+------------------------------
+
+A vectorized loop performs multiple iterations of the original loop
+in parallel using vector instructions. The instruction set of the target
+processor determines which vector instructions are available and their vector
+widths. This restricts the types of loops that can be vectorized. The vectorizer
+automatically determines if the loop is safe and profitable to vectorize. A
+vector instruction cost model is used to select the vector width.
+
+Interleaving multiple loop iterations allows modern processors to further
+improve instruction-level parallelism (ILP) using advanced hardware features,
+such as multiple execution units and out-of-order execution. The vectorizer uses
+a cost model that depends on the register pressure and generated code size to
+select the interleaving count.
+
+Vectorization is enabled by ``vectorize(enable)`` and interleaving is enabled
+by ``interleave(enable)``. This is useful when compiling with ``-Os`` to
+manually enable vectorization or interleaving.
+
+.. code-block:: c++
+
+  #pragma clang loop vectorize(enable)
+  #pragma clang loop interleave(enable)
+  for(...) {
+    ...
+  }
+
+The vector width is specified by ``vectorize_width(_value_)`` and the interleave
+count is specified by ``interleave_count(_value_)``, where
+_value_ is a positive integer. This is useful for specifying the optimal
+width/count of the set of target architectures supported by your application.
+
+.. code-block:: c++
+
+  #pragma clang loop vectorize_width(2)
+  #pragma clang loop interleave_count(2)
+  for(...) {
+    ...
+  }
+
+Specifying a width/count of 1 disables the optimization, and is equivalent to
+``vectorize(disable)`` or ``interleave(disable)``.
+
+Loop Unrolling
+--------------
+
+Unrolling a loop reduces the loop control overhead and exposes more
+opportunities for ILP. Loops can be fully or partially unrolled. Full unrolling
+eliminates the loop and replaces it with an enumerated sequence of loop
+iterations. Full unrolling is only possible if the loop trip count is known at
+compile time. Partial unrolling replicates the loop body within the loop and
+reduces the trip count.
+
+If ``unroll(enable)`` is specified the unroller will attempt to fully unroll the
+loop if the trip count is known at compile time. If the loop count is not known
+or the fully unrolled code size is greater than the limit specified by the
+`-pragma-unroll-threshold` command line option the loop will be partially
+unrolled subject to the same limit.
+
+.. code-block:: c++
+
+  #pragma clang loop unroll(enable)
+  for(...) {
+    ...
+  }
+
+The unroll count can be specified explicitly with ``unroll_count(_value_)`` where
+_value_ is a positive integer. If this value is greater than the trip count the
+loop will be fully unrolled. Otherwise the loop is partially unrolled subject
+to the `-pragma-unroll-threshold` limit.
+
+.. code-block:: c++
+
+  #pragma clang loop unroll_count(8)
+  for(...) {
+    ...
+  }
+
+Unrolling of a loop can be prevented by specifying ``unroll(disable)``.
+
+Additional Information
+----------------------
+
+For convenience multiple loop hints can be specified on a single line.
+
+.. code-block:: c++
+
+  #pragma clang loop vectorize_width(4) interleave_count(8)
+  for(...) {
+    ...
+  }
+
+If an optimization cannot be applied any hints that apply to it will be ignored.
+For example, the hint ``vectorize_width(4)`` is ignored if the loop is not
+proven safe to vectorize. To identify and diagnose optimization issues use
+`-Rpass`, `-Rpass-missed`, and `-Rpass-analysis` command line options. See the
+user guide for details.