Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 1 | gemmlowp: a small self-contained low-precision GEMM library |
| 2 | =========================================================== |
| 3 | |
| 4 | This is not a full linear algebra library, only a GEMM library: it only does |
| 5 | general matrix multiplication ("GEMM"). |
| 6 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 7 | The meaning of "low precision" is detailed in this document: |
| 8 | doc/low-precision.txt |
| 9 | |
| 10 | Some of the general design is explained in |
| 11 | doc/design.txt |
| 12 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 13 | |
| 14 | Disclaimer |
| 15 | ========== |
| 16 | |
| 17 | This is not an official Google product (experimental or otherwise), it is just |
| 18 | code that happens to be owned by Google. |
| 19 | |
| 20 | |
Miao Wang | 7b05d57 | 2015-10-19 15:30:10 -0700 | [diff] [blame] | 21 | Mailing list |
| 22 | ============ |
| 23 | |
| 24 | gemmlowp-related discussion, about either development or usage, is welcome |
| 25 | on this Google Group (mailing list / forum): |
| 26 | |
| 27 | https://groups.google.com/forum/#!forum/gemmlowp |
| 28 | |
| 29 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 30 | Portability, target platforms/architectures |
| 31 | =========================================== |
| 32 | |
| 33 | Should be portable to any platform with some C++11 and POSIX support, |
Pete Warden | 5de59a5 | 2015-07-08 12:50:10 -0700 | [diff] [blame] | 34 | while we have optional optimized code paths for specific architectures. |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 35 | |
| 36 | Required: |
| 37 | C++11 (a small conservative subset of it) |
| 38 | |
| 39 | Required for some features: |
| 40 | * Some POSIX interfaces: |
| 41 | * pthreads (for multi-threaded operation and for profiling). |
| 42 | * sysconf (for multi-threaded operation to detect number of cores; |
| 43 | may be bypassed). |
| 44 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 45 | Optional: |
| 46 | Architecture-specific code paths use intrinsics or inline assembly. |
| 47 | See "Architecture-specific optimized code paths" below. |
| 48 | |
| 49 | Architecture-specific optimized code paths |
| 50 | ========================================== |
| 51 | |
| 52 | We have some optimized code paths for specific instruction sets. |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 53 | Some are written in inline assembly, some are written in C++ using |
| 54 | intrinsics. Both GCC and Clang are supported. |
| 55 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 56 | At the moment, we have a full set of optimized code paths (kernels, |
| 57 | packing and unpacking paths) only for ARM NEON, supporting both |
| 58 | ARMv7 (32bit) and ARMv8 (64bit). |
| 59 | |
| 60 | We also have a partial set of optimized code paths (only kernels |
| 61 | at the moment) for Intel SSE. It supports both x86 and x86-64 but |
| 62 | only targets SSE4. The lack of packing/unpacking code paths means |
| 63 | that performance isn't optimal yet. |
| 64 | |
| 65 | Details of what it takes to make an efficient port of gemmlowp, namely |
| 66 | writing a suitable GEMM kernel and accompanying packing code, are |
| 67 | explained in this file: |
| 68 | doc/kernels.txt |
| 69 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 70 | |
| 71 | Public interfaces |
| 72 | ================= |
| 73 | |
| 74 | 1. gemmlowp public interface |
| 75 | ---------------------------- |
| 76 | |
| 77 | gemmlowp's main public interface is in the public/ subdirectory. The |
| 78 | header to include is |
| 79 | public/gemmlowp.h. |
| 80 | This is a headers-only library, so there is nothing to link to. |
| 81 | |
| 82 | 2. EightBitIntGemm standard interface |
| 83 | ------------------------------------- |
| 84 | |
| 85 | Additionally, the eight_bit_int_gemm/ subdirectory provides an |
| 86 | implementation of the standard EightBitIntGemm interface. The header |
| 87 | to include is |
| 88 | eight_bit_int_gemm/eight_bit_int_gemm.h |
| 89 | This is *NOT* a headers-only library, users need to link to |
| 90 | eight_bit_int_gemm/eight_bit_int_gemm.cc. |
Miao Wang | 7b05d57 | 2015-10-19 15:30:10 -0700 | [diff] [blame] | 91 | The API is similar to the standard BLAS GEMM interface, and implements |
| 92 | C = A * B. If the transpose flags for a matrix argument are false, its memory |
| 93 | order is treated as column major, and row major if its true. |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 94 | |
| 95 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 96 | Building |
| 97 | ======== |
| 98 | |
| 99 | Building by manually invoking your compiler |
| 100 | ------------------------------------------- |
| 101 | |
| 102 | Because gemmlowp is so simple, working with it involves only |
| 103 | single-command-line compiler invokations. Therefore we expect that |
| 104 | most people working with gemmlowp will either manually invoke their |
| 105 | compiler, or write their own rules for their own preferred build |
| 106 | system. |
| 107 | |
| 108 | Keep in mind (previous section) that gemmlowp itself is a pure-headers-only |
| 109 | library so there is nothing to build, and the eight_bit_int_gemm library |
| 110 | consists of a single eight_bit_int_gemm.cc file to build. |
| 111 | |
| 112 | For a Android gemmlowp development workflow, the scripts/ directory |
| 113 | contains a script to build and run a program on an Android device: |
| 114 | scripts/test-android.sh |
| 115 | |
| 116 | Building using Bazel |
| 117 | -------------------- |
| 118 | |
| 119 | That being said, we also maintain a Bazel BUILD system as part of |
| 120 | gemmlowp. Its usage is not mandatory at all and is only one |
| 121 | possible way that gemmlowp libraries and tests may be built. If |
| 122 | you are interested, Bazel's home page is |
| 123 | http://bazel.io/ |
| 124 | And you can get started with using Bazel to build gemmlowp targets |
| 125 | by first creating an empty WORKSPACE file in a parent directory, |
| 126 | for instance: |
| 127 | |
| 128 | $ cd gemmlowp/.. # change to parent directory containing gemmlowp/ |
| 129 | $ touch WORKSPACE # declare that to be our workspace root |
| 130 | $ bazel build gemmlowp:all |
| 131 | |
| 132 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 133 | Testing |
| 134 | ======= |
| 135 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 136 | Testing by manually building and running tests |
| 137 | ---------------------------------------------- |
| 138 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 139 | The test/ directory contains unit tests. The primary unit test is |
| 140 | test/test.cc |
| 141 | Since it covers also the EightBitIntGemm interface, it needs to be |
| 142 | linked against |
| 143 | eight_bit_int_gemm/eight_bit_int_gemm.cc |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 144 | It also uses realistic data captured from a neural network run in |
| 145 | test/test_data.cc |
| 146 | |
| 147 | Thus you'll want to pass the following list of source files to your |
| 148 | compiler/linker: |
| 149 | test/test.cc |
| 150 | eight_bit_int_gemm/eight_bit_int_gemm.cc |
| 151 | test/test_data.cc |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 152 | |
| 153 | The scripts/ directory contains a script to build and run a program |
| 154 | on an Android device: |
| 155 | scripts/test-android.sh |
| 156 | |
| 157 | It expects the CXX environment variable to point to an Android toolchain's |
| 158 | C++ compiler, and expects source files (and optionally, cflags) as |
| 159 | command-line parameters. To build and run the above-mentioned main unit test, |
| 160 | first set CXX e.g.: |
| 161 | |
| 162 | $ export CXX=/some/toolchains/arm-linux-androideabi-4.8/bin/arm-linux-androideabi-g++ |
| 163 | |
| 164 | Then run: |
| 165 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 166 | $ ./scripts/test-android.sh \ |
| 167 | test/test.cc \ |
| 168 | eight_bit_int_gemm/eight_bit_int_gemm.cc \ |
| 169 | test/test_data.cc |
| 170 | |
| 171 | |
| 172 | Testing using Bazel |
| 173 | ------------------- |
| 174 | |
| 175 | Alternatively, you can use Bazel to build and run tests. See the Bazel |
| 176 | instruction in the above section on building. Once your Bazel workspace |
| 177 | is set up, you can for instance do: |
| 178 | |
| 179 | $ bazel test gemmlowp:all |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 180 | |
Pete Warden | 79e31ec | 2015-07-08 12:52:38 -0700 | [diff] [blame] | 181 | |
Pete Warden | 5de59a5 | 2015-07-08 12:50:10 -0700 | [diff] [blame] | 182 | Troubleshooting Compilation |
| 183 | =========================== |
| 184 | |
| 185 | If you're having trouble finding the compiler, follow these instructions to |
| 186 | build a standalone toolchain: |
| 187 | https://developer.android.com/ndk/guides/standalone_toolchain.html |
| 188 | |
| 189 | Here's an example of setting up Clang 3.5: |
| 190 | |
| 191 | $ export INSTALL_DIR=~/toolchains/clang-21-stl-gnu |
| 192 | $ $NDK/build/tools/make-standalone-toolchain.sh \ |
| 193 | --toolchain=arm-linux-androideabi-clang3.5 --platform=android-21 \ |
| 194 | --install-dir=$INSTALL_DIR |
| 195 | $ export CXX="$INSTALL_DIR/bin/arm-linux-androideabi-g++ \ |
| 196 | --sysroot=$INSTALL_DIR/sysroot" |
| 197 | |
| 198 | Some compilers (e.g. the default clang++ in the same bin directory) don't |
| 199 | support NEON assembly. The benchmark build process will issue a warning if |
| 200 | support isn't detected, and you should make sure you're using a compiler like |
| 201 | arm-linux-androideabi-g++ that does include NEON. |
| 202 | |
| 203 | |
| 204 | Benchmarking |
| 205 | ============ |
| 206 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 207 | The main benchmark is |
| 208 | benchmark.cc |
| 209 | It doesn't need to be linked to any |
| 210 | other source file. We recommend building with assertions disabled (-DNDEBUG). |
Pete Warden | 5de59a5 | 2015-07-08 12:50:10 -0700 | [diff] [blame] | 211 | |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 212 | For example, the benchmark can be built and run on an Android device by doing: |
| 213 | |
| 214 | $ ./scripts/test-android.sh test/benchmark.cc -DNDEBUG |
| 215 | |
| 216 | If GEMMLOWP_TEST_PROFILE is defined then the benchmark will be built with |
| 217 | profiling instrumentation (which makes it slower) and will dump profiles. |
| 218 | See next section on profiling. |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 219 | |
Pete Warden | 79e31ec | 2015-07-08 12:52:38 -0700 | [diff] [blame] | 220 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 221 | Profiling |
| 222 | ========= |
| 223 | |
| 224 | The profiling/ subdirectory offers a very simple non-interrupting sampling |
| 225 | profiler that only requires pthreads (no signals). |
| 226 | |
| 227 | It relies on source code being instrumented with pseudo-stack labels. |
| 228 | See profiling/instrumentation.h. |
| 229 | A full example of using this profiler is given in profiling/profiler.h. |
| 230 | |
| 231 | |
Miao Wang | 7b05d57 | 2015-10-19 15:30:10 -0700 | [diff] [blame] | 232 | Contributing |
| 233 | ============ |
| 234 | |
| 235 | Contribution-related discussion is always welcome on the gemmlowp |
| 236 | mailing list (see above). |
| 237 | |
| 238 | We try to keep a current list of TODO items in the todo/ directory. |
| 239 | Prospective contributors are welcome to pick one to work on, and |
| 240 | communicate about it on the gemmlowp mailing list. |
| 241 | |
| 242 | Details of the contributing process, including legalese, are in CONTRIBUTING. |
| 243 | |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 244 | Performance goals |
Miao Wang | 0a70f98 | 2015-09-14 15:39:13 -0700 | [diff] [blame] | 245 | ================= |
Benoit Jacob | 75c4ec0 | 2015-06-25 15:50:59 -0400 | [diff] [blame] | 246 | |
| 247 | Our performance goals differ from typical GEMM performance goals in the |
| 248 | following ways: |
| 249 | |
| 250 | 1. We care not only about speed, but also about minimizing power usage. |
| 251 | We specifically care about charge usage in mobile/embedded devices. |
| 252 | This implies that we care doubly about minimizing memory bandwidth usage: |
| 253 | we care about it, like any GEMM, because of the impact on speed, and we |
| 254 | also care about it because it is a key factor of power usage. |
| 255 | |
| 256 | 2. Most GEMMs are optimized primarily for large dense matrix sizes (>= 1000). |
| 257 | We do care about large sizes, but we also care specifically about the |
| 258 | typically smaller matrix sizes encountered in various mobile applications. |
| 259 | This means that we have to optimize for all sizes, not just for large enough |
| 260 | sizes. |