Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 1 | GL Dispatch |
| 2 | =========== |
| 3 | |
| 4 | Several factors combine to make efficient dispatch of OpenGL functions |
| 5 | fairly complicated. This document attempts to explain some of the issues |
| 6 | and introduce the reader to Mesa's implementation. Readers already |
| 7 | familiar with the issues around GL dispatch can safely skip ahead to the |
Erik Faye-Lund | b1c16e5 | 2020-06-27 10:00:10 +0200 | [diff] [blame] | 8 | :ref:`overview of Mesa's implementation <overview>`. |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 9 | |
| 10 | 1. Complexity of GL Dispatch |
| 11 | ---------------------------- |
| 12 | |
| 13 | Every GL application has at least one object called a GL *context*. This |
| 14 | object, which is an implicit parameter to every GL function, stores all |
| 15 | of the GL related state for the application. Every texture, every buffer |
| 16 | object, every enable, and much, much more is stored in the context. |
| 17 | Since an application can have more than one context, the context to be |
| 18 | used is selected by a window-system dependent function such as |
| 19 | ``glXMakeContextCurrent``. |
| 20 | |
| 21 | In environments that implement OpenGL with X-Windows using GLX, every GL |
| 22 | function, including the pointers returned by ``glXGetProcAddress``, are |
| 23 | *context independent*. This means that no matter what context is |
| 24 | currently active, the same ``glVertex3fv`` function is used. |
| 25 | |
| 26 | This creates the first bit of dispatch complexity. An application can |
| 27 | have two GL contexts. One context is a direct rendering context where |
| 28 | function calls are routed directly to a driver loaded within the |
| 29 | application's address space. The other context is an indirect rendering |
| 30 | context where function calls are converted to GLX protocol and sent to a |
| 31 | server. The same ``glVertex3fv`` has to do the right thing depending on |
| 32 | which context is current. |
| 33 | |
| 34 | Highly optimized drivers or GLX protocol implementations may want to |
| 35 | change the behavior of GL functions depending on current state. For |
| 36 | example, ``glFogCoordf`` may operate differently depending on whether or |
| 37 | not fog is enabled. |
| 38 | |
| 39 | In multi-threaded environments, it is possible for each thread to have a |
| 40 | different GL context current. This means that poor old ``glVertex3fv`` |
| 41 | has to know which GL context is current in the thread where it is being |
| 42 | called. |
| 43 | |
| 44 | .. _overview: |
| 45 | |
| 46 | 2. Overview of Mesa's Implementation |
| 47 | ------------------------------------ |
| 48 | |
| 49 | Mesa uses two per-thread pointers. The first pointer stores the address |
| 50 | of the context current in the thread, and the second pointer stores the |
| 51 | address of the *dispatch table* associated with that context. The |
| 52 | dispatch table stores pointers to functions that actually implement |
| 53 | specific GL functions. Each time a new context is made current in a |
| 54 | thread, these pointers a updated. |
| 55 | |
| 56 | The implementation of functions such as ``glVertex3fv`` becomes |
| 57 | conceptually simple: |
| 58 | |
| 59 | - Fetch the current dispatch table pointer. |
| 60 | - Fetch the pointer to the real ``glVertex3fv`` function from the |
| 61 | table. |
| 62 | - Call the real function. |
| 63 | |
| 64 | This can be implemented in just a few lines of C code. The file |
| 65 | ``src/mesa/glapi/glapitemp.h`` contains code very similar to this. |
| 66 | |
Erik Faye-Lund | 7039310 | 2019-05-06 14:18:23 +0200 | [diff] [blame] | 67 | .. code-block:: c |
| 68 | :caption: Sample dispatch function |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 69 | |
| 70 | void glVertex3f(GLfloat x, GLfloat y, GLfloat z) |
| 71 | { |
| 72 | const struct _glapi_table * const dispatch = GET_DISPATCH(); |
| 73 | |
| 74 | (*dispatch->Vertex3f)(x, y, z); |
| 75 | } |
| 76 | |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 77 | The problem with this simple implementation is the large amount of |
| 78 | overhead that it adds to every GL function call. |
| 79 | |
| 80 | In a multithreaded environment, a naive implementation of |
| 81 | ``GET_DISPATCH`` involves a call to ``pthread_getspecific`` or a similar |
| 82 | function. Mesa provides a wrapper function called |
| 83 | ``_glapi_get_dispatch`` that is used by default. |
| 84 | |
| 85 | 3. Optimizations |
| 86 | ---------------- |
| 87 | |
| 88 | A number of optimizations have been made over the years to diminish the |
| 89 | performance hit imposed by GL dispatch. This section describes these |
| 90 | optimizations. The benefits of each optimization and the situations |
| 91 | where each can or cannot be used are listed. |
| 92 | |
| 93 | 3.1. Dual dispatch table pointers |
| 94 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 95 | |
| 96 | The vast majority of OpenGL applications use the API in a single |
| 97 | threaded manner. That is, the application has only one thread that makes |
| 98 | calls into the GL. In these cases, not only do the calls to |
| 99 | ``pthread_getspecific`` hurt performance, but they are completely |
| 100 | unnecessary! It is possible to detect this common case and avoid these |
| 101 | calls. |
| 102 | |
| 103 | Each time a new dispatch table is set, Mesa examines and records the ID |
| 104 | of the executing thread. If the same thread ID is always seen, Mesa |
| 105 | knows that the application is, from OpenGL's point of view, single |
| 106 | threaded. |
| 107 | |
| 108 | As long as an application is single threaded, Mesa stores a pointer to |
| 109 | the dispatch table in a global variable called ``_glapi_Dispatch``. The |
| 110 | pointer is also stored in a per-thread location via |
| 111 | ``pthread_setspecific``. When Mesa detects that an application has |
| 112 | become multithreaded, ``NULL`` is stored in ``_glapi_Dispatch``. |
| 113 | |
| 114 | Using this simple mechanism the dispatch functions can detect the |
| 115 | multithreaded case by comparing ``_glapi_Dispatch`` to ``NULL``. The |
| 116 | resulting implementation of ``GET_DISPATCH`` is slightly more complex, |
| 117 | but it avoids the expensive ``pthread_getspecific`` call in the common |
| 118 | case. |
| 119 | |
Erik Faye-Lund | 7039310 | 2019-05-06 14:18:23 +0200 | [diff] [blame] | 120 | .. code-block:: c |
| 121 | :caption: Improved ``GET_DISPATCH`` Implementation |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 122 | |
| 123 | #define GET_DISPATCH() \ |
| 124 | (_glapi_Dispatch != NULL) \ |
| 125 | ? _glapi_Dispatch : pthread_getspecific(&_glapi_Dispatch_key) |
| 126 | |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 127 | 3.2. ELF TLS |
| 128 | ~~~~~~~~~~~~ |
| 129 | |
| 130 | Starting with the 2.4.20 Linux kernel, each thread is allocated an area |
| 131 | of per-thread, global storage. Variables can be put in this area using |
| 132 | some extensions to GCC. By storing the dispatch table pointer in this |
| 133 | area, the expensive call to ``pthread_getspecific`` and the test of |
| 134 | ``_glapi_Dispatch`` can be avoided. |
| 135 | |
| 136 | The dispatch table pointer is stored in a new variable called |
| 137 | ``_glapi_tls_Dispatch``. A new variable name is used so that a single |
| 138 | libGL can implement both interfaces. This allows the libGL to operate |
| 139 | with direct rendering drivers that use either interface. Once the |
| 140 | pointer is properly declared, ``GET_DISPACH`` becomes a simple variable |
| 141 | reference. |
| 142 | |
Erik Faye-Lund | 7039310 | 2019-05-06 14:18:23 +0200 | [diff] [blame] | 143 | .. code-block:: c |
| 144 | :caption: TLS ``GET_DISPATCH`` Implementation |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 145 | |
| 146 | extern __thread struct _glapi_table *_glapi_tls_Dispatch |
| 147 | __attribute__((tls_model("initial-exec"))); |
| 148 | |
| 149 | #define GET_DISPATCH() _glapi_tls_Dispatch |
| 150 | |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 151 | Use of this path is controlled by the preprocessor define |
| 152 | ``USE_ELF_TLS``. Any platform capable of using ELF TLS should use this |
| 153 | as the default dispatch method. |
| 154 | |
| 155 | 3.3. Assembly Language Dispatch Stubs |
| 156 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 157 | |
| 158 | Many platforms has difficulty properly optimizing the tail-call in the |
| 159 | dispatch stubs. Platforms like x86 that pass parameters on the stack |
| 160 | seem to have even more difficulty optimizing these routines. All of the |
| 161 | dispatch routines are very short, and it is trivial to create optimal |
| 162 | assembly language versions. The amount of optimization provided by using |
| 163 | assembly stubs varies from platform to platform and application to |
| 164 | application. However, by using the assembly stubs, many platforms can |
Erik Faye-Lund | b1c16e5 | 2020-06-27 10:00:10 +0200 | [diff] [blame] | 165 | use an additional space optimization (see :ref:`below <fixedsize>`). |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 166 | |
| 167 | The biggest hurdle to creating assembly stubs is handling the various |
| 168 | ways that the dispatch table pointer can be accessed. There are four |
| 169 | different methods that can be used: |
| 170 | |
| 171 | #. Using ``_glapi_Dispatch`` directly in builds for non-multithreaded |
| 172 | environments. |
| 173 | #. Using ``_glapi_Dispatch`` and ``_glapi_get_dispatch`` in |
| 174 | multithreaded environments. |
| 175 | #. Using ``_glapi_Dispatch`` and ``pthread_getspecific`` in |
| 176 | multithreaded environments. |
| 177 | #. Using ``_glapi_tls_Dispatch`` directly in TLS enabled multithreaded |
| 178 | environments. |
| 179 | |
| 180 | People wishing to implement assembly stubs for new platforms should |
| 181 | focus on #4 if the new platform supports TLS. Otherwise, implement #2 |
| 182 | followed by #3. Environments that do not support multithreading are |
| 183 | uncommon and not terribly relevant. |
| 184 | |
| 185 | Selection of the dispatch table pointer access method is controlled by a |
| 186 | few preprocessor defines. |
| 187 | |
| 188 | - If ``USE_ELF_TLS`` is defined, method #3 is used. |
| 189 | - If ``HAVE_PTHREAD`` is defined, method #2 is used. |
| 190 | - If none of the preceding are defined, method #1 is used. |
| 191 | |
| 192 | Two different techniques are used to handle the various different cases. |
| 193 | On x86 and SPARC, a macro called ``GL_STUB`` is used. In the preamble of |
| 194 | the assembly source file different implementations of the macro are |
| 195 | selected based on the defined preprocessor variables. The assembly code |
| 196 | then consists of a series of invocations of the macros such as: |
| 197 | |
Erik Faye-Lund | 7039310 | 2019-05-06 14:18:23 +0200 | [diff] [blame] | 198 | .. code-block:: c |
| 199 | :caption: SPARC Assembly Implementation of ``glColor3fv`` |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 200 | |
| 201 | GL_STUB(Color3fv, _gloffset_Color3fv) |
| 202 | |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 203 | The benefit of this technique is that changes to the calling pattern |
| 204 | (i.e., addition of a new dispatch table pointer access method) require |
| 205 | fewer changed lines in the assembly code. |
| 206 | |
| 207 | However, this technique can only be used on platforms where the function |
| 208 | implementation does not change based on the parameters passed to the |
| 209 | function. For example, since x86 passes all parameters on the stack, no |
| 210 | additional code is needed to save and restore function parameters around |
| 211 | a call to ``pthread_getspecific``. Since x86-64 passes parameters in |
| 212 | registers, varying amounts of code needs to be inserted around the call |
| 213 | to ``pthread_getspecific`` to save and restore the GL function's |
| 214 | parameters. |
| 215 | |
| 216 | The other technique, used by platforms like x86-64 that cannot use the |
| 217 | first technique, is to insert ``#ifdef`` within the assembly |
| 218 | implementation of each function. This makes the assembly file |
| 219 | considerably larger (e.g., 29,332 lines for ``glapi_x86-64.S`` versus |
| 220 | 1,155 lines for ``glapi_x86.S``) and causes simple changes to the |
| 221 | function implementation to generate many lines of diffs. Since the |
Erik Faye-Lund | 05e61a9 | 2020-06-27 10:31:18 +0200 | [diff] [blame] | 222 | assembly files are typically generated by scripts, this isn't a |
| 223 | significant problem. |
Erik Faye-Lund | 4d06683 | 2020-06-12 20:09:42 +0200 | [diff] [blame] | 224 | |
| 225 | Once a new assembly file is created, it must be inserted in the build |
| 226 | system. There are two steps to this. The file must first be added to |
| 227 | ``src/mesa/sources``. That gets the file built and linked. The second |
| 228 | step is to add the correct ``#ifdef`` magic to |
| 229 | ``src/mesa/glapi/glapi_dispatch.c`` to prevent the C version of the |
| 230 | dispatch functions from being built. |
| 231 | |
| 232 | .. _fixedsize: |
| 233 | |
| 234 | 3.4. Fixed-Length Dispatch Stubs |
| 235 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 236 | |
| 237 | To implement ``glXGetProcAddress``, Mesa stores a table that associates |
| 238 | function names with pointers to those functions. This table is stored in |
| 239 | ``src/mesa/glapi/glprocs.h``. For different reasons on different |
| 240 | platforms, storing all of those pointers is inefficient. On most |
| 241 | platforms, including all known platforms that support TLS, we can avoid |
| 242 | this added overhead. |
| 243 | |
| 244 | If the assembly stubs are all the same size, the pointer need not be |
| 245 | stored for every function. The location of the function can instead be |
| 246 | calculated by multiplying the size of the dispatch stub by the offset of |
| 247 | the function in the table. This value is then added to the address of |
| 248 | the first dispatch stub. |
| 249 | |
| 250 | This path is activated by adding the correct ``#ifdef`` magic to |
| 251 | ``src/mesa/glapi/glapi.c`` just before ``glprocs.h`` is included. |