AMD_multi_draw_indirect is a functionality that Direct3D 11 programmers can be particularly jealous as it reached OpenGL 4.3 specification but this great feature is still missing in the Direct3D 11 world. Direct3D 11 and OpenGL 4.0 introduced the draw indirect functionality to submit a draw using parameters stored in a buffer object. AMD_multi_draw_indirect extends this feature by allowing multiple draw submissions in a single call. This approach moves a CPU loop to a GPU command processor loop, hence potentially significantly reducing the CPU overhead and the possible number of draws per frame. Unfortunately, unlike instanced draw, OpenGL doesn't provide a unique identifier per draw (let's call it gl_DrawID) for each draw...
A first approach we could thought of is to use an atomic counter and increment each draw its value when gl_VertexID and gl_InstanceID are equal to 0. Unfortunately, this idea is not working because of the GPU architecture so that OpenGL doesn't provide guarantee on the order of excutions so that we can't be sure with this method that the first draw will be identified with the value 0. On AMD hardware this behavior is almost possible but not all the time. On NVIDIA hardware atomic counter are asynchronously executed which nearly guarantee that we will never have the right identifier for a draw.
Fortunately, there is one method we can use using a vertex attribute with a divisor and base instance.
With a divisor equal to 1, all the vertices of a draw will access to the same buffer value. Then, for each draw we use the base instance parameter as an offset to set where in the buffer we are reading the DrawID value. The base instance value act here as an indirection value which increases the execution latency but allow to set a specific value for each DrawID, providing a behavior more advanced than a simple increase of the value for each draw. This behavior allows to maximize the usage of each draws by the application. When each draw is associated with a specific mesh, for a specific frame a lot of draws will be useless for the rendering. This behavior wouldn't matter if discarding a draw was particularly efficient but it isn't on both Kepler and Southern Islands architectures. By being able to assign a specific DrawID to each draw we can assign a specific mesh for each draw allowing to build the needed list of meshes for each draw.
This DrawID gives an interesting functional quality to the multi draw indirect approach however adding an application specific semantic could increase the flexibility and even the change the application performance balance. This DrawID is useful because it allows indexing resources in the shader. Typically, we would like to use it to index material in the fragment shader stage. We can even imagine that multiple resources will be indexed thanks this this DrawID. Hence, we need an indirection table that per draw and this table will be embodied by a uniform block which stores indexes to access each resource per draw. Unfortunately indirections imply latencies and potentially performance...
A possible improvement for this DrawID is to add to itself some semantics to skip one level of indirection. Hence, we can create multiple DrawIDs using both a divisor equal to 1 and BaseInstance and call them "MaterialID", "VertexFormatID" or whatever an OpenGL application needs.
A scenario with multi draw indirect is to pack multiple meshes of different objects into a single set of buffers and manually fetch each vertex for each draw so that each draw can have different vertex format. One tricky issue with this approach is that each the BaseVertex and the BaseInstance parameters are not exposed into the vertex shader stage as input variable. Using the divisor equal to 1 and BaseInstance is also an approach to expose these variables in the vertex shader stage. People used to ridiculously optimized code (If you do VHDL you will understand me !:p) could even naturally pack the BaseVertex and the VertexFormat ids into a single integer by allocating a part bits for BaseVertex and the rest to VertexFormat relying on the limitations given by GL_MAX_SHADER_STORAGE_BLOCK_SIZE and GL_SHADER_STORAGE_BUFFER_OFFSET_ALIGNMENT.
The divisor by 1 and BaseInstance method provides a lot of opportunities to the multi draw indirect approach. However, we can question the problem of optimal performances. Even if a single value for a vertex attribute is fetched for all the vertex invocations of a draw, there are little reasons to think that this value won't be fetched each time for each vertex shader invocations. Chances are that the latency and bandwidth impact will be small as we can expect a good cache reuse hoping that many vertex shader invocations will be triggered at the same time.
A question remains, how fast an automatically increased gl_DrawID would be compare to the proposed solution? No immediate bandwidth and latency impact but this approach would require an indirection table (uniform block) to identify which actual resource to access which may not do a better cache reuse than the divisor. Another approach would be to add another draw parameter alongside with BaseVertex and BaseInstance where the user could set its own value for the DrawID parameter. We could even imagine multiple of those parameters. However, what garantee us that those parameters would not have to be fetched for each vertex shader invocation? As a result, this approach could perform equally than using vertex attribute with divisors.
A sample showing the usage of multi draw indirect is available in the OpenGL Samples Pack 4.3.