The announcement of Mantle few months ago has triggered a lot of discussions about gragphis API design. I think there are technical issues in OpenGL but those are precise problems that needs to be solve individually and following the hardware designs. These redesigning an entier API might be fun, it is keeping marketing people buzy and it makes OpenGL people communicating about how good is OpenGL.
What I like with Direct3D is that Microsoft was important enough to drive IHVs to standardize hardware features. The Khronos Group is certainly getting better at it, the ASTC texture format is a good example as I expect this format will be supported on all mobile and desktop GPUs in the future. How strong is Microsoft these days? This is something we will be able to judge at GDC.
What's an OpenGL 5 hardware feature? Following the conversions for OpenGL 3 and OpenGL 4, it's any hardware feature that can't be implemented on all OpenGL 4 hardware but would be implementable on newer hardware by all IHVs.
In this article I would like to point at hardware features available through OpenGL extensions and ideas that may or may not be interesting to standardize. As we will see, there is a lot of great features that could build this OpenGL 5 and Direct3D 12 hardware generation.
Multi draw indirect is part of OpenGL 4.3 core specification but it's arguably an OpenGL 5 hardware feature allowing the GPU to submit itself draws to be executed. This feature can be implemented through software emulation quite easily using the CPU to push each individual draw but this is really slow. Currently, all the Intel GPU and AMD Evergreen support multi draw in software. Hardware implementation gives another magnitude of performance.
With 800000 draws per frame at 60Hz on Kepler and 300000 draws per frame on Southern Islands with a syntheric test rendering 2 triangles per draw on 4 pixels. That huge amount of draws provide an amazing control over the rendering.
With OpenGL 4.3 we got shader storage buffers that we can use in place of vertex array object for programmable vertex pulling which allows for each single draw to use a custom vertex format. To make that approach really useful we were missing access of some useful draw arguments in the vertex shader stage. With this extension we get gl_BaseVertexARB and gl_BaseInstanceARB that reflect the draw arguments but also gl_DrawIDARB, an equivalent of gl_InstanceID for draws.
Only NVIDIA implements this extension at the moment but it should be implementable by Southern Island. However, the performance of gl_DrawIDARB are poor compared with a dedicated vertex attribute with divisor 1 and a BaseInstance.
One idea with multi draw indirect is to use a compute shader to generate the indirect draw buffer. However, with GL_ARB_multi_draw_indirect the number of draws is submitted to the GPU through a draw call parameter. Hence, if we generate the list of draws with a compute shader, then we need to query the number of draws to submit it back to the GPU. With GL_ARB_indirect_parameters, the number of draws can be stored into a buffer object.
This is the last piece of NVIDIA bindless API allowing to submit draws without binding vertex arrays or indirect draw buffer for lower CPU overhead.
This extension is really ugly but the functionnality is really interesting. Instead of having a single element array, thanks to this extension we can have up to 4 element arrays and we can index each vertex attribute with the element array of our choice. There is quite a few software actually generating meshes using multiple element array so this functionnality sounds extremely useful. Taking advantage of this feature can save bandwidth avoiding duplicating vertex attributes.
Bindless resources is a feature supported by Kepler and Southern Islands architechtures. When we bind a texture, the drivers write a texture descriptor in a special location in the GPU. The number of these special locations is fixed, 32 on OpenGL 4 / Direct3D11 hardware. With bindless textures we pass a handle (a pointer) directly to the shader stages. The shader invocations fetch and cache the texture descriptors. This approach gives access to an unlimited number of textures but there is chances for texture descriptors cache miss so we need to keep access coherence which multi draw indirect and dynamically uniform expression can provide by design.
With OpenGL 4.4 NVIDIA provides bindless buffers but this is still something that we need an ARB extensions for.
ARB_sparse_texture takes advantage of virtual memory supports of GPUs (Fermi and Southern Islands). A sparse texture can be used to create "giant" textures bigger than the graphics card memory size for virtual texturing. AMD_sparse_texture is more powerful as it adds shader queries to figure out in a shader invocation whether a texture page is commited or not.
There are quite some obvious features that would nice to have: The possibility to share the same texture tile for two texture pages without consuming twice the memory but also the support of actual giant textures. Sparse texture sizes are currently bound to the same limit than non-sparse textures. 16384 * 16384 for a 2D texture it's big but for a virtual texture, not really. Texture stitching could be an option for large virtual texture where we would use multiple texture 2D array layers and allow filtering accross these layers. Lastly, supporting sparse shadow map would be great and could simplify a lot the shadow rendering by using very high resolution shadow but allocated only where it matters. In ARB_sparse_texture, this is only optional.
The Khronos Group has standardize a new texture format called ASTC that provides very low bit rate and HDR support. Because, it's a KHR extensions, it means that both the OpenGL ES group and the OpenGL ARB group voted to support that feature which gives me good hope that we will "soon" have support for this format on all platforms.
This feature allows to map the storage of a texture 2D just like we can do it with a buffer. The extension supports linear and tiled formats however if this feature had to be standardize only the linear storage could be supported accross current GPUs. However, linear storage is texture cache inefficient. Each GPU tile format is vendor specific so if we had to have that feature, the texture layout would have to be standardize.
Seamless cube map filtering is a functionnality that is globally enabled. AMD Radeon HD 4000 introduced the possibility to toggle this feature for each cube map. Globally enabled seamless cube map is already not that useful but having that toggle for each cube map even less useful but Kepler added support for this feature as well. As least that per texture toggle works fine while the global toggle is clunky on all available OpenGL implementation.
NVIDIA Fermi and AMD Northen Islands have dedicated DMA engines that can live their lifes on their own. Hence a dedicated thread can could be in charge of streaming ressources because at some point the application figure out that they might become useful. During these transfers, the graphics engine can continue his life independently without any required synchronisation. Obviously, the transfers would have to be completed before using the resources but with enough anticipation we could need a synchronisation object only for the purpose of garantying correctness on all possible hardware but without actually hitting that fence.
Currently NVIDIA supports this behaviour but only by creating a separated context on a dedicated thread. This is workable but cumbersome and it costs thread safety pernalty for the entire OpenGL implementation.
An explicit use of the DMA engine for fully asynchronous transfers and performing transfer outside of the rendering code would be really nice to have.
This extension is the first for a new kind of idea: "Maybe we can make per warp/wavefront decision or work". For example, let's say that we are rendering some objects but an area that we cover is actually blurry. Maybe, it isn't very usefull to use the nicest lighting equation for these pixel. If we make this decision per shader invocation we will trigger complex branching mechanisum. If we make this decision per wrap/wavefront we will trigger simple jump instructions keeping performance high.
This extension goes into the "super resolution" range of idea where we no long want to think at a fixed pixel resolutions but instead we want to think at higher or lower resolution than the native resolution. GPU doesn't actually excute anything on a per pixel or per vertex base but in many different kind of grouping. The warp/wavefront is the grouping for shader invocation and another famous one is the quadpixel, a set of 4 fragments. The texture LOD calculation is computed per quadpixel because it is very complex to compute anatically the derivatives required for the texture LOD computation but it is really easy to compute within a quadpixel: It's only the different between the values across quadpixels.
This extension gives access to quadpixels allowing to swizzle the intermediate results accross each fragments. Let's say that fragment shader requires 4 texture sampling. In some areas, we could consider that it is not that usefull to sample per fragment and we can deal will sampling per quadpixels. This feature should interact pretty well with GL_ARB_shader_group_vote.
This extension extends GL_NV_shader_thread_group to any of the shader invocations of a wrap/wavefront. It seems very likely that we could use GL_NV_shader_thread_group on any GPU because all GPUs use quadpixels however, the warp/wavefront size is different for each GPU vendors. 32 for NVIDIA, 64 for AMD and variable for Intel, between 4 to 16 according to the cases. This feature sounds particularly useful for post processed antialiazing and maybe things like soft shadows.
This extension is simply extending add and exchange to float atomic operations.
ARB_shader_atomic_counters and OpenGL 4.2 introduced the concept of atomic counter operations be those where limited to increment, decrement and query. AMD GPUs support these atomic operations in GDS memory which is faster than image and buffer atomic operations. However, AMD GPUs support more atomic operations from GDS: Increment and decrement with wrap ; addition and subtraction ; minimum and maximum ; bitwise operators (AND, OR, XOR, etc.) ; masked OR operator ; exchange, and compare and exchange operators. GL_AMD_shader_atomic_counter_ops exposes all these operations.
AMD extends mix and max GLSL functions to take three arguments. It also provides mid3 function that returns the median value out of three values.
One particularly annoying behaviour of AMD hardware is that the group size have to be known a compilation time. Hence, this is how OpenGL 4.3 specifies it. With GL_ARB_compute_variable_group_size, the ARB relax this behaviour howver this extension is only implemented by NVIDIA at the moment.
GLSL is pretty limited in term of different types it supports: int, float, double (GL4). With its OpenGL 4 hardware, NVIDIA provides more storage types including 64 bit integers that can be use with vertex attributes.
This extensions what release with Fermi GPUs. It offers many functionnalities that went to core specifications. However, one of the aspect of that specification is the large range of types it provides: 8, 16, 32 bits integers and half-precision floating point types.
Setting the sample positions is to me a very useful feature for post processing antialising but also for very high multisampled rendering with could be nice for text rendering for example.
This extension is a collaboration between Apple and NVIDIA. It is obviously design to handle the high DPI screens. When calling glBlitFramebuffer to resolve a multisampled framebuffer, the OpenGL implementation can resolve and scale the framebuffer.
NVIDIA uses something call multisample coverage which allows to have coverage samples than color samplers. Hence, the implementation can adapted the multisampling according to the number of coverage samples covering the color sample.
In the second party of this article, we will discuss the possibilities for blending, stencil, profiling, rendering pipeline and misc features. We will discuss the announcements at GDC if there is enough public information given away that I could discuss about revelant things. We will conclude by my personal wish list for OpenGL 5 and Direct3D 12 hardware class.