Features for OpenGL 5 and Direct3D 12 hardware, 1/N

♥

16/03/2014 Features for OpenGL 5 and Direct3D 12 hardware, 1/N

The announcement of Mantle few months ago has triggered a lot of discussions about gragphis API design. I think there are technical issues in OpenGL but those are precise problems that needs to be solve individually and following the hardware designs. These redesigning an entier API might be fun, it is keeping marketing people buzy and it makes OpenGL people communicating about how good is OpenGL.

What I like with Direct3D is that Microsoft was important enough to drive IHVs to standardize hardware features. The Khronos Group is certainly getting better at it, the ASTC texture format is a good example as I expect this format will be supported on all mobile and desktop GPUs in the future. How strong is Microsoft these days? This is something we will be able to judge at GDC.

What's an OpenGL 5 hardware feature? Following the conversions for OpenGL 3 and OpenGL 4, it's any hardware feature that can't be implemented on all OpenGL 4 hardware but would be implementable on newer hardware by all IHVs.

In this article I would like to point at hardware features available through OpenGL extensions and ideas that may or may not be interesting to standardize. As we will see, there is a lot of great features that could build this OpenGL 5 and Direct3D 12 hardware generation.

Draw submission

GL_ARB_multi_draw_indirect

Multi draw indirect is part of OpenGL 4.3 core specification but it's arguably an OpenGL 5 hardware feature allowing the GPU to submit itself draws to be executed. This feature can be implemented through software emulation quite easily using the CPU to push each individual draw but this is really slow. Currently, all the Intel GPU and AMD Evergreen support multi draw in software. Hardware implementation gives another magnitude of performance.

With 800000 draws per frame at 60Hz on Kepler and 300000 draws per frame on Southern Islands with a syntheric test rendering 2 triangles per draw on 4 pixels. That huge amount of draws provide an amazing control over the rendering.

GL_ARB_shader_draw_parameters

With OpenGL 4.3 we got shader storage buffers that we can use in place of vertex array object for programmable vertex pulling which allows for each single draw to use a custom vertex format. To make that approach really useful we were missing access of some useful draw arguments in the vertex shader stage. With this extension we get gl_BaseVertexARB and gl_BaseInstanceARB that reflect the draw arguments but also gl_DrawIDARB, an equivalent of gl_InstanceID for draws.

Only NVIDIA implements this extension at the moment but it should be implementable by Southern Island. However, the performance of gl_DrawIDARB are poor compared with a dedicated vertex attribute with divisor 1 and a BaseInstance.

GL_ARB_indirect_parameters

One idea with multi draw indirect is to use a compute shader to generate the indirect draw buffer. However, with GL_ARB_multi_draw_indirect the number of draws is submitted to the GPU through a draw call parameter. Hence, if we generate the list of draws with a compute shader, then we need to query the number of draws to submit it back to the GPU. With GL_ARB_indirect_parameters, the number of draws can be stored into a buffer object.

GL_NV_bindless_multi_draw_indirect

This is the last piece of NVIDIA bindless API allowing to submit draws without binding vertex arrays or indirect draw buffer for lower CPU overhead.

GL_AMD_interleaved_elements

This extension is really ugly but the functionnality is really interesting. Instead of having a single element array, thanks to this extension we can have up to 4 element arrays and we can index each vertex attribute with the element array of our choice. There is quite a few software actually generating meshes using multiple element array so this functionnality sounds extremely useful. Taking advantage of this feature can save bandwidth avoiding duplicating vertex attributes.

Resources

GL_ARB_bindless_texture

Bindless resources is a feature supported by Kepler and Southern Islands architechtures. When we bind a texture, the drivers write a texture descriptor in a special location in the GPU. The number of these special locations is fixed, 32 on OpenGL 4 / Direct3D11 hardware. With bindless textures we pass a handle (a pointer) directly to the shader stages. The shader invocations fetch and cache the texture descriptors. This approach gives access to an unlimited number of textures but there is chances for texture descriptors cache miss so we need to keep access coherence which multi draw indirect and dynamically uniform expression can provide by design.

With OpenGL 4.4 NVIDIA provides bindless buffers but this is still something that we need an ARB extensions for.

GL_ARB_sparse_texture

ARB_sparse_texture takes advantage of virtual memory supports of GPUs (Fermi and Southern Islands). A sparse texture can be used to create "giant" textures bigger than the graphics card memory size for virtual texturing. AMD_sparse_texture is more powerful as it adds shader queries to figure out in a shader invocation whether a texture page is commited or not.

There are quite some obvious features that would nice to have: The possibility to share the same texture tile for two texture pages without consuming twice the memory but also the support of actual giant textures. Sparse texture sizes are currently bound to the same limit than non-sparse textures. 16384 * 16384 for a 2D texture it's big but for a virtual texture, not really. Texture stitching could be an option for large virtual texture where we would use multiple texture 2D array layers and allow filtering accross these layers. Lastly, supporting sparse shadow map would be great and could simplify a lot the shadow rendering by using very high resolution shadow but allocated only where it matters. In ARB_sparse_texture, this is only optional.

GL_KHR_texture_compression_astc_ldr / GL_KHR_texture_compression_astc_hdr

The Khronos Group has standardize a new texture format called ASTC that provides very low bit rate and HDR support. Because, it's a KHR extensions, it means that both the OpenGL ES group and the OpenGL ARB group voted to support that feature which gives me good hope that we will "soon" have support for this format on all platforms.

GL_INTEL_map_texture

This feature allows to map the storage of a texture 2D just like we can do it with a buffer. The extension supports linear and tiled formats however if this feature had to be standardize only the linear storage could be supported accross current GPUs. However, linear storage is texture cache inefficient. Each GPU tile format is vendor specific so if we had to have that feature, the texture layout would have to be standardize.

GL_ARB_seamless_cubemap_per_texture

Seamless cube map filtering is a functionnality that is globally enabled. AMD Radeon HD 4000 introduced the possibility to toggle this feature for each cube map. Globally enabled seamless cube map is already not that useful but having that toggle for each cube map even less useful but Kepler added support for this feature as well. As least that per texture toggle works fine while the global toggle is clunky on all available OpenGL implementation.

DMA engines

NVIDIA Fermi and AMD Northen Islands have dedicated DMA engines that can live their lifes on their own. Hence a dedicated thread can could be in charge of streaming ressources because at some point the application figure out that they might become useful. During these transfers, the graphics engine can continue his life independently without any required synchronisation. Obviously, the transfers would have to be completed before using the resources but with enough anticipation we could need a synchronisation object only for the purpose of garantying correctness on all possible hardware but without actually hitting that fence.

Currently NVIDIA supports this behaviour but only by creating a separated context on a dedicated thread. This is workable but cumbersome and it costs thread safety pernalty for the entire OpenGL implementation.

An explicit use of the DMA engine for fully asynchronous transfers and performing transfer outside of the rendering code would be really nice to have.

Shader operations

GL_ARB_shader_group_vote

This extension is the first for a new kind of idea: "Maybe we can make per warp/wavefront decision or work". For example, let's say that we are rendering some objects but an area that we cover is actually blurry. Maybe, it isn't very usefull to use the nicest lighting equation for these pixel. If we make this decision per shader invocation we will trigger complex branching mechanisum. If we make this decision per wrap/wavefront we will trigger simple jump instructions keeping performance high.

GL_NV_shader_thread_group

This extension goes into the "super resolution" range of idea where we no long want to think at a fixed pixel resolutions but instead we want to think at higher or lower resolution than the native resolution. GPU doesn't actually excute anything on a per pixel or per vertex base but in many different kind of grouping. The warp/wavefront is the grouping for shader invocation and another famous one is the quadpixel, a set of 4 fragments. The texture LOD calculation is computed per quadpixel because it is very complex to compute anatically the derivatives required for the texture LOD computation but it is really easy to compute within a quadpixel: It's only the different between the values across quadpixels.

This extension gives access to quadpixels allowing to swizzle the intermediate results accross each fragments. Let's say that fragment shader requires 4 texture sampling. In some areas, we could consider that it is not that usefull to sample per fragment and we can deal will sampling per quadpixels. This feature should interact pretty well with GL_ARB_shader_group_vote.

GL_NV_shader_thread_shuffle

This extension extends GL_NV_shader_thread_group to any of the shader invocations of a wrap/wavefront. It seems very likely that we could use GL_NV_shader_thread_group on any GPU because all GPUs use quadpixels however, the warp/wavefront size is different for each GPU vendors. 32 for NVIDIA, 64 for AMD and variable for Intel, between 4 to 16 according to the cases. This feature sounds particularly useful for post processed antialiazing and maybe things like soft shadows.

GL_NV_shader_atomic_float

This extension is simply extending add and exchange to float atomic operations.

GL_AMD_shader_atomic_counter_ops

ARB_shader_atomic_counters and OpenGL 4.2 introduced the concept of atomic counter operations be those where limited to increment, decrement and query. AMD GPUs support these atomic operations in GDS memory which is faster than image and buffer atomic operations. However, AMD GPUs support more atomic operations from GDS: Increment and decrement with wrap ; addition and subtraction ; minimum and maximum ; bitwise operators (AND, OR, XOR, etc.) ; masked OR operator ; exchange, and compare and exchange operators. GL_AMD_shader_atomic_counter_ops exposes all these operations.

GL_AMD_shader_trinary_minmax

AMD extends mix and max GLSL functions to take three arguments. It also provides mid3 function that returns the median value out of three values.

GL_ARB_compute_variable_group_size

One particularly annoying behaviour of AMD hardware is that the group size have to be known a compilation time. Hence, this is how OpenGL 4.3 specifies it. With GL_ARB_compute_variable_group_size, the ARB relax this behaviour howver this extension is only implemented by NVIDIA at the moment.

GL_NV_vertex_attrib_integer_64bit

GLSL is pretty limited in term of different types it supports: int, float, double (GL4). With its OpenGL 4 hardware, NVIDIA provides more storage types including 64 bit integers that can be use with vertex attributes.

GL_NV_gpu_shader5

This extensions what release with Fermi GPUs. It offers many functionnalities that went to core specifications. However, one of the aspect of that specification is the large range of types it provides: 8, 16, 32 bits integers and half-precision floating point types.

Framebuffer

GL_AMD_sample_positions

Setting the sample positions is to me a very useful feature for post processing antialising but also for very high multisampled rendering with could be nice for text rendering for example.

GL_EXT_framebuffer_multisample_blit_scaled

This extension is a collaboration between Apple and NVIDIA. It is obviously design to handle the high DPI screens. When calling glBlitFramebuffer to resolve a multisampled framebuffer, the OpenGL implementation can resolve and scale the framebuffer.

GL_NV_multisample_coverage and GL_NV_framebuffer_multisample_coverage

NVIDIA uses something call multisample coverage which allows to have coverage samples than color samplers. Hence, the implementation can adapted the multisampling according to the number of coverage samples covering the color sample.

In the second party of this article, we will discuss the possibilities for blending, stencil, profiling, rendering pipeline and misc features. We will discuss the announcements at GDC if there is enough public information given away that I could discuss about revelant things. We will conclude by my personal wish list for OpenGL 5 and Direct3D 12 hardware class.

OpenGL Samples Pack 4.4.2.0 released >

< OpenGL 4.4 and ES 3.0 Pipeline Maps available in SVG