For this wish-list, I thought that instead of just having a view for OpenGL 4.2, I could have a look at OpenGL for OpenGL 4 hardware class (I might have been borderline on what's possible for some features...).
I call it 'my OpenGL 4.2+ wish-list' as I don't think all the following ideas could be part of OpenGL 4.2. You can see this wish-list as a statement against the 'close to completed OpenGL idea I read to often after OpenGL 4.1 specifications release.
Don't give me wrong, I think that OpenGL is already great but real-time graphics is still at its youth so that I don't accept the idea of we are done when plenty of opportunies for new ideas are awaiting for us and I even see a lot of hardware evolutions for OpenGL 5 hardware and beyond even if it's quite far ahead of us. :)
This article is mainly based on some tests performed with my OpenGL Samples, experiments of engine design and my understanding of GeForce 400 series and Radeon 5000 series.
I have requested direct state access for as long as GL_EXT_direct_state_access extension has been released and it is absolutely nice to see that both AMD and nVidia have implemented it now. However, once again I would like to repeat that in its form, this extension is far from perfect and I really don't want it included in any OpenGL specification.
The ARB has added parts of the direct state access through new API features: Sampler objects with OpenGL 3.3, separate program objects with OpenGL 4.1 and if we are picky even uniform blocks with OpenGL 3.1. I like this approach because it avoids to include a lot of already deprecated functions in core: OpenGL is already a big and fat mammouth, especially the compatibility profile, there is no need to make it heavier for no reason, it's complicated enough to learn! Old softwares that uses deprecated features are probably using mechanisms which are probably not perfect, but which works and rewrite those to only use DSA functions of deprecated features is just a waste of time. For new code path, it's also a waste of time to use deprecated features so that there is still no need for deprecated features with DSA.
There is still a lot of DSA functions missing and I guess that if we want an improved multithreaded rendering we will need it as I assume that it removes the need of queries which would need to be thread safe and hence potential slow. Reading extensions I find out about a GL_EXT_direct_state_access_memory extension which could be a subset of GL_EXT_direct_state_access for buffers and textures. It could cover most of the remaining missing DSA functions.
Now days, every CPUs sold has multiple cores. Unfortunatly, OpenGL doesn't gives a lot of help to program an efficient multithreading renderer unlike Direct3D 11.
On the regard of multithreading, I think that Direct3D 11 gets it right. Thus, for OpenGL I would like the possibility to build command lists with multiple threads, at list one per threads.
Let's admit that compiling a GLSL program is long. I don't see any good reason to compile a program within the thread that executes the OpenGL command list. Using multiple threads to build command lists could even give to the drivers more opportunities to optimize the final command list used for rendering without slowing down the rendering because the optimization would be hidden in a seperate thread, optimizations that we usually have to take care of ourself using delayed OpenGL calls for examples.
Direct3D 11 mainly sold itself with 3 features: Tessellation, multithreading and compute shaders. This is all good but I really think that one of the most interesting part of Direct3D 11 is the RWBuffer and RWTexture*, what we call image and buffer load and store in OpenGL.
When we think about it, image and buffer load and store is one of the most crazy feature integrated in OpenGL 4 hardware... With this feature it's possible to read and write to any buffer and any image wherever we want and even perform atomic operations.
It's a wonderful source for new ideas and an other step toward a programmable blending stage after the texture barrier even if I don't think it could be as efficient or leading to as many hardware optimizations than a proper programmable blend stage could do. The performance is not so good apararently with the current hardware generation but the possibilities are here. These capabilities are embodied by as set of extensions including GL_EXT_shader_image_load_store, GL_NV_shader_buffer_load and GL_NV_shader_buffer_store.
Unfortunatly, the lack of coherence between these extensions show me that there isn't an aggrement on the way it should be designed between AMD and nVidia or at least not in time for OpenGL 4.1 release.
The image and buffer load and store is a great feature but actually I think it will only reach its best only if we also have an API for what D3D11 calls UAV buffer and which allows to build and browse linked lists on the GPUs for very complex data struture and access... A step toward hybrid rendering with raytracing or at least raycasting? Sparse voxel octrees?
One good think with the uniform blocks is that it gives an efficient data access as it is based on AoS (Array of Structure) which provides a contiguous memory access between structure elements. When buffer data are accessed on a per-framebuffer, per-program, per-draw call, per-instance, per-primitive, per-vertices and maybe even per-fragment, etc. it reduces the waste of memory bandwidth. This is due to unused data fetched because of memory alignment and because GPUs (and CPUs too) can't fetch less that a certain amount of data. The GPU is a processor that works per-task so that all the data fetched beyond the current tasks might reach the GPU cache and never be used before being invalidated and fetch again later when the task is actually scheduled.
I think that used in a proper way, uniform block can have significant performance benefices as it naturally fetch continious data. Obviously, it's possible to use arrays inside blocks which might japodize some performance benefices of blocks if the data is not use within the task...
OpenGL 3.1 has introduced texture buffer objects which allow to access to a large amount of data either with a SoA or a AoS model but which also make possible some quite advanced data structures in GPU memory. I believe that in many cases the AoS model is more efficient with texture buffer too. Unfortunately, it doesn't feel really natural to use because GLSL only provides the function texelFetch which only provide up to 4 components vertors. Using multiple calls we can actually fetch a continious memory structure and build it back in the shader... How fastidious?
My request here is a structFetch which allows to directly fetch a data structure from a buffer as it is (no normalization, no cast). There are probably some type limitations for the structure elements in some OpenGL 4 hardware (in Radeon 5000 series but maybe not in GeForce 400 series) but as part of the making complex data structures idea, it would be very convenient and hopefully a good guide for best practices of buffer accesses.
OpenGL 3.1 includes the extension GL_ARB_copy_buffer which allows to copy some data from one buffer to another without going through the global memory. As part of creating complex data structures, I think it would be very convenient to be able to do the same between images. GL_NV_copy_image already provides such feature in a very powerful way, allowing cross target, cross dimention and even cross context copy of texture and renderbuffer sub-data. I'm not sure about the idea to be able to read or write from a renderbuffer... Since OpenGL 3.2 we can assume that renderbuffer are deprecated if we want to however I quite believe that in the future (OpenGL 5 hardware?) renderbuffer might become interesting againt if we consider renderbuffers like surfaces we can't explicitly reused which could bring some hardware optimizations.
After a test of the GeForce 470 on The Froggy FragSniffer, it quickly appears that the Fermi architechture works per tile of 16 by 16 fragments where each 'GPC' is working separately.
In some ways, I would like to call the Fermi architechture a hybrid tile renderer GPU because I don't think that on the vertex side it actually works like a tile renderer, this tiling probably exist only to have big enough 'wraps'/'work groups'.
Thanks to transform feedbacks, we can simulate a tile renderer behaviours even if is would be quite slow. One issue is that we can't control the output format so that everything is saved as floats, ints or uints, an issue that I wish to be able to laverage by being able to setup a different external format thanks to a vertex layout object... See section 6.1. for more details on this vertex layout object.
In this part I would like to deal with wishes close to the specification which would simply removed some specification limitations, simple mistakes or subtle feature refinements.
With OpenGL 4.1 we can use uniform buffers with uniform block array, one buffer per array entry. However, I would rather use only a single buffer for all the per-instance data for example. On top of this, it would be great to have a function to be able to setup this only buffer for the whole uniform block array.
I think that using a uniform block array element for per-instance data is a good thing as it reduces memory bandwidth by forcing the developer to work AoS instead of SoA.
Finally, this feature request removed the small buffer allocations overload and reduces the risk of GPU memory fragmentation.
With the release of OpenGL 4.1, I had a lot of expectations for more efficient cascaded shadow mapping rendering thanks to the new GL_ARB_viewport_array extension and core feature which allows to setup multiple viewports.
One big difference between Direct3D and OpenGL is that OpenGL doesn't require that all the colorbuffers have the same size. However, when a colorbuffer has a smaller size, it clips some pixels. With OpenGL 4.1, I was expecting to use 1 viewport per layer so that I could rasterize some triangles at lower resolution than others which would be usefull for cascaded shadow maps generation in a single pass as we usually want to use higher resolution maps for the maps close to the camera position and lower resolution for maps far from the camera position. Unfortunatlly, layered rendering has a limitation: all the layer must have the same size... I don't really know if it is a hardware limitation (in this case, let's remove it for OpenGL 5 hardware!) or just a specification detail.
OpenGL 4.0 brings a proper support for sampler arrays however some limitations remains. It's not possible to freely access any element of the sampler array, we are restricted to constants and uniform variable indexes. After some tests, I feature out that this limitation doesn't apply on nVidia drivers. On AMD drivers these are some lookup issues but it might be possible to fix it.
I think that removing this limitation could provide some great benefits for instancing and if there are actually some hardare limitations maybe it could be even less restricted to allow, at least, per-shader invocation indices.
Since OpenGL 3.2, it's possible to use blocks to communicate between shader stages and I especially like this feature to define communication protocoles between stages. With OpenGL 4.1, there is almost no more constraint except that vertex shader inputs and fragment shader outputs can't be blocks. I remember doing some tests on nVidia OpenGL 3.3 beta drivers to see if it was possible and for some reasons it results with linking errors... It would be nice to have this limitation removed for a fully consistent approach to program shaders with blocks.
There is a choice I really don't undestand about the program pipeline object design: Why the pipeline object have to be created by a glBindProgramPipeline? On the regard of DSA, this makes impossible to consider this object like a DSA object just because of this limitation and even if the rest of the API is perfectly DSA oriented. At draw time, we need to check if the program pipeline is still correct or if, meanwhile, a program pipeline has been created.
With the sampler object, the ARB got it right because the object is actually created by the first glSamplerParameter* call. Often with new OpenGL features, the previous specifications already provide solutions. For the program pipeline object it should be the same.
I find quite crazy the few possibilities we have on the regard of compiler and preprocessor options. One side of writting an OpenGL 'engine' is to simulate behaviours that Visual Studio or GCC would provide for us...
From the beginning of Cg, it includes the possibilities to build a shader for a specific 'SM' version. There is #version preprocessor option to set in the shader but if we want to manage the version from the C++ program... it's just hand work.
Another very convenient tool is the GCC -D parameter which allows to define a value at build time. More do it yourself with GLSL and it's the same for the extension list...
OpenGL 4.1 allows to get the binary of a GLSL program and to reuse it latter on. This also allows to create an offline compiler tool which would allows more optimized GLSL program... It would be nice to be able to set the optimizations of-source with a finer granularity.
To significantly decrease the compilation time of shaders, a good idea is to do the all the queries query after all the builds because querying the logs introduces a delay. We have to wait for the result of the query. Also, we certainly need logs... but maybe only when the use case is software development! An option as GCC '-quiet' might be really useful to speed up once more the build.
Finally, compilers are far from perfect on the regard of following the GLSL specifications and at least for nVidia case, I don't even think they really want to follow it on a strict manner. My believe is that for nVidia, their compiler should be the GLSL specification and a set of extra-features. Is this bad? On the regard of cross platform development, yes and on this regard I prefer to develop on AMD platform. On the regard of inovation, no. For the best of these two worlds a simple compiler option like GCC '-pedantic' would simply around these two worlds to co-exist on every platforms.
Through its multiple versions, GLSL has significantly increased the number of qualifiers. All of them have a purpose but the syntax is complex and just so ugly. Why some qualifiers are part of the 'layout' and why some are outside? All this seems a bit messy.
In C and C++ for that kind of scenarios (and actually for even simpler scenarios) we would use typedefs which is part of my wish. Also, GLSL defines an arbitrary order for the qualifiers... This is a really annoying choice because this order doesn't rely on anything logical beside hitorical reasons so I would like this arbitrary order limitation removed.
Actually, I am not sure which way would be better: typedef or a qualifier keyword to declare variable qualifier. I see positive and negative conscequences for both.
With OpenGL and GLSL it's possible to create a program using multiple shaders. This is particulary nice to reuse functions, structures, defines and typedefs for different programs associates to the same program stages. Unfortunatly, it's still impossible with OpenGL 4.1 to use a single shader object for multiple stage: we need to duplicate shader libraries for each stage... even it the code is word for word identical! Not great. For example, I typically would like to reuse the same structure between input and output blocks between program stages.
NOTE: The previous code doesn't build on nVidia drivers yet and I hope this drivers bug will be fixed soon.
We can already do it ourself playing with strings but this is compiler/preprocessor/linker work, not us. Instead, I propose to create a 'common/library shader target' that could not contain a main function or any build-in variable or per-stage specific items but that could be reuse across program stages.
With OpenGL 4.1 and the separate programs, OpenGL had to evolve to replace the old rendezvous by name approach that requires a long linking phase to connect the variables between stages checking strings. This evolustion has been possible thanks to the explicit location qualifier introduced by GL_ARB_explicit_attrib_location and generalized to varying variables.
We are suposed to be able to set the locations of a varying structures since OpenGL 4.1 but it is still not supported by OpenGL drivers. I hope it's just a matter of time.
I think that the generalization of the location should be extend further to uniform variables and uniform block (explicit index qualifier). On one side, it would improve the API consistency and in the other side it will improve software modularity and removing strings based queries.
Finally, nVidia already has the funtion glTransformFeedbackVaryingsNV based on the 'rendezvous by resource' which is perfect for be promoted in the core specification.
I find on the OpenGL forum an idea that I find quite interesting: What about having the possibility to check if an OpenGL object is created correctly of if it is completed? This is something we already have with the framebuffer, shader, program and program pipeline objects by it could be generalized to other objects. It possible that this feature could be implemented as a function or maybe through a message of the debug output extension.
Drivers have bugs and will always have some just like any other software and until no one better that humans keep writting them. When using OpenGL, we can check features using the OpenGL context version and also the extensions supported but this is not always enough. It's not because the drivers report that a feature is supported that this feature will entirely work. A common practice is to create a database of faulty drivers and to check the drivers version to warn the software users that the drivers have a bug that prevent the software to run properly and advice them to update his drivers with a newer version or a version we recommand because it has be specifically tested.
Querying the drivers version (and release date?) could have all sort of uses and it would be great to have this possibility from the OpenGL API.
These queries could be extend to the memory quantity, availability, GPU temperatures or even some indications of the GPU performances and a lot more following developer imagination. Some of these features are already exposed by the extensions GL_ATI_meminfo and GL_NVX_gpu_memory_info.
I'm not a big fan of Crossfire and SLI technologies but still, using multiple GPUs is possible and can have some real use cases for research and very expensive computation scenarios. These possibilities are available for years through but the OpenGL support remains limited by vendor extensions: WGL_AMD_gpu_association and GL_NV_gpu_affinity. It would be great to get an ARB extensions for this support.
After debates over debates, I remain a VAO hater even if it's maybe not that bad for the performances. It's not bad for the performances but not really good either and it's a real pain on the software design side. VAOs doesn't make sense so that only a simple and stupid design based on a 1 VAO per draw call work fine... no thank you! VAO certainly looks pretty at first look but this might even makes it more awfull and in anyway it keeps for me the title of worse idea ever integrated to the OpenGL core specification.
To leverage the software constrains of VAO, I suggest to update VAO or create an object that works like a vertex layout object, as the OpenGL community as always requested. This object would only describ the structure of the vertex to tell the GPU how to gather the vertex attributes but also how to output transform feedback varyings. This mean, it could be binded to multiple targets.
This way, the API highlights the developers on an area of optimization, advice to sort the draw calls by vertex layouts and remains flexible as the array buffer would not be attach to this object providing an escential complete freedom for custom vertex data management. The nVidia bindless graphics already allows this type of approach but it's possible to design it without GPU pointers too.
I am quite up for the GPU pointers but let's face it: In case of invalid access the drivers restart on Windows 7 and the computer simply freeze on Windows XP... It feels pretty hard to expect the ARB reaching an aggrement on this. What about having both? It would be my favorite option!
With a lot of new version of OpenGL we get a new draw call function. In a way we could deprecated the previous functions after each new draw call function introduced. Considering this and the issues with the VAO object, I think that OpenGL would take advantage of a draw object. For each new draw parameter, a new draw object parameter would be created and we could keep the same draw function and expect default parameter values. If it sounds more likely to the OpenGL drivers teams, the draw object could work as a container.
With the pipeline program object I saw some opportunities for an environment program object design in an useful way. The environmnt program object is maybe the last promise of Long Peak that we didn't get yet.
The idea behind a environment program object is to be able to group all the data that would be use by a program in a single memory location where the drivers would be able to setup how to access to those data. The environment program object is for the programs what the layout object would be for the 'vertex pulling'. For a succesfull environment program object, it is really important to keep it decoupled from the program objects and from the buffers. This is the only way to keep the level of flexibility we have today and prevent VAOs type of constrains. Uniform variables, would be directly set but uniform block would only hide its level of indirection.
Imutable objects are one of my very old request. For a long time, I liked to use the display list to build some static objects and be able to quicky switch from one object to another. Back at that time (2-3 years ago?), I mesure some interesting performance gains. It was convenient to use and it gave me a software design solution to handle the 'lost global states'. Unfortunatly, display list are deprecated since OpenGL 3.0 so that display list are not an option anymore even if my code usually still group the global states into C++ objects that matches my software design.
Direct3D already has some objects for specific groupa of 'lost global states' but I am not sure if it's the right approach. I guess that these groups could be quite dependent of the hardware which would make hard to reach an aggrement, expect if OpenGL stricly follows Direct3D. Another idea is let the developer group the states the way he wants and let the drivers optimize the states group how they could. This is quite flexible but in practice I am not sure it the drivers would be able to perform a lot of optimizations. At least we could expect the level of efficiency that display list provides which is already great. One draw back is that developers could create imutable objects with a total non-sense and by conscence quite fair.
On note: The idea of custom state object could probably come relly well together with the command list designed for multithreading.
This part is dedicated to ideas that raise my attentions but in which I didn't put enough thoughts or just didn't gather enough clues and experiences to let me settle on any idea of where I think we need to go.
This is maybe too soon but why not thinking to a second deprecation pass in the OpenGL API? I'm not saying removing feature yet but at least to mark some features as deprecated. For example the texture proxy... who is using this? Most of the draw call functions or all if we had a draw object. The renderbuffer which is only a subset of what we can do with textures. glViewport, glClear, glClearColor / glClearStencil / glClearDepth, all these functions that have alternatives. The only purpose of this deprecation pass would to simplify the API, to only keep the useful functions. I also think that deprecation should only be seen as an advice for what to use or not.
API coherency is a long topic that requires much more research and time that what I unfortunately had for this wish-list. However, I think that many of my requests try to leverage some issues of consistency within OpenGL which today might makes OpenGL much harder to learn than Direct3D 11.
For example, the varius rules that handles the ways to set uniform values, subroutines and blocks follows 3 totally differents set of rules... which in practice we will probably limite to the GCD of these 3 set of rules, something that my idea of the program environment object follows.
Other example, when several 'slots' are used for a feature, we have all sort of post-fixes tokens: 'i', 'index', 'array'.
The communication between program stages but also between GLSL and C++ programs are build on top of varisous ways that not obvisously works everywhere. The GLSL cast rules are quite awful and follows some specific rules where it would have been nicer and that already exist like the C++ rules or just keep the OpenGL 3.3 rule (explicit only).
Some objects (texture, sampler, program) use parameter functions to setup their settings or other create a new functions for each new fonctionnality (draw call functions)...
Some features works only on DSA (sampler, uniform blocks) when other rely on multiple targets (Buffer, framebuffer), some even provide both ways (program) and other uses multiple units (texture)...
Some object names are reserved with glGen* but some use glCreate* and can only reserve one name instead of multiple with glGen*... and I could easly find more examples!
I thing that the lack of consistency makes OpenGL a very complex API to work with especially because I believe in OpenGL everywhere for everything, including 'Paint' type of softwares which means that OpenGL must be usable by developers that doesn't have a specific skill in OpenGL or graphics rendering in general.
The ARB has spoken about streamlining the OpenGL API when they created the core profile and I think even during Long Peak development. I don't think that removing features has anything to do with streamlining an API, it has to do with bringing more sense into the API, each element with each element but unfortunatly during the past few years we have assisted to the exact opposit of streamlining... I believe that a streamlined API is what has made OpenGL a backward comptible API for years and what I have affraid is that in the future OpenGL becomes so complex and irationalized that it would be too complex to evolve and to work with. What if it's already the case?
One thing I am not interest at all is to add some functions that does nothing but being pretty. I think that GL_AMD_name_gen_delete is one of those extensions that doesn't lead to anything new even if the concept remain interesting. Large APIs == More complex API by definition. This extension formally defines the concept of named object and concequently it opens doors for others approaches. Why not intoducing a pointed object or some sort pointer based object access? It has been an old topic within the ARB but I don't see why these 2 conventions shouldn't co-exist if it's generalized to every objects.
In this extension or actually in an extended version of this extension, I also imagine a way to standardize the way named objects would be binded but also a way to create several objects of different types.
The purpose of these lasr functions would be for the developer to create some objects that uses severales objects and that are live and are used together. It could give a hint to the drivers to hopefully provide some optimizations.
With all the possibilities provided by OpenGL 4 hardware, the programming freedom given by this generation of hardware, an OpenGL programer could wish to be able to program on the GPU like he would on the CPU: A wish for object orientation. So far OpenGL and OpenCL has stay out of it because of design decisions. Meanwhile, HLSL11 and Cuda have embrace it at different levels.
On one side, Direct3D 11 brings the keyword 'class' to the language. On other side, Cuda brings some sort of C++ support as part of its language. I don't know the details on how this is actually possible but it is certainly impressive and it raises to me a question: do we want object orientation in OpenGL and OpenCL through C++?
Some differences from Cuda to OpenCL are that Cuda is based on an offline compiler and Cuda is platform specific. Like Walter Bright shows in his article on speed of C++, the language is show to compile by nature. I don't know on what is based the Cuda compiler but as the language is build on top of C++, I assume that Cuda compilation is really slow even if it based on the very impressive LLVM. However, even if this compiler is slow, it's an offine compiler and at the program execution it doesn't need to be build again. This is only possible because Cuda is platform specific which means that it is only going to run on nVidia drivers and hardware.
With OpenGL 4.1 we finally have the possibility to get the binary of a GLSL program. I am quite skepical about this feature as building GLSL programs is really fast (compare to C++ program) and in anyway these binaries can only be used for a cache system because there is no standard binary defined.
Thus using an offline compiler is complicated which leads to forget about the idea of using C++ instead of GLSL or the OpenCL language, unless the ARB decides to sit around the table to define a standard binary... for nothing sooner than OpenGL 5 hardware.
This brings us to an alternative, that D3D11 probably uses, a semi-compiled language that AMD and nVidia would transcode into their own OpenGL 4 hardware binary codes.
A standard binary code even open doors for alternative languages other than C++. C++ programers might have fun to programer the GPU with C++ but I can't imagine it would make happy C#, Java, Python programers. Furthermore, nothing prevents GLSL programs to be built into this standard GPU binary code. Other side effects could be some open source works to bring GPU binary support in GCC and LLVM for example. It might become an ongoing topic in the future and I am already impastient to read what people want and think about it!
OpenGL is like all these interesting topics, the more we think about it the more we could find ideas and I bet that in next months I will find more new ideas or refinements. In anyway, with all this, who could still dare to say that OpenGL is 'nearly completed'? :p What about all the work required as part of the OpenGL ecosystem?
I'm not use to it on this website but I would like to give a special thanks to all the people who gives the energy for my endeavor and the ones with whom I shared blossoming OpenGL discussions with but also AMD and nVidia OpenGL teams for their supports which allows me to touch the OpenGL state of the art and beyond.