One difference between a good API and a great API is from my point of view how it allows us to develop with it which includes debugging.
Currently, the main thing we have in the API is glGetError which feels seriously limited. There are some softwares like gDEBugger or glslDevil which are going to tell us about the order of function calls, about the results of shader computations but does it really reach debugging completness?
For example, I recently experience something that I could expect that such software can't see but a debug profile will tell me. I was trying on AMD a code sample which generate mipmaps from a framebuffer texture I wrote previously on nVidia. At launch, the sample simply crash without any notice (no OpenGL error)... not nice! After a while, I understand that AMD drivers want that we want to generate mipmaps before the rendering to texture. We need to call once glGenerateMipmaps to allocate the memory. I understand how it sounds like a good practice and that we should do this even on nVidia but it crashs without a word. With a debugging profile, we could have had a clear message that even explain what going on and why.
I even actually believe that it could be use to give a lot of comment and advice to the developers to just make a better use of OpenGL because performance wouldn't be a critical criterion on that profile.
Instead of glGetError, we could imagine that each function parameter could be check and would report an error as soon an invalid parameter is given or even warnings about anything that should work but isn't standard.
When I compare AMD and nVidia OpenGL drivers, I think AMD drivers are really strict and follow really well the specification. nVidia has this effect that it always works and this is a bit annoying because if we develop on nVidia and then test the software on AMD, we will have problems on AMD when actually the issue is the lack of accuracy of nVidia drivers. Recent examples I experience:
With a debug profile, nVidia would be free to keep this lack of accuracy and just output a warning "this is not standard but we support it because it's a cool feature". If we compare with C++ compilers. Almost all of them have their own freedom. Visual Studio supports since forever ago anonymous union which is a really cool feature and even GCC has the 'pedantic' compiler parameter to really be strict (not set by default!) because sometime specifications are a bit too rigid.
Better debugging capabilities, better implementation freedom and comments about this freedom and better documentation on good use of the drivers. Sound great but even if such profile reach OpenGL one day, it would probably take a long time for both AMD and nVidia to implement it properly...
On aspect of GL_EXT_shader_image_load_store defines how to handle the 'early depth test', an optimization that discards some fragment processing. In his side, AMD has released GL_AMD_conservative_depth which actually implement a Direct3D 11 feature allowing to pass enough information to the OpenGL implementation so that when the gl_FragDepth output is modified, the early z-culling can still be performed... nice!
When Direct3D 11 got released, it felt a bit ironical that Direct3D 11 brings the concept of display list in its API while OpenGL deprecates it. I'm still using display lists because I find them still really efficient at a software design level when working with 'macro state object'. I check if my macro state object is the one I need, if not, I call the display list of my new macro state object. Very efficient and simple design.
Basiccaly, I use display list only for none displayable things... The OpenGL API is quite a nightmare to know what can be place in a display list or not. Having display list deprecated seems good to me actually. We need a similar concept but more generic.
In Direct3D 11 the command lists are used mainly for multithreading rendering so that each thread can build a list of commands and finally the main graphics thread will execute these lists.
With OpenGL display list, we can almost do something similar, except that I'm really not sure about how an OpenGL developer could take care of untracked commands by display lists...
Unfortunately, I haven't actually find any information that could make me think that the ARB is considering improving OpenGL multithreading rendering...
When GLSL was created shaders were really simple. Basically, a simple vertex shader and fragment shader linked into a program with a single function each (main) and we were done. As time goes, they became more and more complex to the point that we speak of ubber shaders, shaders that can handle all cases.
The GLSL build system of OpenGL is quite poor because it as been design for the initial needs. Coming along with OpenGL 3.3, GL_ARB_shading_language_include (not implemented yet by any drivers) try to bring some improvements of this build system by managing the GLSL sources.
Whether or not the GLSL build system is a role of the OpenGL API seems a good debate topic but I don't really have an idea or where we are going on that side. It's too soon, still a researsh topic.
Some people are requesting shader binaries, I even heard of a 'shader blob' idea but I don't know anything about it. From my point of view, 2 simples thing would be a great help: An option like GCC -Dmacro[=defn] to define symboles at build. In a way, this is already possible through glShaderSource function by using the first string to put all our defines. It works... but how nice is it? (That's a real question actually!)
Something else that quite bother me is how to share the structure definitions between shader stage. Vertex shader output blocks need to match fragment shader input blocks and this is good and required. I would like a way to guaranty that stages match and this will be even more important with separate shader programs where programs linking could not tell us about varying variable mismatch.
Should we have a proper compiler object, we would load and release it to save memory? (Like on embedded system to save some memory) Should we be able to get the binaries? Should we be able to configure the GLSL compiler for debugging, optimisations? Should we actually get the same (or a closer) level of features to C++ compilers?
Well, a lot to do and to study on this area. I'm quite curious to see how it will evolve. I'm not sure if we could expect anything on that side for OpenGL 3.4 nor OpenGL 4.1.
Both AMD and nVidia seems really interest on the topic of giving processing informations from the OpenGL drivers those days. The first official example on that topic is obviously GL_ARB_timer_query, part of OpenGL 3.3 and just so great because it always real-time timing without rendering pipeline stalling. Why only now? I don't know. Both AMD and nVidia seems to be interested to give developers more feedbacks abd this is confirm by multiple extension releases.
This is something I am 80% sure we will see in OpenGL 4.1. With their OpenGL 4.0 drivers, nVidia has released an extension called GL_EXT_vertex_attrib_64bit that defines 64 bits floating-point attributes. AMD is a contributor to this extension and it's quite a trivial changes beside the fact that a 3 or 4 components double count for 2 generic vertex attributes.
nVidia also release an extension called GL_NV_vertex_attrib_integer_64bit which introduces u/int8_t u/int16_t u/int32_t u/int64_t scalar and vector types in GLSL and an API to send 64bits integers through attributes. I don't think AMD Radeon 5000 series have support for such features so that we can expect this to make into OpenGL specifications before OpenGL 5.0. Unless AMD choose emulation just like double-float on Radeon HD 5700 series.
Anyway, I'm not convinced yet by how usefull 64 bits types are.
The most obvious features of OpenGL 4.0 that I would see in OpenGL 3.4 are actually from GLSL. I find it quite crazy but in GLSL 4.00, the old explicit type casting have been replaced by implicit casts. There is not reason I can see for not having the same behaviour in GLSL 3.XX. Actually, it would increase shader code compatibility which sound good to me even if I am more into the old strict type casting of GLSL 3.30. In one hand, with implicit cast there is always cases where we can miss a specific behaviour. In other hand with explicit casts, the worse we could have is a compiler error...
GLSL 4.00 also adds a qualifier called precise to make sure that the compiler is not going to do a crazy optimisations which can imply cracks during tesselation for example. A 'precise' qualifier in GLSL 3.40 would be welcome too. Coming along precise, the fma GLSL function allows to compute a MAD (MUL+ADD) operation as if it was a single instruction. (Which it is for half of the G80 core for example.)