During the first years of GLM development, I raised my interest for the efficiency of the generated code by the compiler to answer this simple question: Is writing SSE code using intrinsics worth the effort? Understand, while that give a performance gain? For a long time the answer was easy, yes because the compiler does a poor job. Today's compilers do a much better job however there are still incapable of vectorizing the code...
One of the interest of the SSE or AVX instruction sets is a provide SIMD instructions allowing to process in a Single Instruction Multiple Data. However, this is only a subset of the SEE or AVS instructions as all the Multiple Data instructions also come with Single Data equivalent. This is these Single Data instructions that the compilers are actually generating. Following, here is a code sample that we will use to generate the ASM code for GCC 4.4, GCC 4.8, ICC13, VC12, Clang 3.0 and Clang 3.2.
Reading these codes, we immediately see that no compiler is capable of generating vectorized intructions.
We can notice some useless mov instructions generated by some compilers. Also, Clang tries to interleave different instructions while ICC is regrouping identical instructions. Others compilers more or less interleave or regroup the identical instructions but in any case each compiler is capable to massively reorder the instructions to the point that GCC 4.8 is capable to generate exactly the same assembly code for both mul_cpp and mul_inst_like but it is still incapable of vectorizing a code.
It seems to me that being capable of such reording shows how compiler optimizations have been focus on the result and the dependances to this result. With such strategy based on ASTs, the compilers can remove dead code and useless operations like sequence of mov instructions. However, today CPU performances are more bound to the usage of memory, how we maximize the cache usage and how we reduce the data movement and transfer. Two conscequences: There is still a lot of room for compiler optimizations and hand writing code with intrinsic remains relevants.
There are some researches to resolve the issue of generating vectorized code. ISPC seems inspired by GPU architechtures and it generates C++ source code using on demand SIMD instruction sets. Then Polly is a compiler optimizer that directly tackles the issues of memory access pattern. Finally, LLVM is going to integrate in LLVM 3.3 a new optimizer called SLP Vectorizer
For GLM, what I would enjoy is to figure out an approach where I could avoid writing intrinsic code but still write my C++ code in a way that the compiler would generate the SSE code I expect it to be generated. Even if, I have to look at the assembler code, such approach would allow me to have a single code for each operation making it easier to maintain.
So far GLM provides dedicated simdVec4 and simdMat4 classes for SIMD optimiations. David Reid even contributed a simd version of GLM quaternions for GLM 0.9.5. It is obivous that using GLM to write very fast code is not a good idea but this is not a reason why GLM shouldn't be as fast as possible and ideally it should be fast transparently but for that the compilers will need to do a better job.
Since XCode 4.1, we can display the assembly of a file using the menu "Product/Generate Output/Assembly File". However, with Clang the IDE will show the LLVM IR which might be great for the compiler to use but I find it harder to read than old-fashion CPU or GPU instructions. Fortunately we can enable x86 assembly generation using the argument "-no-integrated-as". This argument can be set using the menu "Product/Scheme/Manage Schemes". Also, "-integrated-as" can explicitly request LLVM IR.
I discovered few months ago that many compilers can be used online. This is very convenient idea that allows to quickly test a code on different compilers. Isn't it nice to be able to use VC12 on MacOS X? My favourite website is gcc.godbolt.org which support many versions of GCC but also Clang 3.0 and ICC 13. A great feature is that this website display the ASM code generated. For Visual C++ 12 there is the great rise4fun.com/vcpp however it doesn't generate ASM code but only the compiler errors.