GLM transformation functions take as a first parameter a matrix to transform by the transformation function. It could be not and simply build a transformation matrix to multiply after by this matrix. Because transformation matrix are filled with a lot of zeros, a dedicate implementation instead of the matrix product could be a lot more efficient.
I get though all these optimisations and the results are as expected. 'rotate' from ~900 cycles to ~675 cycles, translate from ~459 cycles to ~153 cycles and scale from ~432 cycles to ~126 cycles. On Q6600 FPU!
Finally, I write the code specifically so that compiler could easily optimized it for SIMD instructions but obviously my next step is to write a SIMD version.
Available for the next release!