A while ago, I was speaking about my progress on GLM 0.8.4.0 development using SSE instructions to optimized some code.
I published at DevMaster.net, the code source of a fast 4 by 4 matrix inverse and product.
The result seams really interesting to me: 162 CPU cycles instead of 918 CPU cycles on a Core 2 Q6600 for the inverse. 63 cycles instead of 378 CPU cycles for the matrix product.
A speed up by 5.66 times for the matrix inverse and by 6 times for the matrix product. Not bad?