[sldev] Optimization target: Avatar skinning, LLViewerJointMesh,
matrix multiply
Dzonatas
dzonatas at dzonux.net
Sun Apr 15 20:45:05 PDT 2007
James Cook wrote:
> This looks great. On what system are you running it? What headers do
> you include to get the _MM_TRANSPOSE4_PS macro?
>
#include <xmmintrin.h>
Also note, GCC4.2 has a faster _MM_TRANSPOSE4_PS than earlier versions.
I used GCC 4.1.2 on coLinux, which runs paravirtual with XP on a 1.8Ghz P4.
> I'm surprised that Blend DS1 is faster than the others, even including
> recopying all the vector3 data to vector4s!
>
Take all optimizations away with debug on and BlendDS1 is slightly
slower the Blend2. With in-line and optimizations, the temporary vectors
are optimized away. The recopy wasn't needed, I put it in there just in
case, as you stated further that it needs to be a packed vector 3.
> Unfortunately in the LLViewerJointMesh::updateGeometry() function I
> believe we are both reading from and writing to packed arrays of
> vector3s with other data interleaved for OpenGL. So writing out 4
> floats to get the 3 we want (which is what I think _mm_storeu_ps does)
> will obliterate the other data.
>
>
This is one area to look into for further optimization of the outer
loop. I originally had o.setVec(j.v[0],j.v[1],j.v[2]) instead of
_mm_storeu_ps(). It didn't make much difference either way with an
unaligned store.
> In blendDS1 is it faster to copy through memory backwards (VERTEX_COUNT
> -> 0) than forwards (0 -> VERTEX_COUNT)?
>
It *generally* optimizes easily to faster code not that the copy itself
is faster.
> I'll play with this a little bit tomorrow.
>
> This rocks!
>
> James
>
Awesome. I'm now on the task to upgrade OSLCC to use GCC4.2 instead of
GCC3.4. ;-)
--
More information about the SLDev
mailing list