[sldev] Optimization target: Avatar skinning, LLViewerJointMesh, matrix multiply

Dzonatas dzonatas at dzonux.net
Sun Apr 15 20:45:05 PDT 2007


James Cook wrote:
> This looks great.  On what system are you running it?  What headers do
> you include to get the _MM_TRANSPOSE4_PS macro?
>   
#include <xmmintrin.h>

Also note, GCC4.2 has a faster _MM_TRANSPOSE4_PS than earlier versions.

I used GCC 4.1.2 on coLinux, which runs paravirtual with XP on a 1.8Ghz P4.

> I'm surprised that Blend DS1 is faster than the others, even including
> recopying all the vector3 data to vector4s!
>   
Take all optimizations away with debug on and BlendDS1 is slightly 
slower the Blend2. With in-line and optimizations, the temporary vectors 
are optimized away. The recopy wasn't needed, I put it in there just in 
case, as you stated further that it needs to be a packed vector 3.
> Unfortunately in the LLViewerJointMesh::updateGeometry() function I
> believe we are both reading from and writing to packed arrays of
> vector3s with other data interleaved for OpenGL.  So writing out 4
> floats to get the 3 we want (which is what I think _mm_storeu_ps does)
> will obliterate the other data.
>
>   
This is one area to look into for further optimization of the outer 
loop. I originally had o.setVec(j.v[0],j.v[1],j.v[2]) instead of 
_mm_storeu_ps(). It didn't make much difference either way with an 
unaligned store.
> In blendDS1 is it faster to copy through memory backwards (VERTEX_COUNT
> -> 0) than forwards (0 -> VERTEX_COUNT)?
>   
It *generally* optimizes easily to faster code not that the copy itself 
is faster.
> I'll play with this a little bit tomorrow.
>
> This rocks!
>
> James
>
Awesome. I'm now on the task to upgrade OSLCC to use GCC4.2 instead of 
GCC3.4. ;-)

-- 


More information about the SLDev mailing list