[sldev] Optimization target: Avatar skinning, LLViewerJointMesh, matrix multiply

Dzonatas dzonatas at dzonux.net
Sun Apr 15 16:37:55 PDT 2007


Continuing with interleaving....  there is the _MM_TRANSPOSE4_PS() macro 
for generic SSE.

[note: SSE3 has horizontal add instructions, so we wouldn't have to 
interleave on SSE3.]

The whole operation is fast - except for the final vector store back 
into the LLVector3.

Here is the new multiplyDS1() with the reference code below:

void multiplyDS1(const DSVector3& a, const DSMatrix4& b, LLVector3& o)
{
    DSMatrix4 j;
    j.v[0] = a.v * b.v[0];
    j.v[1] = a.v * b.v[1];
    j.v[2] = a.v * b.v[2];
    j.v[3] = b.v[3];
    _MM_TRANSPOSE4_PS(j.v[0], j.v[1], j.v[2], j.v[3]);
    j.v[0] += j.v[1] + j.v[2] + j.v[3];
    _mm_storeu_ps((float*)&o, j.v[0]);
}

Since it is that last store that is the bottleneck here, the next step 
would be to optimize the outer loop from here for whatever calls this 
generic vector/matrix multiply.

Enjoy

Dzonatas wrote:
> James Cook wrote:
>> This time, with actual attachment.  :-P
> I ran some tests. It appears there is a bottleneck with o.setVec() and 
> the addition.
>
> Here are the tests of my code with the commented-out o.setVec() and 
> addition step (under DS1).
>
> Initializing
> 0.06 sec
> Blend DS1
> 1.45 sec
> Blend 2
> 3.31 sec
> Blend 3
> 4.06 sec
> Blend 4
> 3.33 sec
>
> While not complete with interleaving, I thought the result of the 
> multiplication alone was interesting. I expected it to be faster.
>
> Here is the code:
>
>
> typedef float v4sf __attribute__ ((vector_size (16)));
>
> struct DSMatrix4
> {
>    union {
>    v4sf v[4];
>    float mMatrix[4][4];
>    };
> } __attribute__ ((aligned (16)));
>
> union DSVector3
> {
>    float unit[4];
>    v4sf v;
> } __attribute__ ((aligned (16)));
>
>
> void multiplyDS1(const DSVector3& a, const DSMatrix4& b, LLVector3& o)
> {
>    DSMatrix4 j;
>    j.v[0] = a.v * b.v[0];
>    j.v[1] = a.v * b.v[1];
>    j.v[2] = a.v * b.v[2];
> //    o.setVec(j.mMatrix[VX][VX] +  j.mMatrix[VY][VX] + 
> j.mMatrix[VZ][VX] + b.mMatrix[VW][VX],
> //             j.mMatrix[VX][VY] +  j.mMatrix[VY][VY] + 
> j.mMatrix[VZ][VY] + b.mMatrix[VW][VY],
> //             j.mMatrix[VX][VZ] +  j.mMatrix[VY][VZ] + 
> j.mMatrix[VZ][VZ] + b.mMatrix[VW][VZ]);
> }
>
> void blendDS1(LLVector3* in, LLVector3* out)
> {
>    extern void randomize_floats(float*);
>    DSMatrix4 blend;
>    DSVector3 DSin[VERTEX_COUNT+1];
>    for( int k = VERTEX_COUNT; --k>=0;)
>        DSin[k].unit[VX] = in[k].mV[VX],
>        DSin[k].unit[VY] = in[k].mV[VY],
>        DSin[k].unit[VZ] = in[k].mV[VZ];
>    randomize_floats(&blend.mMatrix[0][0]);
>    for (int loop = 0; loop < LOOP_COUNT; loop++)
>    {
>        for (int i = 0; i < VERTEX_COUNT; i++)
>        {
>            multiplyDS1(DSin[i], blend, out[i]);
>        }
>    }
> }
>

-- 


More information about the SLDev mailing list