[sldev] Optimization target: Avatar skinning, LLViewerJointMesh,
matrix multiply
Dzonatas
dzonatas at dzonux.net
Sun Apr 15 16:37:55 PDT 2007
Continuing with interleaving.... there is the _MM_TRANSPOSE4_PS() macro
for generic SSE.
[note: SSE3 has horizontal add instructions, so we wouldn't have to
interleave on SSE3.]
The whole operation is fast - except for the final vector store back
into the LLVector3.
Here is the new multiplyDS1() with the reference code below:
void multiplyDS1(const DSVector3& a, const DSMatrix4& b, LLVector3& o)
{
DSMatrix4 j;
j.v[0] = a.v * b.v[0];
j.v[1] = a.v * b.v[1];
j.v[2] = a.v * b.v[2];
j.v[3] = b.v[3];
_MM_TRANSPOSE4_PS(j.v[0], j.v[1], j.v[2], j.v[3]);
j.v[0] += j.v[1] + j.v[2] + j.v[3];
_mm_storeu_ps((float*)&o, j.v[0]);
}
Since it is that last store that is the bottleneck here, the next step
would be to optimize the outer loop from here for whatever calls this
generic vector/matrix multiply.
Enjoy
Dzonatas wrote:
> James Cook wrote:
>> This time, with actual attachment. :-P
> I ran some tests. It appears there is a bottleneck with o.setVec() and
> the addition.
>
> Here are the tests of my code with the commented-out o.setVec() and
> addition step (under DS1).
>
> Initializing
> 0.06 sec
> Blend DS1
> 1.45 sec
> Blend 2
> 3.31 sec
> Blend 3
> 4.06 sec
> Blend 4
> 3.33 sec
>
> While not complete with interleaving, I thought the result of the
> multiplication alone was interesting. I expected it to be faster.
>
> Here is the code:
>
>
> typedef float v4sf __attribute__ ((vector_size (16)));
>
> struct DSMatrix4
> {
> union {
> v4sf v[4];
> float mMatrix[4][4];
> };
> } __attribute__ ((aligned (16)));
>
> union DSVector3
> {
> float unit[4];
> v4sf v;
> } __attribute__ ((aligned (16)));
>
>
> void multiplyDS1(const DSVector3& a, const DSMatrix4& b, LLVector3& o)
> {
> DSMatrix4 j;
> j.v[0] = a.v * b.v[0];
> j.v[1] = a.v * b.v[1];
> j.v[2] = a.v * b.v[2];
> // o.setVec(j.mMatrix[VX][VX] + j.mMatrix[VY][VX] +
> j.mMatrix[VZ][VX] + b.mMatrix[VW][VX],
> // j.mMatrix[VX][VY] + j.mMatrix[VY][VY] +
> j.mMatrix[VZ][VY] + b.mMatrix[VW][VY],
> // j.mMatrix[VX][VZ] + j.mMatrix[VY][VZ] +
> j.mMatrix[VZ][VZ] + b.mMatrix[VW][VZ]);
> }
>
> void blendDS1(LLVector3* in, LLVector3* out)
> {
> extern void randomize_floats(float*);
> DSMatrix4 blend;
> DSVector3 DSin[VERTEX_COUNT+1];
> for( int k = VERTEX_COUNT; --k>=0;)
> DSin[k].unit[VX] = in[k].mV[VX],
> DSin[k].unit[VY] = in[k].mV[VY],
> DSin[k].unit[VZ] = in[k].mV[VZ];
> randomize_floats(&blend.mMatrix[0][0]);
> for (int loop = 0; loop < LOOP_COUNT; loop++)
> {
> for (int i = 0; i < VERTEX_COUNT; i++)
> {
> multiplyDS1(DSin[i], blend, out[i]);
> }
> }
> }
>
--
More information about the SLDev
mailing list