[sldev] getting serious about software.

Dirk Moerenhout blakar at gmail.com
Sat Jun 23 02:53:52 PDT 2007


Your reference seems incomplete. I assume what you wanted to tell us
is that in the mean time you've learned that the compiler translates
temp/2 or temp>>1 into a single instruction (SAR on intel) which
handles the sign correctly (as long as you correctly typed your
variable as being signed)?

The issue in your attempted optimisation is that you try to replace
integer math. Now this is something where you'll hardly be able to
beat the compiler. The magic is in FP math and rarely in integer math.

I'll give a fun example from the LL code:

This is the current code for finding the next power of two:
F32 next_power_of_two(F32 value)
{
	S32 power = (S32)llceil((F32)log((double)value)/(F32)log(2.0));
	return pow(2.0f, power);
}

Note that it is indeed a mathematically sane way to do it.

This is my version, exploiting what we know of IEEE:
inline F32 next_power_of_two(F32 value)
{
	union
	{
		F32 result;
		S32 iresult;
	};
	if (value <= 0) return 0;
	result=value;
	iresult&= 0x7F800000;
	iresult+= 0x00800000;
	return result;
}

What it does is rather simple: IEEE has you multiply 1.xxxxxx by a
power of 2 to represent numbers. Now in such a situation the only way
to store a power of 2 is by multiplying with 1.0. We hence first clear
all the bits of the fraction so we get 1.0. Then we add 1 to the
mantissa so we multiply with the next power of 2. Et voila, you're
done :)

On my Athlon 64 my code runs about 200 times faster than the current
LL code. Admittedly it's not guaranteed to be fully portable. It
requires 2 things:
- The endianess is the same for float and int
- Floats are stored in IEEE format

My first bet would be that it's supported on all CPU's we want to work
with (now and in the future). Are there any powerpc users who are
willing to play guinea pig from time to time? :)

Kind regards,

Dirk aka Blakar Ogre

On 6/23/07, Callum Lerwick <seg at haxxed.com> wrote:
> Don't underestimate the compiler's ability to optimize. I did a little
> writeup on my LJ:
>
> http://ninjaseg.livejournal.com/49842.html
>
> Lesson: Code what you mean, "tmp / 2", and the compiler can optimize it
> in the best way for the platform you tell it to target.
>
> Though notably, gcc 4.1 is too dumb to vectorize an integer divide on
> its own. But on the other hand, vectorization hasn't proven to be a
> benefit in this case. MMX/SSE really is of little benefit in a loop with
> a single divide, in fact it seems to slow it down a little. Moving
> things in and out of the registers is the bottleneck. Vectorization is
> really only a benefit if you can keep it all in the registers and do a
> large number of operations in a row.
>
> The cleanup on the way to vectorization has resulted in a measurable
> speedup though.
>
> But the biggest speedup by far has been from reducing cache pollution
> and memory overhead.
>
> > --- Write custom code for your math ---
> > All modern CPU's support SIMD instructions aimed at 3D. Compilers
> > don't convert your vector math for you, you need to do it yourself.
>
> Compilers are actually starting to be pretty good at autovectorization.
> You still need to design the code with vectorization in mind. In gcc's
> case, it is very picky about aliasing, basically requiring C99
> "restrict" to be used on all pointers, and assigning structure members
> to temporary variables...
>
> I've experimented with Intel's compiler, but I can't actually get the
> thing to link right on Fedora 6/7, so I've only been able to look at its
> assembler output.
>
> _______________________________________________
> Click here to unsubscribe or manage your list subscription:
> /index.html
>
>
>


More information about the SLDev mailing list