[sldev] QueryPerformanceCounter() & related issues VWR-940, VWR-962, VWR975

Dzonatas dzonatas at dzonux.net
Sat Jun 2 11:55:10 PDT 2007


Paul TBBle Hampson wrote:
> So that assumption has no basis here.
>   
Correct.
> On the other hand, the simplicity was asserted by the person who wrote
> the PowerPC assembly routine I used in my patch. And authorative as he
> might be for his code, I'm not sure he can speak for anyone else who's
> implemented functionally identical code.
>   
A quick google reveals similar code used elsewhere. Actually, a cmpw 
instruction instead of a cmplw seems more common. Hmm.
> Main change I did was move rdtsc and likewise code into the header
>> file, llfasttimers.h, so that code can now assemble in-line. Eh...
>> saves a little overhead.
>>     
>
> As per coding standard, please provide profiling justification for this.
>   
Consider that the code that calls this function is in-line (in the class 
header), it creates a bias in the compiler to not in-line any of the 
code -- even the code of the class header. The reason that it originally 
exists in the object module rather than the header may be based on the 
pentium 1 or pre VS6.0 era. That is a easy presumption because of the 
emit usage. The instruction rdtsc was undocumented, and the only sure 
way to use it was to wrap it up in a call structure with a stack. Newer 
compilers know about rdtsc and how to handle its usage, so there is no 
need to wrap it anymore unless you want to time individual assembly 
instructions and need to prevent out-of-order execution.

The basic recommended system is past the pentium 1 era.
> It's already hard enough to make sense of as it is, and moving things
> around like that makes it _very_ hard to review a patch. Once the code
> is done and working, profile and inline it as neccessary.
>   
I've profiled the viewer and found that there is high cache pollution. A 
polluted cache will not let any of the viewer run at expected speed. One 
step to fix this is to reduce bandwidth bottlenecks. The change to make 
this more in-line will help localize data access. For one, more in-line 
code avoids a global call stack and further page faults. Such isolated 
cache fetches are not easy to profile, if they can be isolated, due to 
the nature of live data being much more random with the overhead of 50 
threads.

General data alignment is another way to reduce cache pollution, but we 
are aways from that being worthwhile. I've turned on alignment and 
optimizations for more recent processors. With the viewer completely 
compiled that way, I've seen code run 2 to 3 times faster. However, that 
would mean maintenance overhead for separate binaries for similar 
processors, which I've learned has been discussed and passed on.

We need incentives to improve end user hardware.
> cpu MHz only exists on x86 and AMD64. It's "clock" on PowerPC.
>   
Thanks. I'll add that.
> /proc might not be the best place to fetch the current CPU speed from,
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is prolly more
> reliable,
Between what Tofu and a google recommends, this appears to be the 
preferred order:

Test /sys/devices/system/cpu/cpu0/cpuinfo_cur_freq, else
Test /sys/devices/system/cpu/cpu0/scaling_cur_freq, else
Test /proc/cpuinfo, else
Test gettimeofday() & rdtsc (or likewise), else
Test gettimeofday() & expect 1Mhz
>  ##else
>   
>>  U64 LLFastTimer::countsPerSecond()
>>  {
>> @@ -108,6 +62,8 @@
>>  	{
>>  		CProcessor proc;
>>  		sCPUClockFrequency = proc.GetCPUFrequency(50);
>> +		if(!sCPUClockFrequency)
>> +			sCPUClockFrequency = 1000000.0;
>>     
>>  	return U64(sCPUClockFrequency);
>>     
>
> I think this rather confusingly ties "has a high-performance counter"
> code to the "has a sensible GetCPUFrequency value" code.
>   
> This is fine on win32 (has both), Linux x86 (has both), Linux AMD64 (has
> both), and PowerPC (Special-cased with #define, never hits this).
>
> Also the high performance counters falling back to gettime of day, will
> be correct.
>
> But it took me two readings to realise that, and I suggest it's a rather
> tangled way of approaching the problem.
>   
Yep. Needs work.

Perhaps, it is best to call LLCPUInfo.getMhz() instead of a direct call 
to CProcessor::GetCPUFrequency(). That way the best frequency 
determination can be more localized to LLCPUInfo, and fallback can be in 
CProcessor::GetCPUFrequency().

...almost quietly fixed a bug here.

-- 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.secondlife.com/pipermail/sldev/attachments/20070602/fc585fb2/attachment-0001.htm


More information about the SLDev mailing list