[sldev] QueryPerformanceCounter() & related issues VWR-940,
VWR-962, VWR975
Dzonatas
dzonatas at dzonux.net
Sat Jun 2 11:55:10 PDT 2007
Paul TBBle Hampson wrote:
> So that assumption has no basis here.
>
Correct.
> On the other hand, the simplicity was asserted by the person who wrote
> the PowerPC assembly routine I used in my patch. And authorative as he
> might be for his code, I'm not sure he can speak for anyone else who's
> implemented functionally identical code.
>
A quick google reveals similar code used elsewhere. Actually, a cmpw
instruction instead of a cmplw seems more common. Hmm.
> Main change I did was move rdtsc and likewise code into the header
>> file, llfasttimers.h, so that code can now assemble in-line. Eh...
>> saves a little overhead.
>>
>
> As per coding standard, please provide profiling justification for this.
>
Consider that the code that calls this function is in-line (in the class
header), it creates a bias in the compiler to not in-line any of the
code -- even the code of the class header. The reason that it originally
exists in the object module rather than the header may be based on the
pentium 1 or pre VS6.0 era. That is a easy presumption because of the
emit usage. The instruction rdtsc was undocumented, and the only sure
way to use it was to wrap it up in a call structure with a stack. Newer
compilers know about rdtsc and how to handle its usage, so there is no
need to wrap it anymore unless you want to time individual assembly
instructions and need to prevent out-of-order execution.
The basic recommended system is past the pentium 1 era.
> It's already hard enough to make sense of as it is, and moving things
> around like that makes it _very_ hard to review a patch. Once the code
> is done and working, profile and inline it as neccessary.
>
I've profiled the viewer and found that there is high cache pollution. A
polluted cache will not let any of the viewer run at expected speed. One
step to fix this is to reduce bandwidth bottlenecks. The change to make
this more in-line will help localize data access. For one, more in-line
code avoids a global call stack and further page faults. Such isolated
cache fetches are not easy to profile, if they can be isolated, due to
the nature of live data being much more random with the overhead of 50
threads.
General data alignment is another way to reduce cache pollution, but we
are aways from that being worthwhile. I've turned on alignment and
optimizations for more recent processors. With the viewer completely
compiled that way, I've seen code run 2 to 3 times faster. However, that
would mean maintenance overhead for separate binaries for similar
processors, which I've learned has been discussed and passed on.
We need incentives to improve end user hardware.
> cpu MHz only exists on x86 and AMD64. It's "clock" on PowerPC.
>
Thanks. I'll add that.
> /proc might not be the best place to fetch the current CPU speed from,
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is prolly more
> reliable,
Between what Tofu and a google recommends, this appears to be the
preferred order:
Test /sys/devices/system/cpu/cpu0/cpuinfo_cur_freq, else
Test /sys/devices/system/cpu/cpu0/scaling_cur_freq, else
Test /proc/cpuinfo, else
Test gettimeofday() & rdtsc (or likewise), else
Test gettimeofday() & expect 1Mhz
> ##else
>
>> U64 LLFastTimer::countsPerSecond()
>> {
>> @@ -108,6 +62,8 @@
>> {
>> CProcessor proc;
>> sCPUClockFrequency = proc.GetCPUFrequency(50);
>> + if(!sCPUClockFrequency)
>> + sCPUClockFrequency = 1000000.0;
>>
>> return U64(sCPUClockFrequency);
>>
>
> I think this rather confusingly ties "has a high-performance counter"
> code to the "has a sensible GetCPUFrequency value" code.
>
> This is fine on win32 (has both), Linux x86 (has both), Linux AMD64 (has
> both), and PowerPC (Special-cased with #define, never hits this).
>
> Also the high performance counters falling back to gettime of day, will
> be correct.
>
> But it took me two readings to realise that, and I suggest it's a rather
> tangled way of approaching the problem.
>
Yep. Needs work.
Perhaps, it is best to call LLCPUInfo.getMhz() instead of a direct call
to CProcessor::GetCPUFrequency(). That way the best frequency
determination can be more localized to LLCPUInfo, and fallback can be in
CProcessor::GetCPUFrequency().
...almost quietly fixed a bug here.
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.secondlife.com/pipermail/sldev/attachments/20070602/fc585fb2/attachment-0001.htm
More information about the SLDev
mailing list