[sldev] Optimizing OpenJPEG (oprofile kicks ass)
Callum Lerwick
seg at haxxed.com
Thu Mar 29 08:26:08 PDT 2007
So, I decided to fire up oprofile and let it loose upon Second Life.
My incredibly craptacular laptop is my guinea pig.
CPU: PIII, speed 1328.94 MHz (estimated)
samples % linenr info image name symbol name
-------------------------------------------------------------------------------
486766 15.9762 (no location information) libc-2.5.so memcpy
449195 14.7431 t1.c:1001 libopenjpeg.so.1.0.0 t1_decode_cblks
167071 5.4835 t_vb_lighttmp.h:239 i915_dri.so light_rgba
149348 4.9018 intel_tex.c:754 i915_dri.so intelUploadTexImages
146105 4.7953 t_vb_lighttmp.h:239 i915_dri.so light_rgba_material
140674 4.6171 dwt.c:524 libopenjpeg.so.1.0.0 dwt_decode_real
86711 2.8459 dwt.c:181 libopenjpeg.so.1.0.0 dwt_interleave_v
83991 2.7567 tcd.c:1231 libopenjpeg.so.1.0.0 tcd_decode_tile
82734 2.7154 dwt.c:285 libopenjpeg.so.1.0.0 dwt_decode_1_real
79416 2.6065 mct.c:111 libopenjpeg.so.1.0.0 mct_decode_real
70358 2.3092 (no location information) libc-2.5.so memset
47125 1.5467 light.c:599 i915_dri.so _mesa_update_material
So a bunch of memcpy-ing tops the list (wonder where that's coming
from), followed by OpenJPEG t1_decode_cblks as expected, then the i915
drivers, then the OpenJPEG dwt, followed by memset and the i915 drivers
again.
Lets take a closer look at t1_decode_cblks:
: /* Changed by Dmitry Kolyadin */
673 0.1498 : for (j = 0; j <= h; j++) {
27823 6.1940 : for (i = 0; i <= w; i++) {
144 0.0321 : t1->flags[j][i] = 0;
: }
: }
:
: /* Changed by Dmitry Kolyadin */
2103 0.4682 : for (i = 0; i < w; i++) {
156170 34.7666 : for (j = 0; j < h; j++){
52543 11.6971 : t1->data[j][i] = 0;
: }
: }
I don't know what Dmitry Kolyadin was trying to accomplish, but for some
reason that second loop is the opposite way around and you can see how
it thrashes the cache. And look at what its doing. The t1 is spending an
awful lot of time JUST ZEROING ARRAYS! What the hell??
Lets flip that second loop around and let gcc4's autovectorizer loose on
it:
gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m32 -march=pentium3
-fasynchronous-unwind-tables -ftree-vectorize
-ftree-vectorizer-verbose=5 -ffast-math -fPIC -Ilibopenjpeg -c
libopenjpeg/t1.c -o libopenjpeg/t1.o
libopenjpeg/t1.c:659: note: Alignment of access forced using peeling.
libopenjpeg/t1.c:659: note: LOOP VECTORIZED.
libopenjpeg/t1.c:666: note: Alignment of access forced using peeling.
libopenjpeg/t1.c:666: note: LOOP VECTORIZED.
libopenjpeg/t1.c:1057: note: vectorized 2 loops in function.
And see what that gets us:
samples % linenr info image name symbol name
-------------------------------------------------------------------------------
1032663 20.3752 (no location information) libc-2.5.so memcpy
439716 8.6759 t1.c:1001 libopenjpeg.so.1.0.0 t1_decode_cblks
321558 6.3446 intel_tex.c:754 i915_dri.so intelUploadTexImages
271098 5.3490 dwt.c:524 libopenjpeg.so.1.0.0 dwt_decode_real
252458 4.9812 t_vb_lighttmp.h:239 i915_dri.so light_rgba
228712 4.5127 t_vb_lighttmp.h:239 i915_dri.so light_rgba_material
170216 3.3585 dwt.c:181 libopenjpeg.so.1.0.0 dwt_interleave_v
147816 2.9165 dwt.c:285 libopenjpeg.so.1.0.0 dwt_decode_1_real
138798 2.7386 tcd.c:1231 libopenjpeg.so.1.0.0 tcd_decode_tile
99387 1.9610 mct.c:111 libopenjpeg.so.1.0.0 mct_decode_real
88111 1.7385 (no location information) libc-2.5.so memset
74694 1.4738 light.c:599 i915_dri.so _mesa_update_material
: /* Changed by Dmitry Kolyadin */
1589 0.5284 : for (j = 0; j <= h; ++j) {
4952 1.6466 : for (i = 0; i <= w; ++i) {
14814 4.9258 : t1->flags[j][i] = 0;
: }
: }
:
: /* Changed by Dmitry Kolyadin */
5198 1.7284 : for (j = 0; j < h; ++j) {
21078 7.0086 : for (i = 0; i < w; ++i) {
23117 7.6866 : t1->data[j][i] = 0;
: }
: }
Nice. Our hot spot has moved down here:
70 0.0233 : w = tilec->x1 - tilec->x0;
51 0.0170 : if (tcp->tccps[compno].qmfbid == 1) {
73 0.0243 : for (j = 0; j < cblk->y1 - cblk->y0; j++) {
6770 2.2511 : for (i = 0; i < cblk->x1 - cblk->x0; i++) {
841 0.2796 : tilec->data[x + i + (y + j) * w] = t1->data[j][i]/2;
: }
: }
: } else { /* if (tcp->tccps[compno].qmfbid == 0) */
447 0.1486 : for (j = 0; j < cblk->y1 - cblk->y0; j++) {
79057 26.2872 : for (i = 0; i < cblk->x1 - cblk->x0; i++) {
28888 9.6055 : if (t1->data[j][i] >> 1 == 0) {
2348 0.7807 : tilec->data[x + i + (y + j) * w] = 0;
: } else {
405 0.1347 : double tmp = (double)((t1->data[j][i] << 12) * band->stepsize);
5086 1.6911 : int tmp2 = ((int) (floor(fabs(tmp)))) + ((int) floor(fabs(tmp*2))%2);
626 0.2082 : tilec->data[x + i + (y + j) * w] = ((tmp<0)?-tmp2);
: }
Which is a bit more sensible. I guess. t1->flags and t1->data are huge
static 1024x1024 arrays, eating 8mb(!) ram total between them if I'm
doing my math right. Christ. So, I'm looking in to making them
dynamically allocated, I don't see slviewer ever using more than 64x64
(33kb!). That should eliminate quite a bit of cache thrashing...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.secondlife.com/pipermail/sldev/attachments/20070329/f4747cb8/attachment.pgp
More information about the SLDev
mailing list