[sldev] Optimizing OpenJPEG (oprofile kicks ass)

Callum Lerwick seg at haxxed.com
Thu Mar 29 08:26:08 PDT 2007


So, I decided to fire up oprofile and let it loose upon Second Life.

My incredibly craptacular laptop is my guinea pig.

CPU: PIII, speed 1328.94 MHz (estimated)
samples  %        linenr info                 image name               symbol name
-------------------------------------------------------------------------------
486766   15.9762  (no location information)   libc-2.5.so              memcpy
449195   14.7431  t1.c:1001                   libopenjpeg.so.1.0.0     t1_decode_cblks
167071    5.4835  t_vb_lighttmp.h:239         i915_dri.so              light_rgba
149348    4.9018  intel_tex.c:754             i915_dri.so              intelUploadTexImages
146105    4.7953  t_vb_lighttmp.h:239         i915_dri.so              light_rgba_material
140674    4.6171  dwt.c:524                   libopenjpeg.so.1.0.0     dwt_decode_real
86711     2.8459  dwt.c:181                   libopenjpeg.so.1.0.0     dwt_interleave_v
83991     2.7567  tcd.c:1231                  libopenjpeg.so.1.0.0     tcd_decode_tile
82734     2.7154  dwt.c:285                   libopenjpeg.so.1.0.0     dwt_decode_1_real
79416     2.6065  mct.c:111                   libopenjpeg.so.1.0.0     mct_decode_real
70358     2.3092  (no location information)   libc-2.5.so              memset
47125     1.5467  light.c:599                 i915_dri.so              _mesa_update_material

So a bunch of memcpy-ing tops the list (wonder where that's coming
from), followed by OpenJPEG t1_decode_cblks as expected, then the i915
drivers, then the OpenJPEG dwt, followed by memset and the i915 drivers
again.

Lets take a closer look at t1_decode_cblks:

               :  /* Changed by Dmitry Kolyadin */
   673  0.1498 :  for (j = 0; j <= h; j++) {
 27823  6.1940 :     for (i = 0; i <= w; i++) {
   144  0.0321 :        t1->flags[j][i] = 0;
               :     }
               :  }
               :
               :  /* Changed by Dmitry Kolyadin */
  2103  0.4682 :  for (i = 0; i < w; i++) {
156170 34.7666 :     for (j = 0; j < h; j++){
 52543 11.6971 :        t1->data[j][i] = 0;
               :     }
               :  }

I don't know what Dmitry Kolyadin was trying to accomplish, but for some
reason that second loop is the opposite way around and you can see how
it thrashes the cache. And look at what its doing. The t1 is spending an
awful lot of time JUST ZEROING ARRAYS! What the hell??

Lets flip that second loop around and let gcc4's autovectorizer loose on
it:

gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m32 -march=pentium3
-fasynchronous-unwind-tables -ftree-vectorize
-ftree-vectorizer-verbose=5 -ffast-math -fPIC -Ilibopenjpeg -c
libopenjpeg/t1.c -o libopenjpeg/t1.o

libopenjpeg/t1.c:659: note: Alignment of access forced using peeling.
libopenjpeg/t1.c:659: note: LOOP VECTORIZED.
libopenjpeg/t1.c:666: note: Alignment of access forced using peeling.
libopenjpeg/t1.c:666: note: LOOP VECTORIZED.
libopenjpeg/t1.c:1057: note: vectorized 2 loops in function.

And see what that gets us:

samples  %        linenr info                 image name               symbol name
-------------------------------------------------------------------------------
1032663  20.3752  (no location information)   libc-2.5.so              memcpy
439716    8.6759  t1.c:1001                   libopenjpeg.so.1.0.0     t1_decode_cblks
321558    6.3446  intel_tex.c:754             i915_dri.so              intelUploadTexImages
271098    5.3490  dwt.c:524                   libopenjpeg.so.1.0.0     dwt_decode_real
252458    4.9812  t_vb_lighttmp.h:239         i915_dri.so              light_rgba
228712    4.5127  t_vb_lighttmp.h:239         i915_dri.so              light_rgba_material
170216    3.3585  dwt.c:181                   libopenjpeg.so.1.0.0     dwt_interleave_v
147816    2.9165  dwt.c:285                   libopenjpeg.so.1.0.0     dwt_decode_1_real
138798    2.7386  tcd.c:1231                  libopenjpeg.so.1.0.0     tcd_decode_tile
99387     1.9610  mct.c:111                   libopenjpeg.so.1.0.0     mct_decode_real
88111     1.7385  (no location information)   libc-2.5.so              memset
74694     1.4738  light.c:599                 i915_dri.so              _mesa_update_material

               :  /* Changed by Dmitry Kolyadin */
  1589  0.5284 :  for (j = 0; j <= h; ++j) {
  4952  1.6466 :     for (i = 0; i <= w; ++i) {
 14814  4.9258 :        t1->flags[j][i] = 0;
               :     }
               :  }
               :
               :  /* Changed by Dmitry Kolyadin */
  5198  1.7284 :  for (j = 0; j < h; ++j) {
 21078  7.0086 :     for (i = 0; i < w; ++i) {
 23117  7.6866 :        t1->data[j][i] = 0;
               :     }
               :  }

Nice. Our hot spot has moved down here:

    70  0.0233 :  w = tilec->x1 - tilec->x0;
    51  0.0170 :  if (tcp->tccps[compno].qmfbid == 1) {
    73  0.0243 :     for (j = 0; j < cblk->y1 - cblk->y0; j++) {
  6770  2.2511 :        for (i = 0; i < cblk->x1 - cblk->x0; i++) {
   841  0.2796 :           tilec->data[x + i + (y + j) * w] = t1->data[j][i]/2;
               :        }
               :     }
               :  } else {    /* if (tcp->tccps[compno].qmfbid == 0) */
   447  0.1486 :     for (j = 0; j < cblk->y1 - cblk->y0; j++) {
 79057 26.2872 :        for (i = 0; i < cblk->x1 - cblk->x0; i++) {
 28888  9.6055 :           if (t1->data[j][i] >> 1 == 0) {
  2348  0.7807 :              tilec->data[x + i + (y + j) * w] = 0;
               :           } else {
   405  0.1347 :              double tmp = (double)((t1->data[j][i] << 12) * band->stepsize);
  5086  1.6911 :              int tmp2 = ((int) (floor(fabs(tmp)))) + ((int) floor(fabs(tmp*2))%2);
   626  0.2082 :              tilec->data[x + i + (y + j) * w] = ((tmp<0)?-tmp2);
               :           }

Which is a bit more sensible. I guess. t1->flags and t1->data are huge
static 1024x1024 arrays, eating 8mb(!) ram total between them if I'm
doing my math right. Christ. So, I'm looking in to making them
dynamically allocated, I don't see slviewer ever using more than 64x64
(33kb!). That should eliminate quite a bit of cache thrashing...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.secondlife.com/pipermail/sldev/attachments/20070329/f4747cb8/attachment.pgp


More information about the SLDev mailing list