[sldev] Optimizing OpenJPEG (oprofile kicks ass)

James Cook james at lindenlab.com
Thu Mar 29 09:26:48 PDT 2007


I have not read the OpenJPEG source, but this might be a good place to 
use memset(), which on some platforms is highly optimized.  In the 
production viewer we use a copy of Intel's fast memcpy() library that 
includes a fast memset.  I'm not sure if that's part of the open source 
package.  But memset() won't be slower than zeroing an array yourself.

James

Callum Lerwick wrote:
> So, I decided to fire up oprofile and let it loose upon Second Life.
> 
> My incredibly craptacular laptop is my guinea pig.
> 
> CPU: PIII, speed 1328.94 MHz (estimated)
> samples  %        linenr info                 image name               symbol name
> -------------------------------------------------------------------------------
> 486766   15.9762  (no location information)   libc-2.5.so              memcpy
> 449195   14.7431  t1.c:1001                   libopenjpeg.so.1.0.0     t1_decode_cblks
> 167071    5.4835  t_vb_lighttmp.h:239         i915_dri.so              light_rgba
> 149348    4.9018  intel_tex.c:754             i915_dri.so              intelUploadTexImages
> 146105    4.7953  t_vb_lighttmp.h:239         i915_dri.so              light_rgba_material
> 140674    4.6171  dwt.c:524                   libopenjpeg.so.1.0.0     dwt_decode_real
> 86711     2.8459  dwt.c:181                   libopenjpeg.so.1.0.0     dwt_interleave_v
> 83991     2.7567  tcd.c:1231                  libopenjpeg.so.1.0.0     tcd_decode_tile
> 82734     2.7154  dwt.c:285                   libopenjpeg.so.1.0.0     dwt_decode_1_real
> 79416     2.6065  mct.c:111                   libopenjpeg.so.1.0.0     mct_decode_real
> 70358     2.3092  (no location information)   libc-2.5.so              memset
> 47125     1.5467  light.c:599                 i915_dri.so              _mesa_update_material
> 
> So a bunch of memcpy-ing tops the list (wonder where that's coming
> from), followed by OpenJPEG t1_decode_cblks as expected, then the i915
> drivers, then the OpenJPEG dwt, followed by memset and the i915 drivers
> again.
> 
> Lets take a closer look at t1_decode_cblks:
> 
>                :  /* Changed by Dmitry Kolyadin */
>    673  0.1498 :  for (j = 0; j <= h; j++) {
>  27823  6.1940 :     for (i = 0; i <= w; i++) {
>    144  0.0321 :        t1->flags[j][i] = 0;
>                :     }
>                :  }
>                :
>                :  /* Changed by Dmitry Kolyadin */
>   2103  0.4682 :  for (i = 0; i < w; i++) {
> 156170 34.7666 :     for (j = 0; j < h; j++){
>  52543 11.6971 :        t1->data[j][i] = 0;
>                :     }
>                :  }
> 
> I don't know what Dmitry Kolyadin was trying to accomplish, but for some
> reason that second loop is the opposite way around and you can see how
> it thrashes the cache. And look at what its doing. The t1 is spending an
> awful lot of time JUST ZEROING ARRAYS! What the hell??
> 
> Lets flip that second loop around and let gcc4's autovectorizer loose on
> it:
> 
> gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector --param=ssp-buffer-size=4 -m32 -march=pentium3
> -fasynchronous-unwind-tables -ftree-vectorize
> -ftree-vectorizer-verbose=5 -ffast-math -fPIC -Ilibopenjpeg -c
> libopenjpeg/t1.c -o libopenjpeg/t1.o
> 
> libopenjpeg/t1.c:659: note: Alignment of access forced using peeling.
> libopenjpeg/t1.c:659: note: LOOP VECTORIZED.
> libopenjpeg/t1.c:666: note: Alignment of access forced using peeling.
> libopenjpeg/t1.c:666: note: LOOP VECTORIZED.
> libopenjpeg/t1.c:1057: note: vectorized 2 loops in function.
> 
> And see what that gets us:
> 
> samples  %        linenr info                 image name               symbol name
> -------------------------------------------------------------------------------
> 1032663  20.3752  (no location information)   libc-2.5.so              memcpy
> 439716    8.6759  t1.c:1001                   libopenjpeg.so.1.0.0     t1_decode_cblks
> 321558    6.3446  intel_tex.c:754             i915_dri.so              intelUploadTexImages
> 271098    5.3490  dwt.c:524                   libopenjpeg.so.1.0.0     dwt_decode_real
> 252458    4.9812  t_vb_lighttmp.h:239         i915_dri.so              light_rgba
> 228712    4.5127  t_vb_lighttmp.h:239         i915_dri.so              light_rgba_material
> 170216    3.3585  dwt.c:181                   libopenjpeg.so.1.0.0     dwt_interleave_v
> 147816    2.9165  dwt.c:285                   libopenjpeg.so.1.0.0     dwt_decode_1_real
> 138798    2.7386  tcd.c:1231                  libopenjpeg.so.1.0.0     tcd_decode_tile
> 99387     1.9610  mct.c:111                   libopenjpeg.so.1.0.0     mct_decode_real
> 88111     1.7385  (no location information)   libc-2.5.so              memset
> 74694     1.4738  light.c:599                 i915_dri.so              _mesa_update_material
> 
>                :  /* Changed by Dmitry Kolyadin */
>   1589  0.5284 :  for (j = 0; j <= h; ++j) {
>   4952  1.6466 :     for (i = 0; i <= w; ++i) {
>  14814  4.9258 :        t1->flags[j][i] = 0;
>                :     }
>                :  }
>                :
>                :  /* Changed by Dmitry Kolyadin */
>   5198  1.7284 :  for (j = 0; j < h; ++j) {
>  21078  7.0086 :     for (i = 0; i < w; ++i) {
>  23117  7.6866 :        t1->data[j][i] = 0;
>                :     }
>                :  }
> 
> Nice. Our hot spot has moved down here:
> 
>     70  0.0233 :  w = tilec->x1 - tilec->x0;
>     51  0.0170 :  if (tcp->tccps[compno].qmfbid == 1) {
>     73  0.0243 :     for (j = 0; j < cblk->y1 - cblk->y0; j++) {
>   6770  2.2511 :        for (i = 0; i < cblk->x1 - cblk->x0; i++) {
>    841  0.2796 :           tilec->data[x + i + (y + j) * w] = t1->data[j][i]/2;
>                :        }
>                :     }
>                :  } else {    /* if (tcp->tccps[compno].qmfbid == 0) */
>    447  0.1486 :     for (j = 0; j < cblk->y1 - cblk->y0; j++) {
>  79057 26.2872 :        for (i = 0; i < cblk->x1 - cblk->x0; i++) {
>  28888  9.6055 :           if (t1->data[j][i] >> 1 == 0) {
>   2348  0.7807 :              tilec->data[x + i + (y + j) * w] = 0;
>                :           } else {
>    405  0.1347 :              double tmp = (double)((t1->data[j][i] << 12) * band->stepsize);
>   5086  1.6911 :              int tmp2 = ((int) (floor(fabs(tmp)))) + ((int) floor(fabs(tmp*2))%2);
>    626  0.2082 :              tilec->data[x + i + (y + j) * w] = ((tmp<0)?-tmp2);
>                :           }
> 
> Which is a bit more sensible. I guess. t1->flags and t1->data are huge
> static 1024x1024 arrays, eating 8mb(!) ram total between them if I'm
> doing my math right. Christ. So, I'm looking in to making them
> dynamically allocated, I don't see slviewer ever using more than 64x64
> (33kb!). That should eliminate quite a bit of cache thrashing...
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Click here to unsubscribe or manage your list subscription:
> /index.html


More information about the SLDev mailing list