[sldev] Optimizing OpenJPEG (oprofile kicks ass)
James Cook
james at lindenlab.com
Thu Mar 29 09:26:48 PDT 2007
I have not read the OpenJPEG source, but this might be a good place to
use memset(), which on some platforms is highly optimized. In the
production viewer we use a copy of Intel's fast memcpy() library that
includes a fast memset. I'm not sure if that's part of the open source
package. But memset() won't be slower than zeroing an array yourself.
James
Callum Lerwick wrote:
> So, I decided to fire up oprofile and let it loose upon Second Life.
>
> My incredibly craptacular laptop is my guinea pig.
>
> CPU: PIII, speed 1328.94 MHz (estimated)
> samples % linenr info image name symbol name
> -------------------------------------------------------------------------------
> 486766 15.9762 (no location information) libc-2.5.so memcpy
> 449195 14.7431 t1.c:1001 libopenjpeg.so.1.0.0 t1_decode_cblks
> 167071 5.4835 t_vb_lighttmp.h:239 i915_dri.so light_rgba
> 149348 4.9018 intel_tex.c:754 i915_dri.so intelUploadTexImages
> 146105 4.7953 t_vb_lighttmp.h:239 i915_dri.so light_rgba_material
> 140674 4.6171 dwt.c:524 libopenjpeg.so.1.0.0 dwt_decode_real
> 86711 2.8459 dwt.c:181 libopenjpeg.so.1.0.0 dwt_interleave_v
> 83991 2.7567 tcd.c:1231 libopenjpeg.so.1.0.0 tcd_decode_tile
> 82734 2.7154 dwt.c:285 libopenjpeg.so.1.0.0 dwt_decode_1_real
> 79416 2.6065 mct.c:111 libopenjpeg.so.1.0.0 mct_decode_real
> 70358 2.3092 (no location information) libc-2.5.so memset
> 47125 1.5467 light.c:599 i915_dri.so _mesa_update_material
>
> So a bunch of memcpy-ing tops the list (wonder where that's coming
> from), followed by OpenJPEG t1_decode_cblks as expected, then the i915
> drivers, then the OpenJPEG dwt, followed by memset and the i915 drivers
> again.
>
> Lets take a closer look at t1_decode_cblks:
>
> : /* Changed by Dmitry Kolyadin */
> 673 0.1498 : for (j = 0; j <= h; j++) {
> 27823 6.1940 : for (i = 0; i <= w; i++) {
> 144 0.0321 : t1->flags[j][i] = 0;
> : }
> : }
> :
> : /* Changed by Dmitry Kolyadin */
> 2103 0.4682 : for (i = 0; i < w; i++) {
> 156170 34.7666 : for (j = 0; j < h; j++){
> 52543 11.6971 : t1->data[j][i] = 0;
> : }
> : }
>
> I don't know what Dmitry Kolyadin was trying to accomplish, but for some
> reason that second loop is the opposite way around and you can see how
> it thrashes the cache. And look at what its doing. The t1 is spending an
> awful lot of time JUST ZEROING ARRAYS! What the hell??
>
> Lets flip that second loop around and let gcc4's autovectorizer loose on
> it:
>
> gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
> -fstack-protector --param=ssp-buffer-size=4 -m32 -march=pentium3
> -fasynchronous-unwind-tables -ftree-vectorize
> -ftree-vectorizer-verbose=5 -ffast-math -fPIC -Ilibopenjpeg -c
> libopenjpeg/t1.c -o libopenjpeg/t1.o
>
> libopenjpeg/t1.c:659: note: Alignment of access forced using peeling.
> libopenjpeg/t1.c:659: note: LOOP VECTORIZED.
> libopenjpeg/t1.c:666: note: Alignment of access forced using peeling.
> libopenjpeg/t1.c:666: note: LOOP VECTORIZED.
> libopenjpeg/t1.c:1057: note: vectorized 2 loops in function.
>
> And see what that gets us:
>
> samples % linenr info image name symbol name
> -------------------------------------------------------------------------------
> 1032663 20.3752 (no location information) libc-2.5.so memcpy
> 439716 8.6759 t1.c:1001 libopenjpeg.so.1.0.0 t1_decode_cblks
> 321558 6.3446 intel_tex.c:754 i915_dri.so intelUploadTexImages
> 271098 5.3490 dwt.c:524 libopenjpeg.so.1.0.0 dwt_decode_real
> 252458 4.9812 t_vb_lighttmp.h:239 i915_dri.so light_rgba
> 228712 4.5127 t_vb_lighttmp.h:239 i915_dri.so light_rgba_material
> 170216 3.3585 dwt.c:181 libopenjpeg.so.1.0.0 dwt_interleave_v
> 147816 2.9165 dwt.c:285 libopenjpeg.so.1.0.0 dwt_decode_1_real
> 138798 2.7386 tcd.c:1231 libopenjpeg.so.1.0.0 tcd_decode_tile
> 99387 1.9610 mct.c:111 libopenjpeg.so.1.0.0 mct_decode_real
> 88111 1.7385 (no location information) libc-2.5.so memset
> 74694 1.4738 light.c:599 i915_dri.so _mesa_update_material
>
> : /* Changed by Dmitry Kolyadin */
> 1589 0.5284 : for (j = 0; j <= h; ++j) {
> 4952 1.6466 : for (i = 0; i <= w; ++i) {
> 14814 4.9258 : t1->flags[j][i] = 0;
> : }
> : }
> :
> : /* Changed by Dmitry Kolyadin */
> 5198 1.7284 : for (j = 0; j < h; ++j) {
> 21078 7.0086 : for (i = 0; i < w; ++i) {
> 23117 7.6866 : t1->data[j][i] = 0;
> : }
> : }
>
> Nice. Our hot spot has moved down here:
>
> 70 0.0233 : w = tilec->x1 - tilec->x0;
> 51 0.0170 : if (tcp->tccps[compno].qmfbid == 1) {
> 73 0.0243 : for (j = 0; j < cblk->y1 - cblk->y0; j++) {
> 6770 2.2511 : for (i = 0; i < cblk->x1 - cblk->x0; i++) {
> 841 0.2796 : tilec->data[x + i + (y + j) * w] = t1->data[j][i]/2;
> : }
> : }
> : } else { /* if (tcp->tccps[compno].qmfbid == 0) */
> 447 0.1486 : for (j = 0; j < cblk->y1 - cblk->y0; j++) {
> 79057 26.2872 : for (i = 0; i < cblk->x1 - cblk->x0; i++) {
> 28888 9.6055 : if (t1->data[j][i] >> 1 == 0) {
> 2348 0.7807 : tilec->data[x + i + (y + j) * w] = 0;
> : } else {
> 405 0.1347 : double tmp = (double)((t1->data[j][i] << 12) * band->stepsize);
> 5086 1.6911 : int tmp2 = ((int) (floor(fabs(tmp)))) + ((int) floor(fabs(tmp*2))%2);
> 626 0.2082 : tilec->data[x + i + (y + j) * w] = ((tmp<0)?-tmp2);
> : }
>
> Which is a bit more sensible. I guess. t1->flags and t1->data are huge
> static 1024x1024 arrays, eating 8mb(!) ram total between them if I'm
> doing my math right. Christ. So, I'm looking in to making them
> dynamically allocated, I don't see slviewer ever using more than 64x64
> (33kb!). That should eliminate quite a bit of cache thrashing...
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Click here to unsubscribe or manage your list subscription:
> /index.html
More information about the SLDev
mailing list