[sldev] [patch] OpenJPEG fix_mul() optimization
Dzonatas
dzonatas at dzonux.net
Fri Apr 6 11:02:49 PDT 2007
alissa_sabre at yahoo.co.jp wrote:
> > Attached is a simple change that gives up to 10% extra performance in
> > DWT time.
>
> Hmm. If this fix_mul() is so important, why don't you make it as
> follows? It's simpler, does same thing, and should be slightly faster
> in general...
>
> static INLINE int fix_mul(int a, int b) {
> return (int) (((int64) a * (int64) b + 4096) >> 13);
> }
>
Very good.
I tested it, and it gave me the same output. However, it gave me these
times:
[INFO] tile 1 of 1
[INFO] - tiers-1 took 1.640000 s
[INFO] - dwt took 10.250000 s
[INFO] - tile decoded in 15.630000 s
Generated Outfile bird.bmp
% cumulative self self total
time seconds seconds calls s/call s/call name
49.04 4.36 4.36 3 1.45 2.32 dwt_decode_tile
29.13 6.95 2.59 29760 0.00 0.00 dwt_decode_1_real
5.74 7.46 0.51 1 0.51 0.51 imagetobmp
5.62 7.96 0.50 1 0.50 8.36 tcd_decode_tile
5.17 8.42 0.46 1 0.46 0.46 mct_decode_real
% cumulative self self total
time seconds seconds calls s/call s/call name
49.59 13.24 13.24 9 1.47 2.32 dwt_decode_tile
28.58 20.87 7.63 89280 0.00 0.00 dwt_decode_1_real
6.63 22.64 1.77 3 0.59 8.48 tcd_decode_tile
5.28 24.05 1.41 3 0.47 0.47 mct_decode_real
4.57 25.27 1.22 3 0.41 0.41 imagetobmp
The second column second row of each is the one to look at (6.95s total
on a single run and 20.87s total after 3 runs)
Here is this version:
static INLINE int fix_mul(int a, int b) {
int64 temp = (int64) a * (int64) b ;
return (int) ((temp + (temp & 4096)) >> 13) ;
}
The results:
[INFO] tile 1 of 1
[INFO] - tiers-1 took 1.590000 s
[INFO] - dwt took 9.790000 s
[INFO] - tile decoded in 15.110000 s
Generated Outfile bird.bmp
% cumulative self self total
time seconds seconds calls s/call s/call name
52.78 4.56 4.56 3 1.52 2.20 dwt_decode_tile
23.50 6.59 2.03 29760 0.00 0.00 dwt_decode_1_real
7.29 7.22 0.63 1 0.63 8.20 tcd_decode_tile
5.79 7.72 0.50 1 0.50 0.50 mct_decode_real
4.86 8.14 0.42 1 0.42 0.42 imagetobmp
% cumulative self self total
time seconds seconds calls s/call s/call name
53.17 13.69 13.69 9 1.52 2.20 dwt_decode_tile
23.61 19.77 6.08 89280 0.00 0.00 dwt_decode_1_real
6.83 21.53 1.76 3 0.59 8.15 tcd_decode_tile
5.67 22.99 1.46 3 0.49 0.49 mct_decode_real
4.85 24.24 1.25 3 0.42 0.42 imagetobmp
You would expect your version to run faster, but this shows 6.59s total
on a single run and 19.77s total after 3 runs).
About a second difference in DWT time to decode "bird.jp2".
It just depends on how the compiler sees the variables and knows how to
optimize them away.
More information about the SLDev
mailing list