[sldev] [patch] OpenJPEG DWT optimizations
Callum Lerwick
seg at haxxed.com
Tue Apr 3 22:48:55 PDT 2007
On Mon, 2007-04-02 at 23:40 -0700, Dzonatas wrote:
> The overall performance gain is estimated to about 30%.
>
> I hope you like the attached patch. =)
Nice, works just fine here.
My current finalized patches: (apply in order)
http://www.haxxed.com/code/openjpeg-1.1-makefile.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-memset.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-static-luts.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-flag-type.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-formatting-cleanup.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-use-prefix-increment.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-consolidate-redundant-calculations.patch
I finally got the dynamic array stuff working:
http://www.haxxed.com/code/openjpeg-1.1.1-t1-dynamic-array.patch
http://www.haxxed.com/code/openjpeg-1.1.1-t1-float.patch
And today's magnum opus:
http://www.haxxed.com/code/openjpeg-1.1.1-t1-autovectorize.patch
Thanks to the previous cleanup work I was able to pick apart the inner
loops in t1_decode_cblks and make them autovectorizable. I have no idea
what that tangled mess of floating point was trying to accomplish, but
its a nightmare. Looking at it the wrong way resulted in it breaking.
(as I discovered previously) I ended up tearing all the floating point
out and sticking to integer. (gcc can't vectorize casts...) Much nicer.
Note that this *does* very slightly change the decoder's output on
images that hit that code path, (not all of them do) but it amounts to
just a scattered few LSBs here and there which IMO is acceptable. Its
probably actually more accurate. Its certainly way less difference than
the difference between the output of OpenJPEG and KDU. :)
This likely needs some macro magic to remain compatible with older/other
compilers. Getting the vectorization to happen required the use of C99's
"restrict" keyword. To get it to work you need to pass gcc something
like "-ffast-math -fstrict-aliasing -Wstrict-aliasing=2 -std=c99
-ftree-vectorize -ftree-vectorizer-verbose=5"
Here's what I get, compiling for x86_64:
gcc -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic
-ftree-vectorize -ftree-vectorizer-verbose=5 -ffast-math
-fstrict-aliasing -Wstrict-aliasing=2 -std=c99 -fPIC -Ilibopenjpeg -c
libopenjpeg/t1.c -o libopenjpeg/t1.o
libopenjpeg/t1.c:1197: note: Alignment of access forced using peeling.
libopenjpeg/t1.c:1197: note: Vectorizing an unaligned access.
libopenjpeg/t1.c:1197: note: LOOP VECTORIZED.
libopenjpeg/t1.c:1186: note: Alignment of access forced using peeling.
libopenjpeg/t1.c:1186: note: Vectorizing an unaligned access.
libopenjpeg/t1.c:1186: note: LOOP VECTORIZED.
libopenjpeg/t1.c:1186: note: vectorized 2 loops in function.
I haven't quite figured out how I go about aligning something that is
allocated in the heap. Note that it is unable to vectorize when
compiling for pentium3. Looks like SSE doesn't cut it, you need SSE2
(pentium4)...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.secondlife.com/pipermail/sldev/attachments/20070404/3535d8e1/attachment.pgp
More information about the SLDev
mailing list