For reference, I aligned the data, which got rid of the unaligned stores. I changed the call method, so the compiler would generate in-line code. The result: Initializing 0.04 sec Blend DS1 1.75 sec Blend 2 3.28 sec Blend 3 4.08 sec Blend 4 3.28 sec Blend DS1 is almost a 2x speed-up from Blend 2 & 4. --