Speed improvement for 32-bit mode: 20.1 cpb for Skein-512

Compiling Randall Farmer's SSE2 code using gcc on 64-bit Ubuntu Linux, on an Intel Core 2 Duo CPU, achieved the following results:
== Skein-256: 21.6 clks/byte
== Skein-512: 20.1 clks/byte
== Skein-1024: 25.5 clks/byte
As noted previously, Randall's code actually runs two independent Skein/Threefish blocks in parallel in the SSE2 registers, so this approach could be be used to run Skein-512 at approximately 10 clks/byte in 32-bit mode when using either tree hashing mode or counter mode.