Randall Farmer (whose email address is twotwotwo on gmail), has modified our Skein code to use Intel's SSE2 instructions in 32-bit mode. This got him a speedup to 23cpb. Note that this is applicable to any 32-bit processor that does something equivalent, as well.
He also noted to us that he was not using part of the CPU and in a parallel application like tree hashing, you'd double that speed to 11 to 12cpb, and that if our code used the 64-bit SSE instructions, we could shave off an additional clock-per-byte.
His sample code is rough and not useful for much beyond timing, but included here.
(Correction: A previous draft said 18cpb, which was the initial estimate. The code ended up being 23cpb.)