[TAG good working version, 16byte alignment req. 2cycle core2 peter@cordes.ca**20080316233700] < [update benchmark comments. peter@cordes.ca**20080316233505] [shufpd version faster than movhlps for size=4, same rest of the time peter@cordes.ca**20080315060439] [putting the pxor after the loop helps performance, but makes the fn bigger peter@cordes.ca**20080315060359] [update benchmark comments peter@cordes.ca**20080315060309] [faster intro avoiding 8bit reg access. movhlps for data shuffling in the loop. some register allocation changes peter@cordes.ca**20080315054049] [comment on endian weirdness in my notes peter@cordes.ca**20080315053919] [order instructions to minimize ROB read port and other stalls. No measurable speedup, though peter@cordes.ca**20080314212019] [edit benchmark and optimization comments peter@cordes.ca**20080314185706] [take clock from the command line peter@cordes.ca**20080314174155] [move pxor further before it's result is needed peter@cordes.ca**20080314174135] [update benchmark comments peter@cordes.ca**20080314174107] [print out function name properly peter@cordes.ca**20080314061217] [benchmark results in comments peter@cordes.ca**20080314061207] [remove the jmp from the loop peter@cordes.ca**20080314061107] [initial working versions of sse2 rshift and test program peter@cordes.ca**20080314043242] > { }