[TAG good working version, 16byte alignment req. 2cycle core2
peter@cordes.ca**20080316233700] 
<
[update benchmark comments.
peter@cordes.ca**20080316233505] 
[shufpd version faster than movhlps for size=4, same rest of the time
peter@cordes.ca**20080315060439] 
[putting the pxor after the loop helps performance, but makes the fn bigger
peter@cordes.ca**20080315060359] 
[update benchmark comments
peter@cordes.ca**20080315060309] 
[faster intro avoiding 8bit reg access.  movhlps for data shuffling in the loop.  some register allocation changes
peter@cordes.ca**20080315054049] 
[comment on endian weirdness in my notes
peter@cordes.ca**20080315053919] 
[order instructions to minimize ROB read port and other stalls.  No measurable speedup, though
peter@cordes.ca**20080314212019] 
[edit benchmark and optimization comments
peter@cordes.ca**20080314185706] 
[take clock from the command line
peter@cordes.ca**20080314174155] 
[move pxor further before it's result is needed
peter@cordes.ca**20080314174135] 
[update benchmark comments
peter@cordes.ca**20080314174107] 
[print out function name properly
peter@cordes.ca**20080314061217] 
[benchmark results in comments
peter@cordes.ca**20080314061207] 
[remove the jmp from the loop
peter@cordes.ca**20080314061107] 
[initial working versions of sse2 rshift and test program
peter@cordes.ca**20080314043242] 
> {
}