[update benchmark comments. peter@cordes.ca**20080316233505] { hunk ./rshift.asm 74 -C ++++++++++++++++++++ fast intro (%ecx), movhlps version +++++++++++++++++++ +C ++++++++++++++++++++ fast intro (%ecx), new loop, shufpd (and sometimes movhlps) version +++++++++++++++++++ hunk ./rshift.asm 76 -C size 1 15.938 cycles -C size 496 3.044 cycles/limb +C size 1 15.938 cycles ( movhlps) +C size 4 5.947 cycles/limb (shufpd) +C size 496 3.791 cycles/limb (3.044 movhlps, still slower than MMX) hunk ./rshift.asm 82 -C size 4 6.420 cycles (4.0 with movdqa ; shufpd version uncommented instead.) -C size 496 2.036-2.068 cycles/limb +C size 4 4.0 cycles/limb (6.420 with movhlps) +C size 496 2.052 cycles/limb (2.036-2.068 movhlps) hunk ./rshift.asm 87 -C size 496 2.062 cycles/limb +C size 4 4.013 cycles/limb (shufpd) +C size 496 2.062 cycles/limb (shufpd/movhlps) }