[benchmark results in comments peter@cordes.ca**20080314061207] { hunk ./rshift.asm 29 +C all benchmarks on tesla: E6600 (2.4GHz) w/DDR800 dual-channel. system not idle, azureus running. + +C MMX: +C size 1 18.000 cycles/limb +C size 2 9.960-10.020 cyc/limb +C size 49: 2.673 cycles/limb +C size 496: 2.465 cycles/limb +C size 1600: 2.402 cycles/limb + + + +C this SSE2 version: requires 16byte aligned input and output + +C AMD K8 (awarnach master node, 2.0GHz, Solaris) +C size 1: 14-15 cycles +C size 2: 8.025 cycles/limb +C size 49: 4.337 cycles/limb +C size 496: 3.808 cycles/limb +C size 1600: 3.652-3.802 cycles/limb +C size 4000: 3.751 cycles/limb +C size 496001: 12.807 cycles/limb + +C Intel Core 2(64bit mode) +C size 1: 13.080 cycles. +C size 2: 6.5-6.6 cycles/limb +C size 49: 2.245 cycles/limb +C size 496,800: 2.064 cycles/limb +C size 1600: 1.981-2.021 cycles/limb +C size 4000: 2.401 cycles/limb +C size 496001: 8.607 cycles/limb + + +C movdqu (unaligned allowed, times for the aligned case) +C size 1: 14.000 cycles/limb +C size 2: 6.990-7.050 cycles/limb +C size 496, 4000: 4.048 cycles/limb +C size 496001: 8.787-8.807 }