Starting with tag: [TAG triple jcc version peter@cordes.ca**20080320081143] [working computed-jmp version peter@cordes.ca**20080320084938 not as fast as triple-branch except when triple branch fell through to the last one. But hopefully less polluting of the branch predictor ] [handy command line for shift in the comments peter@cordes.ca**20080320085219] [align 1024 peter@cordes.ca**20080322044318] [simple loop instead of computed goto to deal with alignment peter@cordes.ca**20080322050316] [update comments peter@cordes.ca**20080322102733] [better-pipelined main loop based on Torbjorn's work. With good short-loop intro peter@cordes.ca**20080322105200 runs 1.331c/l ] [shift.c: set up a useful test pattern peter@cordes.ca**20080322105348] [distribute ALU ops through the loop. add commented-out 8-way unroll peter@cordes.ca**20080323013830 spreading out the ALU ops doesn't seem to make any difference. The 8-way unroll is slightly faster with large n, but way slower with small n that make the intro loop run more. change test $3, %dl to test $7, %dl and uncomment the bottom half of the main loop. ] [test-pattern generator doesn't segfault peter@cordes.ca**20080323014421] [change config.m4 path for use in speed-ext peter@cordes.ca**20080323060840] [fast 8-limb loop, 1.66 c/l. peter@cordes.ca**20080323060928 still reads past the end of src, so needs finishing touches ] [use macros to allow switching between 4-limb and 8-limb unroll peter@cordes.ca**20080324212029 also make it easy to move the add/lea up/down in the loop ] [speed-ext.c for tune/speed-ext peter@cordes.ca**20080325024627 Your darcs repo should be gmp-.../tune instead of tests/devel. Hardlinks should do the trick. Use the patch for Makefile.in. See the suggested command line in speed-ext.c ] [do the cleanup after the loop. debugged and working, and pretty fast peter@cordes.ca**20080325055033 This works well: we can have special cleanup for coming out of the pipeline. Tested with electric fence, and it doesn't read past the end of src. ] [fifo rt prio peter@cordes.ca**20080325101519] [new layout of mostly the same code. only 2 icache lines for n<12, and mostly better speed peter@cordes.ca**20080326023333] [notes on branching a repo and setting up speed-ext to use the other version peter@cordes.ca**20080326030443] [better block ordering to remove more branches peter@cordes.ca**20080326041203 saves a few cycles for most n. ] [CPP macros to rename functions in speed-ext.c peter@cordes.ca**20080326031120]