alt.hn

2/8/2026 at 7:19:11 AM

80386 Barrel Shifter

https://nand2mario.github.io/posts/2026/80386_barrel_shifter/

by jamesbowman

2/10/2026 at 3:17:33 PM

Implementing rotate through carry like that was a really bad decision IMO - it's almost never by more than one bit left or right at a time, and this could be done much more efficiently than with the constant-time code which is only faster when the count is > 6.

Is the full microcode available anywhere?

by rep_lodsb

2/10/2026 at 3:40:20 PM

I haven't published it yet as there are still some rough edges to clear up, but if you email me (andrew@reenigne.org) I'll send you the current work-in-progress (the same one that nand2mario is working from).

by ajenner

2/10/2026 at 3:31:17 PM

Since the shifter is also used for bit tests, the 'most things are a 1-bit shift' might not be the case. Perhaps they did the analysis and it made sense.

by kjs3

2/10/2026 at 4:07:02 PM

There are separate opcodes for shift/rotate by 1, by CL, or by an immediate operand. Those are decoded to separate microcode entry points, so they could have at least optimized the "RCL/RCR x,1" case.

And the microcode for bit test has to be different anyway.

by rep_lodsb

2/12/2026 at 6:26:49 AM

Except that there are tremendous advantages to constant-time execution, not the least of which is protection from timing security attacks/information leakage (which admittedly were less of a concern back then). Sure you can get the one instruction executed for the <6 case faster, but the transistor budget for that isn't worth it, particularly if you pipeline the execution into stages. It makes optimization far more complex...

by cbsmith

2/10/2026 at 6:06:56 PM

> For memory operands, there's an additional twist: the bit index is a signed offset that can address bits outside the nominal operand. A bit index of 35 on a dword accesses bit 3 of the next dword in memory.

I wonder what is the use case for testing a bit outside of the memory address given.

by cmovq

2/10/2026 at 6:20:17 PM

So you can have bit arrays of any length in memory, rather than just 32 bits in a register.

by rep_lodsb

2/10/2026 at 6:50:38 PM

That makes sense. LLVM could probably do better here by using the memory operand version:

https://godbolt.org/z/jeqbaPsMz

by cmovq

2/11/2026 at 3:51:31 AM

Don't think the memory operand version would work here. If I understand the x86 architectural manual description, the 32-bit operand form interprets the bit offset as signed. A 64-bit operand could work around that but then run into issues with over-read due to fetching 64 bits of data.

by ack_complete

2/10/2026 at 10:17:03 PM

The memory operand version tends to be as slow or slower than the manual implementation, so LLVM is right to avoid it.

by jxors

2/13/2026 at 5:03:18 PM

Right, it has much worse throughput:

Memory: https://uica.uops.info/tmp/f022a3c0a70e4ae5ab3588ebe65fd2a5_...

by cmovq

2/10/2026 at 6:15:47 PM

It was probably easier to just implement it that way, given that the barrel shifter is 64 bits wide.

by juancn