4/13/2026 at 5:52:10 AM
This paper missed the state of the art!!!This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a "magic number" but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply.
What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That's the "GM Method" in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit.
What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course.
But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply.
That is, these are the two algorithms for unsigned division by constants where the magic number overflows:
- This paper: Promote to twice the bit width, followed by high multiply (32x64 => 64) and then bitshift
- SOTA: Saturating increment of the dividend, then high multiply (32x32 => 32) and then bitshift
Probably the paper's approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it):
https://ridiculousfish.com/blog/posts/labor-of-division-epis...
by ridiculous_fish
4/13/2026 at 6:20:51 AM
The paper doesn't require a bitshift after multiplication -- it directly uses the high half of the product as the quotient, so it saves at least one tick over the solution you mentioned. And on x86, saturating addition can't be done in a tick and 32->64 zero-extension is implicit, so the distinction is even wider.by purplesyringa
4/13/2026 at 9:36:42 AM
> And on x86, saturating addition can't be done in a tickPerhaps I misunderstand your point, but I am rather sure that in SSE.../AVX... there do exist instructions for saturating addition:
* (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW
* (V)PHADDSW, (V)PHSUBSW
by aleph_minus_one
4/13/2026 at 10:40:24 AM
On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either).by oxxoxoxooo
4/13/2026 at 9:37:04 AM
From my research you can always fit it in a 32x32 multiply, you only need the extra bit at compile time. The extra bit tells you how to adjust the result in the end, but the adjustment is also a constant.by vrighter
4/13/2026 at 6:24:00 AM
Any idea why no one applied this to llvm in the interim?by dooglius