Floating point from scratch: Hard Mode

4/7/2026 at 4:32:57 PM

Well, that escalated quickly. From skimming the first few screenfuls of the page, I thought it was going to be about a beginner programmer's first time dissecting the low-level bit structure of floating-point numbers and implementing some of the arithmetic and logic from scratch using integer operations.

And then she writes about logic cells, hardware pipeline architectures, RTL (register-transfer level), ASIC floorplans, and tapeout. Building hardware is hard, as iteration cycles are long and fabricating stuff costs real money. That was quite a journey.

Her "about" page adds more context that helps everything make sense:

> Professionally, I have a Masters in Electrical and Computer Engineering, have previously worked as a CPU designer at ARM and an FPGA engineer at Optiver. -- https://essenceia.github.io/about/

by nayuki

4/7/2026 at 12:29:27 PM

I had to do an FP8 adder as my final project for my FPGA lab. It was at least a full page of state machine. And I write small. I ended up just not doing the rounding and truncating instead because I was so done with it.

Consider me educated on the mantissa. That's a nifty pedantry.

Typo in the c++?

> cout << "x =/= x: " << ((x=x) ? "true" : "false") << endl;

Should be x != x?

For the leading 0 counter, I've found it's even better for the tool if I use a for loop. Keeps things scalable and it's even less code. I'm not understanding this takeaway though

> Sometimes a good design is about more than just performance.

The good design (unless the author's note that it's easier to read and maintain makes it worse design?) was better performing. So sometimes a good design creates performance.

Likewise for pipelining: it would have been interesting to know if the tools can reorder operations if you give them some pipeline registers to play with. In Xilinx land it'll move your logic around and utilize the flops as it sees fit.

by Neywiny

4/8/2026 at 4:48:33 PM

Yup, that is a typo.

by random__duck

4/7/2026 at 3:09:56 PM

Worked with Retiming using DC Compiler in an ASIC implementation. Remember a lot of back & forth, sometimes the tool just doesn't add enough registers to meet the constraint, had to test variable register depths; this was a design that used Synopsys DesignWare for FP ops lol.

by abhikul0

4/7/2026 at 12:18:16 PM

I think the thing that truly scares me about floating point is not IEEE-754 or even the weird custom floating points that people come up with but the horrible, horrible things that some people think passes for a floating point implementation in hardware, who think that things like infinities or precision in the last place are kind of overrated.

by saagarjha

4/8/2026 at 10:44:32 AM

" until you have proven it works, it is broken!"

immediately followed by

"Time to run some tests!"

Prompted a grin. Then however "If we wanted to test this using directed testing we would need to test for all 2^32 input combinations, which sounds like a terrible idea …

… and exactly what I am going to do "

Wait, exhaustively testing float number space ... I read about that before. Might have been https://randomascii.wordpress.com/2014/01/27/theres-only-fou... covered also here https://news.ycombinator.com/item?id=34726919

by guenthert

4/7/2026 at 3:37:27 PM

Nice write-up.

Let me offer a nitpck: in the "Gradual underflow" section it says this about subnormal numbers:

    Bonus: we have now acquired extra armour against a division by zero:

    if ( x != y ) z = 1.0 / ( x - y );

But that's not that useful: just because you're not dividing by zero doesn't mean the result won't overflow to infinity, which is what you get when you do divide by zero.

Think about it this way: the smallest subnormal double is on the order of 10^-324, but the largest double is on the order of 10^308. If `x - y` is smaller than 10^-308, `1.0 / (x - y)` will be larger than 10^308, which can't be represented and must overflow to infinity.

This C program demonstrates this:

    #include <stdio.h>
    #include <float.h>
    #include <math.h>

    // return a (subnormal) number that results in zero when divided by 2:
    double calc_tiny(void)
    {
        double x = DBL_MIN; // the normal number closest to 0.0
        while (1) {
            double next = x / 2.0;
            if (next == 0.0) {
                return x;
            }
            x = next;
        }
    }

    int main(void)
    {
        double y = calc_tiny();
        double x = 2.0 * y;
        if (x != y) {
            double z = 1.0 / (x - y);
            printf("division is safe: %g\n", z);
        } else {
            printf("refusing to divide by zero\n");
        }
    }

(It will print something like "division is safe: inf", or however else your libc prints infinity)

by moefh

4/7/2026 at 4:58:12 PM

Other things worth noting about denormal numbers:

- It’s not just ‘old’ FPUs that handle them verrry slowly. Benchmark this aspect of your target processors if this detail matters.

- Most modern processors provide a “flush to zero” denormal handling mode (and sometimes you can separately specify “denormals are zero”, relevant when e.g. loading old data). However, math libraries usually haven’t been written with this mode in mind, so you need to be careful with this.

by chrchang523

4/8/2026 at 2:11:28 AM

It's really annoying that IEEE set the exponent bias wrong. x!=0 => 1/x!=Inf was a totally achievable property if they had wanted it (by tweaking the implicit bias)

by adgjlsfhk1

4/7/2026 at 11:06:13 AM

Everyone who has ever had to build a floating point unit has hated it with a passion, I've watched in done from afar, and done it myself

by Taniwha

4/7/2026 at 11:54:29 AM

And anyone implementing numerical algorithms is thankful for the tremendous amount of thought put into the fp spec. The complexity is worth it and makes the code much safer.

by CyLith

4/7/2026 at 12:41:28 PM

imo they were wrong almost as much as they were right. -0.0, the plethora of NaNs, and having separate Inf and NaN all make the life of people writing algorithms a lot more annoying for very little benefit.

by adgjlsfhk1

4/7/2026 at 2:48:02 PM

There was actually no "thought" being put into the IEEE spec as such. It was merely a codification of the design of the Intel FPU (only one of many, very different implementations of FP units pre-standardisation). There was thought put into that implementation, but the "standard" is merely a codification of that design.

It has many many warts, and many design choices were made given the constraints of hardware of that time, not by considerations in terms of a standard.

by andrepd

4/7/2026 at 2:54:04 PM

William Kahan would beg to differ on this.

https://people.eecs.berkeley.edu/~wkahan/ieee754status/754st...

by jmalicki

4/7/2026 at 3:40:36 PM

he would, but he very much designed the standard around the idea that if you wanted to implement a floating point algorithm you would hire him.

by adgjlsfhk1

4/7/2026 at 1:14:47 PM

I think I would find it very challenging but fun. Certainly more fun than writing a date/time library (way more inconsistent cases; daylight savings time horrors; leap seconds; date jumps when moving from Julian to Gregorian) or a file system (also fun, I think, but thoroughly testing it scares me of)

by Someone

4/7/2026 at 12:40:38 PM

I just wish there were a widespread decimal-based floating point standard & units.

by trollbridge

4/7/2026 at 9:09:47 PM

When people see that binary-float-64 causes 0.1 + 0.2 != 0.3, the immediate instinct is to reach for decimal arithmetic. And then they claim that you must use decimal arithmetic for financial calculations. I would rate these statements as half-true at best. Yes, 0.1 + 0.2 = 0.3 using decimal floating-point or fixed-point arithmetic, and yes, it's bad accounting practice when summing a bunch of items and getting a total that differs from the true answer.

But decimal floats fall short in subtle ways. Here is the simplest example - sales tax. In Ontario it's 13%. If you buy two items for $0.98 each, the tax on each is $0.1274. There is no legal, interoperable mechanism to charge the customer a fractional number of cents, so you just can't do that. If you are in charge of producing an invoice, you have to decide where to perform the rounding(s). You can round the tax on each item, which is $0.13 each, so the total is ($0.98 + $0.13) × 2 = $2.22. Or you can add up all the pre-tax items ($1.98) and calculate the tax ($0.2548) and round that ($0.25), which brings the total to $0.98×2 + $0.25 = $2.21, a different amount. Not only do you have to decide where to perform rounding(s), you also have to keep track of how many extra decimal places you need. Massachusetts's sales tax is 6.25%, so that's two more decimal places. If you have discounts like "25% off", now you have another phenomenon that can introduce extra decimal places.

If you do any kind of interest calculation, you will necessarily have decimal places exploding. The simplest example is to take $100 at 10% annual interest compounded annually, which will give you $110, $121, $133.1, $146.41, $161.051, $177.1561, etc., and you will need to round eventually. Or another example is, 10% annual interest, but computed daily (so 10%/365 per day) and added to the account at the end of the month - not only is 10%/365 inexact in decimal arithmetic, but also many decimal places will be generated in the tiny interest calculations per day.

If you do anything that philosophically uses "real numbers", then decimal FP has zero advantages compared to binary FP. If you use pow(), exp(), cos(), sin(), etc. for engineering calculations, continuous interest, physics modeling, describing objects in 3D scene, etc., there will necessarily be all sorts of rational, irrational, and transcendental numbers flying around, and they will have to be approximated in one way or another.

by nayuki

4/8/2026 at 6:36:18 PM

When writing financial software, one almost always reaches for a decimal library in that language and ends up using that instead of the language's built-in floats. (Sometimes you can use ints, but you can't once you need to do things like described above.)

Overall, yes, results need to be rounded, but it's pretty much financial software 101 not to use floats.

by trollbridge

4/8/2026 at 3:02:51 PM

This is legitimately a great explanation.

by random__duck

4/7/2026 at 9:19:40 PM

The one advantage of decimal floating point is that high schoolers have a better understanding of where decimal rounding happens.

by adgjlsfhk1

4/8/2026 at 6:03:13 AM

Wearing my chip designer's hat decimal FP just means more (and slower) gates

by Taniwha

4/7/2026 at 5:36:09 PM

Would it help though? IMHO, being binary is one of the least confusing sides of IEEE 754.

by volemo

4/7/2026 at 1:28:36 PM

doesn't ieee754 define a decimal format? Specifically "decimal64"

by clnhlzmn

4/7/2026 at 7:51:22 PM

Great article. You really start to appreciate floating point when you have to squeeze some arbitrary level of performance out of an underpowered (say embedded) CPU and you decide to use fixed point. Suddenly all those nasty little edge cases that the floating point library would have handled silently, reliably and hopefully correctly for you need to be dealt with.

Just keeping track of the shifts during a chain of multiplications and additions can really ruin your day. And the good code will look exactly the same as the bad code. I'm doing something like that right now and have moved from doubles to fixed point 64 bit ints (32.32), it works, but it took me much longer than I thought it would (phase angle estimator for SDR output).

by jacquesm

4/7/2026 at 3:23:49 PM

I love this article, the edge cases are where the seeming “simplicity” of floating-point numbers breaks down

Recently wrote a chapter in tiny-vllm course about floats in context of LLM inference, much shorter and not that deep as this one, for anyone interested in topic you might like it too https://github.com/jmaczan/tiny-vllm?tab=readme-ov-file#how-...

by yu3zhou4

4/7/2026 at 3:22:56 PM

Super cool to tape out a nearly 500MHz systolic array!!! I wish sometimes I had gone far enough with my hardware education to be able to do more than simple FOGA designs.

A really cool thing to see would be the newer block scaled fp4/fp8 data types. Maybe for their next asic they can try it haha - https://arxiv.org/abs/2310.10537

by buildbot

4/7/2026 at 3:32:28 PM

I wish I knew more people like this author. Then, maybe, I would have maintained my faith in humanity.

by balamatom