The Fifth Kind of Optimisation

4/5/2025 at 4:13:12 AM

Every large scale cloud/SaaS software has been using parallelism for decades [well, almost two decades give or take]. Requests run in parallel on millions of servers and one server processes many requests in parallel.

In that sort of setup the old school optimize one program by doing things in parallel is somewhat less relevant because everything by its nature is massively parallel and complex operations are often broken down into bits that all run in parallel. In that sort of environment over-parallelizing can actually be a pessimization because there is often some cost around breaking the work up to run in parallel and your little bit of this massively parallel system may have some fraction of resources anyways.

Not to say there's no value in knowing "old school" parallelizing (including using SIMD or GPUs or whatnot) but for many types of applications that's not where I'd focus.

by YZF

4/5/2025 at 4:11:45 PM

Parallelism has different constraints when you’ve got a bunch of tasks with their own deadlines versus a bunch of tasks with a single common deadline. It matters a lot more with the common deadline that the tasks are roughly the same size, for instance.

For either scenario you may have to parallelize single tasks anyway if the variance is too wide.

by hinkley

4/5/2025 at 5:57:14 PM

It's also about resource constraints. If your parallel request workload (let's take S3 as an example) is already using all the available resources then parallelizing this single task in the hope that it would make things more efficient/run faster is just going to result in things running slower (because of the overhead e.g. of moving data around, synchronization, etc.) At least in the higher end of request concurrency. Ofcourse if a machine just happens to be be processing a single request that can be a win but that means you're not utilizing your hardware. The way to make those heavy parallel request workload systems go faster is simply to do less, i.e. optimize them, not parallelize them. The exception is if you want to have some "quality of service" controls within this setup. That is fairly unusual in the types of systems I'm thinking about.

I've seen this happen in practice and it's a common anti-pattern. An engineer tries to get requests to go faster, parallelizes some of the pieces of a single request, only to find the system can now handle less requests/s. The reason why is easy, he's now doing more work on the same amount of resources.

This is very different than a single task running on a multi-core machine with 31 cores sitting idle and one doing all the work.

I think your statement mostly applies in the second case you want to chop things up in more or less equal size piece. Otherwise you'll bottleneck on the piece that took longer.

by YZF

4/2/2025 at 4:14:32 PM

There’s also ‘don’t do it’ ( just turn the whole thing off because it’s actually not needed ) and ‘use a better middleman’ ( os, compiler etc while not going lower level). In my experience these are very cheap and usually not looked at and most devs jump at the other 5.

by Agingcoder

4/5/2025 at 1:26:42 AM

I’ve lost my voice for a few days because I spent a solid eight hours in a row talking developers out of solving already-solved problems.

“No, you don’t need Redis for caching output, we have a CDN already.”

“No, you don’t need to invent authentication tokens yourself, JWT is a downloadable package away.”

“No…”

…

by jiggawatts

4/5/2025 at 2:29:09 AM

Same. I have a team lead who's trying to externalize a SQL query to new infrastructure. Why can't we do a database call on the current machine? why does this have to be a task for an external machine to carry out???

It's like the current machine can't wait on the database, we have to wait on an external machine that waits on the database. Why why why? I feel web developers are so used to these patterns that they lose common sense.

I'm sure tons of people will agree with the team lead. I think a lot of people just don't get it.

by ninetyninenine

4/5/2025 at 2:45:44 AM

Oh god, this.

The best part is that by externalising querying, it becomes fiddly to support every possible query shape… so…

… inevitably:

    public object GetSQL( string query ) …

SQL injection is the inevitable end result of this design pattern.

by jiggawatts

4/5/2025 at 4:02:52 PM

Because they get tired of babysitting two sets of PRs and deployment pipelines for every feature and some of the bug fixes. So I’ll just make the service “generic” to stop having to keep changing the service. Brilliant!

Pain is information, but some people take the wrong actions based on that information.

by hinkley

4/5/2025 at 10:02:54 PM

This was in a single repo, but the problem is the same: There is an impedence mismatch between SQL queries and typical RPC APIs such that it becomes decidedly non-trivial to represent every possible arbitrary query "shape" that might occur in a report.

Picture the kind of queries that might be generated by something like an embedded Crystal Report or a Power BI report web control. You can get all sorts of group-by operators, top-n sorts, sums, min/max/avg, etc...

by jiggawatts

4/5/2025 at 9:53:22 AM

To be fair, JWT is abysmal in more than one aspect and there's reasonable chance you'd be better off on your own.

It's like "No, you don't need to write a configuration parser, you can just serialize an object" which is a well travelled road straight to Dante's seventh circle.

by xorcist

4/5/2025 at 10:48:19 AM

> reasonable chance you'd be better off on your own.

Sure, yes, if you know what you're doing and have a valid reasons for going off the beaten path.

In this case I was trying to explain to a developer why their token was horrifically insecure despite superficial appearances.

This was an off-the-cuff sort of thing: I was simply scrolling through some code talking about something else and I saw "encrypt" in the middle of it... and sigh...

I talked for well over an hour just listing the security vulnerabilities in just three lines of code.

"Static keys used in encryption can be cracked by obtaining many ciphertext samples. You actually wanted keyed signing, not encryption."

"The scope isn't included and you're sharing the same key everywhere. That means dev/tst/prd tokens are interchangeable."

"Also... let's just keyword search for this 'random' (not) key in the Git repos... and oh... ohh... it's everywhere. Oh god. Unrelated apps too!"

"There's no dates in there. That means tokens last forever."

"The name is separated from the role with an ASCII symbol that's a valid user-name character. So if I call myself Bobby Tables#Admin, then I am. Fun!"

"Tokens last for a long time and can only be encrypted and decrypted with a single key, which means you can't rotate keys without logging out all users all at once."

Etc...

Hence losing my voice for a bit.

Also, a little bit of my sanity.

by jiggawatts

4/5/2025 at 1:20:50 PM

> Static keys used in encryption can be cracked by obtaining many ciphertext samples.

They can? My understanding was that with a well-designed symmetric algorithm and a key with enough entropy this shouldn't be possible.

by Asraelite

4/5/2025 at 9:56:53 PM

Many (most?) symmetric algorithms simply generate a stream of pseudo-random bits, which are then used to XOR the plaintext. This stream is deterministic, which means that the same key "in" will produce an identical bit-stream "out".

This means that if the same key is reused, and the plaintext is predictable, then you'll see patterns in the output. Those patterns can be used to figure out the key. Heck, you don't even need the key! Just register a bunch of distinct user names (or whatever you can control), and observe the change in the encrypted token. Eventually you can collect enough data to generate arbitrary tokens, or at least tokens with some useful values under your control. You can also exploit different error response codes, which is a common design fault because "decryption failure", "parser failure", and "access denied" tend to go through different code-paths and throw different exception types.

The solution is to mix in a block of random bits to break this determinism. This is the initialization vector (IV). The developers -- of course -- failed to do this and used a constant.

Even that is insufficient, because most encryption algorithms provide only confidentiality. They don't provide authenticity ("signing"), which is more important for tokens.

(An aside: authenticated encryption algorithms are starting to get popular, and these provide both at once efficiently.)

Essentially, it doesn't matter how "secure" an algorithm is, it won't achieve security if it is misused and applied for the wrong purpose.

by jiggawatts

4/7/2025 at 1:01:52 PM

> This is the initialization vector (IV). The developers -- of course -- failed to do this and used a constant

Oh, at that point you're not even using the algorithm anymore. Why is this even possible in the library? I would've assumed that any sensible implementation would handle the initialization vector for you, and manually setting it up would require very verbose explicit configuration.

by Asraelite

4/5/2025 at 1:48:15 AM

Also buy a bigger box or better IO or both. A lot of issues are solved by scaling something up. Its not more optimal but it does solve the performance problem and is often cheaper than months of tracing performance and development work.

by PaulKeeble

4/5/2025 at 4:38:24 PM

Changing the data structure can in some cases also be a “don’t do it. “

I advised another engineer to write the code to generate error pages for the sites we host the same way I pre-generated stylesheets. He did not, and ended up making almost 50k service requests across four services to my 4500 across two. And then it broke after he quit because it tripped circuit breakers by making FIFTY THOUSAND requests. The people paying attention to capacity planning didn’t size for anomalous traffic.

He was trying to find a specific piece of customer data and didn’t notice it was in the main service and only needed to be parsed. So I didn’t just rewrite it to work like mine, I extracted code from mine to call from both and that was that. Went from an estimated 90 minutes to run it per deployment (I got exasperated trying to time a full run that never completed and instead I logged progress reports to figure out how many customers per minute it was doing) down to less than 5:00. And by not thrashing services that were being called by user-facing apps.

If that data hadn’t been in the original service I would have petitioned to have it so. You shouldn’t have to chain three or four queries to find a single column of data.

by hinkley

4/5/2025 at 1:14:03 AM

Also “do it in advance (and reuse)”

by tanelpoder

4/5/2025 at 4:06:42 PM

People act like I have two heads when I start drilling down on the expected read vs write rate on the application. Because when there are orders of magnitude difference between the two it shifts when you make expensive calculations. Either on the infrequent write or lazily on the infrequent read.

And if I’m honest I kind of wish I had two heads in this situation so I could bite them all. Fucking amateur hour.

by hinkley

4/5/2025 at 5:55:57 AM

And batching

by qazxcvbnm

4/5/2025 at 4:42:46 PM

Batching and throttling. If you’re doing work with a loose deadline don’t make it outcompete work with a tight deadline. That’s a self-DOS. Make the batch no more than 1/10th to 1/8th of the overall traffic and things go more smoothly.

by hinkley

4/5/2025 at 2:35:55 AM

Somehow his taxonomy of optimizations doesn't have a category for things like "unroll a loop", "fuse loops", "block your array accesses", "avoid false sharing", or "strength-reduce your operations". He has a "use a better algorithm" category, but his example is switching from bubble sort to selection sort, which is a much more disruptive kind of change than such micro-optimizations, which are usually considered to be "the same algorithm".

This is rather puzzling, given that he starts off with, "Premature optimisation might be the root of all evil, but", quoting Knuth, and in context, Knuth was primarily talking about precisely such micro-optimizations: https://dl.acm.org/doi/10.1145/356635.356640 p. 267 (p. 7/41)

Specifically, the optimizations Knuth was exhibiting in that section of the paper are:

- speeding up a sequential array search by copying the search key to an empty spot at the end of the array, so that the loop only requires one termination test (are keys equal?) rather than two (are keys equal? is index in bounds?), speeding it up on typical machines of the time by 50%;

- unrolling that version of the loop 2× in a way that required gotos, speeding it up by a further 12%.

None of the optimization techniques being applied there fit into any of Tratt’s categories!

Now, I can't claim to be any sort of optimization expert, but I think these kinds of micro-optimizations are still important, though, as Knuth pointed out at the time, not for all code, and often the drawbacks are not worth it. Sometimes compilers will save you. The C compiler might unroll a loop for you, for example. But often it won't, and it certainly isn't going to pull something like the sentinel trick, because it doesn't know you aren't reading the array in another thread.

This sort of depends on what context you're working in; arguably if your hotspots are in Python, even (as Tratt says) in PyPy, you have bigger opportunities for optimization available than manually unrolling a loop to reduce loop termination tests. But I've found that there's often something available in that sort of grungy tweaking; I've frequently been able to double the speed of Numpy code, for example, by specifying the output array for each Numpy operation, thus reducing the cache footprint. The code looks like shit afterwards, sadly.

by kragen

4/5/2025 at 7:06:13 AM

I think "micro optimizations" would be a good category to put them in (although in my experience the term is usually used in a negative way).

> The C compiler might unroll a loop for you, for example. But often it won't

I've found that clang unrolls loops really aggressively compared to GCC on O2, to the point that I think that the majority of loops is getting unrolled

by atiedebee

4/5/2025 at 5:25:57 PM

In that case, on a computer with an instruction cache, you can probably optimize your program by not unrolling the loops that aren't hot. Maybe you can compile those parts of the program with GCC.

Micro-optimizations are in many cases not justifiable, for the reasons Knuth explains in his paper. But sometimes they are.

by kragen

4/5/2025 at 4:24:55 PM

Knuth has lost almost all meaning in more heads than Sturgeon’s Law would anticipate. Too often his quote is just an excuse for developers not to think.

    There is no doubt that the grail of effi-
    ciency leads to abuse. Programmers waste
    enormous amounts of time thinking about,
    or worrying about, the speed of noncritical
    parts of their programs, and these attempts
    at efficiency actually have a strong negative
    impact when debugging and maintenance are
    considered. We should forget about small
    efficiencies, say about 97% of the time: pre-
    mature optimization is the root of all evil.
    Yet we should not pass up our opportuni-
    ties in that critical 3 %. A good programmer
    will not be lulled into complacency by such
    reasoning, he will be wise to look carefully
    at the critical code; but only after that code
    has been identified. It is often a mistake to
    make a priori judgments about what parts
    of a program are really critical, since the
    universal experience of programmers who
    have been using measurement tools has been
    that their intuitive guesses fail.

> A good programmer will not be lulled into complacency by such reasoning

And yet “lulled into complacency” is exactly what most people are.

by hinkley

4/5/2025 at 1:21:14 AM

Another kind of optimisation is to see if you are solving an unnecessarily general problem, which could be replaced by a narrower problem specific to your actual situation & workload, that might be easier to solve.

see also: Mike Acton's 2014 CppCon talk "Data-Oriented Design and C++": https://www.youtube.com/watch?v=rX0ItVEVjHc

by shoo

4/5/2025 at 9:19:56 AM

I think re-evaluation of biases regarding what a single computer/cpu/thread can do would go a long way for a lot of developers.

Having hundreds or thousands of cores to work with is meaningless if the state between those threads is constantly in another castle. Having to go to DRAM or another CCX is a very expensive operation compared to staying within L1/L2 on a single core.

Put differently, if the problem generally has a serialized narrative of events that must be followed (e.g., how we action the current event depends on all prior events being resolved), then spreading the problem across many threads will only slow us down. Contention will immediately dominate. This is where we get things like single writer principle from.

There aren't very many problems that a business would pay you to write software for that are not of the serialized narrative variety. Most ordinary people aren't thinking in terms of mutexes and semaphores. They are thinking in terms of what is effectively a gigantic interpreter lock over their enterprise.

by bob1029

4/5/2025 at 4:43:43 PM

The warp model in GPUs is great at hiding the DRAM latency. The GPU isn't idly waiting for DRAM.

All threads that need a memory access are in a hardware queue, and data coming from the DRAM immediately dequeues a thread and runs the work until the next memory access. So you compute at the full throughput of your RAM. Thread scheduling done in software can't have such granularity and low overhead, and hyperthreading has too few threads to hide the latency (2 vs 768).

by pornel

4/5/2025 at 2:24:52 AM

A bunch of these dates are way wrong because the author is stuck in an x86 consumer grade mindset. Even if we didn’t have multicore on our desktops a lot of us had shipped on multicore long before 2014 which is when he marks the era as beginning.

Multiprocessor was already an established solution. I worked on 4 core machines around 2008. As would anyone who worked on server software instead of just desktop.

by hinkley

4/5/2025 at 3:11:13 AM

Yeah, I was working on a Windows NT image processing program in 01998. My desktop had two processors, and the machines in the dev lab and in the field had four. Mine were Pentium Pro. Some friends of mine had accounts on pioneer.unm.edu, a Sequent Symmetry with I think 8 80386 processors on a single memory bus, up until it was lost or stolen in about 01994. Then I got an account on the MHPCC SP2 UNM had the contract to administer, with I think 384 RS/6000 POWER CPUs, but without shared memory (not even NUMA).

by kragen

4/5/2025 at 4:07:33 AM

My first Unix account was on a Sequent. I tried to put together a dual processor box in 1995 but they were having supply issues and it didn't work out.

I think it is important to differentiate multi core from multiprocessor, but not so much for the context of that article.

by hinkley

4/5/2025 at 5:40:03 PM

Oh, maybe he was talking about having multiple CPUs on the same piece of silicon? That seems like a more plausible explanation for the dates (which are still somewhat wrong, as you point out), but its connection to software optimization strategies seems tenuous at best. The software can't really find out if the other CPU is on a separate die unless the hardware chooses to tell it.

by kragen

4/5/2025 at 6:49:35 PM

He’s careful to say cores but multiprocessor also means multiple cores, just not on the same silicon.

And Sun had 32 thread machines not long after his timeline. IIRC they were so goddamned expensive as to be academic, but my customers only ran Sun hardware in their data centers (cellular carriers). Mid tier hardware mostly. Our target was 4 cores per server around 2008.

by hinkley

4/5/2025 at 7:12:23 PM

Well, and Flynn's taxonomy is from 01966.

by kragen

4/5/2025 at 4:17:26 AM

We were also doing multi-threading before we had multiple cores. I remember some things breaking once we actually started seeing the first dual-core machines, probably Pentium-D circa 2005.

Massively parallel machines and parallelization predates that by a lot: https://en.wikipedia.org/wiki/Connection_Machine

by YZF

4/5/2025 at 9:15:06 AM

I made my ray tracer multi-threaded when I got a dual Celeron[1] box at home in 1999.

Quickly learned about race conditions...

Multi-socket was well established in the server and workstation market at that point.

[1]: https://en.wikipedia.org/wiki/ABIT_BP6

by magicalhippo

4/5/2025 at 6:57:53 PM

I was surprised there was no mention of OpenMP in the discussion of shared-memory parallel programming in C.

OpenMP is a lot easier to work with (and is more concise) than pthreads. It is introduced using pragma directives (so you can maintain serial and parallel versions in the same code). It includes support for many common parallel patterns (e.g. distribution of array based and task-based calculations) and makes the synchronisation easier (or automatic). It also gives you powerful control over work sharing to reduce load imbalance, etc.

It's available in most common C and C++ compilers and many other accelerated programming solutions wrap the API (e.g. you can use OpenMP within a Cython module, called from a Python script).

by jn6118

4/5/2025 at 5:22:46 AM

I was reporting to a developer some research I'd read that said that about half of errors were logical (and, or, off by 1) and half misusing an API.

The developer replied "and the other half comes from concurrency".

All of us listening did the math and burst out laughing for the trenchant truth

by readthenotes1

4/5/2025 at 8:00:27 AM

Some people, when confronted with a problem, think, “I know, I’ll use threads,” and then two they hav erpoblesms.

by NitpickLawyer

4/5/2025 at 4:47:48 PM

This is where Rust makes a massive difference. I can change iter() to par_iter() and have it work flawlessly on the first try.

by pornel

4/5/2025 at 7:40:53 AM

I am often shocked how poor language support for paralelism is. Sorbets asnc/await make writing that kind of code a breeze. Python and Java are much less ergonomic for this. Might have improved with the newest Java versions, but I did not try that yet.

by HdS84

4/5/2025 at 2:24:30 AM

The title implies that the 5th kind of thing is revolutionary and unknown. I mean yeah? You parallelize a task that can be parallelized it's faster...

I thought it was going to be something that was unexpected. But alas that's happening less and less on HN.

by ninetyninenine

4/5/2025 at 9:20:52 AM

I’m surprised that “Add caching” is not on that list. Maybe he bundled it under one of the first four. Caching is used so extensively as an optimization method that we don’t even notice it. L1 cache, edge computing, memoization, web browser cache, and file system caches are examples of how pervasive caching is.

by koliber

4/5/2025 at 11:59:28 AM

To steelman his argument, that falls under better data structure.

by Ygg2

4/5/2025 at 5:31:53 PM

Hm,, doesn't mention Amdahl's Law applied to parallelisation https://en.wikipedia.org/wiki/Amdahl%27s_law#Parallel_progra...

Which tl;dr is that if only 50% of the whole process can be parallelised, then even if you parallelise that portion as much as you can, you'll get a < 2x speedup. If only 10% can be parallelised, then you're limited to a 1.1x speedup by parallelising it.

by SideburnsOfDoom

4/5/2025 at 1:15:20 AM

0: Make your requirements less dumb.

by infogulch

4/5/2025 at 2:32:25 AM

Let's be real he's over blowing the novelty of parallelism. We know what parallelism is. The real "5th optimization" is quantum computing.

by ninetyninenine