Unlocking Python's Cores:Energy Implications of Removing the GIL

3/9/2026 at 10:36:48 AM

One thing I'm curious about here is the operational impact.

In production systems we often see Python services scaling horizontally because of the GIL limitations. If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads.

But that also changes failure patterns — concurrency bugs, race conditions, and deadlocks might become more common in systems that were previously "protected" by the GIL.

It will be interesting to see whether observability and incident tooling evolves alongside this shift.

by devrimozcay

3/9/2026 at 3:02:45 PM

This is surely why Facebook was interested in funding this work. It is common to have N workers or containers of Python because you are generally restricted to one CPU core per Python process (you can get a bit higher if you use libs that unlock the GIL for significant work). So the only scaling option is horizontal because vertical scaling is very limited. The main downside of this was memory usage. You would have to load all of your code and libraries N types and in-process caches would become less effective. So by being able to vertically scale a Python process much further you can run less and save a lot of memory.

Generally speaking the optimal horizontal scaling is as little as you have to. You may want a bit of horizontal scaling for redundancy and geo distribution, but past that vertically scaling to fewer larger process tend to be more efficient, easier to load balance and a handful of other benefits.

by kevincox

3/9/2026 at 5:25:02 PM

> The main downside of this was memory usage. You would have to load all of your code and libraries N types and in-process caches would become less effective.

You can load modules and then fork child processes. Children will share memory with each other (if they need to modify any shared memory, they get copy-on-write pages allocated by the kernel) and you'll save quite a lot on memory.

by philsnow

3/9/2026 at 5:29:17 PM

Yes, this can help a lot, but it definitely isn't perfect. Especially since CPython uses reference counting it is likely that many pages get modified relatively quickly as they are accessed. Many other GC strategies are also pretty hostile to CoW memory (for example mark bits, moving, ...) Additionally this doesn't help for lazy loaded data and caches in code and libraries.

by kevincox

3/9/2026 at 12:32:04 PM

For big things the current way works fine. Having a separate container/deployment for celery, the web server, etc is nice so you can deploy and scale separately. Mostly it works fine, but there are of course some drawbacks. Like prometheus scraping of things then not able to run a web server in parallel etc is clunky to work around.

And for smaller projects it's such an annoyance. Having a simple project running, and having to muck around to get cron jobs, background/async tasks etc. to work in a nice way is one of the reasons I never reach for python in these instances. I hope removing the GIL makes it better, but also afraid it will expose a whole can of worms where lots of apps, tools and frameworks aren't written with this possibility in mind.

by matsemann

3/9/2026 at 5:22:41 PM

> observability tooling for Python evolving

As much as I dislike Java the language, this is somewhere where the difference between CPython and JVM languages (and probably BEAM too) is hugely stark. Want to know if garbage collection or memory allocation is a problem in your long running Python program? I hope you're ready to be disappointed and need to roll a lot of stuff yourself. On the JVM the tooling for all kinds of observability is immensely better. I'm not hopeful that the gap is really going to close.

by rpcope1

3/9/2026 at 5:33:17 PM

> If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads

Not by much. The cases where you can replace processes with threads and save memory are rather limited.

by fiedzia

3/9/2026 at 5:50:35 PM

Citation needed? Tall tasks are standard practice to improve utilization and reduce hotspots by reducing load variance across tasks.

by aoeusnth1

3/9/2026 at 7:45:21 PM

I would have thought most of those would have been moved to async Python by now.

by influx

3/9/2026 at 8:40:04 PM

async python still uses a single thread for the main loop, it just hides non blocking IO.

by LtWorf

3/9/2026 at 1:18:00 PM

A lot of that has already been solved for by scaling workers to cores along with techniques like greenlets/eventlets that support concurrency without true multithreading to take better advantage of CPU capacity.

by apothegm

3/9/2026 at 9:15:02 PM

That's great for concurrency, but doesn't improve parallelism.

Unless you mean you have multiple worker processes (or GIL-free threads).

by Sohcahtoa82

3/9/2026 at 3:00:17 PM

But you are still more or less limited to one CPU core per Python process. Yes, you can use that core more effectively, but you still can't scale up very effectively.

by kevincox

3/9/2026 at 3:59:21 PM

But python can fork itself and run multiple processes into one single container. Why would there be a need to run several containers to run several processes?

There's even the multiprocessing module in the stdlib to achieve this.

by LtWorf

3/9/2026 at 5:21:45 PM

Threads are cheap, you can do N work simultaneously with N threads in one process, without serialization, IPC or process creation overhead.

With multiprocessing, processes are expensive and work hogs each process. You must serialize data twice for IPC, that's expensive and time consuming.

You shouldn't have to break out multiple processes, for example, to do some simple pure-Python math in parallel. It doesn't make sense to use multiple processes for something like that because the actual work you want to do will be overwhelmed by the IPC overhead.

There are also limitations, only some data can be sent to and from multiple processes. Not all of your objects can be serialized for IPC.

by heavyset_go

3/9/2026 at 8:58:48 PM

It makes sense to me that a program currently written using multiple processes would now be re-written to use multiple truly parallel threads. But it seems very odd to suggest (as your grandparent comment does) that a program currently run in multiple containers would likely be migrated to run on multiple threads.

In other words, I imagine anyone who cares about the overhead from serialization, IPC, or process creation would already be avoiding (as much as possible) using containers to scale in the first place.

by connorboyle

3/9/2026 at 5:42:59 PM

I think you have a good point on IPC but process creation in Linux is almost as fast as thread creation

Unless the app would constantly be creating and killing processes then the process creation overhead would not be that much but IPC is killer

And also your types aren’t pickable or whatever and now you gotta change a lot of stuff to get it to work lol.

by akdev1l

3/9/2026 at 4:41:43 PM

Forking and multi threading do not coexist. Even if one of your transitive dependencies decides to launch a thread that’s 99% idle, it becomes unsafe to fork.

by kccqzy

3/9/2026 at 5:28:16 PM

Im curious as to the down votes on this. It's absolutely true, and when I was maintaining a job runner daemon that ran hundreds of thousands of who knows what Python tasks/jobs a day on some shared infra with arbitrary code for a certain megacorp from 2016-2020 or so, this was one of insidious and ugly failure modes to go debug and handle. The docs really make it sound like you can mix threading and multiprocessing but you can never really completely ensure that threading and then bare fork will ever be safe, period. It's really irritating that the docs would have you believe that this is OK or safe, but is in keeping with the Python philosophy of trying to hide the edge of the blade you're using until it's too late and you've cut the shit out of yourself.

by rpcope1

3/9/2026 at 5:44:07 PM

Why is it unsafe?

by akdev1l

3/9/2026 at 8:30:27 PM

In general only the thread calling fork() gets forked, so unless you call exec() soon after, there are a lot of complications with signals, shared memory.

by LtWorf

3/9/2026 at 9:02:56 PM

What are the complications? A single thread with its own process sandbox with everything from the parent is exactly what I'd expect coming from C land. Are the complications you refer to specific to the python VM or more general?

by fc417fc802

3/9/2026 at 10:57:21 PM

Even treating the process as read only after forking is potentially fraught. What if a background thread is mutating some data structure? When it forks the data structure might be internally inconsistent because the work to finish the mutation might not be completed. Imagine there are locks held by various threads when it dies, trying to lock those in the child might deadlock or even worse. There's tons of these types of gotchas.

by grogers

3/9/2026 at 10:26:26 PM

If you have multiple threads, you almost certainly have mutexes. If your fork happens when a non-main thread holds a mutex, your main thread will never again be able to hold that mutex.

An imperfect solution is to require every mutex created to be accompanied by some pthread_atfork, but libraries don’t do that unless forking is specifically requested. In other words, if you don’t control the library you can’t fork.

by kccqzy

3/9/2026 at 5:26:00 PM

Fork-then-thread works, does it not?

by philsnow

3/9/2026 at 5:40:26 PM

If you have enough discipline to make sure you only create threads after all the forking is done, then sure. But having such discipline is harder than just forbidding fork or forbidding threads in your program. It turns a careful analysis of timing and causality into just banning a few functions.

by kccqzy

3/9/2026 at 8:40:08 PM

Can't you check what threads are active at the time you fork?

by josefx

3/9/2026 at 10:30:52 PM

And what do you do with that information? Refuse to fork after you detect more than one thread running? I haven’t seen any code that gracefully handles the unable-to-fork scenario. When people write fork-based code, especially in Python, they always expect forking to succeed.

by kccqzy

3/9/2026 at 5:30:26 PM

But not the reverse, if its a bare fork and not strictly using basically mutex and shared resource free code (which is hard), and there's little or no warning lights to indicate that this is a terrible idea that fails in really unpredictable and hard to debug ways.

by rpcope1

3/9/2026 at 8:33:43 PM

I'm replying to a person that scales python by running several containers instead of 1 container with several python processes.

by LtWorf

3/9/2026 at 12:51:32 PM

Should have funded the entire GIL-removal effort by selling carbon credits. Here's an industry waiting to happen: issue carbon credits for optimizing CPU and GPU resource usage in established libraries.

by carlsborg

3/9/2026 at 4:45:06 PM

There's a spicy argument to be made that "Rewrite it in Rust" is actually an environmentalist approach.

by minimaxir

3/9/2026 at 1:06:27 PM

I am taking all the migration of electron apps.

by pradeeproark

3/9/2026 at 1:40:29 PM

I wonder about the total energy cost of apps like Teams, Slack, Discord, etc... Hundreds of millions of users, an app running constantly in the background. I wouldn't be surprised if the global power consumption on the clients side reached the gigawatt. Add the increased wear on the components, the cost of hardware upgrades, etc...

All that to avoid hiring a few developers to make optimized native clients on the most popular platforms. Popular apps and websites should lose or get carbon credits on optimization. What is negligible for a small project becomes important when millions of users get involved, and especially background apps.

by GuB-42

3/9/2026 at 2:02:20 PM

If we go by Microsofts 2020 account of 1 billion devices running Windows 10 [0], and assume all those are running some kind of electron app (or multiple?) you easily get your gigawatt by just saving 1 watt across each device (on average). I suspect you'd probably go higher than 1 gigawatt, but I'm not sure as far as making another order of magnitude. I also think the noisy fan on my notebook begs to differ and maybe the 10 GW mark could be doable...

[0] https://news.microsoft.com/apac/2020/03/17/windows-10-poweri...

by dr_zoidberg

3/9/2026 at 2:52:00 PM

There are 30,000 different x-platform GUI frameworks and they all share one attribute: (1) they look embarrassingly bad compared to Electron or Native apps and they mostly (2) are terrible to program for.

I feel like I never wasting my time when I learn how to do things with the web platform because it turns out the app I made for desktop and tablet works on my VR headset. Sure if you are going to pay me 2x the market rate and it is a sure thing you might interest me in learning Swift and how to write iOS apps but I am not going to do it for a personal project or even a moneymaking project where I am taking some financial risk no way. The price of learning how to write apps for Android is that I have to also learn how to write apps for iOS and write apps for Windows and write apps for MacOS and decide what's the least-bad widget set for Linux and learn to program for it to.

Every time I do a shoot-out of Electron alternatives Electron wins and it is not even close -- the only real competitor is a plain ordinary web application with or without PWA features.

by PaulHoule

3/9/2026 at 5:16:33 PM

> Every time I do a shoot-out of Electron alternatives Electron wins and it is not even close

Only if you're ok with giving your users a badly performing application. If you actually care about the user experience, then Electron loses and it's not even close.

by bigstrat2003

3/9/2026 at 6:41:11 PM

Name something specific. Note for two x-platform UI toolkits I have some familiarity with:

Python + tkinter == about the same size as electron

Java + JavaFX == about the same size as electron

Sure there are people who write little applets for software developers that are 20k Win32 applications still but that is really out of the mainstream.

by PaulHoule

3/9/2026 at 3:29:51 PM

Many times this. Native path is the path of infinite churn, ALL the time. With web you might find some framework bro who takes pride in knowing all the intricacies of React hooks who'll grill you for not dreaming in React/Vue/framework of the day, but fundamental web skills (JS/HTML/CSS) are universal. And you can pretty much apply them on any platform:

- iOS? React Native, Ionic, Web app via Safari

- Android? Same thing

- Mac, Windows, Linux – Tauri, Electron, serve it yourself

Native? Oh boy, here we fucking go: you've spent last decade honing your Android skills? Too bad, son, time to learn Android jerkpad. XML, styles, Java? What's that, gramps? You didn't hear that everything is Kotlin now? Dagger? That's so 2025, it's Hilt/Metro/Koin now. Oh wow, you learned Compose on Android? Man, was your brain frozen for 50 years? It's KMM now, oh wait, KMM is rebranded! It's KMP now! Haha, you think you know Compost? We're going to release half baked Compost multiplatform now, which is kinda the same, but not quite. Shitty toolchain and performance worse than Electron? Can't fucking hear you over jet engine sounds of my laptop exhaust, get on my level, boy!

by wiseowise

3/9/2026 at 3:57:06 PM

Qt does exist. It's not difficult.

by LtWorf

3/9/2026 at 4:52:40 PM

Qt costs serious money if you go commercial. That might not be important for a hobby project, but lowers the enthusiasm for using the stack since the big players won't use it unless other considerations compel them.

by inejge

3/9/2026 at 9:18:52 PM

QT only costs money if you want access to their custom tooling or insist on static linking. We're comparing to electron here. Why do you need to static link? And why can't you write QML in your text editor of choice and get on with life?

by fc417fc802

3/9/2026 at 5:27:51 PM

Depends on the modules and features you use, or where you're deploying, otherwise it's free if you can adhere to the LGPL. Just make it so users can drop in their own Qt libs.

by heavyset_go

3/9/2026 at 8:31:06 PM

I'm sure microsoft and slack have sufficient funds for a commercial Qt license.

by LtWorf

3/9/2026 at 5:01:20 PM

...which is the same as Flutter. Both don't use native UI toolkits (though Qt doesn't use Skia, I'll give you that (Flutter has Impeller engine in the works)). And Qt has much worse developer experience and costs money.

by wiseowise

3/9/2026 at 8:32:37 PM

Qt costs money if you for some reason insist on static linking AND use all the fancy components, the core stuff is all LGPL.

Anyway it does look native and it is way faster than electron, which also doesn't look native so I don't understand why it's a problem for Qt but not for electron.

by LtWorf

3/9/2026 at 6:59:38 PM

I actually built this analysis while I worked at Microsoft so I 100% agree. Doing the work at the platform level is the way to go and you can actually make a significant impact with this kind of approach. The other value of this that's not obvious is that doing it client side ends up touching all the grids/generators in the world outside of the market based accounting that tends to drive the datacenter carbon impact analysis.

by scottcha

3/9/2026 at 3:06:13 PM

> Similarly, workloads where threads frequently access and modify the same objects show reduced improvements or even degradation due to lock contention.

Perhaps I'm stating the obvious, but you deal with this with lock-free data structures, immutable data, siloing data per thread, fine-grain locks, etc.

Basically you avoid locks as much as possible.

by p_m_c

3/9/2026 at 5:09:44 PM

It'd be nice if Python std lib had more thread safe primitives/structures (compared to something like Java where there's tons of thread safe data structures)

Imo the GIL was used as an excuse for a long time to avoid building those out.

by nijave

3/9/2026 at 8:20:53 PM

> It'd be nice if Python std lib had more thread safe primitives/structures (compared to something like Java where there's tons of thread safe data structures)

Hence why basic Python structures under free-threaded Python are all thread-safe structures, and explains why they are slower than GIL-variant.

by liuliu

3/9/2026 at 11:43:29 AM

Our experience on memory usage, in comparison, has been generally positive.

Previously we had to use ProcessPoolExecutor which meant maintaining multiple copies of the runtime and shared data in memory and paying high IPC costs, being able to switch to ThreadPoolExecutor was hugely beneficially in terms of speed and memory.

It almost feels like programming in a modern (circa 1996) environment like Java.

by chillitom

3/9/2026 at 12:12:00 PM

Swapping ProcessPoolExecutor for ThreadPoolExecutor gives real memory and IPC wins, but it trades process isolation for new failure modes because many C extensions and native libraries still assume the GIL and are not thread safe.

Measure aggressively and test under real concurrency: use tracemalloc to find memory hotspots, py-spy or perf to profile contention, and fuzz C extension paths with stress tests so bugs surface in the lab not in production. Watch per thread stack overhead and GC behavior, design shared state as immutable or sharded, keep critical sections tiny, and if process level isolation is still required stick with ProcessPoolExecutor or expose large datasets via read only mmap.

by hrmtst93837

3/9/2026 at 1:16:32 PM

I thought libraries had to explicitly opt in to no GIL via a macro or constant or something in C

by nijave

3/9/2026 at 2:09:28 PM

GP is a clanker spouting off a lot of random nonsense.

Edit: Never mind. If it walks like a duck and talks like a duck...

by zozbot234

3/9/2026 at 9:49:02 PM

To me it looks like a human sometimes making heavy use of AI and other times posting themselves. And also being incredibly defensive when called out on it.

The stream of posts that resemble copy-paste from gemini is really not improving the site IMO. I can just go query it myself thanks.

by fc417fc802

3/9/2026 at 4:58:11 PM

It seems like ai generated stuff to me, the whole history is eerily identical

by airza

3/9/2026 at 6:39:01 PM

The llm accusations go out of hand nowadays. Cant see any typical AI slop here.

by tuhgdetzhh

3/9/2026 at 11:08:08 PM

"gives real memory and IPC wins" is AI phrasing. So is the colon followed by tricolon.

Pangram is pretty reliable and shows 100% AI.

by astrange

3/9/2026 at 2:46:21 PM

Says someone who has "bot234" in his name ...

by hrmtst93837

3/9/2026 at 10:28:42 AM

Might be worth noting that this seems to be just running some tests using the current implementation, and these are not necessarily general implications of removing the GIL.

by philipallstar

3/9/2026 at 10:56:33 AM

There might also be many optimization opportunities that still have to be seized.

by samus

3/9/2026 at 11:00:05 AM

Sections 5.4 and 5.5 are the interesting ones.

5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?

5.5: Fine-grained lock contention significantly hurts energy consumption.

by flowerthoughts

3/9/2026 at 11:06:01 AM

I'm not sure of the exact relationship, but power consumption increases greater than linear with clock speed. If you have 4 cores running at the same time, there's more likely to be thermal throttling → lower clock speeds → lower energy consumption.

Greater power draw though; remember that energy is the integral of power over time.

by alright2565

3/9/2026 at 3:02:10 PM

Running a program either on 1 core or on N cores, ideally does not change the energy.

On N cores, the power is N times greater and the time is N times smaller, so the energy is constant.

In reality, the scaling is never perfect, so the energy increases slightly when a program is run on more cores.

Nevertheless, as another poster has already written, if you have a deadline, then you can greatly decrease the power consumption by running on more cores.

To meet the deadline, you must either increase the clock frequency or increase the number of cores. The latter increases the consumed energy only very slightly, while the former increases the energy many times.

So for maximum energy efficiency, you have to first increase the number of cores up to the maximum, while using the lowest clock frequency. Only when this is not enough to reach the desired performance, you increase the clock frequency as little as possible.

by adrian_b

3/9/2026 at 11:11:58 AM

By running more tasks in parallel across different cores they can each run at lower clock speed and potentially still finish before a single core at higher clock speeds can execute them sequentially.

by spockz

3/9/2026 at 3:12:23 PM

5.4 is the essential reason why multithreading has become the main method to increase CPU performance after 2004. For reaching a given level of performance, increasing the number of cores at the same clock frequency needs much less energy than increasing the clock frequency at the same number of cores.

5.5 depends a lot on the implementation used for locks. High energy consumption due to contention normally indicates bad lock implementations.

In the best implementations, there is no actual contention. A waiting core only reads a private cache line, which consumes very little energy, until the thread that had hold the lock immediately before it modifies the cache line, which causes an exit from the waiting loop. In such implementations there is no global lock variable. There is only a queue associated with a resource and the threads insert themselves in the queue when they want to use the shared resource, providing to the previous thread the address where to signal that it has completed its use of the resource, so the single shared lock variable is replaced with per-thread variables that accomplish its function, without access contention.

While this has been known for several decades, one can still see archaic lock implementations where multiple cores attempt to read or write the same memory locations, which causes data transfers between the caches of various cores, at a very high power consumption.

Moreover, even if you use optimum lock implementations, mutual exclusion is not the best strategy for accessing a shared data resource. Even optimistic access, which is usually called "lock-free", is typically a bad choice.

In my opinion, the best method of cooperation between multiple threads is to use correctly implemented shared buffers or message queues.

By correctly implemented, I mean using neither mutual exclusion nor optimistic access (which may require retries), but using dynamic partitioning of the shared buffers/queues, which is done using an atomic fetch-and-add instruction and which ensures that when multiple threads access simultaneously the shared buffers or queues they access non-overlapping ranges. This is better than mutual exclusion because the threads are never stalled and this is better than "lock-free", i.e. optimistic access, because retries are never needed.

by adrian_b

3/9/2026 at 5:01:15 PM

That reminded me of how back in 2008 I removed the GIL from Python to run thousands Python modules in 10,000 threads. We were fighting for every clock cycle and byte and it worked. It took 20 years for the GIL to be removed and become available to the public.

by ellis0n

3/9/2026 at 6:56:37 PM

Simply removing the GIL and running 10,000 threads seems very unlikely.

by qzzi

3/9/2026 at 5:11:50 PM

What was the use case?

by heavyset_go

3/9/2026 at 6:44:53 PM

A security scanner, for example, we had to check tens of thousands of IPs of global exchanges for backdoors overnight while the exchanges were offline

by ellis0n

3/9/2026 at 8:00:22 PM

energy implications of decades spent trying to remove the GIL

by beanjuiceII

3/6/2026 at 8:41:37 AM

Title shortened - Original title:

Unlocking Python’s Cores: Hardware Usage and Energy Implications of Removing the GIL

I am curious about the NumPy workload choice made, due to more limited impact on CPython performance.

by runningmike

3/9/2026 at 3:03:10 PM

Can’t it just profile them and pick the right one accordingly?

by Havoc

3/9/2026 at 4:15:53 PM

Is the GIL implemented as a run-time option? I thought this feature had to be enabled at compile-time.

by dguest

3/9/2026 at 3:46:45 PM

From [2603.04782] "Unlocking Python's Cores: Hardware Usage and Energy Implications of Removing the GIL" (2026) https://arxiv.org/abs/2603.04782 :

> Abstract: [...] The results highlight a trade-off. For parallelizable workloads operating on independent data, the free-threaded build reduces execution time by up to 4 times, with a proportional reduction in energy consumption, and effective multi-core utilization, at the cost of an increase in memory usage. In contrast, sequential workloads do not benefit from removing the GIL and instead show a 13-43% increase in energy consumption

by westurner

3/9/2026 at 3:42:04 PM

[dead]

by newzino

3/9/2026 at 4:31:21 PM

[flagged]

by 3tdimhcsb

3/9/2026 at 4:32:33 PM

Ok.

by rirze

3/9/2026 at 5:19:58 PM

[flagged]

by 5tdimhcsb

3/9/2026 at 11:00:40 AM

[flagged]

by pothamk

3/9/2026 at 11:19:29 AM

Thanks ChatGPT, good of you to let us know.

by OskarS

3/9/2026 at 11:27:36 AM

There are so many ChatGPT responses in this thread, it’s giving me a headache.

by stingraycharles

3/9/2026 at 11:34:30 AM

Yep. Real "dead internet theory" vibes, really sad to see.

by OskarS

3/9/2026 at 11:40:13 AM

It’s been very noticeable for about a year now, but the last few months is absolutely terrible. I wonder if clawdbot has anything to do with it.

by stingraycharles

3/9/2026 at 11:49:47 AM

my hypothesis is that chatgpt was trained on the internet, and useful technical answers on the internet were posted by autistic people. who else would spend their time learning and then rushing to answer such things the moment they get their chance to shine? so chatgpt is basically pure distilled autism, which is why it sounds so familiar.

by exe34

3/9/2026 at 11:48:52 AM

I'm curious what makes that obviously llm? As far as I can tell it was a short and fairly benign statement with little scope to give away llm-ness?

by Incipient

3/9/2026 at 2:44:48 PM

It's just the equivalent of that one student restating what the teacher just said with no added value

by elar_verole

3/9/2026 at 12:28:55 PM

Just as bad if it's human. No information has been shared. The writer has turned idle wondering into prose:

> Once threads actually run concurrently, libraries (which?) that never needed locking (contradiction?) could (will they or won't they?) start hitting race conditions in surprising (go on, surprise me) places.

by mrkeen

3/9/2026 at 4:28:10 PM

It was an essentially pointless platitude about the GIL from a very new account not really related to the article, and all comments from this account are the same: top level comments with lots of em-dashes that are just a vague piece of pablum somewhat related to the subject. If it was just this comment, sure, it could be possible it's a rather uninteresting human. But given the history, this account is pure AI slop.

by OskarS

3/9/2026 at 11:15:28 AM

The obvious solution is to require libraries that are no-GIL safe to declare that, and for all other libraries implicitly wrap them with GIL locks.

by RobotToaster

3/9/2026 at 1:56:45 PM

I have a suspicion that this paper is basically a summary with some benchmarks, done with LLMs.

by Tiberium

3/9/2026 at 2:10:14 PM

Your suspicion could have easily been cleared by reading the paper.

If you're short on time: the paper reads a bit dry, but falls in the norm for academic writing. The github repo shows work over months on 2024 (leading up to the release of 3.13) and some rush on Dec 2025 to Jan 2026, probably to wrap things up on the release of this paper. All commits on the repo are from the author, but I didn't look through the code to inspect if there was some Copilot intervention.

[0] https://github.com/Joseda8/profiler

by dr_zoidberg

3/9/2026 at 1:17:30 PM

> Across all workloads, energy consumption is proportional to execution time

Race-to-idle used to be the best path before multicore. Now it's trickier to determine how to clock the device. Especially in battery powered cases. This is why all modern CPU manufacturers are looking into heterogeneous compute (efficiency vs performance cores).

Put differently, I don't think we should be killing ourselves over this at software time. If you are actually concerned about the impact on raw energy consumption, you should move your workloads from AMD/Intel to ARM/Apple. Everything else would be noise compared to this.

by bob1029

3/9/2026 at 2:48:57 PM

Programs whose performance is dominated by array operations, as it is the case for most scientific/technical/engineering applications, achieve a much better energy efficiency on the AMD or Intel CPUs with good AVX-512 support, e.g. Zen 5 Ryzen or Epyc CPUs and Granite Rapids Xeons, than on almost all ARM-based CPUs, including on all Apple CPUs (the only ARM-based CPUs with good energy efficiency for such applications are made by Fujitsu, but they are unobtainium).

So if you want maximum energy efficiency, you should choose well your CPU, but a prejudice like believing that ARM-based CPUs are always better is guaranteed to lead to incorrect decisions.

The Apple CPUs have exceptional and unmatched energy efficiency in single-thread applications, but their energy efficiency in multi-threaded applications is not better than that of Intel/AMD CPUs made with the same TSMC CMOS fabrication process, so Apple can have only a temporary advantage, when they use first some process to which competitors do not have access.

Except for personal computers, the energy efficiency that matters is that of multi-threaded applications, so there Apple does not have anything to offer.

by adrian_b

3/9/2026 at 2:01:04 PM

this is a very silly take. cpu isa is at most a 2x difference, and software has plenty of 100x differences. most of the difference between Windows and macos isn't the chips, OS and driver bloat is a much bigger factor

by adgjlsfhk1

3/9/2026 at 2:39:32 PM

CPU ISA is at most a 2x difference for programs that use only the general-purpose registers and operations.

For applications that use vector or matrix operations and which may need some specific features, it is frequent to have a 4x to 10x better performance, or even more than this, when passing from a badly-designed ISA to a well-designed ISA, e.g. from Intel AVX to Intel AVX-512.

Moreover, there are ISAs that are guilty of various blunders, which lower the performance many times. For instance, if an ISA does not have rotation instructions, an application whose performance depends a lot on such operations may run up to 3x slower than on an ISA with rotation instructions

Even greater slow-downs happen on ISAs that lack good means for detecting various errors, e.g. when running on RISC-V a program that must be reliable, so it has to check for integer overflows.

by adrian_b