Should have funded the entire GIL-removal effort by selling carbon credits. Here's an industry waiting to happen: issue carbon credits for optimizing CPU and GPU resource usage in established libraries.
I wonder about the total energy cost of apps like Teams, Slack, Discord, etc... Hundreds of millions of users, an app running constantly in the background. I wouldn't be surprised if the global power consumption on the clients side reached the gigawatt. Add the increased wear on the components, the cost of hardware upgrades, etc...
All that to avoid hiring a few developers to make optimized native clients on the most popular platforms. Popular apps and websites should lose or get carbon credits on optimization. What is negligible for a small project becomes important when millions of users get involved, and especially background apps.
If we go by Microsofts 2020 account of 1 billion devices running Windows 10 [0], and assume all those are running some kind of electron app (or multiple?) you easily get your gigawatt by just saving 1 watt across each device (on average). I suspect you'd probably go higher than 1 gigawatt, but I'm not sure as far as making another order of magnitude. I also think the noisy fan on my notebook begs to differ and maybe the 10 GW mark could be doable...
There are 30,000 different x-platform GUI frameworks and they all share one attribute: (1) they look embarrassingly bad compared to Electron or Native apps and they mostly (2) are terrible to program for.
I feel like I never wasting my time when I learn how to do things with the web platform because it turns out the app I made for desktop and tablet works on my VR headset. Sure if you are going to pay me 2x the market rate and it is a sure thing you might interest me in learning Swift and how to write iOS apps but I am going to do it for a personal project or even a moneymaking project where I am taking some financial risk no way. The price of learning how to write apps for Android is that I have to also learn how to write apps for iOS and write apps for Windows and write apps for MacOS and decide what's the least-bad widget set for Linux and learn to program for it to.
Every time I do a shoot-out of Electron alternatives Electron wins and it is not even close.
One thing I'm curious about here is the operational impact.
In production systems we often see Python services scaling horizontally
because of the GIL limitations. If true parallelism becomes common,
it might actually reduce the number of containers/services needed
for some workloads.
But that also changes failure patterns — concurrency bugs,
race conditions, and deadlocks might become more common in
systems that were previously "protected" by the GIL.
It will be interesting to see whether observability and
incident tooling evolves alongside this shift.
For big things the current way works fine. Having a separate container/deployment for celery, the web server, etc is nice so you can deploy and scale separately. Mostly it works fine, but there are of course some drawbacks. Like prometheus scraping of things then not able to run a web server in parallel etc is clunky to work around.
And for smaller projects it's such an annoyance. Having a simple project running, and having to muck around to get cron jobs, background/async tasks etc. to work in a nice way is one of the reasons I never reach for python in these instances. I hope removing the GIL makes it better, but also afraid it will expose a whole can of worms where lots of apps, tools and frameworks aren't written with this possibility in mind.
A lot of that has already been solved for by scaling workers to cores along with techniques like greenlets/eventlets that support concurrency without true multithreading to take better advantage of CPU capacity.
Your suspicion could have easily been cleared by reading the paper.
If you're short on time: the paper reads a bit dry, but falls in the norm for academic writing. The github repo shows work over months on 2024 (leading up to the release of 3.13) and some rush on Dec 2025 to Jan 2026, probably to wrap things up on the release of this paper. All commits on the repo are from the author, but I didn't look through the code to inspect if there was some Copilot intervention.
Our experience on memory usage, in comparison, has been generally positive.
Previously we had to use ProcessPoolExecutor which meant maintaining multiple copies of the runtime and shared data in memory and paying high IPC costs, being able to switch to ThreadPoolExecutor was hugely beneficially in terms of speed and memory.
It almost feels like programming in a modern (circa 1996) environment like Java.
Swapping ProcessPoolExecutor for ThreadPoolExecutor gives real memory and IPC wins, but it trades process isolation for new failure modes because many C extensions and native libraries still assume the GIL and are not thread safe.
Measure aggressively and test under real concurrency: use tracemalloc to find memory hotspots, py-spy or perf to profile contention, and fuzz C extension paths with stress tests so bugs surface in the lab not in production. Watch per thread stack overhead and GC behavior, design shared state as immutable or sharded, keep critical sections tiny, and if process level isolation is still required stick with ProcessPoolExecutor or expose large datasets via read only mmap.
Might be worth noting that this seems to be just running some tests using the current implementation, and these are not necessarily general implications of removing the GIL.
5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?
5.5: Fine-grained lock contention significantly hurts energy consumption.
I'm not sure of the exact relationship, but power consumption increases greater than linear with clock speed. If you have 4 cores running at the same time, there's more likely to be thermal throttling → lower clock speeds → lower energy consumption.
Greater power draw though; remember that energy is the integral of power over time.
By running more tasks in parallel across different cores they can each run at lower clock speed and potentially still finish before a single core at higher clock speeds can execute them sequentially.
my hypothesis is that chatgpt was trained on the internet, and useful technical answers on the internet were posted by autistic people. who else would spend their time learning and then rushing to answer such things the moment they get their chance to shine? so chatgpt is basically pure distilled autism, which is why it sounds so familiar.
Just as bad if it's human. No information has been shared. The writer has turned idle wondering into prose:
> Once threads actually run concurrently, libraries (which?) that never needed locking (contradiction?) could (will they or won't they?) start hitting race conditions in surprising (go on, surprise me) places.
> Across all workloads, energy consumption is proportional to execution time
Race-to-idle used to be the best path before multicore. Now it's trickier to determine how to clock the device. Especially in battery powered cases. This is why all modern CPU manufacturers are looking into heterogeneous compute (efficiency vs performance cores).
Put differently, I don't think we should be killing ourselves over this at software time. If you are actually concerned about the impact on raw energy consumption, you should move your workloads from AMD/Intel to ARM/Apple. Everything else would be noise compared to this.
Programs whose performance is dominated by array operations, as it is the case for most scientific/technical/engineering applications, achieve a much better energy efficiency on the AMD or Intel CPUs with good AVX-512 support, e.g. Zen 5 Ryzen or Epyc CPUs and Granite Rapids Xeons, than on almost all ARM-based CPUs, including on all Apple CPUs (the only ARM-based CPUs with good energy efficiency for such applications are made by Fujitsu, but they are unobtainium).
So if you want maximum energy efficiency, you should choose well your CPU, but a prejudice like believing that ARM-based CPUs are always better is guaranteed to lead to incorrect decisions.
The Apple CPUs have exceptional and unmatched energy efficiency in single-thread applications, but their energy efficiency in multi-threaded applications is not better than that of Intel/AMD CPUs made with the same TSMC CMOS fabrication process, so Apple can have only a temporary advantage, when they use first some process to which competitors do not have access.
Except for personal computers, the energy efficiency that matters is that of multi-threaded applications, so there Apple does not have anything to offer.
this is a very silly take. cpu isa is at most a 2x difference, and software has plenty of 100x differences. most of the difference between Windows and macos isn't the chips, OS and driver bloat is a much bigger factor
CPU ISA is at most a 2x difference for programs that use only the general-purpose registers and operations.
For applications that use vector or matrix operations and which may need some specific features, it is frequent to have a 4x to 10x better performance, or even more than this, when passing from a badly-designed ISA to a well-designed ISA, e.g. from Intel AVX to Intel AVX-512.
Moreover, there are ISAs that are guilty of various blunders, which lower the performance many times. For instance, if an ISA does not have rotation instructions, an application whose performance depends a lot on such operations may run up to 3x slower than on an ISA with rotation instructions
Even greater slow-downs happen on ISAs that lack good means for detecting various errors, e.g. when running on RISC-V a program that must be reliable, so it has to check for integer overflows.
Should have funded the entire GIL-removal effort by selling carbon credits. Here's an industry waiting to happen: issue carbon credits for optimizing CPU and GPU resource usage in established libraries.
I am taking all the migration of electron apps.
I wonder about the total energy cost of apps like Teams, Slack, Discord, etc... Hundreds of millions of users, an app running constantly in the background. I wouldn't be surprised if the global power consumption on the clients side reached the gigawatt. Add the increased wear on the components, the cost of hardware upgrades, etc...
All that to avoid hiring a few developers to make optimized native clients on the most popular platforms. Popular apps and websites should lose or get carbon credits on optimization. What is negligible for a small project becomes important when millions of users get involved, and especially background apps.
If we go by Microsofts 2020 account of 1 billion devices running Windows 10 [0], and assume all those are running some kind of electron app (or multiple?) you easily get your gigawatt by just saving 1 watt across each device (on average). I suspect you'd probably go higher than 1 gigawatt, but I'm not sure as far as making another order of magnitude. I also think the noisy fan on my notebook begs to differ and maybe the 10 GW mark could be doable...
[0] https://news.microsoft.com/apac/2020/03/17/windows-10-poweri...
There are 30,000 different x-platform GUI frameworks and they all share one attribute: (1) they look embarrassingly bad compared to Electron or Native apps and they mostly (2) are terrible to program for.
I feel like I never wasting my time when I learn how to do things with the web platform because it turns out the app I made for desktop and tablet works on my VR headset. Sure if you are going to pay me 2x the market rate and it is a sure thing you might interest me in learning Swift and how to write iOS apps but I am going to do it for a personal project or even a moneymaking project where I am taking some financial risk no way. The price of learning how to write apps for Android is that I have to also learn how to write apps for iOS and write apps for Windows and write apps for MacOS and decide what's the least-bad widget set for Linux and learn to program for it to.
Every time I do a shoot-out of Electron alternatives Electron wins and it is not even close.
One thing I'm curious about here is the operational impact.
In production systems we often see Python services scaling horizontally because of the GIL limitations. If true parallelism becomes common, it might actually reduce the number of containers/services needed for some workloads.
But that also changes failure patterns — concurrency bugs, race conditions, and deadlocks might become more common in systems that were previously "protected" by the GIL.
It will be interesting to see whether observability and incident tooling evolves alongside this shift.
For big things the current way works fine. Having a separate container/deployment for celery, the web server, etc is nice so you can deploy and scale separately. Mostly it works fine, but there are of course some drawbacks. Like prometheus scraping of things then not able to run a web server in parallel etc is clunky to work around.
And for smaller projects it's such an annoyance. Having a simple project running, and having to muck around to get cron jobs, background/async tasks etc. to work in a nice way is one of the reasons I never reach for python in these instances. I hope removing the GIL makes it better, but also afraid it will expose a whole can of worms where lots of apps, tools and frameworks aren't written with this possibility in mind.
A lot of that has already been solved for by scaling workers to cores along with techniques like greenlets/eventlets that support concurrency without true multithreading to take better advantage of CPU capacity.
I have a suspicion that this paper is basically a summary with some benchmarks, done with LLMs.
Your suspicion could have easily been cleared by reading the paper.
If you're short on time: the paper reads a bit dry, but falls in the norm for academic writing. The github repo shows work over months on 2024 (leading up to the release of 3.13) and some rush on Dec 2025 to Jan 2026, probably to wrap things up on the release of this paper. All commits on the repo are from the author, but I didn't look through the code to inspect if there was some Copilot intervention.
[0] https://github.com/Joseda8/profiler
Our experience on memory usage, in comparison, has been generally positive.
Previously we had to use ProcessPoolExecutor which meant maintaining multiple copies of the runtime and shared data in memory and paying high IPC costs, being able to switch to ThreadPoolExecutor was hugely beneficially in terms of speed and memory.
It almost feels like programming in a modern (circa 1996) environment like Java.
Swapping ProcessPoolExecutor for ThreadPoolExecutor gives real memory and IPC wins, but it trades process isolation for new failure modes because many C extensions and native libraries still assume the GIL and are not thread safe.
Measure aggressively and test under real concurrency: use tracemalloc to find memory hotspots, py-spy or perf to profile contention, and fuzz C extension paths with stress tests so bugs surface in the lab not in production. Watch per thread stack overhead and GC behavior, design shared state as immutable or sharded, keep critical sections tiny, and if process level isolation is still required stick with ProcessPoolExecutor or expose large datasets via read only mmap.
I thought libraries had to explicitly opt in to no GIL via a macro or constant or something in C
GP is a clanker spouting off a lot of random nonsense.
Says someone who has "bot234" in his name ...
Might be worth noting that this seems to be just running some tests using the current implementation, and these are not necessarily general implications of removing the GIL.
There might also be many optimization opportunities that still have to be seized.
Sections 5.4 and 5.5 are the interesting ones.
5.4: Energy consumption going down because of parallelism over multiple cores seems odd. What were those cores doing before? Better utilization causing some spinlocks to be used less or something?
5.5: Fine-grained lock contention significantly hurts energy consumption.
I'm not sure of the exact relationship, but power consumption increases greater than linear with clock speed. If you have 4 cores running at the same time, there's more likely to be thermal throttling → lower clock speeds → lower energy consumption.
Greater power draw though; remember that energy is the integral of power over time.
By running more tasks in parallel across different cores they can each run at lower clock speed and potentially still finish before a single core at higher clock speeds can execute them sequentially.
Title shortened - Original title:
Unlocking Python’s Cores: Hardware Usage and Energy Implications of Removing the GIL
I am curious about the NumPy workload choice made, due to more limited impact on CPython performance.
[flagged]
Thanks ChatGPT, good of you to let us know.
There are so many ChatGPT responses in this thread, it’s giving me a headache.
Yep. Real "dead internet theory" vibes, really sad to see.
It’s been very noticeable for about a year now, but the last few months is absolutely terrible. I wonder if clawdbot has anything to do with it.
my hypothesis is that chatgpt was trained on the internet, and useful technical answers on the internet were posted by autistic people. who else would spend their time learning and then rushing to answer such things the moment they get their chance to shine? so chatgpt is basically pure distilled autism, which is why it sounds so familiar.
I'm curious what makes that obviously llm? As far as I can tell it was a short and fairly benign statement with little scope to give away llm-ness?
It's just the equivalent of that one student restating what the teacher just said with no added value
Just as bad if it's human. No information has been shared. The writer has turned idle wondering into prose:
> Once threads actually run concurrently, libraries (which?) that never needed locking (contradiction?) could (will they or won't they?) start hitting race conditions in surprising (go on, surprise me) places.
The obvious solution is to require libraries that are no-GIL safe to declare that, and for all other libraries implicitly wrap them with GIL locks.
> Across all workloads, energy consumption is proportional to execution time
Race-to-idle used to be the best path before multicore. Now it's trickier to determine how to clock the device. Especially in battery powered cases. This is why all modern CPU manufacturers are looking into heterogeneous compute (efficiency vs performance cores).
Put differently, I don't think we should be killing ourselves over this at software time. If you are actually concerned about the impact on raw energy consumption, you should move your workloads from AMD/Intel to ARM/Apple. Everything else would be noise compared to this.
Programs whose performance is dominated by array operations, as it is the case for most scientific/technical/engineering applications, achieve a much better energy efficiency on the AMD or Intel CPUs with good AVX-512 support, e.g. Zen 5 Ryzen or Epyc CPUs and Granite Rapids Xeons, than on almost all ARM-based CPUs, including on all Apple CPUs (the only ARM-based CPUs with good energy efficiency for such applications are made by Fujitsu, but they are unobtainium).
So if you want maximum energy efficiency, you should choose well your CPU, but a prejudice like believing that ARM-based CPUs are always better is guaranteed to lead to incorrect decisions.
The Apple CPUs have exceptional and unmatched energy efficiency in single-thread applications, but their energy efficiency in multi-threaded applications is not better than that of Intel/AMD CPUs made with the same TSMC CMOS fabrication process, so Apple can have only a temporary advantage, when they use first some process to which competitors do not have access.
Except for personal computers, the energy efficiency that matters is that of multi-threaded applications, so there Apple does not have anything to offer.
this is a very silly take. cpu isa is at most a 2x difference, and software has plenty of 100x differences. most of the difference between Windows and macos isn't the chips, OS and driver bloat is a much bigger factor
CPU ISA is at most a 2x difference for programs that use only the general-purpose registers and operations.
For applications that use vector or matrix operations and which may need some specific features, it is frequent to have a 4x to 10x better performance, or even more than this, when passing from a badly-designed ISA to a well-designed ISA, e.g. from Intel AVX to Intel AVX-512.
Moreover, there are ISAs that are guilty of various blunders, which lower the performance many times. For instance, if an ISA does not have rotation instructions, an application whose performance depends a lot on such operations may run up to 3x slower than on an ISA with rotation instructions
Even greater slow-downs happen on ISAs that lack good means for detecting various errors, e.g. when running on RISC-V a program that must be reliable, so it has to check for integer overflows.