3/31/2025 at 10:34:12 PM
Every perf guide recommends to minimize allocations to reduce GC times, but if you look at pprof of a Go app, GC mark phase is what takes time, not GC sweep. GC mark always starts with known live roots (goroutine stacks, globals, etc) and traverse references from there colouring every pointer. To minimize GC time it is best to avoid _long living_ allocations. Short lived allocations, those which GC mark phase will never reach, has almost neglible effect on GC times.Allocations of any kind have an effect on triggering GC earlier, but in real apps it is almost hopeless to avoid GC, except for very carefully written programs with no dependenciesm, and if GC happens, then reducing GC mark times gives bigger bang for the buck.
by nopurpose
4/1/2025 at 3:38:20 AM
Its worth calling out that abstractions can kill you in unexpected ways with go.Anytime you use an interface it forces a heap allocation, even if the object is only used read only and within the same scope. That includes calls to things like fmt.Printf() so doing a for loop that prints the value of i forces the integer backing i to be heap allocated, along with every other value that you printed. So if you helpfully make every api in your library use an interface you are forcing the callers to use heap allocations for every single operation.
by liquidgecka
4/1/2025 at 4:12:37 AM
I thought surely an integer could be inlined into the interface, I thought Go used to do that. But I tried it on the playground, and it heap allocates it:by slashdev
4/1/2025 at 4:16:04 AM
Go did use to do that, it was removed years ago, in 1.4: https://go.dev/doc/go1.4#runtimeby masklinn
4/1/2025 at 1:36:40 PM
Basically, anything that isn't a thin pointer (*T, chan, map) gets boxed nowadays. The end result is that both words of an interface value are always pointers [1], which is very friendly to the garbage collector (setting aside the extra allocations when escape analysis fails). I've seen some tricks in the standard library to avoid boxing, e.g. how strings and times are handled by log/slog [2].[1]: https://github.com/teh-cmc/go-internals/blob/master/chapter2...
[2]: https://cs.opensource.google/go/go/+/refs/tags/go1.24.1:src/...
by kbolino
4/1/2025 at 9:36:42 PM
slog.Value looks incredibly useful.Just imagine a day where database/sql doesn't generate a tonne of garbage because it moves to use something like that?
by ncruces
4/1/2025 at 1:59:53 PM
go1.15 re-added small integer packing into interfaces: https://go.dev/doc/go1.15#runtimeby ominous_prime
4/1/2025 at 2:35:00 PM
It didn't, actually. Instead go 1.15 has a static array of the first 256 positive integers, and when it needs to box one for an interface it gets a pointer into that array instead: https://go-review.googlesource.com/c/go/+/216401/4/src/runti...This array is also used for single-byte strings (which previously had its own array): https://go-review.googlesource.com/c/go/+/221979/3/src/runti...
by masklinn
4/1/2025 at 5:06:00 PM
It didn't, do what? I would consider the first 256 integers to be "small integers" ;)> Converting a small integer value into an interface value no longer causes allocation
I forgot that it can also be used for single byte strings, That's not an optimization I ever encountered being useful, but it's there!
by ominous_prime
4/1/2025 at 6:11:50 PM
> It didn't, do what?Reintroduce “packing into interfaces”.
It did a completely different thing. Small integers remain not inlined.
by masklinn
3/31/2025 at 10:40:48 PM
Are you including in this analysis the amount of time/resources it takes to allocate? GC isn't the only thing you want to minimize for when you're making a high performance system.by MarkMarine
4/1/2025 at 7:14:53 AM
From that perspective it boils down to "do less", which is what any perf guide already includes, allocations is just no different from anything else what app do.My comment is more about "reduce allocations to reduce GC pressure" advice seen everywhere. It doesn't tell the whole story. Short lived allocation doesn't introduce any GC pressure: you'll be hard pressed to see GC sweep phase on pprof without zooming. People take this advice, spend time and energy hunting down allocations, just to see that total GC time remained the same after all that effort, because they were focusing on wrong type of allocations.
by nopurpose
4/1/2025 at 3:20:19 PM
Yeah I understand what you’re saying, but my point is you’re doing the opposite side of the same coin. Not doing full perf analysis and saying this one method works (yours is to reduce GC mark time, ignoring allocation, others are trying to reduce allocation time, ignoring GC time, or all these other methods listed in this doc.)by MarkMarine
4/1/2025 at 11:31:22 AM
Side note: see https://tip.golang.org/doc/gc-guide for more on how the Go GC works and what triggers it.GC frequency is directly driven by allocation rate (in terms of bytes) and live heap size. Some examples:
- If you halve the allocation rate, you halve the GC frequency.
- If you double the live heap size, you halve the GC frequency (barring changes away from the default `GOGC=100`).
> ...but if you look at pprof of a Go app, GC mark phase is what takes time, not GC sweep.It is true that sweeping is a lot cheaper than marking, which makes your next statement:
> Short lived allocations, those which GC mark phase will never reach, has almost neglible effect on GC times.
...technically correct. Usually, this is the best kind of correct, but it omits two important considerations:
- If you generate a ton of short-lived allocations instead of keeping them around, the GC will trigger more frequently.
- If you reduce the live heap size (by not keeping anything around), the GC will trigger more frequently.
So now you have cheaper GC cycles, but many more of them. On top of that, you have vastly increased allocation costs.It is not a priori clear to me this is a win. In my experience, it isn't.
by aktau
4/1/2025 at 8:42:23 PM
Interesting, thank you. But I think those points are not correlated that much. For example if I create unnecessary wrappers in a loop, I might double the allocation rate, but I will not halve the live heap size, because I did not have those wrappers outside the loop before.Basically, I'm trying to come up with an real world example of a style change (like create wrappers for every error, or use naked integers instead of time.Time) to estimate its impact. And my feeling is that any such example would affect one of your points way more than the other, so we can still argument that e.g. "creating short-lived iterators is totally fine".
by deepsun
4/1/2025 at 3:04:29 PM
I enjoyed your detailed response, it adds value to this discussion, but I feel you missed the point of my comment.I am against blanket statements "reduce allocations to reduce GC pressure", which lead people wrong way: they compare libraries based on "allocs/op" from go bench, they trust rediculous (who allocates 8KB per iteration in tight loop??) microbenchmarks of sync.Pool like in the article above, hoping to resolve their GC problem. Spend considerabe amount of effort just to find that they barely moved a needle on GC times.
If we generalize then my "avoid long-lived allocations" or yours "reduce allocation rate in terms of bytes" are much more useful in practice, than what this and many other articles preach.
by nopurpose
4/1/2025 at 2:04:15 AM
Pretty similar story in .NET. Make sure your inner loops are allocation-free, then ensure allocations are short-lived, then clean up the long tail of large allocations.by zmj
4/1/2025 at 3:55:15 AM
.NET is far more tolerant to high allocation traffic since its GC is generational and overall more sophisticated (even if at the cost of tail latency, although that is workload-dependent).Doing huge allocations which go to LOH is quite punishing, but even substantial inter-generational traffic won't kill it.
by neonsunset
4/1/2025 at 9:51:06 PM
The runtime also forces GC every 2 minutes. So yeah, a lot of long living allocations can stress the GC, even if you don't allocate often. That's why Discord moved from Go to Rust for their Read States server.by kgeist
4/1/2025 at 10:06:33 AM
The point is not to avoid GC entirely, but to reduce allocation pressure.If you can avoid allocs in a hot loop, it definitely pays to do so. If you can't for some reason, and can use sync.Pool there, measure it.
Cutting allocs in half may not matter much, but if you can cut them by 99% because you were allocating in every iteration of a 1 million loop, and now aren't, it will make a difference, even if all those allocs die instantly.
I've gotten better than two fold performance increases on real code with both techniques.
by ncruces
4/1/2025 at 2:18:09 AM
E.h., kind of. If you are allocating in a hot loop it's going to suck regardless. Object pools are really key if you want high perf because the general purpose allocator is way less efficient in comparison.by zbobet2012
4/1/2025 at 10:30:00 AM
Agree that mark phase is the expensive bit. Disagree that it’s not worth reducing short-lived allocations. I spend a lot of time analyzing Go program performance, and reducing bytes allocated per second is always beneficial.by bboreham
4/1/2025 at 2:19:23 PM
+1. In particular []byte slice allocations are often a significant driver of GC pace while also being relatively easy to optimize (e.g. via sync.Pool reuse).by felixge
4/1/2025 at 12:38:59 AM
You might wanna look at a system profiler too, pprof doesn't show everything.by raggi
3/31/2025 at 11:23:18 PM
Aren't allocations themselves pretty expensive regardless of GC?by Capricorn2481
3/31/2025 at 11:48:58 PM
Go allocations aren't that bad. A few years ago I benchmarked them at about 4x as expensive as a bump allocation. That is slow enough to make an arena beneficial in high allocation situations, but fast enough to not make it worth it most of the time.by nu11ptr
4/1/2025 at 10:29:08 AM
Comparing with a fairly optimized malloc at $COMPANY, the Go allocator is (both in terms of relative cycles and fraction of cycles of all Go programs) significantly more expensive than the C/C++ counterpart (3-4x IIRC). For one, it has to do more work, like setting up GC metadata, and zeroing.There have recently been some optimizations to `runtime.mallocgc`, which may have decrease that 3-4x estimate a bit.
by aktau
4/1/2025 at 12:48:49 PM
How can that be true? If it is 3-4x more expensive than malloc, then per my measurements your malloc is a bump allocator, and that simply isn't true for any real world malloc implementation (typically a modified free list allocator afaik). `mallocgc` may not be fast, but I simply did not find it as slow as you are saying. My guess is it is about as fast as most decent malloc functions, but I have not measured, and it would be interesting to see a comparison (tough to do as you'd need to call malloc via CGo or write one in C and one in Go and trust the looping is roughly the same cost).by nu11ptr
4/2/2025 at 8:38:24 AM
I should correct and clarify: I meant 3-4x more expensive in relative terms. Meaning: - For C++ programs, the allocator (allocating+freeing) consumes roughly 5% of cycles.
- For Go programs, the allocator (runtime.mallocgc) used to consume ~20% of cycles (this is the data I referenced). I checked and recently it's become closer to 15%, thanks to optimizations.
I have not tested the performance differential on a per-byte level (though that will also differ with object structure in Go).
by aktau
4/1/2025 at 1:24:13 AM
No. If you have a moving multi generational GC, allocation is literally just an increment for short lived objects.by epcoa
4/1/2025 at 3:01:11 AM
This is about go not Java. Go makes different tradeoffs and does not have moving multigenerational GC.by burch45
4/1/2025 at 3:54:47 AM
If you have a moving, generational GC, then all the benefits of fast allocation are lost due to data moving and costly memory barriers.by pebal
4/1/2025 at 9:25:46 AM
Not at all. Most objects die young and thus are never moved. Also, the time before it is moved is very long compared to CPU operations so it is only statistically relevant (very good throughput, rare, longer tail on latency graphs).Also, write-only barriers don't have that big of an overhead.
by gf000
4/1/2025 at 3:17:46 PM
It doesn't matter if objects die young — the other objects on the heap are still moved around periodically, which reduces performance. When you're using a moving GC, you also have additional read barriers that non-moving GCs don't require.by pebal
4/2/2025 at 6:45:39 AM
Is that period really that big of a concern when your threads in any language might be context switched away by the OS? It's not a common occurrence on a CPU-timeline at all.Also, it's no accident that every high-performance GC runtime went the moving, generational way.
by gf000
4/2/2025 at 7:45:31 AM
That time may seem negligible, since the OS can context switch threads anyway, but it’s still additional time during which your code isn’t doing its actual work.Generations are used almost exclusively in moving GCs — precisely to reduce the negative performance impact of data relocation. Non-moving GCs are less invasive, which is why they don’t need generations and can be fully concurrent.
by pebal
4/2/2025 at 8:57:21 AM
I would rather say that generations are a further improvement upon a moving collector, improving space usage and decreasing the length of the "mark" phase.And which GC is fully concurrent? I don't think that's possible (though I will preface that I am no expert, only read into the topic on a hobby level) - I believe the most concurrent GC out there is ZGC, which does read barriers and some pointer tricks to make the stop-the-world time independent of the heap size.
by gf000
4/2/2025 at 9:24:40 AM
Java currently has no fully concurrent GC, and due to the volume of garbage it manages and the fact that it moves objects, a truly fully concurrent GC for this language is unlikely to ever exist.Non-moving GCs, however, can be fully concurrent — as demonstrated by the SGCL project for C++.
In my opinion, the GC for Go is the most likely to become fully concurrent in the future.
by pebal
4/2/2025 at 11:20:27 AM
Is SGCL your project?In that case, are you doing atomic writes for managed pointers/the read flag on them? I have read a few of your comments on reddit and your flags seem to be per memory page? Still, the synchronization on them may or may not have a more serious performance impact than alternative methods and without a good way to compare it to something like Java which is the state of the art in GC research we can't really comment much on whether it's a net benefit.
Also, have you perhaps tried modeling your design in something like TLA+?
by gf000
4/2/2025 at 1:45:22 PM
Yes, SGCL is my project.You can't write concurrent code without atomic operations — you need them to ensure memory consistency, and concurrent GCs for Java also rely on them. However, atomic loads and stores are cheap, especially on x86. What’s expensive are atomic counters and CAS operations — and SGCL uses those only occasionally.
Java’s GCs do use state-of-the-art technology, but it's technology specifically optimized for moving collectors. SGCL is optimized for non-moving GC, and some operations can be implemented in ways that are simply not applicable to Java’s approach.
I’ve never tried modeling SGCL's algorithms in TLA+.
by pebal
4/1/2025 at 9:15:29 PM
It’s in uncharitable to say the benefits are lost - I’d reframe it as creating tradeoffs.by pgwhalen
4/1/2025 at 8:33:56 PM
Interesting, and I think that is not specific to Go, other mark-and-sweep GCs (Java, C#) should behave the same.Which means that creating short lived objects (like iterators for loops, or some wrappers) is ok.
by deepsun
4/1/2025 at 11:57:38 PM
Not entirely. Go still doesn't have a generational collector so high allocation rates cause more GC's that must examine long-lived objects.As such, short-lived objects have little impact in Java (thank god for that!). They will have second order effects in Go.
by ted_dunning
4/1/2025 at 10:36:21 PM
It should be noted that in C#, at least, the standard pattern is to use value types for enumerators, precisely so as to avoid heap allocations. This is the case for all (non-obsolete) collections in the .NET stdlib - e.g. List<T>.Enumerator is a struct.by int_19h
4/1/2025 at 5:00:04 AM
Is it worth making short lived allocations just to please the GC? You might just end up with too many allocations which will slow things down even more.by nurettin
4/1/2025 at 11:34:03 AM
It is not. Please see my answer (https://news.ycombinator.com/item?id=43545500).by aktau