1/10/2025 at 5:43:38 PM
The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.
OTEL is actively hostile to any language that uses one process per core. What a joke.
Just go with Prometheus. It’s not like there are other contenders out there.
by hinkley
1/10/2025 at 10:50:01 PM
I'm fairly convinced that OTEL is in a form of 'vendor capture', i.e. because the only way to get a standard was to compromise with various bigcorps and sloppy startups to glue-gun it all together.I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.
I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.
by to11mtm
1/10/2025 at 10:22:48 PM
> It’s not like there are other contenders out there.Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/
Also I quite liked that a minimal setup is indeed pretty minimal: a web UI, a server instance and a DB that you already know https://skywalking.apache.org/docs/main/latest/en/setup/back...
In some ways, it's a lot like Zabbix in the monitoring space - neither will necessarily impress anyone, but both have a nice amount of utility.
by KronisLV
1/10/2025 at 6:22:21 PM
This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.by sethops1
1/10/2025 at 7:27:30 PM
Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.How would you build the "holy grail" map that shows a trace of every sub component in a transaction broken down by start/stop time etc... for instance show the load balancer see a request, the request get handled by middlewares etc, then go onto some kind of handler/controller, the sub-queries inside of that like database calls or cache calls. I don't think that is possible with prometheus?
by whalesalad
1/10/2025 at 7:49:59 PM
> Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.Correct. Prometheus is just metrics.
The main argument for oTel is that instead of one proprietary vendor SDK or importing prometheus and jaeger and whatever you want to use for logging, just import oTel and all that will be done with a common / open data format.
I still believe in that dream but it's clear that the whole project needs some time/resources to mature a bit more.
If anybody remembers the Terraform/ToFu drama, it's been really wild to see how much support everybody pledged for ToFu but all the traditional observability providers have just kinda tolerated oTel :/
by baby_souffle
1/10/2025 at 9:08:40 PM
Yeah part of the problem is it’s called Opentelemetry and half of you are only talking about tracing, not metrics. Telemetry is metrics. It’s been metrics since at least the Mercury Program.Metrics in OTEL is about three years old and it’s garbage for something that’s been in development for three years.
by hinkley
1/11/2025 at 1:03:45 PM
its looks hassle to implement nglby tonyhart7
1/10/2025 at 8:45:22 PM
Code traces are metrics. Run times per function calls metrics, count of specific function call metrics.Otel is an attempt to package such arithmetic.
Web apps have added so many layers of syntax sugar and semantic wank, we’ve lost sight its all just the same old math operations relative to different math objects. Sets are not triangles but both are tested, quantified, and compared with the same old mathematical ops we learn by middle school.
by niftaystory
1/10/2025 at 8:55:58 PM
No, code traces are not just metrics; and while you can knit together something approximating traces from metrics, you'll quickly run into the reason why traces are a distinct thing. First, in a distributed system, you'll discover that you can't rely on clocks to get the timing of subsecond events correct. Second, you'll be contextless about code paths. So, you might independantly reinvent the idea of passing along a context - and now you're just making your own tracing system but without any of the benefit of building on years of existing discoveries in this field.OTel does feel a little bit heavy, unless you're already used to e.g. New Relic, Dynatrace, etc. where you have to run an agent process and instrumentize your code to some extent; it's never going to be free to audit every function call! This is why (a) you sample down and don't keep every trace, and (b) unless your company is extremely flush with cash you probably don't run tracing in every environment. If you can get away with it just in a staging or perf test env you can reap most of the benefit without the production impact and cost.
by mikestorrent
1/10/2025 at 9:01:40 PM
All those things you describe are computable metrics. They have to be or Otel itself would not be able to compute them for consumption. All you described are cherry picked semantic indirections to obfuscate it’s all just a computer computing metrics of its own memory states.Sorry for knowing how computers actually work (EE grad not a CS grad). I know that can frustrate CS grads who think their preferred OS and favorite programming language is how a computer works. You’re describing how contemporary SWEs view their day job.
Edit: teleMETRY …what’s in a name? Oh right …meaning.
by niftaystory
1/11/2025 at 12:27:49 AM
To be a smart-ass, one has to be smart first. Quit this.by mathfailure
1/10/2025 at 9:47:38 PM
As a no grad to EE grad: traces mean a bundle of metrics that varies in structure hence you can't store and process them as effective as a list of counters unless you have a distinct bin for each possible trace, combinatorial explosion y'know.by chupasaurus
1/10/2025 at 11:55:21 PM
You know the conversation is going well for you when you resort to citing the "meaning" of a name instead of, you know, base reality. Who needs the territory, I've got my map right here.Speaking of meaning, the best I can make of your point is that you're using a much broader definition of "metrics" than the rest of this conversation, and in particular broader than Prometheus (remember context? very important for "meaning"!) supports. That or you really just don't know what a "trace" is (in this context).
by andrewflnr
1/10/2025 at 11:52:28 PM
OpenTelemetry's traces are trees of spans. You cannot represent this efficiently without a combinatorial explosion of labels.You may be thinking of metrics in the sense of counters and gauges, but that's not the data model that OpenTelemetry (and before they, Zipkin, Jaeger, and OpenCensus) uses for traces.
The data model for tracing is to emit events that provide a span ID and an optional parent span ID. The event collector can piece these together into a tree after the fact, which will work as long as the parent structure is maintained.
Prometheus is absolutely not suitable for this.
Quibbling about the word "telemetry" doesn't really help here. OpenTelemetry supports three different, completely different subsets of functionality: Metrics (counters, gauges, histograms), traces (span events in a tree structure), and logging (structured log events). They each have completely different client interfaces.
by atombender
1/10/2025 at 8:59:28 PM
huh? I've always heard and read and experienced that "logs, traces, metrics" are the 3 legs of the observability stool.by chrisweekly
1/10/2025 at 9:12:55 PM
Open teleMETRYAny guesses as to etymology?
by niftaystory
1/10/2025 at 9:55:51 PM
By this logic, you can say that logging, metrics and tracing are all fundamentally just different kinds of data and we should be calling it just plain databases and CRUD.They're related, but people have a very specific idea and concept of what each is, you haven't actually provided a good argument why we should throw out these distinctions just because they somewhat resemble each other if you ignore a few details
by ffsm8
1/10/2025 at 9:28:25 PM
Prometheus is good, but let's be clear...you don't get tracing.by paulddraper
1/10/2025 at 11:07:23 PM
For tracing FOSS: Grafana Tempo.by PeterCorless
1/10/2025 at 11:25:22 PM
Tempo's a backend/sink for traces, but if you click through to the Tempo docs and find out how to generate tracing data[1], you learn that you have two options: OpenTelemetry, which they recommend, and Zipkin, which they do not recommend.[1] https://grafana.com/docs/tempo/latest/getting-started/instru...
by flurie
1/11/2025 at 10:11:21 PM
"I don't want solutions, I want to be mad."by paulddraper
1/11/2025 at 2:03:24 AM
Tempo is a traces server. Prometheus is a metrics server.Grafana, the same company that develops and sells Tempo created a horizontally scalable version of Prometheus called Mimir.
OpenTelemetry is an ecosystem, not just 1 app. It’s protocols, libraries, specs, a Collector (which acts as a clearinghouse for metrics+traces+logs data). It’s bigger than just Tempo. The intention of Patel seems to be to decouple the protocol from the app by having adapters for all of the pieces.
by thephyber
1/11/2025 at 7:48:31 AM
Prometheus is not only a metrics server, it's also become the de-facto metrics exposition format.by Too
1/11/2025 at 12:09:18 AM
You probably don´t understand what Otel is if you think that Prometheus is an alternative.by Thaxll
1/11/2025 at 3:21:43 AM
You'd do better to point out which distinction you think the parent poster is missing.My guess is that Prometheus cannot do distributed tracing, while OpenTelemetry can. Is that what you meant?
by MathMonkeyMan
1/11/2025 at 12:25:38 PM
Otel is a spec. You can create your own clients/aggregators/etc.. The problem is that if nobody does it, there will be no tooling. So Otel created some tooling (and yes, it's bad) for people to use.Some companies (ie: Datadog) are contributing to the tooling but I think most companies would rather spend dev time on their own platforms than something that anybody (competitor) can use.
by csomar
1/11/2025 at 5:20:49 PM
From the user side, a spec isn't helpful unless it has implementations. And the official implementations are complicated compared to prometheus.by bluesnews
1/11/2025 at 8:14:42 PM
I worked on a team that produced a distributed tracing library. We were tasked with interoperating with OpenTelemetry, or at least figuring out what that means.My teammate said that at a previous job he wanted to add OpenTelemetry tracing to some C++ code he was working on. He took one look at the reference implementation for C++ OpenTelemetry and decided instead to write his own tracing library that sends gRPC to the OpenTelemetry collector.
It's also worth noting that, at least last time I checked, the reference implementations per programming language are less like reference implementations of some specification, and more like "this is the code you use to do OpenTelemetry in this language."
by MathMonkeyMan
1/11/2025 at 2:38:33 AM
Why Otel compared to prometheus+syslog+(favorite way to do request tagging, eg: MDC in slf4j)+grep?Syslog is kinda a pain, but it's an hour of work and log aggregation is set up. Is the difference the pain of doing simple things with elastic compute and kubernetes?
by seadan83
1/11/2025 at 4:50:01 AM
Typically this is a subset of OTel that's being compared. Almost everything (aside from Datadog's proprietary stuff) is just smaller than OTel is, which is why it's often chosen for many different needs.In my experience, it's often folks who have experience setting up metrics or log collection with something smaller (e.g., StatsD) and sometimes for purposes with less scope, who have the most frustration with OTel. All the concepts are different, carry different names, have different configs, have different quirks, etc. There's often an expectation that things will largely the same as before and that they can carry over the cursed knowledge they had from the other toolset.
by phillipcarter
1/10/2025 at 6:30:46 PM
Simpler near-term, but more painful long term when you want to switch vendors/stacks.by bushbaba
1/10/2025 at 7:56:21 PM
Nine times out of ten, I've got more valuable problems to solve than a theoretical future change of our vendor/stack for telemetry. I'll gladly borrow from my future self's time if it means I can focus on something more important right now.by kemitche
1/10/2025 at 9:16:11 PM
I did our migration from StatsD to OTEL because our third party StatsD service was getting flaky. The first person from OPs to get to me pushed OTEL. The rest were fine with Prometheus and it was late in the process before they realized what had happened. I believe if we had gone straight to Prometheus I would have been done in half the time and solved half the problems I had to solve anyway for OTEL. If someone had to replace it again in the future I fully believe it would have taken cumulatively as much time to go StatsD->Prometheus->OTEL as it took to go StatsD->OTEL, especially when you consider that OTEL is not quite baked.Meanwhile functionality to retain and recruit new customers sat in the backlog.
Edit to add: also regarding the perf issues I saw: do you really want to pay for an extra server or half a server in your cluster just in case some day comes? These decisions were much fuzzier when you ordered hardware once every two years and just had to live with the capacity you got.
by hinkley
1/10/2025 at 6:41:30 PM
And switching log implementations can be a pain in the butt. Ask me how I know.But I’d rather do that three more times before I want to see OpenTelemetry again.
Also Prometheus is getting OTEL interop.
by hinkley
1/10/2025 at 7:02:28 PM
Is this the same scam as "standard SQL"? Switching database products is never straightforward in practice, despite any marketing copy or wishful thinking.Prometheus ecosystem is very interoperable, by the way.
by pphysch
1/11/2025 at 6:51:30 PM
It's not a "scam", the protocols and clients are 100% scrutable. Not sure why you used that word.by pdimitar
1/10/2025 at 9:47:44 PM
Using otel from C++ side... To have cumulative metrics from multiple applications (e.g. not "statds/delta") I create a relatively low cardinality process.vpid integer (and somehow coordinate this number to be unique as long as the app emitting it is stil alive) - you can use some global object to coordinate it.Then you can have something that sums, and removes the attribute.
With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.
edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.
by malkia
1/10/2025 at 6:13:53 PM
This matches my experience. Very difficult to understand what I needed to get the effect I wanted.by mkeedlinger
1/10/2025 at 8:07:39 PM
I wonder what your experience is with Sentry? Not just for error reporting but especially also their support for traces.Also open-source & self-hostable.
by Xeago
1/11/2025 at 3:20:57 AM
Likely only a handful of people care, but Sentry hasn't been open source in quite a while https://github.com/getsentry/sentry/blob/24.12.1/LICENSE.md (I'd have to do tag-spelunking to find the last Apache 2 version)Glitchtip is the Sentry compatible open source (MIT) one https://gitlab.com/glitchtip/glitchtip-backend/-/blob/v4.2.2... with the extra advantage that it doesn't require like 12 containers to deploy (e.g. https://github.com/getsentry/self-hosted/blob/24.12.1/docker... )
by mdaniel
1/11/2025 at 12:32:10 AM
Sentry is not horizontally scalable, thus ~ not-scalable at all, if your company is big.by mathfailure
1/11/2025 at 11:50:35 AM
That's a fair point, but scaling it vertically can take you very far in my experience.by Fidelix
1/10/2025 at 9:49:58 PM
Quota/pricing.by malkia
1/11/2025 at 12:06:50 AM
Same. I implemented Otel once and exactly once. I wouldn't wish it on any company.Otel is a design by committee garbage pile of half baked ideas.
by silisili
1/10/2025 at 5:59:54 PM
There are a lot of Java programmers working on it.(And some Go tbf.)
by paulddraper
1/10/2025 at 6:13:59 PM
Yeah and a blind man can see this, it’s so loud.by hinkley