I got OpenTelemetry to work. But why was it so complicated?

1/10/2025 at 5:43:38 PM

The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.

And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.

OTEL is actively hostile to any language that uses one process per core. What a joke.

Just go with Prometheus. It’s not like there are other contenders out there.

by hinkley

1/10/2025 at 10:50:01 PM

I'm fairly convinced that OTEL is in a form of 'vendor capture', i.e. because the only way to get a standard was to compromise with various bigcorps and sloppy startups to glue-gun it all together.

I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.

I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.

by to11mtm

1/10/2025 at 10:22:48 PM

> It’s not like there are other contenders out there.

Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/

Also I quite liked that a minimal setup is indeed pretty minimal: a web UI, a server instance and a DB that you already know https://skywalking.apache.org/docs/main/latest/en/setup/back...

In some ways, it's a lot like Zabbix in the monitoring space - neither will necessarily impress anyone, but both have a nice amount of utility.

by KronisLV

1/10/2025 at 6:22:21 PM

This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.

by sethops1

1/10/2025 at 7:27:30 PM

Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.

How would you build the "holy grail" map that shows a trace of every sub component in a transaction broken down by start/stop time etc... for instance show the load balancer see a request, the request get handled by middlewares etc, then go onto some kind of handler/controller, the sub-queries inside of that like database calls or cache calls. I don't think that is possible with prometheus?

by whalesalad

1/10/2025 at 7:49:59 PM

> Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.

Correct. Prometheus is just metrics.

The main argument for oTel is that instead of one proprietary vendor SDK or importing prometheus and jaeger and whatever you want to use for logging, just import oTel and all that will be done with a common / open data format.

I still believe in that dream but it's clear that the whole project needs some time/resources to mature a bit more.

If anybody remembers the Terraform/ToFu drama, it's been really wild to see how much support everybody pledged for ToFu but all the traditional observability providers have just kinda tolerated oTel :/

by baby_souffle

1/10/2025 at 9:08:40 PM

Yeah part of the problem is it’s called Opentelemetry and half of you are only talking about tracing, not metrics. Telemetry is metrics. It’s been metrics since at least the Mercury Program.

Metrics in OTEL is about three years old and it’s garbage for something that’s been in development for three years.

by hinkley

1/11/2025 at 1:03:45 PM

its looks hassle to implement ngl

by tonyhart7

1/10/2025 at 8:45:22 PM

Code traces are metrics. Run times per function calls metrics, count of specific function call metrics.

Otel is an attempt to package such arithmetic.

Web apps have added so many layers of syntax sugar and semantic wank, we’ve lost sight its all just the same old math operations relative to different math objects. Sets are not triangles but both are tested, quantified, and compared with the same old mathematical ops we learn by middle school.

by niftaystory

1/10/2025 at 8:55:58 PM

No, code traces are not just metrics; and while you can knit together something approximating traces from metrics, you'll quickly run into the reason why traces are a distinct thing. First, in a distributed system, you'll discover that you can't rely on clocks to get the timing of subsecond events correct. Second, you'll be contextless about code paths. So, you might independantly reinvent the idea of passing along a context - and now you're just making your own tracing system but without any of the benefit of building on years of existing discoveries in this field.

OTel does feel a little bit heavy, unless you're already used to e.g. New Relic, Dynatrace, etc. where you have to run an agent process and instrumentize your code to some extent; it's never going to be free to audit every function call! This is why (a) you sample down and don't keep every trace, and (b) unless your company is extremely flush with cash you probably don't run tracing in every environment. If you can get away with it just in a staging or perf test env you can reap most of the benefit without the production impact and cost.

by mikestorrent

1/10/2025 at 9:01:40 PM

All those things you describe are computable metrics. They have to be or Otel itself would not be able to compute them for consumption. All you described are cherry picked semantic indirections to obfuscate it’s all just a computer computing metrics of its own memory states.

Sorry for knowing how computers actually work (EE grad not a CS grad). I know that can frustrate CS grads who think their preferred OS and favorite programming language is how a computer works. You’re describing how contemporary SWEs view their day job.

Edit: teleMETRY …what’s in a name? Oh right …meaning.

by niftaystory

1/11/2025 at 12:27:49 AM

To be a smart-ass, one has to be smart first. Quit this.

by mathfailure

1/10/2025 at 9:47:38 PM

As a no grad to EE grad: traces mean a bundle of metrics that varies in structure hence you can't store and process them as effective as a list of counters unless you have a distinct bin for each possible trace, combinatorial explosion y'know.

by chupasaurus

1/10/2025 at 11:55:21 PM

You know the conversation is going well for you when you resort to citing the "meaning" of a name instead of, you know, base reality. Who needs the territory, I've got my map right here.

Speaking of meaning, the best I can make of your point is that you're using a much broader definition of "metrics" than the rest of this conversation, and in particular broader than Prometheus (remember context? very important for "meaning"!) supports. That or you really just don't know what a "trace" is (in this context).

by andrewflnr

1/10/2025 at 11:52:28 PM

OpenTelemetry's traces are trees of spans. You cannot represent this efficiently without a combinatorial explosion of labels.

You may be thinking of metrics in the sense of counters and gauges, but that's not the data model that OpenTelemetry (and before they, Zipkin, Jaeger, and OpenCensus) uses for traces.

The data model for tracing is to emit events that provide a span ID and an optional parent span ID. The event collector can piece these together into a tree after the fact, which will work as long as the parent structure is maintained.

Prometheus is absolutely not suitable for this.

Quibbling about the word "telemetry" doesn't really help here. OpenTelemetry supports three different, completely different subsets of functionality: Metrics (counters, gauges, histograms), traces (span events in a tree structure), and logging (structured log events). They each have completely different client interfaces.

by atombender

1/10/2025 at 8:59:28 PM

huh? I've always heard and read and experienced that "logs, traces, metrics" are the 3 legs of the observability stool.

by chrisweekly

1/10/2025 at 9:12:55 PM

Open teleMETRY

Any guesses as to etymology?

by niftaystory

1/10/2025 at 9:55:51 PM

By this logic, you can say that logging, metrics and tracing are all fundamentally just different kinds of data and we should be calling it just plain databases and CRUD.

They're related, but people have a very specific idea and concept of what each is, you haven't actually provided a good argument why we should throw out these distinctions just because they somewhat resemble each other if you ignore a few details

by ffsm8

1/10/2025 at 9:28:25 PM

Prometheus is good, but let's be clear...you don't get tracing.

by paulddraper

1/10/2025 at 11:07:23 PM

For tracing FOSS: Grafana Tempo.

https://grafana.com/oss/tempo/

by PeterCorless

1/10/2025 at 11:25:22 PM

Tempo's a backend/sink for traces, but if you click through to the Tempo docs and find out how to generate tracing data[1], you learn that you have two options: OpenTelemetry, which they recommend, and Zipkin, which they do not recommend.

[1] https://grafana.com/docs/tempo/latest/getting-started/instru...

by flurie

1/11/2025 at 10:11:21 PM

"I don't want solutions, I want to be mad."

by paulddraper

1/11/2025 at 2:03:24 AM

Tempo is a traces server. Prometheus is a metrics server.

Grafana, the same company that develops and sells Tempo created a horizontally scalable version of Prometheus called Mimir.

OpenTelemetry is an ecosystem, not just 1 app. It’s protocols, libraries, specs, a Collector (which acts as a clearinghouse for metrics+traces+logs data). It’s bigger than just Tempo. The intention of Patel seems to be to decouple the protocol from the app by having adapters for all of the pieces.

by thephyber

1/11/2025 at 7:48:31 AM

Prometheus is not only a metrics server, it's also become the de-facto metrics exposition format.

by Too

1/11/2025 at 12:09:18 AM

You probably don´t understand what Otel is if you think that Prometheus is an alternative.

by Thaxll

1/11/2025 at 3:21:43 AM

You'd do better to point out which distinction you think the parent poster is missing.

My guess is that Prometheus cannot do distributed tracing, while OpenTelemetry can. Is that what you meant?

by MathMonkeyMan

1/11/2025 at 12:25:38 PM

Otel is a spec. You can create your own clients/aggregators/etc.. The problem is that if nobody does it, there will be no tooling. So Otel created some tooling (and yes, it's bad) for people to use.

Some companies (ie: Datadog) are contributing to the tooling but I think most companies would rather spend dev time on their own platforms than something that anybody (competitor) can use.

by csomar

1/11/2025 at 5:20:49 PM

From the user side, a spec isn't helpful unless it has implementations. And the official implementations are complicated compared to prometheus.

by bluesnews

1/11/2025 at 8:14:42 PM

I worked on a team that produced a distributed tracing library. We were tasked with interoperating with OpenTelemetry, or at least figuring out what that means.

My teammate said that at a previous job he wanted to add OpenTelemetry tracing to some C++ code he was working on. He took one look at the reference implementation for C++ OpenTelemetry and decided instead to write his own tracing library that sends gRPC to the OpenTelemetry collector.

It's also worth noting that, at least last time I checked, the reference implementations per programming language are less like reference implementations of some specification, and more like "this is the code you use to do OpenTelemetry in this language."

by MathMonkeyMan

1/11/2025 at 2:38:33 AM

Why Otel compared to prometheus+syslog+(favorite way to do request tagging, eg: MDC in slf4j)+grep?

Syslog is kinda a pain, but it's an hour of work and log aggregation is set up. Is the difference the pain of doing simple things with elastic compute and kubernetes?

by seadan83

1/11/2025 at 4:50:01 AM

Typically this is a subset of OTel that's being compared. Almost everything (aside from Datadog's proprietary stuff) is just smaller than OTel is, which is why it's often chosen for many different needs.

In my experience, it's often folks who have experience setting up metrics or log collection with something smaller (e.g., StatsD) and sometimes for purposes with less scope, who have the most frustration with OTel. All the concepts are different, carry different names, have different configs, have different quirks, etc. There's often an expectation that things will largely the same as before and that they can carry over the cursed knowledge they had from the other toolset.

by phillipcarter

1/10/2025 at 6:30:46 PM

Simpler near-term, but more painful long term when you want to switch vendors/stacks.

by bushbaba

1/10/2025 at 7:56:21 PM

Nine times out of ten, I've got more valuable problems to solve than a theoretical future change of our vendor/stack for telemetry. I'll gladly borrow from my future self's time if it means I can focus on something more important right now.

by kemitche

1/10/2025 at 9:16:11 PM

I did our migration from StatsD to OTEL because our third party StatsD service was getting flaky. The first person from OPs to get to me pushed OTEL. The rest were fine with Prometheus and it was late in the process before they realized what had happened. I believe if we had gone straight to Prometheus I would have been done in half the time and solved half the problems I had to solve anyway for OTEL. If someone had to replace it again in the future I fully believe it would have taken cumulatively as much time to go StatsD->Prometheus->OTEL as it took to go StatsD->OTEL, especially when you consider that OTEL is not quite baked.

Meanwhile functionality to retain and recruit new customers sat in the backlog.

Edit to add: also regarding the perf issues I saw: do you really want to pay for an extra server or half a server in your cluster just in case some day comes? These decisions were much fuzzier when you ordered hardware once every two years and just had to live with the capacity you got.

by hinkley

1/10/2025 at 6:41:30 PM

And switching log implementations can be a pain in the butt. Ask me how I know.

But I’d rather do that three more times before I want to see OpenTelemetry again.

Also Prometheus is getting OTEL interop.

by hinkley

1/10/2025 at 7:02:28 PM

Is this the same scam as "standard SQL"? Switching database products is never straightforward in practice, despite any marketing copy or wishful thinking.

Prometheus ecosystem is very interoperable, by the way.

by pphysch

1/11/2025 at 6:51:30 PM

It's not a "scam", the protocols and clients are 100% scrutable. Not sure why you used that word.

by pdimitar

1/10/2025 at 9:47:44 PM

Using otel from C++ side... To have cumulative metrics from multiple applications (e.g. not "statds/delta") I create a relatively low cardinality process.vpid integer (and somehow coordinate this number to be unique as long as the app emitting it is stil alive) - you can use some global object to coordinate it.

Then you can have something that sums, and removes the attribute.

With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.

edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.

by malkia

1/10/2025 at 6:13:53 PM

This matches my experience. Very difficult to understand what I needed to get the effect I wanted.

by mkeedlinger

1/10/2025 at 8:07:39 PM

I wonder what your experience is with Sentry? Not just for error reporting but especially also their support for traces.

Also open-source & self-hostable.

by Xeago

1/11/2025 at 3:20:57 AM

Likely only a handful of people care, but Sentry hasn't been open source in quite a while https://github.com/getsentry/sentry/blob/24.12.1/LICENSE.md (I'd have to do tag-spelunking to find the last Apache 2 version)

Glitchtip is the Sentry compatible open source (MIT) one https://gitlab.com/glitchtip/glitchtip-backend/-/blob/v4.2.2... with the extra advantage that it doesn't require like 12 containers to deploy (e.g. https://github.com/getsentry/self-hosted/blob/24.12.1/docker... )

by mdaniel

1/11/2025 at 12:32:10 AM

Sentry is not horizontally scalable, thus ~ not-scalable at all, if your company is big.

by mathfailure

1/11/2025 at 11:50:35 AM

That's a fair point, but scaling it vertically can take you very far in my experience.

by Fidelix

1/10/2025 at 9:49:58 PM

Quota/pricing.

by malkia

1/11/2025 at 12:06:50 AM

Same. I implemented Otel once and exactly once. I wouldn't wish it on any company.

Otel is a design by committee garbage pile of half baked ideas.

by silisili

1/10/2025 at 5:59:54 PM

There are a lot of Java programmers working on it.

(And some Go tbf.)

by paulddraper

1/10/2025 at 6:13:59 PM

Yeah and a blind man can see this, it’s so loud.

by hinkley

1/10/2025 at 7:32:54 PM

Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job. Also kudos to grafana for adopting OpenTelemetry as a first class citizen of their ecosystem.

I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises. So as years passed and OpenTelemetry API’s and SDK’s stabilized it became our standard for application observability.

To be honest the documentation could be better overall and the onboarding docs differ per programming language, which is not ideal.

My current team is on a NodeJS/Typescript stack and we’ve created a set of packages and an example Grafana stack to get started with OpenTelemetry real quick. Maybe it’s useful to anyone here: https://github.com/zonneplan/open-telemetry-js

by rtuin

1/10/2025 at 10:52:52 PM

> Otel seems complicated because different observability vendors make implementing observability super easy with their proprietary SDK’s, agents and API’s. This is what Otel wants to solve and I think the people behind it are doing a great job.

Wait... so, the problem is that everyone makes it super easy, and so this product solves that by being complicated? ;P

by saurik

1/10/2025 at 10:59:32 PM

The problem is that they make it super easy in very hacky ways and it becomes painful to improve things without startup money.

Also, per the hackiness, it tends to have visible perf impact. I know with dynatrace agent we had 0-1MS metrics pop up to 5-10ms (this service had a lot of traffic so it added up) and I'm pretty sure on .NET side there's issues around general performance of OTEL. I also know some of the work/'fun' colleagues have had to endure to make OTEL performant for their libs, in spite of the fact it was a message passing framework where that should be fairly simple...

by to11mtm

1/11/2025 at 5:34:04 AM

Well let's be fair. You can't get the type of telemetry Dyntrace provides "for free". You have to pay for it somewhere. Pretty sure you can exclude the agent from instrumenting performance critical parts of the code, if that is your concern.

by laichzeit0

1/10/2025 at 10:56:33 PM

> I’ve been pushing the use of Datadog for years but their pricing is out of control for anyone between mid size company and large enterprises

Not a fan of datadog vs just good metric collection. OTOH while I see the value of OTEL vs what I prefer to do... in theory.

My biggest problem with all of the APM vendors, once you have kernel hooks via your magical agent all sorts of fun things come up that developers can't explain.

My favorite example: At another shop we eventually adopted Dynatrace. Thankfully our app already had enough built-in metrics that a lead SRE considered it a 'model' for how to do instrumentation... I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts as well as a directly measured drop in performance. [0]

Ironically, the metrics saved us from grief, yet nobody had an idea how to fix it. ;_;

[0] - Curiously, the 'worst' one was MSSQL failovers on update somehow polluting our ADO.NET connection pools in a bad way...

by to11mtm

1/10/2025 at 11:01:02 PM

> I say that because, as soon as Dynatrace agents got installed on the app hosts, we started having various 'heisenbugs' requiring node restarts

Our containers regularly fail due vague LD_PRELOAD errors. Nobody has invested the time to figure out what the issue is because it usually goes away after restarting; the issue is intermittent and non-blocking, yet constant.

It's miserable.

by richbell

1/11/2025 at 2:40:33 AM

We do at least one rolling restart a day because it’s the best way to GC. And we’re not using any APM yet

by a012

1/11/2025 at 1:08:11 AM

Thank you! I'm very interested in that.

by EdwardDiego

1/10/2025 at 5:36:57 PM

It is as complicated as you want or need it to be. You can avoid any magic and stick to a subset that is easy to reason about and brings the most value in your context.

For our team, it is very simple:

* we use a library send traces and traces only[0]. They bring the most value for observing applications and can contain all the data the other types can contain. Basically hash-maps vs strings and floats.

* we use manual instrumentation as opposed to automatic - we are deliberate in what we observe and have great understand of what emits the spans. We have naming conventions that match our code organization.

* we use two different backends - an affordable 3rd party service and an all-on-one Jaeger install (just run 1 executable or docker container) that doesn't save the spans on disk for local development. The second is mostly for piece of mind of team members that they are not going to flood the third party service.

[0] We have a previous setup to monitor infrastructure and in our case we don't see a lot of value of ingesting all the infrastructure logs and metrics. I think it is early days for OTEL metrics and logs, but the vendors don't tell you this.

by dimitar

1/10/2025 at 5:54:59 PM

It's as complicated as you want, but it's not as easy as I want. The floor is pretty high.

I'm still looking for an endpoint just to send simple one-off metrics to from parts of infrastructure that's not scrapable.

by madeofpalk

1/10/2025 at 5:58:16 PM

You can just send metrics via JSON to any otlphttp collector: https://github.com/open-telemetry/opentelemetry-proto/blob/v...

by pat2man

1/10/2025 at 6:02:21 PM

Shame none of this comes up whenever I search for it!

Top google result, for me, for 'send metrics to otel' is https://opentelemetry.io/docs/specs/otel/metrics/. If I go through the the Language APIs & SDK more whole bunch of useless junk https://opentelemetry.io/docs/languages/js/

Compare to the InfluxDB "send data" getting started https://docs.influxdata.com/influxdb/cloud/api-guide/client-... which gives you exactly it in a few lines.

by madeofpalk

1/10/2025 at 6:39:28 PM

There's an excellent article on how to implement OpenTelemetry Tracing in 200 lines of code.

https://jeremymorrell.dev/blog/minimal-js-tracing/

"It might help to go over a non-exhaustive list of things the offical SDK handles that our little learning library doesn’t:

- Buffer and batch outgoing telemetry data in a more efficient format. Don’t send one-span-per-http request in production. Your vendor will want to have words."

- Gracefully handle errors, wrap this library around your core functionality at your own peril"

You can solve them of course, if you can

by blue_pants

1/10/2025 at 6:09:20 PM

Maybe the confusion here is in comparing different things.

The InfluxData docs you're linking to are similar to Observability vendor docs, which do indeed amount to "here's the endpoint, plug it in here, add this API key, tada".

But OpenTelemetry isn't an observability vendor. You can send to an OpenTelemetry Collector (and the act of sending is simple), but you also need to stand that thing up and run it yourself. There's a lot of good reasons to do that, but if you don't need to run infrastructure right now then it's a lot simpler to just send directly to a backend.

Would it be more helpful if the docs on OTel spelled this out more clearly?

by phillipcarter

1/11/2025 at 12:40:47 PM

The problem is ecosystem wide - the documentation starts at 8/10 and is written for observability nerds where easy things are hard, and hard things are slightly harder.

I understand the role that all the different parts of OTel plays in the ecosystem vs InfluxDB, but if you pay attention to that documentation page, it starts off with the easiest thing (here's how you manually send one metric), and then ramps up the capabilities and functionality from here. OTel docs slam you straight into "here's a complete observaility stack for logs, metrics, and traces for your whole k8s deployment".

by madeofpalk

1/11/2025 at 4:15:04 PM

The equivalent page for this is the Get Started pages in OTel, e.g. https://opentelemetry.io/docs/languages/js/getting-started/n...

However, since OTel is not a backend, there's no pluggable endpoint + API key you can just start sending to. Since you were comparing the relative difficulties of sending data to a backend, that's why I responded in kind.

I do agree that it's more complicated, there's no argument there. And the docs have a very long way to go to highlight easier ways to do things and ramp up in complexity. There's also a lot more to document since OTel is for a wider audience of people, many of whom have different priorities. A group not talked about much in this thread is ops folks who are more concerned with getting a base level of instrumentation across a fleet of services, normalizing that data centrally, pulling in from external sources, and making sure all the right keys for common fields are named the right way. OTel has robust tools for (and must document) these use cases as well. And since most of us who work on it do so in spare time, or a part-time capacity at work, it's difficult to cover it all.

by phillipcarter

1/11/2025 at 7:02:34 PM

https://github.com/openobserve/openobserve, more or less.

First time it takes 5 minutes to setup locally, from then on you just run the command in a separate terminal tab (or Docker container, they have an image too).

by pdimitar

1/10/2025 at 5:49:43 PM

I did not find that manual instrumentation made things simpler. You’re trading a learning curve that now starts way before you can demonstrate results for a clearer understanding of the performance penalties of using this Rube Goldberg machine.

Otel may be okay for a green field project but turning this thing on in a production service that already had telemetry felt like replacing a tire on a moving vehicle.

by hinkley

1/10/2025 at 9:18:06 PM

I've not used otel for anything not greenfield, but I just wanted to say

> felt like replacing a tire on a moving vehicle.

Some people do this as a joke / dare. I mean literally replacing a car tire on a moving vehicle.

You Saudi drift up onto one side, and have people climb out of the side in the air, and then swap the tire while the car is driving on two wheels.

It's pretty insane stuff: https://youtu.be/Str7m8xV7W8?si=KkjBh6OvFoD0HGoh

by dmoy

1/10/2025 at 9:31:47 PM

That was the image I had in my head.

My whole career I’ve been watching people on greenfield projects looking down on devs on already successful products for not using some tool they’ve discovered, missing the fact that their tool only functions if you build your whole product around the exact mental model of the tool (green field).

Wisdom is learning to watch for people obviously working on brownfield projects espousing a tool. Like moving from VMs to Docker. Ansible to Kubernetes (maybe not the best example). They can have a faster adoption cycle and more staying power.

by hinkley

1/10/2025 at 11:18:46 PM

SaS Institute used that exact same analogy & even this video in their talk about implementing ScyllaDB back in 2020 (check out 0:35 in the video):

https://www.scylladb.com/2020/05/28/sas-institute-changing-a...

Seems like moving to OTel might even be a bit more complex for some brownfield folks.

by PeterCorless

1/10/2025 at 5:43:16 PM

Mind sharing that that affordable 3rd party service is?

by buzzdenver

1/10/2025 at 11:24:58 PM

honeycomb

by dimitar

1/10/2025 at 8:57:14 PM

Very sane advice. Most folks will already have something for metrics and logs and unless there's ROI on changing it out, why bother?

by mikestorrent

1/10/2025 at 8:49:51 PM

>You can avoid any magic and stick to a subset...

... if (and only if) all the libraries you use also stick to that subset, yea. That is overwhelmingly not true in my experience. And the article shows a nice concrete example of why.

For green-field projects which use nothing but otel and no non-otel frameworks, yea. I can believe it's nice. But I definitely do not live in that world yet.

by Groxx

1/10/2025 at 10:09:10 PM

One of my biggest problems was the local development story. I wanted logs, traces and metrics support locally but didn’t want to spin up a multitude of Docker images just to get that to work. I wanted logs to be able to check what my metrics, traces, baggage and activity spans look like before I deploy.

Recently, the .NET team launched .NET Aspire and it’s awesome. Super easy to visualize everything in one place in my local development stack and it acts as an orchestrator as code.

Then when we deploy to k8s we just point the OTEL endpoint at the DataDog Agent and everything just works.

We just avoid the DataDog custom trace libraries and SDK and stick with OTEL.

Now it’s a really nice development experience.

https://learn.microsoft.com/en-us/dotnet/aspire/fundamentals...

https://docs.datadoghq.com/opentelemetry/#overview

by junto

1/11/2025 at 1:57:50 PM

> I wanted logs, traces and metrics support locally but didn’t want to spin up a multitude of Docker images just to get that to work.

This project is really nice for that https://github.com/grafana/docker-otel-lgtm

by rochacon

1/11/2025 at 2:21:10 AM

There's also https://github.com/CtrlSpice/otel-desktop-viewer

by masterj

1/11/2025 at 6:58:10 PM

Just use https://github.com/openobserve/openobserve.

Takes 5 minutes to set it up locally on your dev machine the first time, from then on you can just have a separate terminal tab where you simply run `/path/to/openobserve` and that's it. They also offer a Docker image for local and remote running as well, if you don't want to have the grand complexity of a single statically-linked binary. :P

It's an all-in-one fully compliant OpenTelemetry backend with pretty graphs. I love it for my projects, hasn't failed me in any detectable way yet.

by pdimitar

1/12/2025 at 5:02:46 AM

I'm not convinced by .NET Aspire. It solves a small problem (service discovery and orchestration for local development of multi service projects). But it solves this by making service discovery and orchestration an application level concern. With Aspire you needlessly add complexity at the app level and get locked into a narrow ecosystem. There are many proven alternatives like docker compose for local development. Aspire is not even that much if at all easier than using docker compose and env vars.

by WuxiFingerHold

1/11/2025 at 12:15:13 AM

There are official all-in-one docker image that have everything.

by Thaxll

1/11/2025 at 12:02:13 AM

If you are doing otel with python, use Logfire's client... even if you don't use their offering.

It's foss, and ypu can point it to any otel compat enpoint. Plus the client that the pydantic team made is 10 times better and simpler than the official otel lib.

Samuel Colvin has a cool intervew where he explains how he got there: https://www.bitecode.dev/p/samuel-colvin-on-logfire-mixing-p...

by BiteCode_dev

1/10/2025 at 6:47:44 PM

Definitely can relate, this is why I started an open-source project that focus on making OpenTelemetry adoption as easy as running a single command line: https://github.com/odigos-io/odigos

by edenfed

1/10/2025 at 6:00:34 PM

A lot of web frameworks etc do most of the instrumentation for you these days. For instance using opentelemetry-js and self hosting something like https://signoz.io should take less than an hour to get spun up and you get a ton of data without writing any custom code.

by pat2man

1/10/2025 at 6:10:58 PM

Agree. Here's the repo for SigNoz if you want to check it out - https://github.com/signoz/signoz

by pranay01

1/10/2025 at 6:12:46 PM

Context propagation isn't trivial on a multi-threaded async runtime. There are several ways to do it, but JVM agents that instrument bytecode are popular because they work transparently.

by hocuspocus

1/10/2025 at 6:57:14 PM

While that’s true, if you’ve already solved punching correlation-IDs and A/B testing (feature flags per request) through then you can use the same solution for all three. In fact you really should.

Ours was old so based on domain <dry heaving sounds>, but by the time I left the project there were just a few places left where anyone touched raw domains directly and you could switch to AsyncLocalStorage in a reasonable amount of time.

The simplest thing that could work is to pass the original request or response context everywhere but that… has its own struggles. It’s hell on your function signatures (so I sympathize with my predecessors not doing that but goddamn) and you really don’t want an entire sequence diagram being able to fire the response. That’s equivalent to having a function with 100 return statements in it.

by hinkley

1/10/2025 at 5:50:34 PM

Same thing. OpenTelemetry grew up from Traces, but Metrics and Logs are much better left to specialized solutions.

Feels like a "leaky abstraction" (or "leaky framework") issue. If we wanted to put everything under one umbrella, then well, an SQL database can also do all these things at the same time! Doesn't mean it should.

by deepsun

1/10/2025 at 11:33:33 PM

Cramer wants to get traces out of OTel. Which is ironic because he's one of the creators of OpenTracing.

https://cra.mr/the-problem-with-otel/

by PeterCorless

1/13/2025 at 9:44:43 PM

He also started Sentry, so must know a thing or two on the topic.

by deepsun

1/10/2025 at 5:54:47 PM

I think giving metrics and logging a location in a trace is really useful.

But I still dislike OTel every time I have to deal with it.

by incangold

1/10/2025 at 6:01:26 PM

You can’t do fine grained tracing in OTEL because if you hit 500 spans in a single trace it starts dropping the trace. Basically a toy solution for brownfield work.

by hinkley

1/10/2025 at 7:33:16 PM

This is just not true. We have traces with hundreds of thousands of spans. Those are not very readable but that’s another problem.

by IneffablePigeon

1/10/2025 at 11:35:40 PM

How are you storing them, and what do you use to read/visualize/analyze them? I'd imagine just putting them up in a UI becomes a needle-in-a-haystack issue. Are you programmatically analyzing them?

by PeterCorless

1/11/2025 at 10:19:19 AM

Honeycomb. For shorter traces (most of them), a waterfall view is great. For those long ones, we try to split them up if it makes sense but you can also just run queries scoped to that trace to answer questions about it (how many of the spans are db queries, how many are this query, are they quick, etc etc)

by IneffablePigeon

1/10/2025 at 6:27:51 PM

As mentioned by philip below, 500 spans is a very small amount. I have seen customers send 1000s of spans in a trace very easily

by pranay01

1/10/2025 at 6:10:29 PM

...huh? I work with customers who (through a mistake) have created literally multi-million span traces using OTel. Are you referring to a particular backend?

by phillipcarter

1/10/2025 at 6:13:31 PM

AWS

by hinkley

1/10/2025 at 6:25:22 PM

Well that's a shame, I'm going to ask some folks about that. 500 spans per trace is ridiculously small and I can't imagine any good reason to have that limitation since it's just not that big of a footprint.

OTel doesn't define any limits on the # of spans in a trace (nor the # of attributes on a span!) but it will be bound by the limits of whatever backend you use. In the case of the one I work for, we do limit the total size of a span to be 1MB or less with 64KB per attribute before truncation. Other backends have different limitations. This is the first I've heard of such a small limitation on the total number of spans in a trace though. Traces are just (basically) collections of structured logs with in-built correlation IDs. I can't imagine why you'd limit them like this.

by phillipcarter

1/10/2025 at 6:35:40 PM

That was two years ago (we tried spans before metrics), so it’s fuzzy. I believe the collector sidecar was fine with it but the backend was not, which complicated debugging. There’s not a clear feedback path in OpenTelemetry that we could find. I completely forgot to mention the tendency toward silent failures. That’s a cardinal sin for telemetry. I would take it out back and shoot it for that fact alone.

The other problem I noticed looking at the wire protocol was that the data for the parent trace doesn’t seem to get sent until the trace closes. That seems like a bookkeeping nightmare to me. There should be a start of trace packet and an update at the end. I shouldn’t have finished spans showing up before the parent trace has been registered. And that’s what it looked like in the dumps my OPs people sent me to debug.

by hinkley

1/11/2025 at 3:31:45 AM

Practically a given outcome, then; we could knock their Managed Prometheus offering off the Internet on the regular. It was just laughable for a company that prides itself in one trillion IAM transactions to 429 some metric ingest

by mdaniel

1/10/2025 at 7:00:47 PM

If you get to the end you find that the pain was all self-inflicted. I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

by BugsJustFindMe

1/10/2025 at 7:55:45 PM

> I found it to be very easy in Python with standard stacks (mysql, flask, redis, requests, etc), because you literally just do a few imports at the top of your service and it automatically hooks itself up to track everything without any fuss.

Yes, but only if everything in your stack is supported by their auto instrumentation. Take `aiohttp` for example. The latest version is 3.11.X and ... their auto instrumentation claims to support `3.X` [0] but results vary depending on how new your `aiohttp` is versus the auto instrumentation.

It's _magical_ when it all just works, but that ends up being a pretty narrow needle to thread!

[0]: https://github.com/open-telemetry/opentelemetry-python-contr...

by baby_souffle

1/10/2025 at 9:56:32 PM

> their auto instrumentation claims to support `3.X`

Semver should never be treated as anything more than some tired programmer's shrug and prayer that nobody else notices the breakages they didn't notice themselves. Pin strict dependencies instead of loose ones, and upgrade only after integration testing.

There are only two kinds of updates, ones that intend to break something and ones that don't intend to break something, and neither one guarantees that the intent matches the outcome.

by BugsJustFindMe

1/11/2025 at 2:04:46 AM

> Semver should never be treated as anything more than some tired programmer's shrug and prayer that nobody else notices the breakages they didn't notice themselves.

That's precisely my point, but you said it better :).

I have had _mixed_ results getting auto instrumentation working reliably with packages that are - technically - supported.

by baby_souffle

1/10/2025 at 7:42:53 PM

So recently I needed to this up for a very simple flask app. We're running otel-collector-contrib, jaeger-all-in-one, and prometheus on a single server with docker compose (has to be all within the corpo intranet for reasons..)

Traces work, and I have the spanmetrics exporter set up, and I can actually see the spanmetrics in prometheus if I query directly, but they won't show up in the jaeger "monitor" tab, no matter what I do.

I spent 3 days on this before my boss is like "why don't we just manually instrument and send everything to the SQL server and create a grafana dashboard from that" and agh I don't want to do that either.

Any advice? It's literally the simplest usecase but I can't get it to work. Should I just add grafana to the pile?

by verall

1/11/2025 at 7:15:35 PM

Try https://github.com/openobserve/openobserve, it's extremely easy to self-host and it's an all-in-one solution, dashboards included (though admittedly I've seen prettier ones).

by pdimitar

1/10/2025 at 7:57:53 PM

Yeah the biggest trouble really is on the dashboarding side of things, not the sending side, and is why there are popular SaaS products like datadog. If you're amenable to saas, datadog is probably the best way. Otherwise, look into SigNoz for a one-stop solution with minimal effort even if there are some rough edges still.

by BugsJustFindMe

1/10/2025 at 10:43:59 PM

We absolutely have to run it ourselves (...corporate reasons...), it's a lightweight service with only a few hundred users so we haven't had to worry much about perf (yet).

SigNoz does look interesting, I may give this a shot, thank you. I'm a bit concerned about it conflicting with other things going on in our docker-compose but it doesn't look too bad..

by verall

1/10/2025 at 7:11:45 PM

Until you run your server behind something like gunicorn and all of the auto imports stop working and you have to do it all yourself.

by etimberg

1/10/2025 at 9:58:46 PM

I found this to work fine https://opentelemetry-python.readthedocs.io/en/latest/exampl...

by jdsleppy

1/13/2025 at 2:35:20 PM

...but with manually running autoinstrumentation in the post fork hook.

I guess there is a lot of undocumented magic in OTel...

by jdsleppy

1/10/2025 at 7:45:00 PM

It works with uwsgi just fine though.

by BugsJustFindMe

1/10/2025 at 5:12:57 PM

It's complicated because it's designed for the companies selling Otel compatible software, not the engineers implementing it

by nimish

1/10/2025 at 11:01:49 PM

Not sure about this, I think the vendors were happy with their own proprietary code, agents and backends because the lock-in ensures that switching costs (in terms of writing all new code) are very high.

by andy800

1/10/2025 at 9:29:27 PM

That hasn't been what I've seen from the contributors.

If anything I think the backends were kinda slow to adopt.

by paulddraper

1/10/2025 at 5:19:51 PM

this is going to come off as being fussy, but 'implement' use to refer to the former activity, not the latter. which is fine, meanings change, its just amusing that we no longer have a word we can use for 'sitting down and writing software to match a specification' and only 'taking existing software and deploying it on servers'

by convolvatron

1/10/2025 at 7:47:32 PM

This has been the case for ages. Sysadmins use "implement" to mean "install software on servers and keep it running", coders use "implement" to mean "code stuff that matches a spec/interface". It's just two worlds accidentally using the same term for a different thing. No meanings are changing. Two MS certified sysadmins in 1999 could talk about how they were "Implementing Exchange across the whole company".

by skrebbel

1/10/2025 at 5:51:46 PM

Operational versus builder jargon.

by hinkley

1/10/2025 at 7:05:04 PM

It's still implementing. Someone has taken the specifications and implemented the software, and then someone else has taken the software and implemented a solution with it.

by stronglikedan

1/10/2025 at 10:52:10 PM

Author is trying to do something difficult with a non-batteries-included open source (free to them) product. Seems quite uncomplicated given the circumstances. The whole point of OTel is to not get bent over backwards by one of the SaaS "logging/tracing/telemetry" companies, and as such it's going to incur some cost/pain of its own, but typically the bargain is worth taking.

by dboreham

1/10/2025 at 8:10:08 PM

I have implemented OTEL over numerous projects to retrieve traces. It's just a total pain and I'd 500% skip it for anything else.

by 6r17

1/10/2025 at 5:42:42 PM

For those looking for tracing but less complexity check out eBPF based solutions such as Coroot or Odigos

by PeterZaitsev

1/10/2025 at 9:44:29 PM

Isn't Odigos just OTel with simpler setup?

by jensensbutton

1/10/2025 at 11:39:25 PM

Yes. edenfed posted a comment linking to the project above. Here is is again, though:

https://github.com/odigos-io/odigos

by PeterCorless

1/10/2025 at 5:04:09 PM

I agree. I tried to get it to work recently with datadog, but there was so many hiccups. I ended up having to use datadogs solution mostly. The documentation across everything is also kind of confusing

by cglan

1/10/2025 at 5:20:02 PM

imo Datadog is pretty hostile to OTel too. Ever since https://github.com/open-telemetry/opentelemetry-collector-co... was nearly killed by them I never felt like they fully supported the standard (perhaps for good reasons)

OTel is a bear though. I think the biggest advantage it gives you is the ability to move across tracing providers

by SomaticPirate

1/10/2025 at 6:14:15 PM

> the ability to move across tracing providers

It's a nice dream. At Google Cloud Next last year, the vendors kinda of came in two buckets. Datadog, and everyone trying to replace Datadog's outrageous bills.

by rikthevik

1/10/2025 at 9:46:09 PM

Pretty sure Datadog is literally one of the top contributors to OTel.

by jensensbutton

1/10/2025 at 5:28:45 PM

I worry that vision is not going to become reality if the large observability vendors don't want to support the standard.

by bebop

1/10/2025 at 6:03:39 PM

FWIW the "datadog doesn't like otel" thing is kind of old hat, and the story was a little more complicated at the time too.

Nowadays they're contributing more to the project directly and have built some support to embed the collector into their DD agent. Other vendors (splunk, dynatrace, new relic, grafana, honeycomb, sumo logic, etc.) contribute to the project a bunch and typically recommend using OTel to start instead of some custom stuff from before.

by phillipcarter

1/10/2025 at 6:43:33 PM

They support ingesting via otel (ie competing with other vendors for their customers) but won't support ingesting via their SDKs (they still try very hard to lock you in to their tooling).

by arccy

1/10/2025 at 6:38:24 PM

Yeah their agent will accept traces from the standard Otel SDK but there is no way to change their SDK to send the traces to anyone other than Datadog when I last checked a couple(?) of years ago.

I mean I understand why they did that but it really removes one of the most compelling parts about Otel. We ended doing the hard work of using the standard Otel libraries. I had to contribute a PR or two to get it all to work with our services but am glad that's the route we went because now we can switch vendors if needed (which is likely in the not too distant future in our case.

by hangonhn

1/10/2025 at 6:25:26 PM

part of the reason for that experience is also because DataDog is not open telemetry native and all their docs and instructions encourage use of their own agents. Using DataDog with Otel is like trying to hold your nose round over your head

You should try Otel native observability platforms like SigNoz, Honeycomb, etc. your life will be much simpler

Disclaimer : i am one of the maintainers at SigNoz

by pranay01

1/10/2025 at 5:53:43 PM

The biggest barrier to setting up oTel for me is the development experience. Having a single open specification is fantastic, especially for portability, but the SDKs are almost overwhelmingly abstract and therefore difficult to intuit.

I used to really like Datadog for being a one-stop observability shop and even though the experience of integrating with it is still quite simple, I think product and pricing wise they've jumped the shark.

I'm much happier these days using a collection of small time services and self-hosting other things, and the only part of that which isn't joyful is the boilerplate and not really understanding when and why you should, say, use gRPC over HTTP, and stuff like that.

by ljm

1/11/2025 at 7:21:29 PM

You are generally correct but I've used https://github.com/openobserve/openobserve for several projects for dev-only complete OTel stack (dashboards included) and I liked it. There are better dashboards out there for sure, but for what I needed locally it did the job fantastically well. Zero complaints.

It's extremely easy to self-host, either on a dev machine, a VPS, or in any Docker-based PaaS.

by pdimitar

1/11/2025 at 4:02:11 AM

And having to rebuild a golang binary based on this horseshit just to get a bugfixed collector is some horseshit: https://github.com/open-telemetry/opentelemetry-collector/tr... which is required (as best I can tell) because they text/template in the deps https://github.com/open-telemetry/opentelemetry-collector/bl...

Heaven help you if it's a contrib collector bugfix

by mdaniel

1/10/2025 at 5:58:16 PM

I still don’t understand what OTEL is. What problem is it solving? If it’s a standard what is the change for the end user? Is it not just a matter of continuing to use whatever (Prometheus, Grafana, etc) with the option to swap components out?

by cedws

1/10/2025 at 6:35:06 PM

For the tracing part of Otel, neither Prometheus nor Grafana are capable of doing that. Tracing is the most mature part of Otel and the most compelling use case for it. For metrics, we've stayed with Prometheus and AWS Cloudwatch Metrics. The metrics part feels very under developed at the moment.

by hangonhn

1/10/2025 at 7:12:46 PM

When I last looked 9 months ago, there were libraries of the metrics side of the tree still marked as experimental, that you couldn’t successfully send metrics without using. And a huge memory leak in the JS implementation that was only fixed 15 months ago: https://github.com/open-telemetry/opentelemetry-js/issues/41...

Things, especially crosscutting concerns, you want to use in production should have stopped experiencing basic growing pains like this long before you touch them. It’s not baked yet. Come back in a year. Or two.

by hinkley

1/10/2025 at 7:42:52 PM

Everything is either in development or stable. There aren't statuses like alpha, beta, release candidate, etc. except for individual library releases. Metric clients will be marked as "development" until it goes "stable" [0]. Consequently it can be hard to determine the actual maturity level of any given implementation.

Tracing is very mature, with metric and logging implementations stable for a number of popular languages [1].

the "experimental" status was renamed "development"

[0] https://opentelemetry.io/docs/specs/otel/versioning-and-stab...

[1] https://opentelemetry.io/docs/languages/#status-and-releases

by barake

1/10/2025 at 9:25:36 PM

> the "experimental" status was renamed "development

That doesn’t really change things now does it. It’s still a bunch of people sitting around saying “MMMM” loudly while eating half-raw cookies.

by hinkley

1/10/2025 at 6:04:29 PM

The point of OTel is interoperability.

For example the author of the software instruments it with OTel -- either language interface or wire protocol -- and the operator of the software uses the backend of choice.

Otherwise, you have a combinatorial matrix of supported options.

(Naturally, this problem is moot if the author and operator are the same.)

by paulddraper

1/10/2025 at 11:41:23 PM

This is worthy of a bookmark. (If HN supported bookmarks.)

by PeterCorless

1/11/2025 at 3:37:53 AM

https://news.ycombinator.com/item?id=42658095#:~:text=favori... and then https://news.ycombinator.com/favorites?id=PeterCorless&comme...

by mdaniel

1/15/2025 at 12:05:13 AM

Thanks! TIL.

by PeterCorless

1/11/2025 at 7:24:19 PM

You can favorite comments. (And posts.)

by pdimitar

1/10/2025 at 7:21:50 PM

Interoperability with what?

Where are the three existing, successful solutions it is trying to abstract over?

It doesn’t know what it is because it’s violating the Rule of Three.

by hinkley

1/11/2025 at 3:43:43 AM

Aside from what others have mentioned (being interop on the send and receive side of things) I find a great deal of value in the interop schema, so you don't have to re-learn what every joker wants to name their special unicorn flavor of kubernetes pod name or container id or http response code when exploring metrics or making dashboards

https://opentelemetry.io/docs/specs/semconv/general/attribut...

https://opentelemetry.io/docs/specs/semconv/hardware/common/

https://opentelemetry.io/docs/specs/semconv/system/container...

https://opentelemetry.io/docs/specs/semconv/system/k8s-metri...

https://opentelemetry.io/docs/specs/semconv/http/http-metric...

https://opentelemetry.io/docs/specs/semconv/cloud-providers/...

by mdaniel

1/10/2025 at 8:00:29 PM

interoperability between vendors, so your business isn't stuck with a vendor who can raise prices because their SDKs are deeply embedded in your codebase, so open source libraries / products have a common point to hook into without needing to integrate with each vendor.

by arccy

1/10/2025 at 7:58:51 PM

Application Insights, Data Dog, New Relic, etc…

APM products in general.

by jiggawatts

1/10/2025 at 9:04:19 PM

How to send Prometheus data to New Relic: https://docs.newrelic.com/docs/infrastructure/prometheus-int...

How to send StatsD data to Datadog: https://docs.datadoghq.com/developers/dogstatsd/?tab=hostage...

Places like datadog and posthog are selling you their ability to ingest your existing data. I call bullshit. It’s a problem looking for a solution. It’s an excuse for engineers to build moats around a moderately difficult problem by making it inscrutable.

by hinkley

1/10/2025 at 11:05:21 PM

It simplifies the issue for applications and libraries that can support Otel and quickly integrate with whatever is being used to collect the data.

by wbl

1/11/2025 at 3:11:45 AM

The whole idea of OTel is not to be a moat.

It's the exact opposite.

I can export traces (or metrics or logs) to whatever backend I want, and change easily.

by paulddraper

1/11/2025 at 9:09:21 AM

You’re talking about a data moat, I’m talking about intellectual moats and intentional complexity, as opposed to accidental or essential complexity.

by hinkley

1/11/2025 at 2:39:52 PM

You're ignoring the oldest and most mature part of OTEL which is traces.

If you look up how to send traces to any popular vendor the options are either a) use our proprietary for at and proprietary agents and SDKs, or b) use otel

Iirc metrics in OTEL are very similar to Prometheus. Haven't looked at logging but realistically logging becomes an afterthougt with a good tracing setup.

by nijave

1/10/2025 at 9:27:12 PM

> Interoperability with what?

For the backend?

Datadog, New Relic, Grafana, Sentry, Azure Monitor, Splunk, Dynatrace, Honeycomb

by paulddraper

1/10/2025 at 7:26:34 PM

Interoperability with the other things your Otel Vendor is selling you. No two implementations are even remotely compatible, but they can all mostly scrape data from your Prometheus endpoints, so it's easy to migrate from useful software to their walled garden.

by GauntletWizard

1/10/2025 at 9:05:46 PM

> to their walled garden

Am I detecting sarcasm or did I just bring my own?

by hinkley

1/10/2025 at 9:40:42 PM

I don't think there's any sarcasm there; Perhaps a wistful hope that behind one of those walls somebody's actually got a garden instead of just a seedbed of false promises.

by GauntletWizard

1/10/2025 at 10:54:22 PM

Yarp.

by hinkley

1/10/2025 at 7:58:15 PM

i can report the same traces to jager if i want open source or i switch out the provider and it can go to aws x-ray (paid). without any code or config changes. pretty useful. yes, a tad clumsy to set up the first time.

by dionian

1/12/2025 at 11:51:44 AM

Adopting OpenTelemetry does not have to be hard for common use-cases. On Kubernetes, the Dash0 operator (https://artifacthub.io/packages/search?repo=dash0-operator) automatically instruments Node.js and Java workloads (and soon other runtimes) with just a custom resource created in a namespace. It works with all OpenTelemetry backends I know of.

Disclaimer: I am one of the authors of the Dash0 operator and work on Dash0 (https://www.dash0.com/), an OpenTelemetry-native observability platform.

Automatic instrumentation on Kubernetes is also provided by the community OpenTelemetry (https://github.com/open-telemetry/opentelemetry-operator).

I am certainly biased here because OpenTelemetry and Prometheus have been at the core of my professional life for the past half decade, but I think that the biggest challenge, is that there are many different ways to get you to a good setup, and people get lost in the discovery of the available options.

by mmanciop

1/10/2025 at 6:37:09 PM

This was exactly my reaction to OpenTelemetry.

Creating an HTTP endpoint that publishes metrics in a Prometheus-scrape-able format? Easy! Some boolean/float key-value-pairs with appropriate annotations (basically: is this a counter or a gauge?), and done! And that lead (and leads!) to some very usable Grafana dashboards-created-by-actual-users and therefore much joy.

Then, I read up on how to do things The Proper Way, and was initially very much discouraged, but decided to ignore All that Noise due to the existing solutions working so well. No complaints so far!

by antithesis-nl

1/10/2025 at 10:17:01 PM

Glad I'm not the only one that feels this way. For a small application when you just want some metrics and observability, it's a big burden to get it all working.

On my own projects, I send the metrics I care about out through the logs and have another project I run collect and aggregate them from the logs. Probably “wrong” but it works and it's easy to set up.

by ejs

1/10/2025 at 5:46:09 PM

Gee whiz is this person is in for a treat when they discover the joys of OpAMP https://github.com/open-telemetry/opamp-spec/blob/main/speci...

Turtles all the way down.

by lexh

1/10/2025 at 6:11:35 PM

Blech.

If you already have reloadable configuration infrastructure, or plan to add it in the future, this is just spreading out your configuration capture. No thank you (and by “no thank you” I mean fuck right off).

If you want to improve your bus number for production triage, you have to make it so anyone (senior) can first identify and then reproduce the configuration and dependencies of the production system locally without interrupting any of the point people to do so. If you cannot see you cannot help.

Just because you’re one of k people who usually discover the problem quickly doesn’t mean you’ll always do it quickly. You have bad days. You have PTO. People release things or flip feature toggles that escape your notice. If you stop to entertain other people’s queries or theories you are guaranteed to be in for a long triage window, and a potential SLA violation. But if you never accept other perspectives then your blind spots can also make for SLA violations.

Let people putter on their own and they can help with the Pareto distributions. Encourage them to do so and you can build your bus number.

by hinkley

1/10/2025 at 8:56:06 PM

I spent altogether too much time trying to get the Rust otel libs working in a useful and concise way. After a few hours I junked it and went back to a direct use of a jaeger client sending off to the otel collector.

there's some gold here, but most of it is over in the consultant/vendor space today, I fear.

by pnathan

1/10/2025 at 10:54:43 PM

I'm literally porting some code to Otel now and here is what I landed on, even before this article: It is confusing because it's a topic that uses vague terminology that means different things in different domains. For example, I'm looking at one OTel ui and "Traces" are the individual http requests to a service. In another UI, against the same data, "Traces" are the log messages from code in the service, and "Requests" are the individual http requests. To wire up in code, there's yet other terminology.

I haven't decided exactly what to blame for this. In some ways, it's necessary to have vague, inconsistent terminology to cover various use cases. And, to be fair some of the UIs predate OTel.

by shireboy

1/10/2025 at 11:24:47 PM

I gave up on opentelemetry when I was on the 5th Rust crate that I had to wire together based on little to no documentation.

Loki works great.

by Alex-Programs

1/12/2025 at 4:48:37 PM

Interesting. We're trying to cut costs on APM so we've been moving toward opensource alternatives. Setting up OTEL is definitely tedious, especially for traces and DT wasn't making it easier. I've been checking out a few alts, Signoz, Odigos, Chronosphere... there a few others too but these guys stood out. As much as we want to build out OTEl ourselves, looking for a solution to make the transition easy seems like the way to go.

by vzbl9293

1/13/2025 at 4:41:44 AM

Thanks for the shout out to SigNoz. Do reach out to us in our slack community if you need any help setting things up - https://signoz.io/slack

For others checking out this thread, here's our github repo - https://github.com/signoz/signoz

PS: I am one of the maintainers at SigNoz

by pranay01

1/11/2025 at 9:55:56 AM

OTEL always seems way too complicated to use to me. Especially if you want to understand what it is doing. The code has a lot of abstractions and indirection (at least in Go).

And reading this it seems a lot of people agree. Hope that can be fixed at some point. Tracing should be simple.

See for example this project: https://github.com/jmorrell/minimal-nodejs-otel-tracer

I think it is more a POC but it shows that all this complexity is not needed IMO.

by Cwizard

1/11/2025 at 4:28:50 PM

Go with OTel is, unfortunately, known to be challenging ergonomics-wise. The OTel project doesn't really define an ergonomics standard, and leaves it up to the groups for each sub-project (e.g., each of the 11 language groups) to define how they package things up, what convenience wrapper APIs they offer, etc.

In Go, currently it is a deliberate choice to be both very granular and specific, so that end-users can ultimately depend on only the exact packages they need and have nothing standing between them and any customization of the SDK they need to do for their organizations.

There's some ways to isolate this kind of setup, which we document like so: https://opentelemetry.io/docs/languages/go/getting-started/#...

Stuff that into an otel.go file and then the rest of your code is usually pretty okay. From there your application code usually looks like this:

https://gist.github.com/cartermp/f37b6702109bbd7401be8a1cab8...

The main thing people sometimes struggle with at this point is threading context through all their calls if they haven't done that yet. It's annoying, but unfortunately running into a limitation of the Go language here. Most of the other languages (Java, .NET, Ruby, etc.) keep context implicitly available for you at all times because the languages provide that affordance.

by phillipcarter

1/11/2025 at 1:36:59 AM

So much pain related to context tracking. I'm growing more and more convinced that solving that problem will be the next big thing in PLs, probably in the form of effect systems.

by andrewflnr

1/10/2025 at 7:14:17 PM

What otel really needs to succeed, at least in the python space, is something as easy and straightforward as DataDog's ddtrace command.

by etimberg

1/10/2025 at 8:48:07 PM

Yeah... this is about how well every OTel migration goes, from what I've seen.

Docs are an absolute monstrosity that rival Bazel's for utility, but are far less complete. Implementations are extremely widely varied in support for basics. Getting X to work with OTel often requires exactly what they did here: reverse-engineering X to figure out where it does something slightly abnormal... which is normal, almost every library does something similar, because it's so hard to push custom data through these systems in a type-safe way, and many decent systems want type safety and will spend a lot of effort to get it.

It feels kinda like OAuth 2 tbh. Lots of promise, obvious desirable goals, but completely failing at everything involving consistent and standardized implementation.

by Groxx

1/10/2025 at 8:30:28 PM

OpenTelemessy

by gpi

1/11/2025 at 2:41:53 PM

finally someone got it running

by vednig

1/13/2025 at 3:44:16 AM

In addition to OTEL, there are many other products, including Odigos, Beyla, Kubeshark, Malcolm, Falco, DDosify, Deepflow, Tetragon, and Retina. Deepflow is a free and open source product.

by almaight

1/10/2025 at 6:18:02 PM

I literally gave a lightning talk on this in Kubecon NA last year. Here's the youtube video, might help you get some perspective

tl;dr

while there are certainly many areas to improve for the project, some reasons why it could seem complicated

Extensibility by Design: Flexibility in defining meters and signals ensures diverse use cases are supported.

It's still a relatively new technology (~3 years old), growing pains are expected. OpenTelemetry is still the most advanced open standard handling all three signals together.

[1]https://www.youtube.com/watch?v=xEu8_Aeo_-o

by pranay01

1/10/2025 at 9:50:16 PM

It's getting close to k8s in terms of activity so at least there are a lot of people working on it.

by jensensbutton

1/11/2025 at 8:58:46 AM

https://github.com/anacrolix/notel?tab=readme-ov-file#what-a...

by anacrolix

1/11/2025 at 7:37:29 PM

That section reads like trashing on OTel; basically an annoyed rant. Clear debunking of your text exists and is easy to find f.ex. both OpenObserve and SigNoz can be trivially self-hosted and you will not "be charged extraordinary amounts of money" for it. Both take no more than 5 minutes, just set a few env vars, run a single command -- and you're done.

I can see the value in smaller software -- I fought for it many times, in fact -- but you will have to do better when making a case for your program. Just giving one semi-informed dismissive take reads like a beer-infused dismissal.

by pdimitar

1/11/2025 at 9:15:45 AM

I wish I could move off NewRelic. Every time I post about it (seriously, check my post history) over the years, HN commenters try to convince me that it does automated metrics almost as good, or just as good, or even better.

Once in awhile I try to spin up OTel like they say. Every single time it sucks. I'll keep trying, though. NewRelic's pricing is so brutal that I hold out hope. Unfortunately, NR's product really is that good...

by icelancer

1/11/2025 at 7:39:46 PM

Have you tried OpenObserve or SigNoz? Both are trivial to start locally and to self-host, be it in VPS-es or in any Docker-based PaaS.

by pdimitar

1/10/2025 at 5:19:40 PM

Have you considered Kamon instead? From personal experience it's really the best tracing solution for Akka and other libraries using Scala Futures. I haven't tried it, but it does have built-in Spring support as well.

https://kamon.io

Edit: I wonder why suggesting JVM instrumentation that is much more polished than the OTel and Lightbend agents gets me downvoted?

by hocuspocus

1/11/2025 at 7:42:23 PM

I have not downvoted you but you seem to recommend a very specific and tailor-made product for a specific tech stack.

OpenTelemetry is universal. As long as you can send the right network packages to one of a number of ingesting programs, you can have pretty dashboards and a lot of insights, regardless of the programming language of the program that originated the metric / trace / log.

by pdimitar

1/13/2025 at 2:13:17 PM

Kamon supports OTel.

by hocuspocus

1/13/2025 at 2:20:26 PM

Thank you. It was not obvious from a quick glance.

by pdimitar

1/11/2025 at 6:00:56 AM

[dead]

by linkerdoo