Async DNS | alt.hn

12/12/2025 at 5:50:34 PM

The first linked article was recently discussed here: RIP pthread_cancel (https://news.ycombinator.com/item?id=45233713)

In that discussion, most of the same points as in this article were already discussed, specifically some async DNS alternatives.

See also here the discussion: https://github.com/crystal-lang/crystal/issues/13619

by albertzeyer

12/12/2025 at 6:39:29 PM

I am always amused when folks rediscover the bad idea that is `pthread_cancel()` — it’s amazing that it was ever part of the standard.

We knew it was a bad idea at the time it was standardized in the 1990s, but politics — and the inevitable allure of a very convenient sounding (but very bad) idea — meant that the bad idea won.

Funny enough, while Java has deprecated their version of thread cancellation for the same reasons, Haskell still has theirs. When you’re writing code in IO, you have to be prepared for async cancellation anywhere, at any time.

This leads to common bugs in the standard library that you really wouldn’t expect from a language like Haskell; e.g. https://github.com/haskell/process/issues/183 (withCreateProcess async exception safety)

by frumplestlatz

12/12/2025 at 7:27:10 PM

What's crazy is that it's almost good. All they had to do was make the next syscall return ECANCELED (already a defined error code!) rather than terminating the thread.

Musl has an undocumented extension that does exactly this: PTHREAD_CANCEL_MASKED passed to pthread_setcancelstate.

It's great and it should be standardized.

by AndyKelley

12/12/2025 at 9:07:45 PM

That would have been fantastic. My worry is if we standardized it now, a lot of library code would be unexpectedly dealing with ECANCELED from APIs that previously were guaranteed to never fail outside of programmer error, e.g. `pthread_mutex_lock()`.

Looking at some of my shipping code, there's a fair bit that triggers a runtime `assert()` if `pthread_mutex_lock()` fails, as that should never occur outside of a locking bug of my own making.

by frumplestlatz

12/12/2025 at 9:19:09 PM

You can sort of emulate that with pthread_kill and EINTR but you need to control all code that can call interruptable sys calls to correctly return without retry (or longjmp/throw from the signal handler, but then we are back in phtread_cancel territory)

by gpderetta

12/12/2025 at 11:04:24 PM

There's a second problem here that musl also solves. If the signal is delivered in between checking for cancelation and the syscall machine code instruction, the interrupt is missed. This can cause a deadlock if the syscall was going to wait indefinitely and the application relies on cancelation for interruption.

Musl solves this problem by inspecting the program counter in the interrupt handler and checking if it falls specifically in that range, and if so, modifying registers such that when it returns from the signal, it returns to instructions that cause ECANCELED to be returned.

Blew my mind when I learned this last month.

by AndyKelley

12/13/2025 at 4:04:25 AM

Introspection windows from a interrupting context are a neat technique. You can use it to implement “atomic transaction” guarantees for the interruptee as long as you control all potential interrupters. You can also implement “non-interruption” sections and bailout logic.

by Veserv

12/13/2025 at 3:10:02 AM

In particular you need to control the signal handlers. You can't do that easily in a library.

by cryptonector

12/13/2025 at 3:09:30 AM

`pthread_cancel()` was meant for interrupting long computations, not I/O.

by cryptonector

12/13/2025 at 3:21:31 AM

It always surprised me that in the path of so many glibc functions are calls to open() items in /etc and then parse their output into some kind of value to use or possibly return.

The initialization of these objects should have been separate and then used as a parameter to the functions that operate on them. Then you could load the /etc/gai.conf configuration, parse it, then pass that to getaddrinfo(). The fact that multiple cancellation points are discreetly buried in the paths of these functions is an element of unfortunate design.

by themafia

12/12/2025 at 7:47:33 PM

It’s extremely easy to write application code in Haskell that handles async cancellation correctly without even thinking about it. The async library provides high level abstractions. However your point is still valid as I do think if you write library code at a low level of abstraction (the standard library must) it is just as error prone as in Java or C.

by kccqzy

12/13/2025 at 3:03:26 AM

`pthread_cancel()` is necessary _only_ to interrupt compute-only code without killing the entire process. That's it. The moment you try to use it to interrupt _I/O_ you lose -- you lose BIG.

by cryptonector

12/14/2025 at 7:54:57 PM

there is a better way - in any unbounded compute loop, add some code to check for cancellation. it can be very very very cheap

this is not possible if you are calling third party code that you can't modify. in this case it's probably a better idea to run it on another process and use shared memory to communicate back results. this can even be done in an airtight sandboxed manner (browsers do this for example), something that can't really be done with threads

by nextaccountic

12/16/2025 at 4:57:08 AM

Right, and then you can kill it, but that's essentially what `pthread_cancel()` is. `pthread_cancel()` is just fine as long as that's all you use it for. The moment you go beyond interruption of 100% compute-bound work, you're in for a world of hurt.

by cryptonector

12/12/2025 at 7:43:29 PM

IO can fail at any point though, so that’s not particularly bad.

by paulddraper

12/13/2025 at 1:41:25 AM

It's particularly bad because thread interruptions are funneled into the same system as IO errors, so it's easy to consume them by mistake.

Java has that same issue.

by marcosdumay

12/12/2025 at 8:10:54 PM

I was able in an afternoon to implement a pretty decent completely async Swift DNS resolver client for my app. DNS clients are simple enough to build that rolling your own async is not a big deal anymore.

Yes, there is separate work to discern what DNS server the system is currently using: on macOS this requires a call to an undocumented function in libSystem - that both Chromium and Tailscale use!

by dweekly

12/12/2025 at 8:15:31 PM

A lot of folks think this, but did you also implement EDNS0?

The golang team also thought DNS clients were simple, and it led to almost ten years of difficult to debug panics in Docker, Mesos, Terraform, Mesos, Consul, Heroku, Weave and countless other services and CLI tools written in Go. (Search "cannot unmarshal DNS message" and marvel at the thousands of forum threads and GitHub issues that all bottom out at Go implementing the original DNS spec and not following later updates.)

by AaronFriel

12/12/2025 at 10:47:58 PM

nsswitch cough

by formerly_proven

12/12/2025 at 9:43:41 PM

Even once you use the private `dns_config*()` APIs on macOS, you need to put in heavy lifting to correctly handle scoped, service-specific providers, supplemental matching rules, etc -- none of which is documented, and can change in the future.

Since you're not using the system resolver, you won't benefit from mDNSResponder's built-in DNS caching and mDNS resolution/caching/service registration, so you're going to need to reimplement all of of that, too. And don't forget about nsswitch on BSD/Linux/Solaris/etc -- there's no generic API that let's you plug into that cleanly, so for a complete implementation there, you need to:

- Reimplement built-in modules like `hosts` (for `/etc/hosts`), `cache` (query a local `nscd` cache, etc), and more.

- Parse the nsswitch.conf configuration file, including the rule syntax for defining whether to continue/return on different status codes.

- Reimplement rule-based dispatch to both the built-in modules and custom, dynamically loaded modules (like `nss_mdns` for mDNS resolution).

Each OS has its own set of built-ins, and private/incompatible interfaces for interacting with things like the `nscd` cache daemon. Plus, the nsswitch APIs and config files themselves differ across operating systems. And we haven't even discussed Windows yet.

Re-implementing all of this correctly, thoroughly, and keeping it working across OS changes is extremely non-trivial.

The simplest and most correct solution is to just:

- Use OS-specific async APIs when available; e.g. `CFHostStartInfoResolution()` on macOS, `DnsQueryEx()` on Windows, `getaddrinfo_a()` on glibc (although that spawns a thread, too), etc.

- If you have a special use-case where you need absolutely need better performance, and do not need to support all the system resolver functionality above (i.e. server-side, controlled deployment environment), use an event-based async resolver library.

- Otherwise, issue a blocking call to `getaddrinfo()` on a new thread. If you're very worried about unbounded resource consumption, use a size-limited thread pool.

by frumplestlatz

12/12/2025 at 11:26:03 PM

Good points, all - there is a lot of subtlety here.

CFHostStartInfoResolution is deprecated, no? https://developer.apple.com/documentation/cfnetwork/cfhostst...:)

That leaves us with DNSServiceGetAddrInfo? https://developer.apple.com/documentation/dnssd/dnsservicege...:) or some kinda convoluted use of Network and NWEndpoint/NWconnection with continuations could do the same?

by dweekly

12/12/2025 at 11:34:13 PM

Oh yes, good catch. Yeah, you want to use `NWConnection` (or one of the other higher-level supported networking APIs), which raises another issue with doing custom DNS resolution. You need those API's connect-by-name semantics to get VPN-on-Demand:

https://developer.apple.com/documentation/technotes/tn3151-c...

by frumplestlatz

12/13/2025 at 8:18:29 PM

Browsers don't care about the nsswitch though. There are apps where that complexity can be avoided.

by cryptonector

12/13/2025 at 8:19:41 AM

Doesn't linux run resolved locally? You just send request there and it handles hosts, cache and whatnot.

by GoblinSlayer

12/12/2025 at 7:31:29 PM

For those using it in Python, Gevent provides a pluggable set of DNS resolvers that monkey-patch the standard library's functions for async/cooperative use, including one built on c-ares: https://www.gevent.org/dns.html

by btown

12/12/2025 at 8:08:20 PM

gevent. Man that's a blast from the past

by petcat

12/12/2025 at 9:23:52 PM

Still alive and kicking in production for us! For situations where many requests are bound by external HTTP requests to third-party suppliers, it's an amazing way to allow for practically unlimited concurrency with limited cores.

by btown

12/12/2025 at 6:09:01 PM

It's weird to me that event-based DNS using epoll or similar doesn't have a battle-tested implementation. I know it's harder to do in C than in Rust but I'm pretty sure that's what Hickory does internally.

by 01HNNWZ0MV43FF

12/12/2025 at 6:44:17 PM

it’s a weird problem, in that (1) DNS is hard, and (2) you really need the upstream vendor to solve the problem, because correct applications want to use the system resolver.

If you don’t use the system resolver, you have to glue into the system’s configuration mechanism for resolvers somehow … which isn’t simple — for example, there’s a lot of complex logic on macOS around handling which resolver to use based on what connections, VPNs, etc, are present.

And the there’s nsswitch and other plugin systems that are meant to allow globally configured hooks plug into the name resolution path.

by frumplestlatz

12/12/2025 at 8:07:25 PM

(1) DNS is hard

It's really not.

Just because some systems took something fundamentally simple and wrapped a bunch of unnecessary complexity around it does not make it hard.

At its core, it's an elegant, minimal protocol.

by AndyKelley

12/12/2025 at 8:50:32 PM

It falls into the category that most people think they understand DNS, the same as JavaScript, or e.g. elections, but the devil is in the detail. And I can tell you, at least for DNS (and Dutch Elections), it's kind of tricky, see fun cases like https://github.com/internetstandards/Internet.nl/issues/1370 and I thought the same before I had my current job which involves quite some tricky DNS stuff (and regarding this we also sometimes encounter bugs in unbound https://github.com/internetstandards/Internet.nl/issues/1803 )

But maybe DNSSEC is the 'unnecessary complexity' for you (I think it's kind of fundamental to secure DNS). Also without DNSSEC they needed RFC's like https://datatracker.ietf.org/doc/html/rfc8020 to clarify fundamentals (same goes for https://datatracker.ietf.org/doc/html/rfc8482 to fix stuff).

by bwblabs

12/13/2025 at 8:22:52 PM

Dutch elections? How do they come into this?

by cryptonector

12/15/2025 at 11:04:56 PM

There is this list of things tech people think they understand (DNS, javascript), and more common you can see this with everyday people, e.g. with stuff like elections: the basic concept is clear, understandable, but the devil/complexity is in the detail, how to handle certain exceptions. I was employed by the Election Management Body of The Netherlands for a few years, so I can only vouch for the complexity of that relatively simple election system, but I'm pretty sure it will hold for about every country ;)

by bwblabs

12/12/2025 at 11:06:58 PM

You and GP are talking about completely different things. Yes DNS at its core it’s an elegant minimal protocol. But all the complexity comes from client side configuration before the protocol is even involved.

We have complexity like different kinds of VPNs, from network-level VPNs to app-based VPNs to MDM-managed VPNs possibly coexisting. We have on-demand VPNs that only start when a particular domain is being visited: yes VPN starting because of DNS. We have user-provided or admin-provided hardcoded responses in /etc/hosts. We have user-specified resolver overrides (for example the user wants to use 8.8.8.8 not ISP resolver). We have multiple sources of network-provided resolvers from RDNSS to DHCPv6 O mode.

It is non-trivial to determine which resolver to even start sending datagrams with that elegant minimal protocol.

by kccqzy

12/12/2025 at 10:45:40 PM

Lots of elegant, minimal things are hard to use effectively.

by tptacek

12/14/2025 at 11:25:20 AM

Many async frameworks (e. g. libevent [1]) have a DNS client. But it's not something easy to use unless your program uses this specific framework (say libevent) for all network I/O. The problem is not that it's hard to do in C but that there is no single async framework everyone would use.

[1] https://libevent.org/libevent-book/Ref9_dns.html

by citrin_ru

12/13/2025 at 5:35:43 AM

I use hickory a lot and have contributed to it. It does have a pretty robust async DNS implementation, and its helpfully split into multiple different crates so you can pick your entry point into the stack. For instance, it offers a recursive resolver, but you can also just import the protocol library and build your own with tokio.

by leshow

12/13/2025 at 8:19:43 PM

Link?

by cryptonector

12/14/2025 at 12:51:49 PM

I'm one of the Hickory maintainers, although I mainly work on the server-side code.

https://github.com/hickory-dns/hickory-dns is our Git repo

Documentation for the resolver including an example: https://docs.rs/hickory-resolver/latest/hickory_resolver/ind...

by marcusb

12/14/2025 at 5:56:30 PM

Thank you!

by cryptonector

12/12/2025 at 6:04:21 PM

Just curious how you approached performance bottlenecks — anything surprising you discovered while testing?

by javantanna

12/12/2025 at 6:18:10 PM

Another related article: https://ziglang.org/devlog/2025/#2025-10-15

by benatkin

12/13/2025 at 3:02:02 AM

I'm digging dns.c and asr. I might get dns.c building and use it.

by cryptonector

12/12/2025 at 7:23:58 PM

Who can fix getaddrinfo?

by brcmthrowaway

12/12/2025 at 7:47:35 PM

There are steps that three different parties can take, which do not depend on other parties to cooperate:

POSIX can specify a new version of DNS resolution.

libcs can add extensions, allowing applications to detect when they are targeting those systems and use them.

Applications on Linux and Windows can bypass libc.

by AndyKelley

12/12/2025 at 8:28:25 PM

What about macOS?

by brcmthrowaway

12/12/2025 at 8:37:41 PM

they already have CFHostStartInfoResolution / CFHostCancelInfoResolution

by AndyKelley

12/13/2025 at 12:09:16 AM

libuv? libevent?

by jupp0r