5/10/2026 at 1:44:02 AM
Anyone trying to do this... the first thing you do is avoid lex/yacc/bison/antlr. You do not need all this ceremony. A recursive descent parser that uses Pratt parsing will work for a vast majority of cases.The lexer/parser is never the bottleneck. In fact, you can write those two by hand over a single weekend for a largish language. With LLMs, it takes 15 minutes if you have an unambiguous spec.
The biggest time sink, and the reason you will fail for sure, is the inability to restrict the scope of the project. You start with a limited feature set and produce the entire compiler/vm toolchain. Then you get greedy and fiddle with the type system, adding features that you have never used and probably never will. And now you have to change every single phase from start to end.
I mostly give up at this stage.
by sieve
5/10/2026 at 2:40:40 AM
Jonathan Blow wrote his own game enginee and for that he wrote his own programming language.He went through straight recursive descendant parser and said same thing.
I think compiler courses teach from yacc, bison etc that's where this whole thing came from but in practice people discovered that hand written recursive descendant parsers are all you need.
by wg0
5/10/2026 at 3:10:15 AM
> I think compiler courses teach from yacc, bison etc that's where this whole thing came fromVery true. I have a shelf full of books on compiler development and optimization. I have read them selectively, a chapter here, a chapter there. But that shelf is useless for a vast majority of people.
You might find it useful if you are developing a production-level compiler/vm (I cannot make this statement with a straight face while Python rules the world). But a simple and sensible architecture that uses recursive-descent parsing takes you a long way.
Most hobbyist compilers (and even some production ones) are written as a heavy front-end compiling down to C or LLVM. Very few people actually write their own backend.
by sieve
5/10/2026 at 7:10:59 AM
> You might find it useful if you are developing a production-level compiler/vmNot any of the ones I have worked on, nor the ones I know about: they all use hand-written parsers. In practice, error reporting and recovery tends to be tedious and/or difficult with a generated parser, which is a serious issue for practical tools.
Parsing has turned out to be simpler, in practice, than the computing pioneers expected it to be, because simpler grammars are easier for both machines and humans to reason about. Instead of using sophisticated parser generators, we just design dumb grammars: that works out better all around.
by marssaxman
5/10/2026 at 7:33:20 AM
Yeah. I added the caveat because I haven't looked at the source of the major production compilers and didn't want to overreach. The hobbyist ones mostly stick to hand-rolled recursive descent.by sieve
5/10/2026 at 3:49:44 AM
Re: bison and yacc. It came from the dragon book which for forever was the way to learn to write languages.by tehologist
5/10/2026 at 3:51:13 AM
I learned to do this about 2 years ago (pre LLM). I have been developing software for ~30 years and somehow doing something like this was a major mental obstacle, mostly created by the perception of "the dragon book", as in this topic being full of mystical unobtainable incantations, so I never even dared venture into this space. Silly, I know. However, after diving into this and learning to write a recursive descent parser for a DSL I wanted to write, it felt like I'd acquired a superpower. Totally understand that there is many more layers to all of this, layer that can get very complex, but just learning that first bit...by pan69
5/10/2026 at 4:47:36 AM
I wish people would start with Nystrom's https://www.craftinginterpreters.com/ and avoid the dragon etc unless they really, really need it. Almost everything I have learnt about compiler/vm development, I have done so by reading random blogs and articles on various aspects and small tutorials on writing parsers and vms.Even stuff like Crenshaw's Let's Build a Compiler was more useful to me than all these books that do lexical analysis using regular expressions. I have written lexers and parsers hundreds of times for all kinds of DSLs and config languages and not once have I used regular expressions to scan the text.
by sieve
5/10/2026 at 8:41:13 AM
Isn't using regex in this space kinda shunned, when you can easily write a grammar and parse things more reliably that way? Surprised to read that any books do that.by zelphirkalt
5/10/2026 at 9:33:48 AM
Every single book starts with regexes and DFA/NFA for lexical analysis. Too much ceremony for something you can write in 30 minutes and 300 linesby sieve
5/10/2026 at 1:27:40 PM
To add, another reason to hand-write your parser is that it gives you much, much better opportunities for adding helpful diagnostics.Grammar rules are not meaningful to your users - they don’t have the same mental model as whatever parser generator you’re using. If you want to be helpful to them, you probably want some pattern matching in the error path as well.
Best in class here is probably rustc, which has incredible UX compared to most compilers.
by simonask
5/10/2026 at 8:25:02 AM
Probably the most fun I’ve had with LLMs has been slowly making a programming language as a side project.I used to give up somewhere around the type system, too, but this time I’m approaching something vaguely useful. It even has a basic LSP.
It’s been both enjoyable and enlightening, and LLMs turn out to be an excellent pair designer as (in addition to implementation) they’re really good at summarising the impact of various decisions.
> the reason you will fail for sure, is the inability to restrict the scope of the project
This will be the reason, for sure. But then the scope of every project like this tends towards building an OS with it then replacing every piece of software, including all embedded devices :)
by barnabee
5/10/2026 at 9:30:52 AM
> slowlyI cannot do slow. It is either burn the candle at both ends, or do nothing at all.
I am using LLMs this time as well, but I spent close to 400 hours over a period of 6-7 weeks on my project before I put it to the side temporarily (got bored once the thinking part was done). About 300 of those were spent on iterating over the language and VM specs and eliminating all ambiguities and needless features. The remaining 100 were used to produce the code --- the VM, the assembler and the compiler --- and to repeated rewrite it to conform to my way of doing things.
LLMs have let me become extremely choosy about which code I am willing to keep.
by sieve
5/10/2026 at 10:24:32 AM
I've taken the approach of writing and even directly reviewing almost no code for this, otherwise I'd simply not have time for it as another side project. It's also interesting to see how far I can push this "vibe engineering" approach, and although it's not perfect, the answer is much further than I'd have expected going in.I've managed to get OpenCode setup such that I can have a productive discussion about the design or an issue / change then leave the LLM iterating for long periods while I do other work. It's instructed to maintain test coverage and treat quality very seriously - as a result there are over 5000 tests (some I suspect are useless...) and it's pretty rare to get a regression.
I'm pretty sure there are plenty of significant bugs and gaps, but also that once found it seems like all of them will be fixed pretty quickly by the LLM.
I just have to avoid looking at the code...
by barnabee
5/10/2026 at 1:23:48 PM
The part where most people give up is useful error messages and parsing recovery. Parsing a correct document with a grammar is really easy, to the point that nobody should be under the illusion that it is difficult. The moment you add error recovery, the things that are obvious have n ways to go wrong and you have to come up with test cases for each of them.by imtringued
5/10/2026 at 3:21:14 PM
Recovery is simple depending on your syntax (and whether your language supports exceptions). You pick a few places where you trap exceptions, add it to your log and continue. My current project is the first one that might becomes public. So it is the first time I have worked a little bit on usability. So multiple errors at a time, not one.by sieve
5/10/2026 at 8:27:14 AM
Many projects wish they had a proper grammar. When a project turns useful and people want to port it, or support it on other platforms, a grammar makes that job much easier.I am not quite sure what you mean by having a recursive descent parser, because you can write one manually, or you can generate one from a grammar, which would have the additional benefits of having a grammar. I recommend having a grammar.
by zelphirkalt
5/10/2026 at 8:43:56 AM
I like writing parsers, and nowadays just use handwritten recursive descent functions, using a couple of simple utility functions. It is easy to reason about and flexible. I do start each parsing function with a comment stating the informal grammar the function should parse (and LLM autocomplete usually types the rest of the function).With regard to portability: I've found cross-language parser generators especially unpleasant to work with. Instead, I just implement the parser in a language that runs on all platforms I care about.
by vanviegen
5/10/2026 at 9:49:36 AM
Yep. I started out using ANTLR for one project of mine. I ended up spending loads of time fighting its syntax to do really quite simple things, and it was slow! I probably wasn't holding it right. In the end, I wrote a simple lexer and recursive descent parser (with a small amount of lookahead) in a weekend. The code was easy to read, easy to extend, and fast.by MattPalmer1086
5/10/2026 at 2:14:02 AM
I wrote a few of these due to an interest in compilers and hardware.The easiest syntax to copy if you’re looking for a high level language is Smalltalk.
But most of the time, I wouldn’t even use that. Simple imperative languages that look like BASIC works pretty well in most domains. If you simplify the syntax a little, it’s very easy to understand the compiler and use it for say when you want users to input code into existing systems.
by true_religion
5/10/2026 at 2:53:11 AM
I have written compilers for two families over the years: C and ML. My current preference is Python. I am currently working on a statically typed language that is inspired by Python (minus objects and OOP) that runs on a register VM.Syntax is a minor issue but something that people are very opinionated about. You could technically build multiple front ends that share the typechecking, CFG validation, optimization, register allocation and byte code emission phases. But it is too much work for what is presently a personal project.
by sieve
5/10/2026 at 11:16:38 AM
Are they public? Can we study from them? Got later into compilers and I'm trying a little bit of everythingby FattiMei
5/10/2026 at 3:17:29 PM
The older ones, no. The current project will be. I am developing in private and occasionally rewriting the jj tree to make each commit self-contained. So it won't contain all the false-starts and bad code, only the cleaned up version."ut" (https://github.com/s-i-e-v-e/ut) exists, but it is more of a POC for the syntax more than anything. So lexer+parser+typechecker. Did this during COVID in TypeScript but did not finish.
by sieve
5/10/2026 at 11:33:08 AM
There are many open source compiler and interpreter projects on github.also:
https://github.com/BaseMax/AwesomeInterpreter
and probably there is one for compilers too.
by fuzztester
5/10/2026 at 2:56:10 PM
I feel the need to reference Crafting Interpreters design note on parsing here.[1]To quote an excerpt: "If you’re just trying to get your parser done, pick one of the bog-standard techniques, use it, and move on. Recursive descent, Pratt parsing, and the popular parser generators like ANTLR or Bison are all fine."
In my opinion, parser generators are great if you just want to write a grammar and be done. Especially if that's what you learned in college and it's all you know.
[1]: https://craftinginterpreters.com/compiling-expressions.html#...
by Levitating
5/10/2026 at 5:54:24 AM
I agree. I have written lexer/parser for my language twice (for compiler0 and for a self-hosted compiler). It's a very dumb task requiring almost to mental load.Profiling results show that the amount of time spent lexing/parsing is negligible - less than 1% of the total compilation time.
by Panzerschrek