12/30/2025 at 7:57:10 PM
I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.~41K pages/sec peak throughput.
Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.
~5,000 lines, no dependencies, compiles in <2s.
Why it's fast:
- Memory-mapped file I/O (no read syscalls)
- Zero-copy parsing where possible
- SIMD-accelerated string search for finding PDF structures
- Parallel extraction across pages using Zig's thread pool
- Streaming output (no intermediate allocations for extracted text)
What it handles: - XRef tables and streams (PDF 1.5+)
- Incremental PDF updates (/Prev chain)
- FlateDecode, ASCII85, LZW, RunLength decompression
- Font encodings: WinAnsi, MacRoman, ToUnicode CMap
- CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)
by lulzx
12/31/2025 at 2:50:27 AM
FWIW - mupdf is simply not fast. I've done lots of pdf indexing apps, and mupdf is by far the slowest and least able to open valid pdfs when it came to text extraction. It also takes tons of memory.a better speed comparison would either be multi-process pdfium (since pdfium was forked from foxit before multi-thread support, you can't thread it), multi-threaded foxit, or something like syncfusion (which is quite fast and supports multiple threads). Or even single thread pdfium vs single thread your-code.
These were always the fastest/best options. I can (and do) achieve 41k pages/sec or better on these options.
The other thing it doesn't appear you mention is whether you handle putting the words in reading order (IE how they appear on the page), or only stream order (which varies in its relation to apperance order) .
If it's only stream order, sure, that's really fast to do. But also not anywhere near as helpful as reading order, which is what other text-extraction engines do.
Looking at the code, it looks like the code to do reading order exists, but is not what is being benchmarked or used by default?
If so, this is really comparing apples and oranges.
by DannyBee
12/30/2025 at 9:24:25 PM
What kind of performance are you seeing with/without SIMD enabled?From https://github.com/Lulzx/zpdf/blob/main/src/main.zig it looks like the help text cites an unimplemented "-j" option to enable multiple threads.
There is a "--parallel" option, but that is only implemented for the "bench" command.
by tveita
12/30/2025 at 9:38:09 PM
I have now made parallel by default and added an option to enable multiple threads.I haven't tested without SIMD.
by lulzx
12/30/2025 at 9:28:35 PM
You've released quite a few projects lately, very impressive.Are you using LLMs for parts of the coding?
What's your work flow when approaching a new project like this?
by cheshire_cat
12/30/2025 at 9:59:47 PM
> Are you using LLMs for parts of the coding?I can't talk about the code, but the readme and commit messages are most likely LLM-generated.
And when you take into account that the first commit happened just three hours ago, it feels like the entire project has been vibe coded.
by littlestymaar
12/30/2025 at 10:24:52 PM
Hard disagree. Initial commit was 6k LOC. Author could've spent years before committing. Ill advised but not impossible.by Neywiny
12/30/2025 at 10:44:04 PM
Why would you make Claude write your commit message for a commit you've spent years working on though?by littlestymaar
12/30/2025 at 10:56:38 PM
1. Be not good at or a fan of git when coding2. Be not good at or a fan of git when committing
Not sure what the disconnect is.
Now if it were vibecoded, I wouldn't be surprised. But benefit of the doubt
by Neywiny
12/31/2025 at 12:56:29 AM
We're well beyond benefit of the doubt these days. If it looks like a duck... For me there wasn't any doubt, the author's first top comment here was evidence enough, then seeing the readme + random code + random commit message, it's all obvious LLM-speak to me.I don't particularly care, though, and I'm more positive about LLMs than negative even if I don't (yet?) use them very much. I think it's hilarious that a few people asked for Python bindings and then bam, done, and one person is like "..wha?" Yes, LLMs can do that sort of grunt work now! How cool, if kind of pointless. Couldn't the cycles have just been spent on trying to make muPDF better? Though I see they're in C and AGPL, I suppose either is motivation enough to do a rewrite instead. (This is MIT Licensed though it's still unclear to me how 100% or even large-% vibe-coded code deserves any copyright protection, I think all such should generally be under the Unlicense/public domain.)
If the intent of "benefit of the doubt" is to reduce people having a freak out over anyone who dares use these tools, I get that.
by Jach
12/31/2025 at 1:10:54 AM
I have updated the licence to WTFPL.I'll try my best to make it a really good one!
by lulzx
12/31/2025 at 11:41:00 AM
> I have updated the licence to WTFPL.You still have no basis in claiming copyright protection hence you cannot set a license on that code.
Instead of the WTFPL you should just write a disclaimer that due to being machine generated and devoid of creating work, the work is not protected by copyright and free to be used without any license.
by littlestymaar
12/31/2025 at 11:47:26 AM
hasn't world moved on from these things already?by lulzx
1/1/2026 at 9:27:12 AM
Has the world moved on from copyright? Or expecting other people to behave ethically and fairly?No, and god I hope not.
But it's a real dick move to set up your CI the way you have. Zig explicitly requests using one of the many mirrors for CI instead of hammering the main ziglang.org site itself. Perhaps you've moved on from trying to be ethical?
by grayhatter
1/1/2026 at 9:56:12 AM
That's good to know, I wasn't aware of it, I have updated to using a github action they recommend (https://github.com/marketplace/actions/setup-zig-compiler)For the copyright thing, I understand that there's legit ongoing debate around all this AI-assisted coding and copyrightability.
In this case of zpdf, while Claude Code did a lot of the heavy lifting on implementation, there was a real effort in architecture decisions, iterative prompting/refinement, debugging, testing, benchmarking.
My intent is zero restrictions: use it, fork it, sell it, whatever. WTFPL captures that spirit perfectly for me. It's as permissive as legally possible while being upfront about not caring.
The goal is just to make a useful tool freely available.
Edit: I have changed it to CC0.
by lulzx
12/30/2025 at 10:01:04 PM
Claude Code.by lulzx
12/30/2025 at 9:57:47 PM
What's fast about mmap?by jeffbee
12/31/2025 at 4:25:22 AM
Two big advantages:You avoid an unnecessary copy. Normal read system call gets the data from disk hardware into the kernel page cache and then copies it into the buffer you provide in your process memory. With mmap, the page cache is mapped directly into your process memory, no copy.
All running processes share the mapped copy of the file.
There are a lot of downsides to mmap: you lose explicit error handling and fine-grained control of when exactly I/O happens. Consult the classic article on why sophisticated systems like DBMSs do not use mmap: https://db.cs.cmu.edu/mmap-cidr2022/
by kennethallen
12/31/2025 at 5:44:30 PM
> Consult the classic article on why sophisticated systems like DBMSs do not use mmap: https://db.cs.cmu.edu/mmap-cidr2022/Sqlite does (or can optionally use mmap). How come?
Is sqlite with mmap less reliable or anything?
by nextaccountic
1/1/2026 at 12:05:27 AM
If an I/O error happens with read()/write(), you get back an error code, which SQLite can deal with and pass back up to the application, perhaps accompanied by a reasonable error message. But if you get an I/O error with mmap, you get a signal. SQLite itself ought not be setting signal handlers, as that is the domain of the application and SQLite is just a lowly library. And even if SQLite could set signal handlers, it would be difficult to associate a signal with a particular I/O operation. So there isn't a good way to deal with I/O errors when using mmap(). With mmap(), you just have to assume that the filesystem/mass-storage works flawlessly and never runs out of space.SQLite can use mmap(). That is a tested and supported capability. But we don't advocate it because of the inability to precisely identify I/O errors and report them back up into the application.
by SQLite
1/1/2026 at 6:07:55 PM
Thanks for the response. I am more worried about losing already committed data due to an errorhttps://www.sqlite.org/mmap.html
> The operating system must have a unified buffer cache in order for the memory-mapped I/O extension to work correctly, especially in situations where two processes are accessing the same database file and one process is using memory-mapped I/O while the other is not. Not all operating systems have a unified buffer cache. In some operating systems that claim to have a unified buffer cache, the implementation is buggy and can lead to corrupt databases.
What are those OSes with buggy unified buffer caches? More importantly, is there a list of platforms where the use of mmap in sqlite can lead to data loss?
by nextaccountic
12/31/2025 at 6:14:07 PM
I know that the spirit of HN will strike me down for this, but sqlite is not a "sophisticated system". It assumes the hardware is lawful neutral. Real hardware is chaotic. Sqlite has a good reputation because it is very easy to use. In fact this is the same reason programmers like mmap: it is a hell of a shortcut.by jeffbee
1/1/2026 at 5:37:15 PM
I think the main thing is whether mmap will make sqlite lose data or otherwise corrupt already committed data... it will if two programs open the same sqlite, one with mmap, and another without https://www.sqlite.org/mmap.html - at least "in some operating systems" (no mention of which ones)
https://www.sqlite.org/mmap.html
> The operating system must have a unified buffer cache in order for the memory-mapped I/O extension to work correctly, especially in situations where two processes are accessing the same database file and one process is using memory-mapped I/O while the other is not. Not all operating systems have a unified buffer cache. In some operating systems that claim to have a unified buffer cache, the implementation is buggy and can lead to corrupt databases.
Sqlite is otherwise rock solid and won't lose data as easily
by nextaccountic
12/31/2025 at 8:24:29 AM
you lose explicit error handlingI've never had to use mmap but this is always been the issue in my head. If you're treating I/O as memory pages, what happens when you read a page and it needs to "fault" by reading the backing storage but the storage fails to deliver? What can be said at that point, or does the program crash?
by commandersaki
1/2/2026 at 12:34:09 AM
If you fail to load an mmapped page because of an I/O error, Unix-like OSes interrupt your program with SIGBUS/SIGSEGV. It might be technically possible to write a program that would handle those signals and recover, but it seems like a lot more work and complexity than just checking errno after a read system call.by kennethallen
12/31/2025 at 5:31:24 AM
This is a very interesting link. I didn't expect mmap to be less performant than read() calls.I now wonder which use cases would mmap suit better - if any...
> All running processes share the mapped copy of the file.
So something like building linkers that deal with read only shared libraries "plugins" etc ..?
by saidinesh5
1/2/2026 at 12:54:13 AM
mmap is better when: * You want your program to crash on any I/O error because you wouldn't handle them anyway
* You value the programming convenience of being able to treat a file on disk as if the entire thing exists in memory
* The performance is good enough for your use. As the article showed, sequential scan performance is as good as direct I/O until the page cache fills up *from a single SSD*, and random access performance is as good as direct I/O until the page cache fills up *if you use MADV_RANDOM*. If your data doesn't fit in memory, or is across multiple storage devices, or you don't correctly advise the OS about your access patterns, mmap will probably be much slower
To be clear, normal I/O still benefits from the OS's shared page cache, where files that other processes have loaded will probably still be in memory, avoiding waiting on the storage device. But each normal I/O process incurs the space and time cost of a copy into its private memory, unlike mmap.
by kennethallen
12/31/2025 at 10:07:34 AM
One reason to use shared memory mmap is to ensure that even if your process crashes, the memory stays intact. Another is to communicate between different processes.by squirrellous
12/30/2025 at 11:43:10 PM
it allows the program to reference memory without having to manage it in the heap space. it would make the program faster in a memory managed language, otherwise it would reduce the memory footprint consumed by the program.by rishabhaiover
12/30/2025 at 11:53:23 PM
You mean it converts an expression like `buf[i]` into a baroque sequence of CPU exception paths, potentially involving a trap back into the kernel.by jeffbee
12/31/2025 at 12:08:59 AM
I don't fully understand the under the hood mechanics of mmap, but I can sense that you're trying to convey that mmap shouldn't be used a blanket optimization technique as there are tradeoffs in terms of page fault overheads (being at the mercy of OS page cache mechanics)by rishabhaiover
12/31/2025 at 3:56:54 AM
Tradeoffs such as "if an I/O error occurs, the program immediately segfaults." Also, I doubt you're I/O bound to the point where mmap noticeably better than read, but I guess it's fine for an experiment.by StilesCrisis
12/31/2025 at 8:53:33 PM
An I/O error on a mmapped file causes a SIGBUS, which the program can catch and report.And I/O bound programs are I/O bound whereas programs that aren't, aren't, so it really isn't meaningful to talk about whether "you" are I/O bound to the point that it's significant--maybe you are, maybe you aren't. I agree about experimentation.
by jibal
12/31/2025 at 1:16:02 AM
I think he's conveying that he doesn't know what he's talking about. buf[i] generates the same code regardless of whether mmap is being used. The first access to a page will cause a trap that loads the page into memory, but this is also true if the memory is read into.by jibal
12/30/2025 at 10:27:32 PM
What’s the fidelity like compared to tika?by jonstewart
12/30/2025 at 10:39:23 PM
The accuracy difference is marginal (1-2%) but the speed difference is massive.by lulzx
12/31/2025 at 8:11:08 AM
> I builtYou didn't. Claude did. Like it did write this comment.
And you didn't even bother testing it before submitting, which is insulting to everyone.
by littlestymaar
12/31/2025 at 12:01:07 PM
tools are tools.by lulzx