Show HN: 22 GB of Hacker News in SQLite

12/30/2025 at 7:27:22 PM

Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.

I watched it in the browser network panel and saw it fetch:

  https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz

As I paginated to previous days.

It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.

The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.

by simonw

12/30/2025 at 11:48:45 PM

A read-only VFS doing this can be really simple, with the right API…

This is my VFS: https://github.com/ncruces/go-sqlite3/blob/main/vfs/readervf...

And using it with range requests: https://pkg.go.dev/github.com/ncruces/go-sqlite3/vfs/readerv...

And having it work with a Zstandard compressed SQLite database, is one library away: https://pkg.go.dev/github.com/SaveTheRbtz/zstd-seekable-form...

by ncruces

12/31/2025 at 9:29:06 AM

Your page is served over sqlitevfs with Range queries? Let's try this here.

by keepamovin

12/31/2025 at 9:33:34 PM

I did a similar VFS in Go. It doesn't run client-side in a browser.

But you can use it (e.g.) in a small VPS to access a multi-TB database directly from S3.

by ncruces

1/1/2026 at 5:14:07 AM

That is cool. Maybe i look at the go code

by keepamovin

12/31/2025 at 5:07:48 AM

this does not caches the data right? it would always fetch from network? by any chance do you know of solution/extension that caches the data it would make it so much more efficient.

by pdyc

12/31/2025 at 5:56:57 AM

The package I'm using in the HTTP example can be configured to cache data: https://github.com/psanford/httpreadat?tab=readme-ov-file#ca...

But, also, SQLite caches data; you can simply increase the page cache.

by ncruces

12/31/2025 at 4:45:34 AM

Thanks! I'm glad you enjoyed the sausage being made. There's a little easter egg if you click on the compact disc icon.

And I just now added a 'me' view. Enter your username and it will show your comments/posts on any day. So you can scrub back through your 2006 - 2025 retrospective using the calendar buttons.

by keepamovin

12/31/2025 at 2:34:32 PM

I almost got tricked into trying to figure out what was Easter eggy about August 9 2015 :-) There's a clarifying tooltip on the link, but it is mostly obscured by the image's "Archive" title attribute.

by oblosys

12/31/2025 at 2:37:56 PM

Oh, shit that was the problem! You solved the bug! I was trying to figure out why the right tooltip didn't display. A linked wrapped in an image wrapped in an easter egg! Or something. Ha, thank you. Will fix :)

edit: Fixed! Also I just pushed a new version with a Dec 29th Data Dump, so ... updates - yay!

by keepamovin

12/31/2025 at 8:34:55 PM

Happy to help!

by oblosys

12/30/2025 at 9:33:29 PM

Is there anything more production grade built around the same idea of HTTP range requests like that sqlite thing? This has so much potential

by nextaccountic

12/30/2025 at 10:25:06 PM

Yes — PMTiles is exactly that: a production-ready, single-file, static container for vector tiles built around HTTP range requests.

I’ve used it in production to self-host Australia-only maps on S3. We generated a single ~900 MB PMTiles file from OpenStreetMap (Australia only, up to Z14) and uploaded it to S3. Clients then fetch just the required byte ranges for each vector tile via HTTP range requests.

It’s fast, scales well, and bandwidth costs are negligible because clients only download the exact data they need.

https://docs.protomaps.com/pmtiles/

by Humphrey

12/30/2025 at 10:32:30 PM

PMTiles is absurdly great software.

by simonw

12/30/2025 at 10:36:49 PM

I know right! I'd never heard of HTTP Range requests until PMTiles - but gee it's an elegant solution.

by Humphrey

12/31/2025 at 8:25:20 AM

Hadn't seen PMTiles before, but that matches the mental model exactly! I chose physical file sharding over Range Requests on a single db because it felt safer for 'dumb' static hosts like CF. - less risk of a single 22GB file getting stuck or cached weirdly. Maybe it would work

by keepamovin

12/31/2025 at 9:20:36 AM

My only gripe is that the tile metadata is stored as JSON, which I get is for compatibility reasons with existing software, but for e.g. a simple C program to implement the full spec you need to ship a JSON parser on top of the PMTiles parser itself.

by hyperbolablabla

12/31/2025 at 12:56:25 PM

A JSON parser is less than a thousand lines of code.

by seg_lol

12/31/2025 at 5:50:33 PM

And where most of CPU time will be wasted in, if you care about profiling/improving responsiveness.

by Diti

1/1/2026 at 6:44:41 AM

At that point you're just io bound, no? I can easily parse json at 100+GB/s on commodity hardware, but I'm gonna have a much harder time actually delivering that much data to parse.

by monerozcash

1/1/2026 at 6:27:08 AM

What's a better way?

by keepamovin

12/31/2025 at 3:22:12 PM

How would you store it?

by keepamovin

12/31/2025 at 1:23:56 AM

That's neat, but.. is it just for cartographic data?

I want something like a db with indexes

by nextaccountic

12/31/2025 at 6:56:14 AM

Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing

LanceDB has a similar mechanism for operating on remote vector embeddings/text search.

It’s a fun time to be a dev in this space!

by jtbaker

1/2/2026 at 1:44:45 AM

> Look into using duckdb with remote http/s3 parquet files. The parquet files are organized as columnar vectors, grouped into chunks of rows. Each row group stores metadata about the set it contains that can be used to prune out data that doesn’t need to be scanned by the query engine. https://duckdb.org/docs/stable/guides/performance/indexing

But, when using this on frontend, are portions of files fetched specifically with http range requests? I tried to search for it but couldn't find details

by nextaccountic

1/4/2026 at 3:37:58 AM

Yes, you should be able to see the byte range requests and 206 responses from an s3 compatible bucket or http server that supports those access patterns.

by jtbaker

12/30/2025 at 9:37:23 PM

There was a UK government GitHub repo that did something interesting with this kind of trick against S3 but I checked just now and the repo is a 404. Here are my notes about what it did: https://simonwillison.net/2025/Feb/7/sqlite-s3vfs/

Looks like it's still on PyPI though: https://pypi.org/project/sqlite-s3vfs/

You can see inside it with my PyPI package explorer: https://tools.simonwillison.net/zip-wheel-explorer?package=s...

by simonw

12/30/2025 at 10:30:47 PM

I recovered it from https://archive.softwareheritage.org/browse/origin/directory... and pushed a fresh copy to GitHub here:

https://github.com/simonw/sqlite-s3vfs

This comment was helpful in figuring out how to get a full Git clone out of the heritage archive: https://news.ycombinator.com/item?id=37516523#37517378

Here's a TIL I wrote up of the process: https://til.simonwillison.net/github/software-archive-recove...

by simonw

12/30/2025 at 11:12:06 PM

I also have a locally cloned copy of that repo from when it was on GitHub. Same latest commit as your copy of it.

From what I see in GitHub in your copy of the repo, it looks like you don’t have the tags.

Do you have the tags locally?

If you don’t have the tags, I can push a copy of the repo to GitHub too and you can get the tags from my copy.

by QuantumNomad_

12/30/2025 at 11:23:38 PM

I don't have the tags! It would be awesome if you could push that.

by simonw

12/30/2025 at 11:29:57 PM

Uploaded here:

https://github.com/Quantum-Nomad/sqlite-s3vfs

by QuantumNomad_

12/30/2025 at 11:33:57 PM

Thanks for that, though actually it turns out I had them after all - I needed to run:

  git push --tags origin

by simonw

12/30/2025 at 11:35:53 PM

All the better :)

by QuantumNomad_

12/31/2025 at 1:00:38 AM

Doing all this in an hour is such a good example of how absurdly efficient you can be with LLMs.

by bspammer

12/31/2025 at 6:35:48 PM

From reading the TIL, it doesn't appear as if Simon used LLM for a large portion of what he did; only the initial suggestion to check the archive, and the web tool to make his process reproducible. Also, if you read the script from his chat with Claude code, the prompt really does the heavy lifting.

Sure, the LLM fills in all the boilerplate and makes an easy-to-use, reproducible tool with loads of documentation, and credit for that. But is it not more accurate to say that Simon is absurdly efficient, LLM or sans LLM? :)

by socialcommenter

12/30/2025 at 10:46:42 PM

didn't you do something similar for Datasette, Simon?

by AceJohnny2

12/30/2025 at 11:02:07 PM

Nothing smart with HTTP range requests yet - I have https://lite.datasette.io which runs the full Python server app in the browser via WebAssembly and Pyodide but it still works by fetching the entire SQLite file at once.

by simonw

12/30/2025 at 11:04:57 PM

oh! I must've been confused with your TIL where you linked to an explainer of this technique

https://simonwillison.net/2021/May/2/hosting-sqlite-database...

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

https://news.ycombinator.com/item?id=27016630

by AceJohnny2

12/31/2025 at 3:13:32 AM

i played around with this a while back. you can see a demo here. it also lets you pull new WAL segments in and apply them to the current database. never got much time to go any further with it than this.

https://just.billywhizz.io/sqlite/demo/#https://raw.githubus...

by billywhizz

12/30/2025 at 9:43:42 PM

This is somewhat related to a large dataset browsing service a friend and I worked on a while back - we made index files, and the browser ran a lightweight query planner to fetch static chunks which could be served from S3/torrents/whatever. It worked pretty well, and I think there’s a lot of potential for this style of data serving infra.

by ericd

12/31/2025 at 3:05:12 AM

gdal vsis3 dynamically fetches chunks of rasters from s3 using range requests. It is the underlying technology for several mapping systems.

There is also a file format to optimize this https://cogeo.org/

by __turbobrew__

12/31/2025 at 1:33:48 AM

I tried to implement something similar to optimize sampling semi-random documents from (very) large datasets on Huggingface, unfortunately their API doesn't support range requests well.

by omneity

12/31/2025 at 4:04:00 AM

This is pretty much well what is so remarkable about parquet files; not only do you get seekable data, you can fetch only the columns you want too.

I believe that there are also indexing opportunities (not necessarily via eg hive partitioning) but frankly - am kinda out of my depth pn it.

by mootothemax

12/31/2025 at 12:01:12 AM

I want to see a bittorrent version :P

by 6510

12/31/2025 at 1:21:07 AM

Maybe webtorrent-based?

by nextaccountic

12/31/2025 at 4:28:59 AM

Parquet/iceberg

by tlarkworthy

12/31/2025 at 6:57:07 PM

A recent change is I added date spans to the shard checboxes on query view so it's easier to zero dates you want if you have that in mind. Because if your copy isn't local all those network pulls take a while.

The sequence of shards you saw when you paginated to days is faciliated by the static-manifest which maps HN item ID ranges to shards, and since IDs are increasing and a pretty good proxy of time (a "HN clock"), we can also map the shards that we cut up by ID to the time spans their items cover. An in memory table sorted by time is created from the manifest on load so we can easily look up which shard we need when you pick a day.

Funnily enough, this system was thrown off early on by a handful of "ID/timestamp" outliers in the data: items with weird future timestamps (offset by a couple years), or null timestamps. To cleanse our pure data from this noise, and restore proper adjacent-in-time shard cuts we just did a 1/99 percentile grouping and discarded the outliers leaving shards with sensible 'effective' time spans.

Sometimes we end up fetching two shards when you enter a new day because some items' comments exist "cross shard". We needed another index for that and it lives in cross-shard-index.bin which is just a list of 4-byte item IDs that have children in more than 1 shard (2-bytes), which occurs when people have the self-indulgence to respond to comments a few days after a post has died down ;)

Thankfully HN imposes a 2 week horizon for replies so there aren't that many cross-shard comments (those living outside the 2-3 days span of most, recent, shards). But I think there's still around 1M or so, IIRC.

by keepamovin

12/31/2025 at 3:32:24 AM

I am curios why they don't use a single file and HTTP Range Requests instead. PMTiles (a distribution of OpenStreetMap) uses that.

by maxloh

12/31/2025 at 4:48:12 AM

This would be a neat idea to try. Want to add a PR? Bench different "hackends" to see how DuckDB, SQLite shards, or range queries perform?

by keepamovin

12/31/2025 at 2:18:03 AM

I love this so much, on my phone this is much faster than actual HN (I know it's only a read-only version).

Where did you get the 22GB figure from? On the site it says:

> 46,399,072 items, 1,637 shards, 8.5GB, spanning Oct 9, 2006 to Dec 28, 2025

by meander_water

12/31/2025 at 2:57:17 AM

> Where did you get the 22GB figure from?

The HN post title (:

by amitmahbubani

12/31/2025 at 4:46:18 AM

22GB is non-gzipped.

by keepamovin

12/31/2025 at 2:59:48 AM

Hah, well that's embarrassing

by meander_water

12/31/2025 at 3:13:21 AM

The GitHub page is no longer available, which is a shame because I'm really interested in how this works.

How was the entirety of HN stored in a single SQLite database? In other words, how was the data acquired? And how does the page load instantly if there's 22GB of data having to be downloaded to the browser?

by sodafountan

12/31/2025 at 4:50:32 AM

You can see it now, forgot to make it public.

- 1. download_hn.sh - bash script that queries BigQuery and saves the data to *.json.gz

- 2. etl-hn.js - does the sharding and ID -> shard map, plus the user stats shards.

- 3. Then either npx serve docs or upload to CloudFlare Pages.

The ./toool/s/predeploy-checks.sh script basically runs the entire pipeline. You can do it unattended with AUTO_RUN=true

by keepamovin

12/31/2025 at 6:47:00 AM

Awesome, I'll take a look

by sodafountan

12/31/2025 at 10:28:10 PM

Is it possible to implement search this way?

by dzhiurgis

12/30/2025 at 7:47:57 PM

Vfs support is amazing.

by tehlike

12/30/2025 at 9:17:46 PM

I wonder how much smaller it could get with some compression. You could probably encode "This website hijacks the scrollbar and I don't like it" comments into just a few bits.

by yread

12/30/2025 at 10:27:56 PM

The hard-coded dictionary wouldn't be much stranger than Brotli's:

https://news.ycombinator.com/item?id=27160590

by Rendello

12/31/2025 at 5:45:14 AM

You can use a BPE variant like SentencePiece to identify these patterns rather than hard coding them.

by maxbond

12/30/2025 at 10:23:11 PM

That's at least 45%, then you can leave out all of my comments and you're left with only 5!

by jacquesm

12/31/2025 at 2:31:36 AM

It might be a neat experiment to use ai to produce canonicalized paraphrasings of HN arguments so they could be compared directly and compress well.

by hamburglar

1/1/2026 at 11:22:14 PM

Dear it is already compressed using G zip – nine for every SQLlight shard and manifest

22 GB is uncompressed and compressed the entire things about 9 GB

by keepamovin

12/31/2025 at 6:45:03 AM

Guilty.

by rossant

12/31/2025 at 12:58:22 AM

It'd be great if you could add it to Kiwix[1] somehow (not sure what the process is for that but 100rabbits figured it out for their site) - I use it all the time now that I have a dumb phone - I have the entirety of wikipedia, wiktionary and 100rabbits all offline.

https://kiwix.org/en/

by kamranjon

12/31/2025 at 3:43:32 AM

I love that you have 100r.ca on that short list.

by codazoda

12/31/2025 at 5:15:40 AM

what dumb phone do you use?

and why do you want wikipedia in your pocket, but not a smartphone? where do you draw the line?

(doing a lot of work in that area, so i am asking to learn from someone who might think alike)

by endofreach

12/31/2025 at 5:34:01 AM

I use the Mudita Kompakt specifically cause it allows sideloading so I can still have a few extras. Right now I have Kiwix and Libby. It works really well.

I have a $10 a month plan from US cellular with only 2gigs so I try to keep everything offline that I can.

Honestly it's mostly the news... so I draw the line at browser, I'll never install a browser, that's basically something I can do when I sit down at a PC. I read quite a bit and I like to have the ability to look up a word or a historical event or some reference from something I read using Kiwix and it's been great for that, just needed to add a 512gb micro sd card. And Libby I just use at the gym when I'm on the treadmill.

by kamranjon

1/1/2026 at 9:32:46 PM

interesting. thank you. any way i can reach out to you regardibg a project i am working on?

your input would be very valuable.

by endofreach

1/1/2026 at 10:00:21 PM

This is a good idea — we should do it.

I also want to make sure we can build this in CI. My goal is to have this updated every day using the BigQuery update process, so it becomes a 1–2 day delayed static archive of the current state of Hacker News, which is honestly very cool.

I can probably run the build for free on GitHub Actions runners, as long as the runner has about 40 GB of disk space available. If needed, I can free up space on the runner before the build starts.

I’ll also write to GitHub support and ask if they can sponsor the cost of a larger runner, mainly because I need the extra disk space to run the build reliably.

by keepamovin

12/30/2025 at 7:41:47 PM

Similar to Single-page applications (SPA), single-table application (STA) might become a thing. Just a shard a table on multiple keys and serve the shards as static files, provided that the data is Ok to share, similar to sharing static html content.

by zkmon

12/30/2025 at 8:58:53 PM

[The Baked Data architectural pattern](https://simonwillison.net/2021/Jul/28/baked-data/)

by jhd3

12/30/2025 at 8:23:23 PM

do you mean single database? it'd be quite hard if not impossible to make applications using a single table (no relations). reddit did it though, they have a huge table of "things" iirc.

by jesprenj

12/30/2025 at 9:09:31 PM

That is a common misconception.

> Next, we've got more than just two tables. The quote/paraphrase doesn't make it clear, but we've got two tables per thing. That means Accounts have an "account_thing" and an "account_data" table, Subreddits have a "subreddit_thing" and "subreddit_data" table, etc.

https://www.reddit.com/r/programming/comments/z9sm8/comment/...

by mburns

12/30/2025 at 10:19:39 PM

And the important lesson from that the k/v-like aspect of it. That the "schema" is horizontal (is that a thing?) and not column-based. But I actually only read it on their blog IIRC and never even got the full details - that there's still a third ID column. Thanks for the link.

by rplnt

12/30/2025 at 7:17:24 PM

That's pretty neat!

I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.

SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.

I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.

[1] https://github.com/Paul-E/Pushshift-Importer

[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...

by Paul-E

12/30/2025 at 7:57:10 PM

You could check out SQLite's auto_vacuum which reclaims space without rebuilding the entire db https://sqlite.org/pragma.html#pragma_auto_vacuum

by s_ting765

12/30/2025 at 11:38:29 PM

I haven't tested that, so I'm not sure if it would work. The import only inserts rows, it doesn't delete, so I don't think that is the cause of fragmentation. I suspect this line in the vacuum docs:

> The VACUUM command may change the ROWIDs of entries in any tables that do not have an explicit INTEGER PRIMARY KEY.

means SQLite does something to organize by rowid and that this is doing most of the work.

Reddit post/comment IDs are 1:1 with integers, though expressed in a different base that is more friendly to URLs. I map decoded post/comment IDs to INTEGER PRIMARY KEYs on their respective tables. I suspect the vacuum operation sorts the tables by their reddit post ID and something about this sorting improves tables scans, which in turn helps building indices quickly after standing up the DB.

by Paul-E

12/31/2025 at 8:39:47 AM

Holy cow, I didn't know getting reddit was that straightforward. I am building public readonly-SQL+vector databases optimized for exploring high-quality public commons with Claude Code (https://exopriors.com/scry), I so cannot wait until some funding source comes in and I can upgrade to a $1500/month Hetzner server and pay the ~$1k to embed all that.

by Xyra

12/30/2025 at 6:37:46 PM

That repo is throwing up a 404 for me.

Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?

by carbocation

12/30/2025 at 6:46:41 PM

No, I just went straight to sqlite. What is duckdb?

by keepamovin

12/30/2025 at 7:31:31 PM

One interesting feature of DuckDB is that it can run queries against HTTP ranges of a static file hosted via HTTPS, and there's an official WebAssembly build of it that can do that same trick.

So you can dump e.g. all of Hacker News in a single multi-GB Parquet file somewhere and build a client-side JavaScript application that can run queries against that without having to fetch the whole thing.

You can run searches on https://lil.law.harvard.edu/data-gov-archive/ and watch the network panel to see DuckDB in action.

by simonw

12/31/2025 at 4:23:30 AM

In that case, then using duckdb might be even more performant than using what we’re doing here.

It would be an interesting experiment to add the duckdb hackend

by keepamovin

12/30/2025 at 7:05:14 PM

DuckDB is an open-source column-oriented Relational Database Management System (RDBMS). It's designed to provide high performance on complex queries against large databases in embedded configuration.

It has transparent compression built-in and has support for natural language queries. https://buckenhofer.com/2025/11/agentic-ai-with-duckdb-and-s...

"DICT FSST (Dictionary FSST) represents a hybrid compression technique that combines the benefits of Dictionary Encoding with the string-level compression capabilities of FSST. This approach was implemented and integrated into DuckDB as part of ongoing efforts to optimize string storage and processing performance." https://homepages.cwi.nl/~boncz/msc/2025-YanLannaAlexandre.p...

by fsiefken

12/30/2025 at 7:00:46 PM

It is very similar to SQLite in that it can run in-process and store its data as a file.

It's different in that it is tailored to analytics, among other things storage is columnar, and it can run off some common data analytics file formats.

by cess11

12/30/2025 at 11:05:19 PM

"What is duckdb?"

duckdb is a 45M dynamically-linked binary (amd64)

sqlite3 1.7M static binary (amd64)

DuckDB is a 6yr-old project

SQLite is a 25yr-old project

by 1vuio0pswjnm7

12/31/2025 at 1:51:35 AM

I like SQLite

by 1vuio0pswjnm7

12/30/2025 at 10:29:18 PM

Maybe it got nuked by MS? The rest of their repo's are up.

by jacquesm

12/31/2025 at 6:49:36 AM

Hey jacquesm! No, I just forgot to make it public.

BUT I did try to push the entire 10GB of shards to GitHub (no LFS, no thanks, money), and after the 20 minutes compressing objects etc, "remote hang up unexpectedly"

To be expected I guess. I did not think GH Pages would be able to do this. So have been repeating:

  wrangler pages deploy docs --project-name static-news --commit-dirty=true

on changes and first time CF Pages user here, much impressed!

by keepamovin

12/31/2025 at 8:28:05 AM

Pretty neat project. I never thought you could do this in the first place, very much inspiring. I've made a little project that stores all of its data locally but still runs in the browser to protect against take downs and because I don't think you should store your precious data online more than you have to, eventually it all rots away. Your project takes this to the next level.

by jacquesm

12/31/2025 at 9:59:10 AM

Thanks, bud, that means a lot! Would like to see your versions of the data stored offline idea, it's very cool.

by keepamovin

12/31/2025 at 3:44:16 PM

pianojacq.com

It's super simple, really, far less impressive than what you've built there.

by jacquesm

1/2/2026 at 12:58:39 PM

That's really cool, man. The music notation is beautiful. I hit play but couldn't get it to progress past the first note. Maybe I need to plug in a midi keyboard? It would be cool if I could "play" with my ASCII keyboard.

Listen was nice. That's really cool, actually. I encourage you to do it.

by keepamovin

12/30/2025 at 6:50:05 PM

While I suspect DuckDB would compress better, given the ubiquity of SQLite, it seems a fine standard choice.

by 3eb7988a1663

12/31/2025 at 8:42:37 AM

the data is dominated by big unique TEXT columns, unsure how that can much compress better when grouped - but would be interesting to know

by peheje

12/31/2025 at 4:07:28 PM

I was thinking more the numeric columns which have pre-built compression mechanisms to handle incrementing columns or long runs of identical values. For sure less total data than the text, but my prior is that the two should perform equivalently on the text, so the better compression on numbers should let duckdb pull ahead.

I had to run a test for myself, and using sqlite2duckdb (no research, first search hit), and using randomly picked shard 1636, the sqlite.gz was 4.9MB, but the duckdb.gz was 3.7MB.

The uncompressed sizes favor sqlite, which does not make sense to me, so not sure if duckdb keeps around more statistics information. Uncompressed sqlite 12.9MB, duckdb 15.5MB

by 3eb7988a1663

12/30/2025 at 6:47:23 PM

Not the author here. I’m not sure about DuckDB, but SQLite allows you to simply use a file as a database and for archiving, it’s really helpful. One file, that’s it.

by linhns

12/30/2025 at 6:52:56 PM

DuckDB does as well. A super simplified explanation of duckdb is that it’s sqlite but columnar, and so is better for analytics of large datasets.

by cobolcomesback

12/30/2025 at 7:02:25 PM

The schema is this: items(id INTEGER PRIMARY KEY, type TEXT, time INTEGER, by TEXT, title TEXT, text TEXT, url TEXT

Doesn't scream columnar database to me.

by formerly_proven

12/30/2025 at 7:05:15 PM

At a glance, that is missing (at least) a `parent` or `parent_id` attribute which items in HN can have (and you kind of need if you want to render comments), see http://hn.algolia.com/api/v1/items/46436741

by embedding-shape

12/30/2025 at 7:22:03 PM

Edges are a separate table

by agolliver

12/31/2025 at 4:21:27 AM

i forgot to set repo to public. Fixed now

by keepamovin

12/30/2025 at 11:13:44 PM

I tried "select * from items limit 10" and it is slowly iterating through the shards without returning. I got up to 60 shards before I stopped. Selecting just one shard makes that query return instantly. As mentioned elsewhere I think duckdb can work faster by only reading the part of a parquet file it needs over http.

I was getting an error that the users and user_domains tables aren't available, but you just need to change the shard filter to the user stats shard.

by kristianp

12/30/2025 at 11:40:58 PM

That's odd. If it was a VFS, that's not what I'd expect would happen. Maybe it's not a VFS?

by ncruces

12/31/2025 at 6:46:55 AM

What is a VFS?

by keepamovin

12/31/2025 at 9:31:48 PM

https://sqlite.org/vfs.html

While designed for OS portability, you can use it to convince SQLite to read from something other than a file on disk.

by ncruces

12/31/2025 at 2:23:08 AM

Doesn't `LIMIT` just limit the amount of rows returned, rather than the amount read & processed?

by piperswe

12/31/2025 at 10:47:22 AM

That depends on the query. SQLite tries to use LIMIT to restrict the amount of reading that it does. It is often successful at that. But some queries, by their very nature, logically require reading the whole input in order to compute the correct answer, regardless of whether or not there is a LIMIT clause.

by SQLite

12/31/2025 at 2:46:01 AM

That's what it does, but if I'm not mistaken (at least in my experience with MariaDB) it'll also return immediately once it ran up to the limit and not try to process further rows. If you have an expensive subquery in the SELECT (...) AS `column_name`, it won't run that for every row before returning the first 10 (when using LIMIT 10) unless you ORDERed BY that column_name. Other components like the WHERE clause might also require that it reads every row before finding the ten matches. So mostly yes but not necessarily

by lucb1e

12/31/2025 at 5:57:04 AM

The limit clause isn't official/standard ansi sql, so it's up to the rdbms to implement. Your assumption is true for bigquery (infamously) but not true for things like snowflake, duckdb, etc.

by faxmeyourcode

12/30/2025 at 10:40:35 PM

Looks like the repo was taken down (404).

That's too bad, I'd like to see the inner-working with a subset of data, even with placeholders for the posts and comments.

by m-p-3

12/30/2025 at 10:54:52 PM

That was fast. I was looking into recent HN datasets, and they are impossible find.

by 3abiton

12/30/2025 at 10:58:26 PM

Complete and continuously updated: https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...

by xnx

12/31/2025 at 2:41:43 AM

If the last story on HN was at December 26, that is.

by gettingoverit

12/31/2025 at 6:51:18 AM

Continuously updated != instantly updated

by rolymath

12/31/2025 at 12:18:33 PM

Continuously would suggest to me that the data is never far out of date, and a few days might be considered far in this case.

Perhaps “regularly updated” would be less contentious wordage?

by dspillett

12/31/2025 at 3:17:31 AM

It's available on BigQuery and is updated frequently enough(daily I think).

by scsh

12/31/2025 at 3:52:49 AM

But why would they take it down?

by octoberfranklin

12/31/2025 at 4:24:02 AM

Sorry i just forgot to set it to public! It’s there now

by keepamovin

12/31/2025 at 10:34:54 AM

threw some heatmaps together of post volume and average score by day and time (15min intervals)

story volume (all time): https://ibb.co/pBTTRznP

average score (all time): https://ibb.co/KcvVjx8p

story volume (since 2020): https://ibb.co/cKC5d7Pp

average score (since 2020): https://ibb.co/WpN20kfh

by WadeGrimridge

12/31/2025 at 10:49:45 AM

added median too.

median score (all time): https://ibb.co/gZV5QVMG

median score (since 2020): https://ibb.co/Gfv8T7k8

by WadeGrimridge

12/31/2025 at 12:33:26 PM

This is fascinating, I love this! What do you create these heat maps with? Did you use the BigQuery data or download from the site?

by keepamovin

12/31/2025 at 1:27:56 PM

thanks! i downloaded them and used python to make these

by WadeGrimridge

12/31/2025 at 2:21:26 PM

So cool! Would it be impossible fro me to use them on the Archive stats page (https://hackerbook.dosaygo.com/?view=archive) ? If you're okay with that any links/credit line details?

Totally cool if not, just super interesting!

by keepamovin

12/31/2025 at 6:48:10 PM

yes, please, that'd be cool! you can link my site grimridge.net if you'd like. here are the numbers so you can plot graphs that fit the site's style:

mean (all time): https://katb.in/yutupojerux

mean (since 2020): https://katb.in/omoyibisava

median (all time): https://katb.in/kilopofivet

median (since 2020): https://katb.in/ukefetuyuhi

by WadeGrimridge

1/1/2026 at 1:27:03 PM

Added in the latest: https://hackerbook.dosaygo.com/?view=archive

by keepamovin

1/9/2026 at 12:31:09 PM

just saw this, thanks!

by WadeGrimridge

1/20/2026 at 7:16:19 AM

[dead]

by mergisi

1/1/2026 at 5:12:15 AM

Ok i will put them in the next update! Thank you

by keepamovin

12/31/2025 at 11:01:17 AM

can confirm, i'm usually very generous with upvotes on sundays at noon

by setnone

12/30/2025 at 6:49:30 PM

The query tab looks quite complex with all these content shards: https://hackerbook.dosaygo.com/?view=query

I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...

by zX41ZdbW

12/30/2025 at 7:03:57 PM

Does your database also runs offline/locally in the browser? Seems to be the reason for the large number of shards.

by embedding-shape

12/31/2025 at 4:21:49 AM

You can run it locally, but it is a client-server architecture, which means that something has to run behind the browser.

by zX41ZdbW

12/31/2025 at 7:16:58 AM

Similar in spirit to a recent tool I recently posted Show HN on, https://exopriors.com/scry. You can use Claude Code to SQL+vector query HackerNews and many other high quality public commons sites, exceptionally well-indexed and usually 5+ minute query timeout limits, so you can run seriously large research queries, to rapidly refine your worldview (particular because you can do easily to EXHAUSTIVE exploration).

by Xyra

12/31/2025 at 12:58:13 PM

This looks cool but can you make a "Google Search Box" page where I don't have to sign in but can use it? It's just a bit of friction and I feel unbothered to overcome it. It's not personal to you - it's just how I feel about anything that looks unknown and interesting I just want to try, not have to sign up. For now. You know?

by keepamovin

12/31/2025 at 11:55:01 AM

I like your concept of indexing high quality sources for RAG. For many queries we might not need the usual search engines.

by visarga

12/30/2025 at 6:44:35 PM

Is this updated regularly? 404 on GitHub as the other comment.

With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).

by wslh

12/31/2025 at 3:25:22 AM

The BQ dataset is only ~17GB and the free tier of BQ lets you query 1TB per month. If you're not doing select * on every query you should be able to do a lot with that.

by scsh

12/30/2025 at 7:31:41 PM

What a reminder on how text is so much more efficient than video, its crazy! Could you imagine the same amount of knowledge (or dribble) but in video form? I wonder how large that would be.

by sieep

12/30/2025 at 10:24:55 PM

That's what's so sad about youtube. 20 minute videos to encode a hundred words of usable content to get you to click on a link. The inefficiency is just staggering.

by jacquesm

12/30/2025 at 10:54:23 PM

Youtube can be excellent for explanations. A picture's worth a thousand words, and you can fit a lot of decent pictures in a 20 minute video. The signal-to-noise can be high, of course.

by Rendello

12/31/2025 at 5:39:04 PM

Unfortunately even the videos that do contain helpful imagery are still dominated by huge sections of low entropy.

For example, one of the most useful applications of video over text is appliance or automotive repair, but the ideal format would be an article interspersed with short video sections, not a video with a talking head and some ~static shaky cam taking up most of the time as the individual drones on about mostly unrelated topics or unimportant details yet you can’t skip past it in case there is something actually pertinent covered in that time.

by ComputerGuru

12/31/2025 at 6:42:00 PM

Ay, there's the rub. Professional video makes tend to be pushed into making videos for a more general audience, and niche topics are left to first-timers who haven't developed video-making skills and (tend to) go on and on.

I've produced a few videos, and I was shocked at how difficult it was to be clear. I have the same problem with writing, but at least it's restricted in a way video making isn't. There's so many ways to make a video about something, and most of them are wrong!

by Rendello

12/30/2025 at 7:58:22 PM

Average high quality 1080p60 video has bitrate of 5Mbps, which is equivalent to 120k English words per second. With average English speech being 150wpm, we end up with text being 50 thousand times more space efficient.

Converting 22GB of uncompressed text into video essay lands us at ~1PB or 1000TB.

by ivanjermakov

12/31/2025 at 10:50:52 PM

Cheers to you for doing the math, I hope 2026 is excellent for you!

by sieep

12/31/2025 at 4:25:57 AM

Right? 20 years, probably 10s millions of human hours of interactions, and it’s only as much as a couple DVDs.

by keepamovin

12/30/2025 at 8:33:41 PM

one could use a video llm to generate the video, diagrams or the stills automatically based on the text. except when it's boardgames playthroughs or programming i just transcribe to text, summarise and read youtube video's.

by fsiefken

12/30/2025 at 9:02:47 PM

How do you read youtube videos? Very curious as I have been wanting to watch PDF's scroll by slowly on a large TV. I am interested in the workflow of getting a pdf/document into a scrolling video format. These days NotebookLM may be an option but I am curious if there is something custom. If I can get it into video form (mp4) then I can even deliver it via plex.

by deskamess

12/31/2025 at 1:16:48 AM

I use yt-dlp to download the transcript, and if it's not available i can get the audio file and run it through parakeet locally. Then I have the plain text, which could be read out loud (kind of defeating the purpose), but perhaps at triple speed with a computer voice that's still understandble at that speed. I could also summarize it with an llm. With pandoc or typst I can convert to single column or mult column pdf to print or watch on tv or my smart glasses. If I strip the vowels and make the font smaller I can fit more!

One could convert the Markdown/PDF to a very long image first with pandoc+wkhtml, then use ffmpeg to crop and move the viewport slowly over the image, this scrolls at 20 pixels per second for 30s - with the mpv player one could change speed dynamically through keys.

ffmpeg -loop 1 -i long_image.png -vf "crop=iw:ih/10:0:t*20" -t 30 -pix_fmt yuv420p output.mp4

Alternatively one could use a Rapid Serial Visual Presentation / Speedreading / Spritz technique to output to mp4 or use dedicated rsvp program where one can change speed.

One could also output to a braille 'screen'.

Scrolling mp4 text on the the TV or Laptop to read is a good idea for my mother and her macula degeneration, or perhaps I should make use of an easier to see/read magnification browser plugin tool.

by fsiefken

12/30/2025 at 8:47:26 PM

Can be nice to pull a raw transcript and have it formatted as HTML (formatting/punctuation fixes applied).

Best locally of course to avoid “I burned a lake for this?” guilt.

by Barbing

12/31/2025 at 1:12:44 AM

yes, yt-dlp can download the transcript, and if it's not available i can get the audio file and run it through parakeet locally.

by fsiefken

12/30/2025 at 7:12:13 PM

Wonder if you could turn this into a .zim file for offline browsing with an offline browser like Kiwix, etc. [0]

I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).

[0] https://kiwix.org/en/the-new-kiwix-library-is-available/

by abixb

12/31/2025 at 4:41:07 AM

Oh that's a cool idea. If you want to take a crack at writing the script, the repo is open!

by keepamovin

12/30/2025 at 8:42:06 PM

Oh this should TOTALLY be available to those who are scrolling through sources on the Kiwix app!

by Barbing

12/30/2025 at 9:57:55 PM

Site does not load on Firefox console error says 'Uncaught (in promise) TypeError: can't access property "wasm", sqlite3 is null'

Guess its common knowledge that SharedArrayBuffer (SQLite wasm) does not work with FF due to Cross-Origin Attacks (i just found out ;).

Once the initial chunk of data loads the rest load almost instantly on Chrome. Can you please fix the GitHub link (current 404) would like to peak at the code. Thank you!

by Sn0wCoder

12/31/2025 at 4:25:06 AM

Damn. Will try to fix for FF.

edit: I just tested with FF latest, seems to be working.

by keepamovin

12/31/2025 at 3:02:35 PM

Strange now the first few days load (getting a new error) 'Ignoring inability to install OPFS sqlite3_vfs: Cannot install OPFS: Missing SharedArrayBuffer and/or Atomics. The server must emit the COOP/COEP response headers to enable those. See https://sqlite.org/wasm/doc/trunk/persistence.md#coop-coep'

But when go back to the 26th none of the shards will load, error out.

Using Windows 11, FF 146.0.1

Since you tested it seems its just a me problem and thanks for fixing the GitHub link

by Sn0wCoder

12/31/2025 at 4:24:04 PM

No I've seen that error too, on Safari. I think it's related to the wasm being sent with wrong headers. CF pages _headers file should be ensuring correctness. Can you try busting your cache (or wait for a new Dec 29 Data dump version coming in a couple minutes), or from incognito to see if that fixes the issue? It's possible an earlier version had stale headers or sth. Idk.

by keepamovin

12/30/2025 at 11:20:44 PM

link no workie: https://github.com/DOSAYGO-STUDIO/HackerBook

by diyseguy

12/31/2025 at 5:18:18 AM

Fixed now. Forgot to make public. I also added a script:

  ./toool/download-site.mjs --help

To let you download the entire site over HTTPS so you don't need to "build it" by running the pipeline.

That way it's truly offline.

by keepamovin

12/30/2025 at 7:34:55 PM

The link seems to be down, was it taken down?

by tevon

12/30/2025 at 7:50:33 PM

Probably just forgot to make it public.

by scsh

1/4/2026 at 11:46:42 AM

I added a weekly update, so it should remain current with the actual site BigQuery does so. It's funny that it now archives itself: https://hackerbook.dosaygo.com/?view=item&id=46435308

by keepamovin

12/31/2025 at 11:43:03 AM

Absolutely love this!! I have been analyzing a lot of HN data lately [1] so I backtested my hypothesis on your dataset and ran some stats: https://philippdubach.com/standalone/hackerbook-stats/

[1]https://news.ycombinator.com/item?id=46434575

by 7777777phil

12/31/2025 at 12:02:51 PM

That is so cool! I love that this passion fun project inspired and helped your paper. So cool!

by keepamovin

12/30/2025 at 10:15:50 PM

Is there a public dump of the data anywhere that this is based upon, or have they scraped it themselves?

Such as DB might be entertaining to play with, and the threadedness of comments would be useful for beginners to practise efficient recursive queries (more so than the StackExchange dumps, for instance).

by dspillett

12/30/2025 at 10:27:10 PM

While not a dump per se, there is an API where you can get HN data programmatically, no scraping needed.

https://github.com/HackerNews/API

by thomasmarton

12/31/2025 at 4:27:53 AM

Yes, you can see the download HN bash script in the repository now that simply extract the data to your local machine from BigQuery and saves it as a series of gzip JSON files

by keepamovin

12/31/2025 at 12:16:25 PM

Ah, the repo was 404ing for me last time I checked (seems fine now) so I couldn't inspect that. I'll have a play later.

by dspillett

12/30/2025 at 6:51:04 PM

1 hour passed and it's already nuked?

Thank you btw

by yupyupyups

12/31/2025 at 1:45:18 AM

It's really a shame that comment scores are hidden forever. Would the admins consider publishing them after stories are old enough that voting is closed? It would be great to have them for archives and search indices and projects like this.

by modeless

12/31/2025 at 4:13:06 AM

I wrote to hn@ and asked for this as a feature request:

"1. Delayed Karma Display. I understand why comment karma was hidden. I don't see the harm in un-hiding karma after some time. If not 24 hours, then 72-168 hours. This would help me read through threads with 1300 comments."

This was last January. While I asked for a few more features, it is the only one that seems essential as HN grows with massive threads.

by pilingual

12/31/2025 at 4:26:30 AM

Fear not. I have a collaborative project designed to address this.

by keepamovin

12/31/2025 at 4:56:06 AM

They're referring to scores on individual COMMENTS - this information isn't available via the HN Firebase API.

The only way you could theoretically extract everyone's comment scores (at least the top level ones) would be like this if you're a complete madman:

1. Wait 48 hours so the article is effectively dead

2. Post a new comment using an account called ThePresident

3. Create a swarm of a thousand shill user accounts called Voter1, Voter2, etc.

4. Use a single account at a time and upvote ThePresident

5. Recheck the page to see if ThePresident has moved above a user(s) post

6. Record the score for that user and assign it to the tracked story's history

7. Repeat from (4)

by vunderba

12/31/2025 at 4:57:14 AM

I know that! I have a collaborative project to make it sort of available.

But the idea I have is not like that at all - it's much nicer on everyone's ethics. Stay tuned! :)

by keepamovin

12/31/2025 at 12:14:45 PM

Is it a thing that the design is almost unusable on a mobile phone? The tech making this possible is beyond cool, but it's just presented in such a brutal way for phone users, even though fixing it would be super simple.

by adamszakal

12/31/2025 at 12:18:44 PM

Really? Let me know how I can help. What would you like to see fixed?

by keepamovin

12/31/2025 at 6:40:44 PM

Just following the ordinary guidelines when doing responsive designs, like increasing the text size and sizes of buttons and inputs, so my fat fingers don't missklick every other try. HN has gotten better, but is still below average, hence why I thought it was some kind of aesthetic choice.

by adamszakal

1/1/2026 at 5:13:11 AM

I tried to keep to HN look. But I’m not a mobile design expert. I will give it a once over see if i can make it better for you

by keepamovin

1/5/2026 at 7:40:57 AM

You rock dude, hope it didn't come out as condescending! It's an awesome project.

by adamszakal

12/31/2025 at 1:41:17 AM

Did anyone get a copy of this before it was pulled? If GitHub is not keen, could it be uploaded to HuggingFace or some other service which hosts large assets?

I have always known I could scrape HN, but I would much rather take a neat little package.

by 3eb7988a1663

12/30/2025 at 9:40:17 PM

This is pretty neat! The calendar didn't work well for me. I could only seem to navigate by month. And when I selected the earliest day (after much tapping), nothing seemed to be updated.

Nonetheless, random access history is cool.

by spit2wind

12/31/2025 at 4:41:53 AM

Cna you let me know? I'm sure there's some weirdness lurking there and I want to smooth it out. Calendar is essential.

by keepamovin

12/31/2025 at 5:42:21 PM

Awesome work.

Minor bug/suggestion: right-aligned text inputs (eg the username input on the “me” page) aren’t ideal since they are often obscured by input helpers (autocomplete or form fill helper icons).

by ComputerGuru

12/31/2025 at 10:12:25 AM

Neat. I keep wanting to build something like this for GitHub audit logs, but at ~5 tb, probably a little much

by RyJones

12/30/2025 at 10:12:32 PM

Apparently the comment counts are only the top-level comments?

It would be nice for the thread pages to show a comment count.

by layer8

12/31/2025 at 4:28:44 AM

Yes, because comments in a thread can span shards. It’s just a bit too heavy to add comment counts of an entire thread. So I give a low bound ha ha

by keepamovin

12/31/2025 at 3:26:00 AM

Suddenly occurs to me that it would be neat to pair a small LLM (3-7B) with an HN dataset

by fouc

12/31/2025 at 3:48:07 AM

Does the SQLite version of this already exist somewhere? The github link on the footer of the page fails for me.

by codazoda

12/31/2025 at 1:40:39 PM

Nice. I wonder if there’s any way to quickly get a view for a whole year.

by rcarmo

12/31/2025 at 4:24:45 PM

You mean like stuff ranked per year?

Edit: Good idea! I implemented a "year" selector so all main views (front/show/ask/jobs) will be from that entire year rather than just a single day.

by keepamovin

12/30/2025 at 11:26:56 PM

Link appears broken

by joshcsimmons

12/30/2025 at 11:32:42 PM

confirmed - I wonder what happened?

by ra

12/30/2025 at 7:33:39 PM

This would be awesome as a cross platform app.

by sirjaz

12/31/2025 at 4:42:46 AM

Good idea. HN.exe

by keepamovin

12/30/2025 at 10:45:31 PM

How do I download it? That repo is a 404.

by KomoD

12/30/2025 at 9:19:38 PM

22gb for mostly text? tried loading the site, it's pretty slow. curious how the query performance is with this much data in sqlite

by dmarwicke

12/30/2025 at 11:12:30 PM

Beautiful !

2026 prayer: for all you AI junkies—please don’t pollute H/N with your dirty AI gaming.

Don’t bot posts, comments, or upvote/downvote just to maximize karma. Please.

We can’t identify anymore who’s a bot and who’s human. I just want to hang out with real humans here.

by solarized

12/31/2025 at 7:57:46 AM

Hahaha, now you can be prepared for the apocalypse when the internet disappears. ;)

by DenisDolya

12/30/2025 at 7:24:20 PM

[flagged]

by fao_

12/30/2025 at 8:20:59 PM

I doubt it. "hacker news" spelled lowercase? comma after "beauty"? missing "in" after "it's"? i doubt an LLM would make such syntax mistakes. it's just good writing, that's also possible these days.

by jesprenj

12/31/2025 at 4:16:01 PM

> it's just good writing, that's also possible these days.

As someone reskilling into being a writer, I really do not think that is "good writing".

by fao_

12/30/2025 at 7:39:52 PM

There's a thing in soccer at the moment where a tackle looks fine in realtime but when the video referee shows it to the onpitch referee, they show the impact in slo-mo over and over again and it always looks way worse.

I wonder if there's something like this going on here. I never thought it was LLM on first read, and I still don't, but when you take snippets and point at them it makes me think maybe they are

by walthamstow

12/30/2025 at 9:17:44 PM

Even if so, would it have mattered? The point is showing off the SQLite DB.

But it didn’t read LLM generated IMO.

by Insanity

12/30/2025 at 7:32:58 PM

Why do you say that?

by rantingdemon

12/30/2025 at 7:34:34 PM

Because anything that even slightly differs from the standard American phrasing of something must be "LLM generated" these days.

by sundarurfriend

12/30/2025 at 7:52:25 PM

Sometimes I want to write more creatively, but then worry I’ll be accused of being an LLM. So I dumb it down. Remove the colorful language. Conform.

by deadbabe

12/30/2025 at 11:58:57 PM

Fuck 'em.

Always write what you want, however you want to write it. If some reader somewhere decides to be judgemental because of — you know — an em dash or an X/Y comparison or a complement or some other thing that they think pins you down as being a bot, then that's entirely their own problem. Not yours.

They observe the reality that they deserve.

by ssl-3

12/31/2025 at 1:49:58 AM

You’re absolutely right. It’s not my problem, it’s their problem.

by deadbabe

12/30/2025 at 7:39:51 PM

With the em dashes I see you. But at this point idrc so long as it reads well. Everyone uses spell check…

by JavGull

12/30/2025 at 8:31:09 PM

I add em dashes to everything I write now, solely to throw people who look for them off. Lots of editors add them automatically when you have two sequential dashes between words — a common occurrence, like that one. And this is is Chrome on iOS doing it automatically.

Ooh, I used “sequential”, ooh, I used an em dash. ZOMG AI IS COMING FOR US ALL

by naikrovek

12/31/2025 at 1:40:28 AM

Anyone demonstrating above a high-school vocabulary/reading level is obviously a machine.

by 3eb7988a1663

12/30/2025 at 8:44:39 PM

Ya—in fact, globally replaced on iOS (sent from Safari)

Also for reference: “this shortcut can be toggled using the switch labeled 'Smart Punctuation' in General > Keyboard settings.”

by Barbing

12/31/2025 at 4:15:20 PM

I also use Em-Dashes, this is about how weird the thing is tonally

by fao_

12/31/2025 at 4:14:35 PM

It feels like a LLM doing it's usual "gushing out appreciations"

by fao_

12/30/2025 at 8:16:18 PM

> I'm really sorry to have to ask this, but this really feels like you had an LLM write it?

Ending a sentence with a question mark doesn’t automatically make your sentence a question. You didn’t ask anything. You stated an opinion and followed it with a question mark.

If you intended to ask if the text was written by AI, no, you don’t have to ask that.

I am so damn tired of the “that didn’t happen” and the “AI did that” people when there is zero evidence of either being true.

These people are the most exhausting people I have ever encountered in my entire life.

by naikrovek

12/30/2025 at 10:27:08 PM

You're right. Unfortunately they are also more and more often right.

by jacquesm

12/30/2025 at 5:37:00 PM

How much space is needed? ...for the data .... Im wondering if it would work on a tablet? ....

by asdefghyk

12/30/2025 at 5:47:23 PM

~9GB gzipped.

by keepamovin

12/31/2025 at 3:27:46 AM

FYI I did NOT see the size info in the title. Impossible to edit / delete my comment now ........

by asdefghyk

12/30/2025 at 10:34:13 PM

Alas, HN does not belong to us, and the existence of projects like this are subject to the whims of the legal owners of HN.

From the terms of use [0]:

"""

Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.

"""

[0] https://www.ycombinator.com/legal/#tou

by abetusk

12/31/2025 at 1:08:29 AM

But is this really a commercial use? There doesn’t seem to be any intention of monetising this so I guess it doesn’t as specify commercial?

by tom1337