alt.hn

2/26/2026 at 3:31:39 PM

Hardwood: A New Parser for Apache Parquet

https://www.morling.dev/blog/hardwood-new-parser-for-apache-parquet/

by rmoff

3/1/2026 at 7:30:56 PM

Cool! I definitely felt the pain of current options when I added parquet support to Planetiler to process overture data. I ended up using parquet-floor to trim the dependencies but it’s a bit of a hacky approach. If there’s a way to use the lower level utilities from my own threads without hardwood spawning it’s own then I’ll have to give it a shot.

by zylepe

3/1/2026 at 1:01:49 PM

This sounds great. parquet-java is extremely unpleasant to use with its massive fan-out of dependencies, an awkward API which exposes these dependencies causing the dependencies to bleed into a user's code base - the Hadoop stuff is particularly annoying given the relatively poor quality (IMO) of the Hadoop code base and the amount of class name sharing with built in Java types (like File, FileSystem, etc.). And the performance of parquet-java is very poor compared to the libraries available to other languages.

by derriz

3/1/2026 at 2:17:58 PM

Thanks! The heavy dependency footprint of parquet-java was the main driver for kicking off this project. Hardwood doesn't have any mandatory dependencies; any libs for compression algorithms used can be added by the user (most of them are single JARs with no further transitive dependencies) as needed. Same for log bindings (Hardwood is using the System.Logger abstraction).

by gunnarmorling

3/1/2026 at 9:56:51 AM

Respect for doing this. I recently implemented a Parquet reader in Swift using parquet-java as a reference and it was by a long way the hardest bit of coding I’ve done. Your bit unpacking is interesting, is it faster then the 74 KLOC parquet-java bit unpacker?

by willtemperley

3/1/2026 at 5:27:35 PM

Love to see Gunnar continuing to produce great stuff!

by jrjeksjd8d

3/1/2026 at 7:08:00 PM

Excited to see this. Have some upcoming work projects that involve Parquet and Java. Fingers crossed I can get approval to use Java 21.

by coredog64

3/1/2026 at 7:47:43 AM

Great! I will give it a try. I found that using DuckDB to select from the parquet files and using the Apache Arrow API to read the results is also a very fast method

by uwemaurer

3/1/2026 at 2:15:30 PM

Yes, absolutely, DuckDB is great. But I think there's a space and need for a pure Java library.

by gunnarmorling

3/1/2026 at 1:05:57 PM

Sounds great. No benchmarks?

by xnx

3/1/2026 at 2:13:38 PM

We have some first benchmarks here: https://github.com/hardwood-hq/hardwood/blob/main/performanc....

From the post:

> As an example, the values of three out of 20 columns of the NYC taxi ride data set (a subset of 119 files overall, ~9.2 GB total, ~650M rows) can be summed up in ~2.7 sec using the row reader API with indexed access on my MacBook Pro M3 Max with 16 CPU cores. With the column reader API, the same task takes ~1.2 sec.

In my measurements, this is significantly faster than parquet-java for the same task (which is not surprising, as Hardwood is multi-threaded); but I want to be sure I am setting up and configuring parquet-java correctly before publishing any comparisons. The test above also is hooked up to run parquet-java (and there's a set-up for PyArrow, too), so you could run it yourself on your machine if you wanted to.

So far, we've spent most time optimizing for flat (non-nested) data sets which are fully parsed (either all columns, or with projections) and I think it's faring really well for those. There's no support for predicate push-down yet, so right now, Hardwood isn't optimal for use cases with high query selectivity; this is the next thing on the roadmap though.

by gunnarmorling