We have some first benchmarks here: https://github.com/hardwood-hq/hardwood/blob/main/performanc....From the post:
> As an example, the values of three out of 20 columns of the NYC taxi ride data set (a subset of 119 files overall, ~9.2 GB total, ~650M rows) can be summed up in ~2.7 sec using the row reader API with indexed access on my MacBook Pro M3 Max with 16 CPU cores. With the column reader API, the same task takes ~1.2 sec.
In my measurements, this is significantly faster than parquet-java for the same task (which is not surprising, as Hardwood is multi-threaded); but I want to be sure I am setting up and configuring parquet-java correctly before publishing any comparisons. The test above also is hooked up to run parquet-java (and there's a set-up for PyArrow, too), so you could run it yourself on your machine if you wanted to.
So far, we've spent most time optimizing for flat (non-nested) data sets which are fully parsed (either all columns, or with projections) and I think it's faring really well for those. There's no support for predicate push-down yet, so right now, Hardwood isn't optimal for use cases with high query selectivity; this is the next thing on the roadmap though.