I've done some testing of polars, duckdb, and datafusion.Anecdotally, these are my experiences:
DuckDB (last used maybe 7-8 months):
- Very nice for very fast local queries (against parquet files, i ignored their homegrown file format)
- Most pleasant cli
- Seems to have the best out of core experience
- As far as I can tell, seems to be closest to state of the art in terms of algorithms/overall design, though honestly everyone is within spitting distance of each other
- Spark api seems exciting
Datafusion (last used 1.5y ago):
- Most pleasant to build/extend on top of (in rust)
- Is to OLAP DBMS's what LLVM is to compilers (stole this quote off Andrew Lamb)
- Could be wrong, but in terms of core engineering discipline they are the most rigorous/thoughtful (no shade thrown to the other libraries, which are all awesome libraries/tools too)
- Seems to be the most foundational to many other tools (and is most ubiquitously embedded)
- Their python dataframe centric workflow isn't as nice as polars (this is rapidly improving afaict)
- Docs are lagging behind polars
- Very exciting future (ray datafusion, improvements to python bindings, ballista, datafusion-comet)
Polars (last used this week):
- The most pleasant api by far for a programmatic user
- Pretty good interop with python ecosystem
- Rust crate is a second class citizen
- Python is a first class citizen
- Probably the best for advanced ETL use cases
- Fastest library for querying hive partitioned parquet data in an object store
- Wide end-user adoption (less so as a query engine)
- Moves very fast (I do get more bugs/regressions in polars version to version, but on the flip side, they move fast to fix issues and release very often)
- Exciting distributed cloud solution coming (is proprietary though)
- New streaming engine based off morsel driven parallelism (same architectural as duckdb afaict?) should greatly improve polars OOC capabilities
- Much nicer to test/compose/build re-usable queries/functions on top of then SQL based ETL tools
- Error messages/debuggability/observability are still immature
All three are awesome tools. The OLAP space is really heating up.
Things I still see lacking in the OLAP end-user space are:
- Unified batch/streaming dataframe centric workflows, nothing is truly high throughput/low latency/pleasant to use/mature/robust. I've only really seen arroyo and risingwave, neither seem too mature usable yet.
- Nothing is quite at the robustness level of something like sqlite
- Despite native query engines, datalake implementations are mostly lagging behind their java equivalents (iceberg/delta)
Some questions for other users:
- I'm curious if anyone uses Ibis in prod, I found that it wasn't very usable as an end user