Apache DataFusion

1/16/2025 at 2:28:11 AM

Of interest and relevance: This past semester, Andy Pavlo's DB seminar at CMU explored a number of projects under the heading 'Database Building Blocks', starting with DataFusion and several of its applications. Take a listen!

https://www.youtube.com/playlist?list=PLSE8ODhjZXjZc2AdXq_Lc...

by kristjansson

1/16/2025 at 9:03:10 AM

There's also Andrew Lamb's series - https://www.youtube.com/watch?v=NVKujPxwSBA

by GardenLetter27

1/16/2025 at 2:58:30 AM

The lectures / seminars from CMU's DB lab produces are always a treat!

by PartiallyTyped

1/16/2025 at 12:31:05 AM

There is a cambrian explosion in data processing engines: DataFusion, Polars, DuckDB, Feldera, Pathway, and more than i can remember.

It reminds of 15 years ago where there was JDBC/ODBC for data. Then when data volumes increased, specialized databases became viable - graph, document, json, key-value, etc.

I don't see SQL and Spark hammers keeping their ETL monopolies for much longer.

by jamesblonde

1/16/2025 at 4:35:23 AM

Spark for sure I view with suspicion and avoid as much as possible at work.

SQL though is going the distance. like Feldera is SQL based stream processing and uses DataFusion under the hood for some data wrangling. DuckDB is also very SQL.

I have my quibbles with SQL as a language but I would prefer SQL embedded in $myLanguage to needing to use Python or (shudder) Scala to screw around with data.

by jitl

1/16/2025 at 6:41:22 AM

Absolutely agree. Spark is the same garbage as Hadoop but in-memory.

by hipadev23

1/16/2025 at 9:56:49 AM

just out of curiosity, why do you say that spark is "in-memory"? I see a lot people claiming that, including several that I've interviewed in the past few years but that's not very accurate(at least in the default case). Spark SQL execution uses a bog standard volcano-ish iterator model (with a pretty shitty codegen operator merging part) built on top of their RDD engine. The exchange (shuffle) is disk based by default (both for sql queries and lower level RDD code), unless you mount the shuffle directory in a ramdisk I would say that spark is disk based. You can try it out on spark shell:

  spark.sql("SELECT explode(sequence(0, 10000))").write.parquet("sample_data")
  spark.read.parquet("sample_data").groupBy($"col").count().count()

after running the code, you should see a /tmp/blockmgr-{uuid} directory that holds the exchange data.

by ignoreusernames

1/16/2025 at 10:24:01 AM

Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations, which used to be a point of comparison to MapReduce specifically. Not ground-breaking nowadays but when I was doing this stuff 10+ years ago we didn't have all the open-source horizontally scalable SQL databases you get now - Oracle could do it and RedShift was new hotness.

by bdndndndbve

1/16/2025 at 10:57:32 AM

> Spark is "in-memory" in the sense that it isn't forced to spill results to disk between operations

I see your point, but that's only true within a single stage. Any operator that requires partitioning (groupBys and joins for example) requires writing to disk

> [...] which used to be a point of comparison to MapReduce specifically.

So each mapper in hadoop wrote partial results to disk? LOL this was way worse than I remember than. It's been a long time that I've dealt with hadoop

> Not ground-breaking nowadays but when I was doing this stuff 10+ years

I would say that it wouldn't be ground breaking 20 years ago. I feel like hadoop influence held up our entire field for years. Most of the stuff that arrow made mainstream and is being used by a bunch of engines mentioned in this thread has been known for a long time. It's like, as a community, we had blindfolds on. Sorry about the rant, but I'm glad the hadoop fog is finally dissipating

by ignoreusernames

1/17/2025 at 8:09:28 AM

Because that was the central point in the original whitepaper [1]: Hadoop is slow because it’s disk-only where Spark uses memory and caching to speed things up. I understand Spark isn’t 100% in-memory the way say Redis is, but it was still the major selling point vs. Hadoop.

https://people.csail.mit.edu/matei/papers/2010/hotcloud_spar...

by hipadev23

1/16/2025 at 4:04:58 PM

Apache DataFusion Comet is a high-performance accelerator for Apache Spark, built on top of the powerful Apache DataFusion query engine: https://datafusion.apache.org/comet/user-guide/overview.html

by 62951413

1/16/2025 at 11:56:00 AM

As someone that confortably ignored NoSQL hype, I am not worried.

by pjmlp

1/16/2025 at 3:11:35 PM

I don't think SQL is going anyware. There might me abstactions that use these engines but you write SQL (a là dbt) before people get used to 10 APIs for the same.

What Spark has going for it is its ecosystem. Things like Delta and Iceberg are being written for Spark first. Look at PyIceberg for example

by francocalvo

1/16/2025 at 8:18:37 AM

Did you really put SQL and Spark in the same basket?

by jurgenaut23

1/13/2025 at 12:53:51 AM

I feel like I'm not the target audience for this. When I have large data, then I directly write SQL queries and run them against the database. It's impossible to improve performance when you have to go out to the DB anyway; might as well have it run the query too. Certainly the server ops and db admins have loads more money to spend on making the DB fast compared with my anti-virus laden corporate laptop.

When I have small data that fits on my laptop, Pandas is good enough.

Maybe 10% of the time I have stuff that's annoyingly slow to run with Pandas; then I might choose a different library, but needing this is rare. Even then, of that 10% you can solve 9% of that by dropping down to numpy and picking a better algorithm...

by krapht

1/16/2025 at 4:26:24 AM

Your large db doesn’t sound very large. If I want to run a query that requires visiting every row of my biggest table, I will need to run the query a total of 480 times across 96 different Postgres databases. Just `select id from block` will take days to weeks.

But, I can visit most rows in that dataset in about 4 hours if I use an OLAP data warehouse thing, the kind of thing you build on top of DataFusion.

by jitl

1/15/2025 at 11:54:34 PM

You’re right it isn’t for you.

It’s largely for companies who can’t put everything in a single database because (a) they don’t control the source schema e.g. it’s a daily export from a SaaS app, (b) the ROI is not high enough to do so and (c) it’s not in a relational format e.g. JSON, Logs, Telemetry etc.

And with the trend toward SaaS apps it’s a situation that is becoming more common.

by threeseed

1/16/2025 at 8:43:02 AM

Or when the data is massive - so even BigQuery would be crazy expensive.

by GardenLetter27

1/13/2025 at 2:09:02 AM

I agree. The main reason I shared it is because I find it interesting as a library. I actually use it behind the scenes to build https://telemetry.sh. Essentially, I ingest JSON, infer a Parquet schema, store the data in S3 with a lookaside cache on disk, and then use DataFusion for querying.

by thebuilderjr

1/16/2025 at 3:09:11 AM

How do you infer your Parquet schemas?

by Hugsun

1/16/2025 at 4:12:30 AM

You infer the types of the source data.

For example you can go through say 1% of your data and for each column see if you can coerce all of the values to a float, int, date, string etc. And then from there you can set the Parquet schema with proper types.

by threeseed

1/16/2025 at 7:24:39 AM

> It's impossible to improve performance when you have to go out to the DB anyway;

That's not right. There are many queries that run far faster in duckdb/datafusion than (say) postgres, even with the overhead of pulling whole large tables prior to running the query. (Or use like pg_duckdb).

For certain types of queries these engines can be 100x faster.

More here: https://postgres.fm/episodes/pg_duckdb

by RobinL

1/16/2025 at 4:02:50 AM

> When I have large data, then I directly write SQL queries and run them against the database

what database is that? For example PgSQL will be XX-XXX times slower on OLAP queries than duckdb/polars/datafusion from various reasons.

by riku_iki

1/16/2025 at 3:32:07 AM

Maybe your data is stored in a multi-PB pile of HDF5.

by rch

1/16/2025 at 1:55:49 AM

Why would this be useful over of DuckDb? (earnest question)

by netcraft

1/16/2025 at 2:26:31 AM

They’re similar, but DuckDb is more of a batteries-included database whereas DataFusion is an embeddable query engine. You can use DuckDb in embedded-ish scenarios, but it’s not primarily targeting that use case. To put it another way, DataFusion is sometimes described as “the LLVM of databases.”

Another difference is that DuckDb is written in C++ whereas DataFusion is in Rust, so all the usual memory-safety and performance arguments apply. In fact DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year to optimize its performance.

by chatmasta

1/16/2025 at 4:36:55 AM

We tried both about 8 months ago, at the time DuckDB’s Node driver leaked memory and segfaulted, DataFusion was missing some features we wanted. But they are both improving rapidly.

by jitl

1/16/2025 at 5:49:26 AM

> DataFusion has recently overtaken DuckDb in Clickbench results after a community push last year

Really? I don't see it near the top.

[CH benchmarks](https://benchmark.clickhouse.com/#eyjzexn0zw0ionsiqwxsb3leqi...)

by geysersam

1/16/2025 at 5:24:21 PM

Specifically, DataFusion is faster when querying parquet directly.

Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)

by alamb

1/16/2025 at 8:43:51 AM

You might need to adjust filters to do an apple to apple comparison.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQi...

by kalendos

1/16/2025 at 9:03:32 PM

Not clear why someone need to give up on native duckdb format if it is much faster.

by riku_iki

1/16/2025 at 9:57:42 PM

Because it means you need to keep another copy of your data in a special format just for DuckDb. The point of Parquet is that it’s an open format queryable by multiple tools. You don’t need to wait to load every table into a new format, you don’t need to retain multiple copies, and you don’t need to keep them in sync.

If DuckDb is the only query engine in your analytics stack, then it makes sense to use its specialized format. But that’s not the typical Lakehouse use case.

by chatmasta

1/16/2025 at 10:07:21 PM

> But that’s not the typical Lakehouse use case.

that benchmark is also not typical lakehouse use case, since all data is hosted locally, so they don't test significant component of the stack.

by riku_iki

1/16/2025 at 10:16:37 PM

Yeah, that’s one of many issues with Clickbench. It’s also one table so it can’t test joins.

TPC-H is okay but not Lakehouse specific. I’m not aware of any benchmarks that specifically test performance of engines under common setups like external storage or scalable compute. It would be hard to design one that’s easily reproducible. (And in fairness to Clickbench, it’s intentionally simple for that exact reason - to generate a baseline score for any query engine that can query tabular data).

by chatmasta

1/16/2025 at 5:23:24 PM

I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like * custom file formats (e.g. Spiral or Lance) * custom query languages / sql dialects * custom catalogs (e.g. other than a local file or prebuilt duckdb connectors) * custom indexes (read only parts of parquet files based on extra information you store) * etc.

If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat

Disclaimer: I am the PMC chair of DataFusion

There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html

by alamb

1/13/2025 at 2:53:35 AM

How does this compare/contrast to polars? Seems pretty similar, anybody tried both?

by bionhoward

1/13/2025 at 3:27:42 AM

DataFusion and Polars are like two sides of the same Rust coin: DataFusion is built for distributed, SQL-based analytics at scale, serving as the backbone for data systems and enabling complex query execution across clusters. Polars, on the other hand, is laser-focused on blazing-fast, single-node data manipulation, offering a Python-like DataFrame API that feels intuitive for exploratory analysis and in-memory processing.

by thebuilderjr

1/13/2025 at 3:39:51 AM

And the thing is - single node can still scale ridiculously high without the orchestration overheads of distributed stuff.

You can do dual AMD 192 core CPU's (384 cores / 768 threads) with 9 TB of memory and a 24 disk SSD array in a 2U box.

by donor20

1/16/2025 at 3:23:45 PM

The vast majority of businesses will never need more than single node architecture. Hardware advances are continually increasing that percentage.

SPARK and its modern counterpart Databricks are essentially obsolete for these organizations. Whatever justification they may have had in the past is no longer true.

I’ve recently closed down several in house SPARK clusters and replaced them with single nodes.

In addition to the simplicity of the design and reduction in cost there was a massive increase in performance. I expect this will become more common in the future; leaving distributed architecture for a small and increasingly niche group.

by spratzt

1/14/2025 at 4:42:44 AM

Exactly, datafusion is implied batteries included apache bigdata ecosystem. Polars is chasing the Python Pandas crowd and uses python syntax, handy if you're already comfortable with ipython.

by elasticventures

1/16/2025 at 1:43:49 AM

Can't you use DataFusion single node/without any Apache ecosystem stuff? They have a Python library and DataFusion is "just" a query engine. (If anything, I'd call Pandas the batteries included option...)

I think the difference is more that DataFusion is built as a library so you can plug it into the product you're building (e.g. Comet, which plugs it into Spark, or pg_lakehouse, which plugs it into Postgres). Polars could be used that way, but it's also a functional package you can pip install and use as a Pandas alternative right now.

by lidavidm

1/16/2025 at 6:10:53 AM

"pg_analytics (formerly named pg_lakehouse) puts DuckDB inside Postgres" https://github.com/paradedb/pg_analytics

by Epa095

1/16/2025 at 6:50:52 AM

It used to use DataFusion. https://www.paradedb.com/blog/iceberg_lakehouse

by lidavidm

1/16/2025 at 7:41:08 PM

That's true. We have some more ideas for DataFusion in the works, though... Stay tuned!

by philippemnoel

1/16/2025 at 3:58:52 PM

When I started on my path to get my application off of Spark late 2023 I started with Polars because it seemed to have more community velocity and seemed more approachable. Unfortunately, at least at that time, the lazy evaluation Rust api was very much a wip and didn't work for my use case. Switching to DataFusion enabled me to port/rewrite my application into Rust and drastically improve it's performance.

by Omega359

1/16/2025 at 11:54:37 AM

highly recommend this video for a deeper dive. this is a actual example in practice: https://www.youtube.com/watch?v=VLAvZw0ZEwI&list=PLSE8ODhjZX... Enjoy!

by pickinrust

1/16/2025 at 10:23:33 PM

I've done some testing of polars, duckdb, and datafusion.

Anecdotally, these are my experiences:

DuckDB (last used maybe 7-8 months):

- Very nice for very fast local queries (against parquet files, i ignored their homegrown file format)

- Most pleasant cli

- Seems to have the best out of core experience

- As far as I can tell, seems to be closest to state of the art in terms of algorithms/overall design, though honestly everyone is within spitting distance of each other

- Spark api seems exciting

Datafusion (last used 1.5y ago):

- Most pleasant to build/extend on top of (in rust)

- Is to OLAP DBMS's what LLVM is to compilers (stole this quote off Andrew Lamb)

- Could be wrong, but in terms of core engineering discipline they are the most rigorous/thoughtful (no shade thrown to the other libraries, which are all awesome libraries/tools too)

- Seems to be the most foundational to many other tools (and is most ubiquitously embedded)

- Their python dataframe centric workflow isn't as nice as polars (this is rapidly improving afaict)

- Docs are lagging behind polars

- Very exciting future (ray datafusion, improvements to python bindings, ballista, datafusion-comet)

Polars (last used this week):

- The most pleasant api by far for a programmatic user

- Pretty good interop with python ecosystem

- Rust crate is a second class citizen

- Python is a first class citizen

- Probably the best for advanced ETL use cases

- Fastest library for querying hive partitioned parquet data in an object store

- Wide end-user adoption (less so as a query engine)

- Moves very fast (I do get more bugs/regressions in polars version to version, but on the flip side, they move fast to fix issues and release very often)

- Exciting distributed cloud solution coming (is proprietary though)

- New streaming engine based off morsel driven parallelism (same architectural as duckdb afaict?) should greatly improve polars OOC capabilities

- Much nicer to test/compose/build re-usable queries/functions on top of then SQL based ETL tools - Error messages/debuggability/observability are still immature

All three are awesome tools. The OLAP space is really heating up.

Things I still see lacking in the OLAP end-user space are: - Unified batch/streaming dataframe centric workflows, nothing is truly high throughput/low latency/pleasant to use/mature/robust. I've only really seen arroyo and risingwave, neither seem too mature usable yet.

- Nothing is quite at the robustness level of something like sqlite

- Despite native query engines, datalake implementations are mostly lagging behind their java equivalents (iceberg/delta)

Some questions for other users:

- I'm curious if anyone uses Ibis in prod, I found that it wasn't very usable as an end user

by theLiminator