alt.hn

3/22/2026 at 9:55:23 AM

Data Manipulation in Clojure Compared to R and Python

https://codewithkira.com/2024-07-18-tablecloth-dplyr-pandas-polars.html

by tosh

3/24/2026 at 6:55:31 PM

I’ve built many different kinds of software (backend, frontend, 3D games, cli tools, code editor, and more) with Clojure and have been using it for over a decade now.

I can confidently say that, among the list I mentioned, it’s the best for data manipulation/transformation. Thanks to the author for presenting it clearly and showing how the libraries and code look across different languages, all of which do a great job.

But Clojure has its own special place (maybe in my heart as well :). I think Clojure should be used more in the data science space. Thanks to the JVM, it can be very performant (I’m looking at you, Python).

by ertucetin

3/24/2026 at 11:13:03 PM

There was XLISP-STAT before R, but the scientists have spoken. They don't like the parentheses.

by hatmatrix

3/24/2026 at 10:54:37 PM

Seems like it's going to be a tough sell to get people to want to write

    (tc/select-rows ds #(> (% "year") 2008))
instead of

    filter(ds, year > 2008)
They seem to ignore the existance of Spark, so even if you specifically want to use JVM it feels clearer and simpler:

    ds.filter(r => r.year > 2008)

by zmmmmm

3/25/2026 at 2:41:30 AM

You're right, that is longer! I get why though; `filter` is a clojure.core function name people don't necessarily feel comfortable shadowing, and the Clojure and Spark versions make it clear what's a symbol in local scope versus a field in the dataset. I don't think it'd be hard to make a little wrapper for this sort of thing though! Here's an example which turns any symbols not in local scope into field lookups on an implicit row variable.

    (require '[clojure.walk :refer [postwalk]])

    (defmacro filter
      [ds & anaphoric-pred]
      (let [row-name (gensym 'row)
            pred     (postwalk (fn [form]
                                 (if (and (symbol? form) (nil? (resolve form)))
                                   `(get ~row-name ~(str form))
                                   form))
                       anaphoric-pred)]
      `(tc/select-rows ds (fn [~row-name] ~@pred))))
Now you can write

    (filter ds (> year 2008))
And it'll expand to the ts form:

    (pprint (macroexpand '(filter ds (> year 2008))))
    => (tc/select-rows ds (fn [row2411] (> (get row2411 "year") 2008)))

by aphyr

3/25/2026 at 2:34:21 AM

In my experience the advantage comes when you have a few more lines of code

The Clojure pipelining makes code much more readable. Granted dplyr has them too, but tidyverse pipes always felt like a hack on top of R (though my experience is dated here). While in Clojure I always feel like I'm playing with the fundamental language data-types/protocols. I can extend things in any way I want

by geokon

3/24/2026 at 11:13:47 PM

Couldn't agree more. R and dplyrs ability to pass column names as unquoted objects actually reduces cognitive load for new people so much (pure anecdata, nothing to back this up except lots of teaching people).

And that's on top of the vastly simpler syntax compared to what's being shown here

by condwanaland

3/24/2026 at 11:31:43 PM

All the comparisons are with scripting and untyped languages perhaps for faster development and more intuitive eco-system to increase developer productivity.

In the age of IntelliSense, auto-completion and AI assisted coding, does the choice of scripting and untyped language justifiable for increased in productivity at the expense of safety and reliability?

If you're building data system not just for exploratory, surely modern compiled and typed system languages like Rust and D language make more sense for safety and reliability for the end users?

Even more so with D language where you can even have scripting capability for exploratory and protyping stage with its built-in REPL facility [1],[2]. This is feasible due to its very fast compile time unlike Rust. It has more intuitive "Phytonic" syntax compared to other typed languages [3]. You can also program with GC on by default if you choose to. Apparently, you can have your cake and eat it too.

[1] drepl:

https://github.com/dlang-community/drepl

[2] Why I use the D programming language for scripting:

https://opensource.com/article/21/1/d-scripting

[3] All in on DLang: Why I pivoted to D for web, teaching, and graphics in 2025 and beyond! [PDF]

https://dconf.org/2025/slides/shah.pdf

by teleforce

3/25/2026 at 8:32:59 AM

One general problem or challenge with statically strongly typed languages is, that one can quick get to a local optimum, but that local optimum might lack some flexibility, that is needed later on, only discovered after some usage and seeing many use cases. Then a big refactoring is ahead, possibly even of the core types of the project. If that is allowed and introducing such flexibility thought of, it often happens, that expressing it in types becomes quite complex, which, without a lot of care, will impact the user of the project. The user needs to adhere to the same types and there might then be quite some ceremony around making something of the correct type, to use it with the project.

It is safer, but it is not without its downsides. It demands a careful design to make something people will enjoy using.

by zelphirkalt

3/25/2026 at 2:40:41 AM

It's a bit apples to oranges.

If you're "building data system not just for exploratory" then you're probably not going to be using any of the presented options. However, in my experience Clojure has an ecosystem where there it is very easy to transition from exploring/playing with data at the REPL to a more robust "pro" setup that's designed to scale, handle failures, etc.

by geokon

3/25/2026 at 4:45:46 AM

I understand the sentiments but I disagree with the approach, it's probably efficient for exploratory but not effective for everything else including prototyping and systems development.

For any engineering work, including software engineering you choose the best tool for the job. In D you can have the high performance tool capable of bit shifting, string processing, array manipulation (to name a few) and from scripts to highly concurrent low-latency applications (see presentation in the ref [3] above by Prof. Shah from Yale).

It's a shame that the proper typed programming language are being ignored just because of programmers' locally sub-optimal preferences and limited exposure. The productivity increased using typical scripting languages including Python is diminishing everyday with the proliferation of IntelliSense, auto-complete and AI assisted coding.

For production codes, the scripting language based systems if they ever made it to production (mostly do e.g AirBNB, Twitter, Shopify, Github, etc) will be a maintenance headache and user nightmare, if the supports are not great and not unicorn start-ups. The last thing you want is that your saved eclaim form that you spent many hours preparing totally dissapeared since the system cannot recall the saved version. Granted this can be because of many reasons, but most of the problematic production systems are mostly written in scripting languages including Python because these are the only language the programmers know and familiar with. Adding to the insults are the readily available so called "battery included" libraries are convenients but ironically written in other compiled but unsafe system language in C/C++.

by teleforce

3/25/2026 at 5:09:38 AM

I think you're going to trouble convincing people a compile-loop language is going to be on-par with a REPL/interactive setup. You can look at some extreme example like MATLAB. With all your tools you're never going to reach the same level of interactive productivity with D for the subset of problems it's address.

You can have all your tools dump out and rewrite the oodles of boiler plate your typed languages require - but at the end of the day you have to read all that junk... or not? and just vibecode and #yolo it? But then you're back to "safety and reliability" problems and you haven't won anything

Also "safety and reliability" are just non-goals in a lot of contexts. My shitty plotting script doesn't care about "safety". It's not sitting on the network. It's reliable enough for the subset of inputs I provide it. I don't need to handle every conceivable corner case. I have other things to do

> Adding to the insults are the available readily available libraries are convenients but ironically written in other compiled but unsafe system language in C/C++

No on cares if you leak memory in some corner case with some esoteric inputs. And noone is worried your BLAS bindings are going to leak your secrets. These are just not objectives

by geokon

3/25/2026 at 5:44:59 AM

My point is that Dlang scales from beginner to expert, from scripting to highly concurrent low-latency applications. Why settle for sub-optimal scripting languages if you can have the real deal with much better performance and freely available open source?

In the automative world if you can afford it, you need daily drive car for the job and supermarket runs, weekend supercar for fun/showing off, and off-road 4x4 vehicles for overnight camping. But in the software world D can cater for mostly everything with free open-source compilers, minimum productivity overhead and much cheaper to host as well [1].

Funny you mentioned BLAS, since Dlang BLAS implementation has also surpassed the run-of-the-mill high performance BLAS library that these scripting languages can only dream of (Matlab calling the 3rd party Fortran codes no less) [2].

[1] Saving Money by Switching from PHP to D:

https://dlang.org/blog/2019/09/30/saving-money-by-switching-...

[2] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

by teleforce

3/25/2026 at 3:09:15 AM

The Clojure tablecloth performance numbers here are pretty surprising, usually see Python/polars dominating these benchmarks. Been running similar transformations on transit data feeds and polars consistently outperforms pandas by 3x-5x on the group-by operations, but hadn't considered Clojure for the pipeline. Anyone actually using tablecloth in production data workflows?

by manudaro

3/24/2026 at 8:19:05 PM

Having "NA" being treated as nil/null/None by default seems like it would cause the Namibia problem!

by olivia-banks

3/24/2026 at 7:12:10 PM

Good pandas and polars code should also be written in an immutable way...

by __mharrison__

3/24/2026 at 7:17:12 PM

Good python code can exist, but python makes it so easy to write bad code that good python rarely exists.

by epgui

3/24/2026 at 7:18:11 PM

Agree. While it is common to see code like these pandas examples, it is very possible to write these manipulations so that they return a new frame or view without changing the inputs.

by nxpnsv

3/24/2026 at 9:22:28 PM

I always wished Incanter took off.

by thrawa8387336

3/24/2026 at 6:52:46 PM

Clojure never got the data science crowd even though the language is genuinely good for it. Always felt like a distribution problem more than a technical one.

by soumyaskartha

3/24/2026 at 7:35:51 PM

In this very post you can see why: the dplyr code is just so much more readable. Like a lot of python, dplyr reads almost like pseudocode: take this dataset, select the columns that start with "bill", then filter so that bill_length is less than 30. So simple and so little fluff!

by levocardia

3/24/2026 at 11:15:55 PM

Julia's Tidier.jl ecosystem is getting there too. It uses macros to mimic this 'special' evaluation framework of R, so the code is also readable in a similar way.

by hatmatrix

3/24/2026 at 7:50:09 PM

> is just so much more readable

I thought that too before I learned Clojure, now I find them equally readable.

by erichocean

3/24/2026 at 11:49:57 PM

I'm very familiar with Clojure, but even I can't make a good argument that:

    (tc/select-rows ds #(> (% "year") 2008))
is more, or at least as, intuitive as:

    filter(ds, year > 2008)
as cited above. I think there's a good argument to be made that Clojure's data processing abilities, particularly around immutable data, make a compelling case in spite of the syntax. The REPL is great too, and the JVM is fast. But I still to this day imagine infix comparisons in my head and then mentally move the comparator to the front of the list to make sure I get it right.

by lemming

3/25/2026 at 7:31:36 AM

How about this?

    (filter ds (> year 2008))
That's a trivial Clojure macro to make work if it's what you find "intuitive."

by erichocean

3/25/2026 at 12:55:07 AM

I am really not in data science, and I have decent Clojure experience. Is there a reason anyone would pick Clojure over something like K? From what I understand, those array languages are really good for writing safe but efficient code on rectangular data.

by Capricorn2481

3/24/2026 at 7:05:45 PM

Unfortunately, having to mess around with a JVM is a tough sell for a lot of data analysis folks. I'm not saying it's rational or right, but a lot of people hear "JVM" and they go "no thank you". Personally I think it's a non-issue, but you have to meet people where they are.

by asa400

3/24/2026 at 8:16:38 PM

The irony given the mess of Python setup where there are companies whose business is to solve Python tooling.

by pjmlp

3/25/2026 at 1:55:54 AM

Oh, I completely agree. Like I said, it's not rational, but it is what it is.

by asa400

3/24/2026 at 8:43:39 PM

I dunno, if you can slog through the Python ecosystem then the JVM is starting to look not so bad. Plus with Clojure you don't need to deal with the headache and heartache that is Maven.

by cmiles74

3/25/2026 at 3:19:13 AM

I think that's true for only a limited subset of programs, though. The Clojure lib ecosystem is nowhere near the size of the broader Java ecosystem, so you frequently end up pulling Maven deps to plug holes anyway.

by KingMob

3/25/2026 at 6:59:52 AM

That is the goal of a polyglot runtime, and why Clojure was designed to be a hosted language that embraces the platform, unlike others that make their tiny island.

by pjmlp

3/24/2026 at 7:36:13 PM

Meanwhile, I find it very annoying to deal with the litany of Python versions and the distinction between global packages and user packages, and needing to manage virtual environments just to run scripts. That being said, I am not an expert but that's always been my experience when I need to do anything Python related.

by famicom0

3/24/2026 at 8:49:47 PM

idk, I don't think I've had to do anything beyond install the JVM to work with Clojure. I'm not really a fan of the clj commands flag choices though (-M, -X, etc. all make no sense)

by packetlost

3/25/2026 at 3:15:53 AM

It's unfortunate, but people's associations with Java the lang bleed into their beliefs about the JVM, one of the most heavily-optimized VMs on the planet.

There's some historical cruft (especially the memory model), but picking the JVM as a target is a great decision (especially with Graal offering even more options).

by KingMob

3/25/2026 at 7:04:12 AM

Exactly, especially because there isn't THE JVM, rather a bunch of versions each with their own approaches to GC, JIT, JIT caches, ahead of time compilation.

Only .NET follows up on it at scale.

by pjmlp

3/24/2026 at 9:01:52 PM

Interesting perspective Clojure’s immutable, functional approach makes data wrangling feel very different from the more imperative style of R and Python.

by QubridAI