4/21/2025 at 11:31:26 PM
The author keeps calling it "pipelining", but I think the right term is "method chaining".Compare with a simple pipeline in bash:
grep needle < haystack.txt | sed 's/foo/bar/g' | xargs wc -l
Each of those components executes in parallel, with the intermediate results streaming between them. You get a similar effect with coroutines.Compare Ruby:
data = File.readlines("haystack.txt")
.map(&:strip)
.grep(/needle/)
.map { |i| i.gsub('foo', 'bar') }
.map { |i| File.readlines(i).count }
In that case, each line is processed sequentially, with a complete array being created between each step. Nothing actually gets pipelined.Despite being clean and readable, I don't tend to do it any more, because it's harder to debug. More often these days, I write things like this:
data = File.readlines("haystack.txt")
data = data.map(&:strip)
data = data.grep(/needle/)
data = data.map { |i| i.gsub('foo', 'bar') }
data = data.map { |i| File.readlines(i).count }
It's ugly, but you know what? I can set a breakpoint anywhere and inspect the intermediate states without having to edit the script in prod. Sometimes ugly and boring is better.
by invalidator
4/21/2025 at 11:49:34 PM
> The author keeps calling it "pipelining", but I think the right term is "method chaining". [...] You get a similar effect with coroutines.The inventor of the shell pipeline, Douglas McIlroy, always understood the equivalency between pipelines and coroutines; it was deliberate. See https://www.cs.dartmouth.edu/~doug/sieve/sieve.pdf It goes even deeper than it appears, too. The way pipes were originally implemented in the Unix kernel was when the pipe buffer was filled[1] by the writer the kernel continued execution directly in the blocked reader process without bouncing through the scheduler. Effectively, arguably literally, coroutines; one process call the write function and execution continues with a read call returning the data.
Interestingly, Solaris Doors operate the same way by design--no bouncing through the scheduler--unlike pipes today where long ago I think most Unix kernels moved away from direct execution switching to better support multiple readers, etc.
[1] Or even on the first write? I'd have to double-check the source again.
by wahern
4/22/2025 at 5:16:40 AM
I don’t find your “seasoned developer” version ugly at all. It just looks more mature and relaxed. It also has the benefits that you can actually do error handling and have space to add comments. Maybe people don’t like it because of the repetition of “data =“ but in fact you could use descriptive new variable names making the code even more readable (auto documenting). I’ve always felt method chaining to look “cramped”, if that’s the right word. Like a person drawing on paper but only using the upper left corner. However, this surely is also a matter of preference or what your used to.by marhee
4/22/2025 at 7:19:25 AM
I have a lot of code like this. The reason I prefer pipelines now is the mental overhead of understanding the intermediate step variables.Something like
lines = File.readlines("haystack.txt")
stripped_lines = lines.map(&:strip)
needle_lines = stripped_lines.grep(/needle/)
transformed_lines = needle_lines.map { |line| line.gsub('foo', 'bar') }
line_counts = transformed_lines.map { |file_path| File.readlines(file_path).count }
is a hell to read and understand later imo. You have to read a lot of intermediate variables that do not matter in anything else in the code after you set it up, but you do not know in advance necessarily which matter and which don't unless you read and understand all of it. Also, it pollutes your workspace with too much stuff, so while this makes it easier to debug, it makes it also harder to read some time after. Moreover becomes even more crumpy if you need to repeat code. You probably need to define a function block then, which moves the crumpiness there.What I do now is starting defining the transformation in each step as a pure function, and chain them after once everything works, plus enclosing it into an error handler so that I depend on breakpoint debugging less.
There is certainly a trade off, but as a codebase grows larger and deals with more cases where the same code needs to be applied, the benefits of a concise yet expressive notation shows.
by freehorse
4/22/2025 at 7:28:03 AM
Code in this "named-pipeline" style is already self-documenting: using the same variable name makes it clear that we are dealing with a pipeline/chain. Using more descriptive names for the intermediate steps hides this, making each line more readable (and even then you're likely to end up with `dataStripped = data.map(&:strip)`) at the cost of making the block as a whole less readable.by deredede
4/23/2025 at 8:16:40 PM
> Maybe people don’t like it because of the repetition of “data =“Eh, at first glance it looks "amateurish" due to all the repeated stuff. Chaining explicitly eliminates redundant operations - a more minimal representation of data flow - so it looks more "professional". But I also know better than to act on that impulse. ;)
That said, it really depends on the language at play. Some will compile all the repetition of `data =` away such that the variable's memory isn't re-written until after the last operation in that list; it'll hang out in a register or on the stack somewhere. Others will run the code exactly as written, bouncing data between the heap, stack, and registers - inefficiencies and all.
IMO, a comment like "We wind up debugging this a lot, please keep this syntax" would go a long way to help the next engineer. Assuming that the actual processing dwarfs the overhead present in this section, it would be even better to add discrete exception handling and post-conditions to make it more robust.
by pragma_x
4/22/2025 at 2:41:32 AM
In most debuggers I have used, if you put a breakpoint on the first line of the method chain, you can "step over" each function in the chain until you get to the one you want.Bit annoying, but serviceable. Though there's nothing wrong with your approach either.
by ehnto
4/22/2025 at 12:32:55 PM
debuggers can take it even further if they want that UX. in firefox given a chain of foo().bar().baz() you can set a breakpoint on any of 'em.https://gist.github.com/user-attachments/assets/3329d736-70f...
by grimgrin
4/22/2025 at 4:29:35 PM
> The author keeps calling it "pipelining", but I think the right term is "method chaining".Allow me, too, to disagree. I think the right term is "function composition".
Instead of writing
h(g(f(x)))
as a way to say "first apply f to x, after which g is applied to the result of this, after which h is applied to the result of this", we can use function composition to compose f, g and h, and then "stuff" the value x into this "pipeline of composed functions".We can use whatever syntax we want for that, but I like Elm syntax which would look like:
x |> f >> g >> h
by runeks
4/22/2025 at 12:37:35 AM
If you add in a call to “.lazy“ it won’t create all the intermediate arrays. There since at least 2.7. https://ruby-doc.org/core-2.7.0/Enumerator/Lazy.htmlby billdueber
4/22/2025 at 1:10:47 PM
I do the same with Python, replacing multilevel comprehensions with intermediary steps of generator expressions, which are lazy and therefore do not impact performance and memory usage.by dorfsmay
4/22/2025 at 1:59:08 AM
Ultimately it will depend on the functions being chained. If they can work with one part of the result, or a subset of parts, then they might not block, otherwise they will still need to get a complete result and the lazy cannot help.by zelphirkalt
4/22/2025 at 6:24:16 AM
Not much different from having a `sort` in shell pipeline I guess?by hbogert
4/22/2025 at 1:42:45 AM
Shouldn’t modern debuggers be able to handle that easily? You can step in, step out, until you get where you want, or you could set a breakpoint in the method you want to debug instead of at the call site.by ses1984
4/22/2025 at 2:19:09 PM
Even if your debugger can't do that, an AI agent can easily change the code for you to add intermediate output.by abirch
4/22/2025 at 2:56:13 PM
...an AI agent can independently patch your debugger to modify the semantics? Wow that's crazy.Incidentally, have you ever considered investing in real estate? I happen to own an interest in a lovely bridge which, for personal reasons, I must suddenly sell at a below-market price.
by bccdee
4/23/2025 at 10:11:20 PM
I think the best term is "function composition", but with a particular syntax so pipelining seems alright. Method chaining is a common case, where some base object is repeatedly modified by some action and then the object reference is returned by the "method", thus allowing the "chaining", but what if you're not dealing with objects and methods? The pipelined composition pattern is more general than method chaining imho.You make an interesting point about debugging which is something I have also encountered in practice. There is an interesting tension here which I am unsure about how to best resolve.
In PRQL we use the pipelining approach by using the output of the last step as the implicit last argument of the next step. In M Lang (MS Power BI/Power Query), which is quite similar in many ways, they use second approach in that each step has to be named. This is very useful for debugging as you point out but also a lot more verbose and can be tedious. I like both but prefer the ergonomics of PRQL for interactive work.
Update: Actually, PRQL has a decent answer to this. Say you have a query like:
from invoices
filter total > 1_000
derive invoice_age = @2025-04-23 - invoice_date
filter invoice_age > 3months
and you want to figure out why the result set is empty. You can pipe the results into an intermediate reference like so: from invoices
filter total > 1_000
into tmp
from tmp
derive invoice_age = @2025-04-23 - invoice_date
filter invoice_age > 3months
So, good ergonomics on the happy path and a simple enough workaround when you need it. You can try these out in the PRQL Playground btw: https://prql-lang.org/playground/
by snthpy
4/22/2025 at 2:11:30 AM
> Despite being clean and readable, I don't tend to do it any more, because it's harder to debug. More often these days, I write things like this: data = File.readlines("haystack.txt")
data = data.map(&:strip)
data = data.grep(/needle/)
data = data.map { |i| i.gsub('foo', 'bar') }
data = data.map { |i| File.readlines(i).count }
Hard disagree. It's less readable, the intend is unclear (where does it end?), and the variables are rewritten on every step and everything is named "data" (and please don't call them data_1, data_2, ...) so now you have to run a debugger to figure out what even is going on, rather than just... reading the code.
by refactor_master
4/22/2025 at 4:46:10 AM
The person you are quoting already conceded that is less readable, but that the ability to set a breakpoint easily (without having to stop the process and modify the code) is more important.I myself agree, and find myself doing that too, especially in frontend code that executes in a browser. Debuggability is much more important than marginally-better readability, for production code.
by veidr
4/22/2025 at 3:03:57 PM
> Debuggability is much more important than marginally-better readability, for production code.I find this take surprising. I guess it depends on how much weight you give to "marginally-better", but IMHO readability is the single most important factor when it comes to writing code in most code-bases. You write code once, it may need to be debugged (by yourself or others) on rare occasions. However anytime anyone needs to understand the code (to update it, debug it, or just make changes in adjacent code) they will have to read it. In a shared code-base your code will be read many more times than it will be updated/debugged.
by jlkuester7
4/23/2025 at 1:48:03 PM
Yeah, part of it is that I do find const foo = something()
.hoge()
.hige()
.hage();
better, sure, but not actually significantly harder to read than: let foo = something();
foo = foo.hoge();
foo = foo.hige();
foo = foo.hage();
But, while reading is more common than debugging, debugging a production app is often more important. I guess I am mostly thinking about web apps, because that is the area where I have mainly found the available debuggers lacking. Although they are getting better, I believe, I've frequently seen problems where they can't debug into some standard language feature because it's implemented in C++ native code, or they just don't expose the implicit temporary variables in a useful way.(I also often see similar-ish problems in languages where the debuggers just aren't that advanced, due to lack of popularity, or whatever.)
Particularly with web apps, though, we often want to attach to the current production app for initial debugging instead of modifying the app and running it locally, usually because somebody has reported a bug that happens in production (but how to reproduce it locally is not yet clear).
Alternatively stated, I guess, I believe readability is important, and maybe the "second most important thing", but nevertheless we should not prefer fancy/elegant code that feels nice to us to write and read, but makes debugging more difficult (with the prevailing debuggers) in any significant way.
In an ideal world, a difference like the above wouldn't be harder to debug, in which case I would also prefer the first version.
(And probably in the real world, the problems would be with async functions less conducive to the pithy hypothetical example. I'm a stalwart opponent of libraries like RxJs for the sole reason that you pay back with interest all of the gains you realized during development, the first time you have to debug something weird.)
by veidr
4/22/2025 at 10:12:02 AM
> Each of those components executes in parallel, with the intermediate results streaming between them. You get a similar effect with coroutines.Processes run in parallel, but they process the data in a strict sequential order: «grep» must produce a chunk of data before «sed» can proceed, and «sed» must produce another chunk of data before «xargs» can do its part. «xargs» in no way can ever pick up the output of «grep» and bypass the «sed» step. If the preceding step is busy crunching the data and is not producing the data, the subsequent step will be blocked (the process will fall asleep). So it is both, a pipeline and a chain.
It is actually a directed data flow graph.
Also, if you replace «haystack.txt» with a /dev/haystack, i.e.
grep needle < /dev/haystack | sed 's/foo/bar/g' | xargs wc -l
and /dev/haystack is waiting on the device it is attached to to yield a new chunk of data, all of the three, «grep», «sed» and «xargs» will block.
by inkyoto
4/22/2025 at 12:54:06 AM
> The author keeps calling it "pipelining", but I think the right term is "method chaining".I believe the correct definition for this concept is the Thrush combinator[0]. In some ML-based languages[1], such as F#, the |> operator is defined[2] for same:
[1..10] |> List.map (fun i -> i + 1)
Other functional languages have libraries which also provide this operator, such as the Scala Mouse[3] project.0 - https://leanpub.com/combinators/read#leanpub-auto-the-thrush
1 - https://en.wikipedia.org/wiki/ML_(programming_language)
2 - https://fsharpforfunandprofit.com/posts/defining-functions/
by AdieuToLogic
4/22/2025 at 2:53:10 AM
I'm not sure that's right, method chaining is just immediately acting on the return of the previous function, directly. It doesn't pass the return into the next function like a pipeline. The method must exist on the returned object. That is different to pipelines or thrush operators. Evaluation happens in the order it is written.Unless I misunderstood the author, because method chaining is super common where I feel thrush operators are pretty rare, I would be surprised if they meant the latter.
by ehnto
4/22/2025 at 2:58:01 PM
They cite Gleam explicitly, which has a thrush operator in place of method chaining.I get the impression (though I haven't checked) that the thrush operator is a backport of OOP-style method chaining to functional languages that don't support dot-method notation.
by bccdee
4/22/2025 at 1:53:20 AM
For debugging method chains you can just use `tap`by dzuc
4/22/2025 at 1:23:29 PM
Isn't the difference between a pipeline and a method chain that a pipeline doesn't have to wait for the previous process to complete in order to send results to the next step? Grep sends lines as it finds them to sed and sed on to xargs, which acts as a sink to collect the data (an is necessary otherwise wc -l would write out a series of ones).Given File.readlines("haystack.txt"), the entire file must be resident in memory before .grep(/needle/) is performed, which may cause unnecessary utilization. Iirc, in frameworks like Polars, the collect() chain ending method tells the compiler that the previous methods will be performed as a stream and thus not require pulling the entirety into memory in order to perform an operation on a subset of the corpus.
by adolph
4/22/2025 at 3:03:13 PM
Yeah, I've always heard this called method chaining. It's widespread in C#, particularly with Linq (which was explicitly designed to leverage it).I've only ever heard the term 'pipelining' in reference to GPUs, or as an abstract umbrella term for moving data around.
by mystified5016
4/22/2025 at 2:11:08 PM
In Python, such steps like map() and filter() would execute concurrently, without large intermediate arrays. It lacks the chaining syntax for them, too.Java streams are the closest equivalent, both by the concurrent execution model, and syntactically. And yes, the Java debugger can show you the state of the intermediate streams.
by nine_k
4/22/2025 at 3:57:39 PM
> would execute concurrentlyIterators are not (necessarily) concurrent. I believe you mean lazily.
by maleldil
4/22/2025 at 4:03:17 PM
Concurrent, not parallel.That is, iterators' execution flow is interspersed, with the `yield` statement explicitly giving control to another coroutine, and then continuing the current coroutine at another yield point, like the call to next(). This is very similar to JS coroutines implemented via promises, with `await` yielding control.
Even though there is only one thread of execution, the parts of the pipeline execute together in lockstep, not sequentially, so there's no need for a previous part to completely compute a large list before the following part can start iterating over it.
by nine_k
4/22/2025 at 3:36:56 AM
if you work with I/O, when you can have all sorts of wrong/invalid data and I/O errors, the chaining is a nightmare, as each chain can have numerous different errors/exceptions.the chaining really only works if your language is strongly typed and you are somewhat guaranteed that variables will be of expected type.
by slt2021
4/22/2025 at 2:41:49 AM
I have to object against reusing the 'data' var. Make up a new name for each assignment in particular when types and data structures change (like the last step is switching from strings to ints).Other than that I think both styles are fine.
by 3np
4/22/2025 at 9:39:12 AM
I agree with this comment: https://news.ycombinator.com/item?id=43759814 that this pollutes current scope, which is especially bad if scoping is not that narrow (the case in Python where if-branches do not define their own scope, I don´t know for Ruby).Another problem of having different names for each step is that you can no longer quickly comment out a single step to try things out, which you can if you either have the pipeline or a single variable name.
by hiq
4/21/2025 at 11:52:05 PM
Syntactic sugar can sometimes fool us into thinking the underlying process is more efficient or streamlined. As a new programmer, I probably would have assumed that "storing" `data` at each step would be more expensive.by axblount
4/24/2025 at 8:43:56 PM
It depends on the language you're using.For my Ruby example, each of those method calls will allocate an Array on the heap, where it will persist until all references are removed and the GC runs again. The extra overhead of the named reference is somewhere between Tiny and Zero, depending on your interpreter. No extra copies are made; it's just a reference.
In most compiled languages: the overhead is exactly zero. At runtime, nothing even knows it's called "data" unless you have debug symbols.
If these are going to be large arrays and you actually care about memory usage, you wouldn't write the code the way I did. You might use lazy enumerators, or just flatten it out into a simple procedure; either of those would process one line at a time, discarding all the intermediate results as it goes.
Also, "File.readlines(i).count" is an atrocity of wasted memory. If you care about efficiency at all, that's the first part to go. :)
by invalidator
4/22/2025 at 12:31:41 AM
It absolutely becomes very inefficient, though the threshold data set size varies according to context. Most languages don't have lightweight coroutines as an alternative (but see Lua!), so the convenient alternatives have larger fixed cost. Plus cache locality means cache utilization might be helpful, or even better, as opposed to switching back-and-for every data element, though coroutine-based approaches can also use buffering strategies, which not coincidentally is how pipes work.But, yes, naive call chaining like that is sometimes a significant performance problem in the real world. For example, in the land of JavaScript. One of the more egregious examples I've personally seen was a Bash script that used Bash arrays rather than pipelines, though in that case it had to do with the loss of concurrency, not data churn.
by wahern
4/22/2025 at 1:33:36 PM
Reading this, I am so happy that my first language was a scheme where I could see the result of the first optimization passes.This helped me quickly develop a sense for how code is optimized and what code is eventually executed.
by bjoli
4/22/2025 at 10:52:49 AM
Exactly that. It looks nice but it's annoying to debugI do it in a similar way you mentioned
by raverbashing
4/22/2025 at 1:56:03 AM
I think updating the former to the latter when you are actually debugging something isn’t that big of a deal.But with actually checked in code, the tradeoff in readability is pretty substantial
by jjfoooo4