4/20/2026 at 8:06:47 AM
It wouldn't surprise me one bit if many of these things can be attributed to Excel usage. I'm a "power user" of excel, and when working on larger problems with tens of sheets, smaller mistakes can easily carry on. Even more so if you're not a proficient user.One of my first jobs as an analyst was to clean up messy spreadsheets made by people, even very senior employees, who never bothered to learn excel properly.
by TrackerFF
4/20/2026 at 8:22:02 AM
Yes, immediately thought the same. CSV alone is a footgun and a half on any computer which doesn't have . as the decimal separator.Let alone column sorting and joining of data.
by jcattle
4/20/2026 at 11:45:46 AM
CSV occupies, even years after moving away from more raw data work, way too much of my brain is still dedicated to "ways of dealing with CSV from random places".I can already hear people who like CSV coming in now, so to get some of my bottled up anger about CSV out and to forestall the responses I've seen before
* It's not standardised
* Yes I know you found an RFC from long after many generators and parsers were written. It's not a standard, is regularly not followed, doesn't specify allowing UTF-8 (lmao, in 2005 no less) or other character sets as just files. I have learned about many new character sets from submitted data from real users. I have had to split up files written in multiple different character sets because users concatenated files.
* "You can edit it in a text editor" which feels like a monkeys-paw wish "I want to edit the file easily" "Granted - your users can now edit the files easily". Users editing the files in text editors results in broken CSV files because your text editor isn't checking it's standards compliant or typed correctly, and couldn't even if it wanted to.
* Errors are not even detectable in many cases.
* Parsers are often either strict and so fail to deal with real world cases or deal with real world cases but let through broken files.
* Literally no types. Nice date field you have there, shame if someone were to add a mixture of different dd/mm/yy and mm/dd/yy into it.
* You can blame excel for being excel, but at some point if that csv file leaves an automated data handling system and a user can do something to it, it's getting loaded into excel and rewritten out. Say goodbye to prefixed 0s, a variety of gene names, dates and more in a fully unrecoverable fashion.
* "ah just use tabs" no your users will put tabs in. "That's why I use pipes" yes pipes too. I have written code to use actual data separators and actual record separators that exist in ASCII and still users found some way of adding those in mid word in some arbitrary data. The only three places I've ever seen these characters are 1. lists of ascii characters where I found them, 2. my code, 3. this users data. It must have been crafted deliberately to break things.
This, excel and other things are enormous issues. The fact that there any are manual steps along the path for this introduces so many places for errors. People writing things down then entering them into excel/whatever. Moving data between files. You ran some analysis and got graphs, are those the ones in the paper? Are they based on the same datasets? You later updated something, are all the downstream things updated?
This occurs in all kinds of papers, I've seen clear and obvious issues over datasets covering many billions of spending, in aggregate trillions. I can only assume the same is true in many other fields as well as those processes exist there too.
There is so much scope to improve things, and yet so much of this work is done by people who don't know what the options are and often are working late hours in personal time to sort that it's rarely addressed. My wife was still working on papers for a research position she left and was not being paid for any more years after, because the whole process is so slow for research -> publication. What time is there then for learning and designing a better way of tracking and recording data and teaching all the other people how to update & generate stats? I built things which helped but there's only so much of the workflow I could manage.
by IanCal
4/20/2026 at 2:21:33 PM
While I appreciate a good rant just as much as the next person, most of these points have nothing to do with CSV. They are a general problem with underspecifying data, which is exactly what happens when you move data between systems.The amount of hours I have wasted on unifying character sets across single database tables is horrifying to even think about. And the months it took before an important national dataset that supposedly many people use across several types of businesses was staggering. That fact that that XML came with a DTD was apparently not a hindrance to doing unspeakable horrors with both attributes and cdata constructs.
Sure, you can specify MM/DD/YY in a table, but it people put DD/MM/YY in there, what are you going to do about it? And that's exactly what happens in the real world when people move data across systems. That's why mojibake is still a thing in 2026.
by xorcist
4/20/2026 at 12:37:19 PM
You're blaming a lot of normal ETL problems on DSVs.Like, specifying date as a type for a field in JSON isn't going to ensure that people format it correctly and uniformly. You still have parsing issues, except now you're duplicating the ignored schema for every data point. The benefit you get for all of that overhead is more useful for network issues than ensuring a file is well formed before sending it. The people who send garbage will be more likely to send garbage when the format isn't tabular.
There are types and there is a spec WHEN YOU DEFINE IT.
You define a spec. You deal with garbage that doesn't match the spec. You adjust your tools if the garbage-sending account is big. You warn or fire them if they're small. You shit-talk the garbage senders after hours to blow off steam. That's what ETL is.
DSVs aren't the problem. Or maybe they are for you because you're unable to address problems in your process, so you need a heavy unreadable format that enforces things that could be handled elsewhere.
by mcdonje
4/20/2026 at 1:16:40 PM
I would kind of disagree.We are talking here in the context of scientific datasets. Of course ETL plays a part here. However here it is really more the interplay of Excel with CSV which is often outputted by scientific instruments or scientific assistants.
You get your raw sensor data as a csv, just want to take a look in excel, it understandably mangles the data in attempt to infer column types, because of course it does, its's CSV! Then you mistakenly hit save and boom, all your data on disk is now an unrecoverable mangled mess.
Of course this is also the fault of not having good clean data practices, but with CSV and Excel it is just so, so easy to hold it wrong, simply because there is no right.
> so you need a heavy unreadable format
I prefer human unreadable if it means I get machine readable without any guesswork.
by jcattle
4/20/2026 at 1:49:18 PM
This is an excellent rant, thanks for sharing. I didn’t have to work with csv s mich as you, but what experience I had I share your sentiment.by adrianN
4/20/2026 at 12:15:50 PM
Completely agree. If CSVs stay read only and are not user-submitted but computer generated they can be okay at best.Anything else? Nope nope nope!
by jcattle
4/20/2026 at 11:46:14 AM
I think that's a little unfair. It really comes down to parsing text and you'll have similar issues even if you use a database or whatever you think the "real" solution is. I have a project I'm working on right now that stores dates, phone numbers, and website links. Cleaning/parsing has been 90% of the work and I still have edge cases that aren't fully solved. Every time I think I'm done, I find something else I haven't thought about. Local AI models have been a huge help though for sanitizing.by tempaccount5050
4/20/2026 at 12:16:07 PM
that's a little unfair. It really comes down to parsing text and you'll have similar issues even if you use a database or whateverFeel free to show a real-world example of a database or whatever that takes the input string "IGF1 SEPT2 PRX3 MARCH1" and writes that into storage as ["IGF1", "2026-09-02", "PRX3", "2026-03-01"].
Also with Excel, an inadvertent click+drag can move data between cells, and since the cells are uniform it's hard to see that anything unintended happened. I've seen people lose files in Windows Explorer the same way: double-click with a shaky hand can easily move a file into a subdirectory.
by tremon
4/20/2026 at 1:38:20 PM
You still have to push and pull from the db. Meaning transforms still need to happen in either direction. I get what you're saying but it's just as easy to screw up a regex in either direction. Or making assumptions about how your language of choice will handle dates etc.by tempaccount5050