alt.hn

2/13/2026 at 9:35:52 PM

Show HN: Data Engineering Book – An open source, community-driven guide

https://github.com/datascale-ai/data_engineering_book/blob/main/README_en.md

by xx123122

2/14/2026 at 5:50:24 PM

Thank you so much for this book! I'm finding the translation is very high quality.

I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.

I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.

by fudged71

2/14/2026 at 4:03:40 AM

I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.

by esafak

2/14/2026 at 9:29:10 AM

That's a great point. I completely agree—'Data Engineering for LLMs' is much more accurate given the content.I'll pass this feedback on to the project lead immediately. Thanks for the suggestion.

by xx123122

2/14/2026 at 1:48:37 PM

Project lead? The post gives the impression that you are a student open sourcing your learning notes.

by dotancohen

2/15/2026 at 1:20:10 PM

Let me clarify the team structure to avoid any misunderstanding.

We are actually three first-year Master's students. This project is indeed a summary of our learning from this past semester, which we rushed to wrap up right before the Chinese New Year break.

When I mentioned 'Project Lead,' I was referring to a senior PhD candidate in our lab. He acts as a mentor to review our code and ensure quality control, but the learning and implementation are very much ours. And yes, to move fast and polish the English, we did utilize LLMs during the writing process.

by xx123122

2/15/2026 at 1:31:58 PM

You guys are doing an excellent job. Happy new year, with wishes for prosperity!

by dotancohen

2/14/2026 at 5:08:22 PM

No, I think it gives the impression that the author is using an LLM without much supervision to not only write the submitted content, but to reply to posts here.

by gtowey

2/15/2026 at 1:22:23 PM

I promise there is a human behind the keyboard!QAQ

My English is not good, so I use GPT to help translate and polish my replies to be polite. Maybe it made them sound too robotic. I am reading every comment myself. Sorry for the wrong impression.

by xx123122

2/14/2026 at 5:59:12 AM

I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence:

> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure

https://github.com/datascale-ai/data_engineering_book/blob/m...

Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...

Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...

by hliyan

2/14/2026 at 9:22:43 AM

Appreciate the honest feedback.

by xx123122

2/15/2026 at 5:08:38 AM

It's important in a book treating an emerging field (data eng for LLMs) to mention emerging categories related to it such as storage formats purpose built for the full ML lifecycle.

Lance[1] (the format, not just LanceDB) is a great example, where you have columnar storage optimized for both analytical operations and vector workloads together with built-in versioning for dataset iteration.

Plus (very important) random access, which is important for stuff like sampling and efficient filtering during curation but also for working with multimodal data, e.g. videos.

Lance is not alone, vortex[2] is another one, nimble[3] from Meta yet another one and I might be missing a few more.

[1] https://github.com/lance-format/lance [2] https://vortex.dev [3] https://github.com/facebookincubator/nimble

by cpard

2/13/2026 at 10:58:13 PM

English version: https://github.com/datascale-ai/data_engineering_book/blob/m...

by joshuaissac

2/13/2026 at 11:13:23 PM

Oh thanks! I've switched the top URL to that now. Submitted URL was https://github.com/datascale-ai/data_engineering_book.

I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!

Edit: they did, and I've moved that post to the toptext.

by dang

2/14/2026 at 9:40:00 AM

Huge thanks, dang! I really appreciate you rescuing the post from the filter and switching the URL to the English version.And thanks for pinning the context comment; it helps a lot since the project is quite extensive. We're thrilled it struck a chord.

by xx123122

2/14/2026 at 9:41:23 AM

Thanks for sharing the direct link! Much appreciated.

by xx123122

2/14/2026 at 5:15:45 AM

this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt

by osamabinladen

2/14/2026 at 9:26:46 AM

Yes, you are right. We are a team from China and used GPT to help with the English translation. We didn't realize it came across as 'fake warmth.' We appreciate the feedback and will work on making the tone more neutral and concise.

by xx123122

2/14/2026 at 8:46:52 AM

I think it was. It's a wall of information, lots of summary tables, fake warmth, and has that LLM smell to it. I'd be very surprised if this wasn't generated text.

Whether it's GPT or not, it needs rewriting.

by nimonian

2/14/2026 at 11:30:14 AM

> "Data is the new oil, but only if you know how to refine it."

Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?

[0]: https://en.wikipedia.org/wiki/Petroleum

by baalimago

2/14/2026 at 3:09:50 PM

The 'Vector DB vs Keyword Search' section caught my eye. In your testing for RAG pipelines, where do you draw the line?

We've found keyword search (BM25) often beats semantic search for specific entity names/IDs, while vectors win on concepts. Do you cover hybrid search patterns/re-ranking in the book? That seems to be where most production systems end up.

by 13pixels

2/14/2026 at 7:03:33 PM

Great question. In our production experience, the hybrid approach (BM25 + vector) typically wins for most use cases around 70/30 split favoring keyword for exact matches. The key insight is that reranking becomes critical - without it, you're just concatenating results and hoping. We typically use cross-encoder rerankers (like Cohere or custom fine-tuned models) to score the combined results. The break-even point for pure semantic search is usually when queries are abstract concept-heavy, not entity-specific.

by eshaham78

2/14/2026 at 3:34:37 PM

Thanks for the insight.We definitely plan to cover these patterns in future updates. Please excuse a slight delay as our team is currently celebrating the Chinese New Year. We'll be back to shipping code right after the holidays.OWO

by xx123122

2/14/2026 at 12:54:16 AM

The figures in the different chapters are in english (it's not the case for the image in README_en.md).

by guillem_lefait

2/14/2026 at 4:47:42 AM

Thanks for the heads-up! We noticed that discrepancy as well and have just updated the README_en.md with the correct English diagram. It should be displaying correctly now.

by xx123122

2/14/2026 at 8:08:30 AM

Parquet alone is not for modern data engineering. Delta, Iceberg should be in the list

by alexott

2/14/2026 at 9:37:52 AM

Thanks for the feedback! I've flagged this for the team member working on that section.We are taking a short break for the Chinese New Year, so updates might be a bit slower than usual.QAQ

Thanks for understanding, and Happy New Year!

by xx123122

2/14/2026 at 4:11:47 AM

[dead]

by xx123122

2/14/2026 at 12:32:28 AM

If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to d+data@krea.ai !

by dvrp

2/13/2026 at 11:16:56 PM

谢谢

How is possible a Chinese publication gets to the top in HN?

by rafavargascom

2/14/2026 at 4:49:57 AM

Thanks for the support! We believe that code and engineering challenges are universal languages.

We are pleasantly surprised by the warm reception. We know the project (and our English localization) is still a Work in Progress, but we are committed to improving it to meet the high standards of the HN community. We'll keep shipping updates!

by xx123122

2/14/2026 at 12:38:02 PM

Just sprinkle a little llm on top and it gets there in no time

by heliumtera

2/13/2026 at 11:24:10 PM

Nevermind.

by rafavargascom

2/14/2026 at 6:41:29 AM

[dead]

by MUSTANG303