5/22/2025 at 10:40:32 PM
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.In the end I found the python trifatura library to extract the best quality content with accurate meta data.
You might want to compare your implementation to trifatura to see if there is room for improvement.
by tmpfs
5/23/2025 at 3:41:16 AM
> ...it being Javascript didn't suit my project.If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.
by acrophobic
5/23/2025 at 5:47:21 PM
this is what i came here to see, thanks!by breadchris
5/23/2025 at 1:16:06 AM
reference to the library: https://trafilatura.readthedocs.io/en/latest/for the curious: Trafilatura means "extrusion" in Italian.
| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)
(btw I think you meant trafilatura not trifatura)
by fabmilo
5/23/2025 at 3:47:24 AM
Been using it since day one but development has stalled quite a bit since 2.0.0.by thm
5/23/2025 at 3:07:46 PM
It's a bit old, but I bench marked a number of the web extraction tools years ago, https://github.com/Nootka-io/wee-benchmarking-tool, resiliparse-plain was my clear winner at the time.by winddude