Show HN: Context-aware Japanese furigana using Sudachi and ModernBERT

5/29/2026 at 5:34:55 PM

Got an incorrect result on my first try. Input was 振り仮名変換器の性能が如何程か試してみよう. It returned 如何(どう)程(ほど) instead of 如何(いか)程(ほど).

Regardless, I'm impressed with the tool!

by uasi

5/29/2026 at 5:44:11 PM

Thanks, this kind of report is very useful.

如何 is context-dependent, and I hadn’t come across this case yet. I’ll add it to the model soon. Really appreciate the report and the kind words.

by epitrochoid413

5/31/2026 at 7:37:26 AM

Don't want to discourage or anything but is this mostly pattern matching? Most of the comments here seem to be about corner cases "added to the model", which doesn't feel that novel. Naturally, context-aware is all about the corner cases.

今日asこんにちはbeing denylisted was the biggest point of concern here.

by pjjpo

5/29/2026 at 3:16:52 PM

Fantastic tool and love the delivery; no sign up required. Interested to hear how you pulled that off.

Also interested to hear if you plan to eventually support an option to add pitch accent; I've never seen what training material exists for that or how that is supported in unicode.

by bluechair

5/29/2026 at 4:00:10 PM

Thanks, I really appreciate that.

No signup was a deliberate choice, to keep the barrier to trying it as low as possible.

I’ve thought about pitch accent, but it feels like a whole separate beast. The datasets are less comprehensive, and pitch can be even more context-dependent. I’d like to look into it eventually though.

by epitrochoid413

5/29/2026 at 2:35:47 PM

It really works. Very cool. I’ve been looking for this kind of service for a long time since I started learning Japanese, and I’ve rarely been satisfied with the available services.

by altilunium

5/31/2026 at 7:44:13 AM

20 years now but rikaikun / chan were great at the time and I suspect still are for learning. Hover over the words you can't read, no corner cases really since it shows all the possible readings and meanings, not just one. I would say that extra context is useful for learning the word completely, not just being able to read some content (can just fully translate an article for that).

The best part is the feeling of hovering less over time until finally removing the extension.

by pjjpo

5/29/2026 at 3:01:10 PM

Thank you, that means a lot. I’ve been working on it for about a year now, so it’s really encouraging to hear it’s useful. I’m hoping to keep pushing the accuracy further, especially on the remaining hard cases like rendaku and person/place names.

by epitrochoid413

5/29/2026 at 4:07:09 PM

I’m Japanese. I was surprised that it was able to answer correctly even when I entered commonly seen difficult-to-read place names. However, there seem to be cases where it may incorrectly read “今日” when it should be read as “こんにち.” Example: 今日の日本社会では、少子高齢化が大きな課題となっている。

Also, it’s disappointing that Japanese does not appear even when I select it.

Please let me know if there’s anything I can do to help.

by k-taro56

5/29/2026 at 4:20:31 PM

Thank you this is very helpful, especially from a native speaker.

今日 is a tradeoff I made intentionally: I disabled the fallback model for it because most cases are きょう, while こんにち is much rarer. But yes, this is one of the cases that gets lost with that choice.

And agreed on Japanese dictionary support. I plan to add Japanese soon. Thanks again.

by epitrochoid413

5/29/2026 at 4:35:15 PM

Certainly, you’re right that reading it as “konnichi” is very rare. It’s rare enough that I only sometimes misread it that way when it appears in the news or similar contexts. So I think that was a good decision.

By the way, I tried testing it further while thinking back to the kinds of tests I had when I was in school. The accuracy is still excellent. My guess is that “一日” and “分別” are being handled in a similar way to “今日.” “分別” is very rare, but I don’t think “一日” is all that uncommon.

by k-taro56

5/30/2026 at 2:31:36 AM

[dead]

by epitrochoid413

5/29/2026 at 4:11:38 PM

Uh, in 田中さんは今何をしている, 今何 comes out as こんなに.

by sollniss

5/29/2026 at 4:15:37 PM

Thanks, good catch. This is a known class of edge case I am trying to improve: adjacent tokens sometimes get over-merged into a phrase reading when they should stay separate. 今 and 何 should be handled separately here, not as こんなに. I appreciate this concrete example. That helps a lot.

by epitrochoid413

5/29/2026 at 12:25:22 PM

I built a context-aware furigana converter for Japanese text, files, and web pages.

The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:

* 市場: いちば or しじょう

* 大分: おおいた or だいぶ

* 人気: にんき or ひとけ

* 最中: さいちゅう or さなか or もなか

* 方: かた or ほう

The engine is a hybrid system:

* Sudachi for tokenization, base forms, POS, and candidate readings

* Expanded dictionary coverage for compounds and fixed expressions

* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides

* ModernBERT fallback for 144 especially context-dependent target words

I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.

The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.

I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.

by epitrochoid413

5/29/2026 at 3:29:53 PM

Very, very nice.

Since January, I’ve been having Claude build a static Japanese-English dictionary in which all of the kanji and jukugo can be displayed either with or without furigana:

https://www.tkgje.jp/index.html

I haven’t spotted any mistakes in the furigana myself, though there must be some. I have a scheduled routine running multiple times a day to have Claude check and polish existing entries; it should be correcting most of whatever furigana mistakes might be in the data. At some point, I will set up an agent to use a different LLM to run a similar set of checks to try to reduce the error rate even more.

As you note, the readings of Japanese words depend on the context, so producing accurate furigana cannot be done entirely programmatically. Sentences must be interpreted semantically.

I am releasing all of the dictionary data into the public domain, and anyone is free to fork it or adapt it however they like:

https://github.com/tkgally/je-dict-1

by tkgally

5/29/2026 at 3:55:28 PM

Thanks for sharing this. It looks like a really cool project, and making the data public domain is especially generous.

I especially like the dictionary + example sentence format. I haven’t found a really good Japanese-English dictionary for learners, and yours looks promising.

I’m curious how token-intensive the repeated Claude polishing runs are.

by epitrochoid413

5/29/2026 at 4:03:06 PM

Quite token-intensive. I pay for the Max plan, and the regular dictionary runs consume maybe 20 percent of my weekly quota.

by tkgally

5/29/2026 at 2:55:44 PM

Nice work, just gave a quick pass but seems to work well!

(Also: vouched, your comment was dead FYI)

by fenomas

5/29/2026 at 3:02:56 PM

Thanks, that’s great to hear. Thanks for the vouch too, I didn’t realize the comment was dead.

by epitrochoid413