Building a Korean ambiguity solver fast enough to skip the GPU: 7,300 words/sec

The problem

For people who don’t know what Kimchi Reader is: it’s an immersion tool for learning Korean, and the core problem I’m solving is lemmatization, finding the dictionary base form of a word as it appears in real text.

Korean makes this genuinely hard. It’s agglutinative and heavily conjugated, so a single surface form can be the result of many different stems plus stacked particles and endings. My core lemmatizer is written in Rust with a rule-based strategy: it explores every valid way to decompose a word and reports back every possible lemma. It’s reliable, and it’s fast. Easily north of 100k words/s multithreaded.

The one downside: sometimes more than one decomposition is valid, and I end up with ambiguity.

In 음악을 들어요, the 들어요 could reduce to 듣다 (to listen) or 들다 (to hold). — In 음악을 들어요, the 들어요 could reduce to 듣다 (to listen) or 들다 (to hold), two candidate lemmas for one surface form. A human picks the right one instantly from context; a rule engine cannot.

From the user’s perspective this is fine. My users are smart enough to pick the right lemma from context, and I’ve already saved them the work by narrowing it down to a couple of options. The popup shows both; they move on.

The problem is the stats. They suffer from ambiguity at every level, and I have to come up with some different options on how to deal with ambiguity when it happens. This is the main potential gain this model would get us, and that would be crazy good.

Two constraints that shaped everything

Before any modeling, two decisions framed the whole search space:

It has to be absurdly fast. We can’t just solve it on-the-fly. Comprehension stats mean resolving every word in an entire book/movie/novel, ahead of time. A model that’s accurate but slow is useless to me here.
The model only ever suggests, on top of the deterministic rule engine. I never wanted to replace the rule-based lemmatizer, only to add a layer that picks among the candidates it already produced. This is the design decision that pays off later: the model is handed a closed set of real candidates and forced to choose one. It can’t hallucinate a lemma that doesn’t exist, and since it’s only a suggestion, sub-100% accuracy is fine by design.

The full adventure

2023 to 2024: dreaming about it

In 2023 a friend pointed me at course.fast.ai, Jeremy Howard’s course. At the time I just wanted to learn and have fun. That’s where I got into ML at all, and ran my first finetune. Excellent course, by the way. I’d still recommend it to anyone in 2026.

2023, me taking the fastai course and classifying gigachads. 41.7% confident, apparently.

By 2024 I’d started dreaming, on and off, about a model that could solve ambiguity for my lemmatizer. This is the earliest screenshot I can find of the idea.

Me, 2024, long before any of this worked.

Attempt #1 (2025): finetune Gemma 3 1B as a seq2seq task

In 2025 I finally started seriously messing around to see what was possible. Real open models had landed; Gemma 3 was out.

I don’t fully recall how I got there (probably brainstorming with a chatbot), but I started framing disambiguation as a seq2seq task. It felt elegant: seq2seq is just translation. One sentence in language A, the same sentence out in language B. Except here the “languages” were Korean → lemmatized Korean. Then I’d reconcile the output against my lemmatizer wherever they matched.

Same shape as translation: Korean in, lemmatized Korean out.

I knew an LLM would be slow for this, I wasn’t deluding myself. But I just wanted a starting point, LLMs were the easiest one to reach for, and I wanted to see how far I could push it.

The plan was a classic distillation pipeline:

Hand-build a small golden dataset.
Finetune a big teacher (Gemma 3 27B) on it.
Use that teacher to generate ~8M synthetic seq2seq sentences.
Finetune the small Gemma 3 1B on the synthetic data.

I rented GPUs on vast.ai (awesome service for quick experiments, I love it), built the pipeline, and tried a few model families, Qwen too. Gemma 3 gave me the least headache and the best results at the time.

I don’t have the exact numbers saved, but I was missing roughly one to two orders of magnitude on both accuracy and speed versus what I needed. Even with batched offline inference on vLLM on a 4090, it solved about 1,500 sentences/s.

RTX 4090, Gemma 3 1B, batch-512 offline inference.

Honestly? Not bad at all. But this is a small feature of my product, and it implied a ~$500/month GPU server, for which a real production workload would never hit the ideal batched-offline throughput anyway. And the accuracy wasn’t there yet either. So I decided to conclude that experiment there.

I did ship something out of it, though: a quantized ONNX build of the Gemma 1B running on CPU, behind an experimental flag, at a sluggish ~1 to 2 seconds per sentence. That was the first version my users ever touched.

Attempt #2 (2025): sentence similarity and embeddings

Around then I started learning about embeddings, and how you can find similar things via context (i.e. attention).

I thought: why not reframe it entirely? Precompute a vector for every definition in the dictionary, then at runtime parse a sentence once, embed it, and pick the candidate whose definition vector is closest.

The rule lemmatizer narrows the whole dictionary to 2 candidates. Encode the sentence once, then cosine-match the target's vector against only those 2 precomputed definition vectors. A closed set, so there's always exactly one winner.

I love this approach because I only ever compared against the two candidates my rule lemmatizer had already given me. A closed set, always one winner, no way to invent a lemma, unlike attempt #1.

And so I built it but it got worse accuracy and worse speed than attempt #1, so I didn’t push it far. I gotta admit this one I only tested quickly and got immediately discouraged by the inference speed and the difficulty to finetune those model (skill issue mostly tbh).

Trying random off-the-shelf embedding models for a baseline. About 75% of ambiguities solved.

It wasn’t a total loss, though. What I learned there is exactly what later shipped (2026) as semantic search for the dictionary, hanja and grammar pages. No reranking, just raw cosine distance, lol.

Attempt #3 (2025): train a nano Gemma 3 from scratch, char-level tokenizer

Note: at the time I had this idea, Gemma 3 270M wasn’t out yet (it dropped shortly after, while I was mid-build).

After the first two attempts I still liked #1 the most. I just wanted it faster and runnable on cheaper GPUs. So, clueless me: why can’t I just train my own model from scratch? Let’s try and see. I skipped pretraining entirely and threw the 8M synthetic seq2seq samples from attempt #1 straight at it. And while I was at it, a character-level tokenizer felt more intuitive than BPE for this.

I had a lot of fun here because I decided to be a serious boy and do both training and inference in Rust, with burn-rs, which looked very cool. I reimplemented the Gemma 3 architecture in burn-rs (gist) with help from some LLMs (reminder: no agentic coding existed yet, only web chat UIs, and that was pain).

The training dashboard. Loss dropping over a run.

And it actually worked quite well! Against my ~200-sample handwritten test set, after many attempts, I got a 15M-param model with the Gemma 3 architecture to beat Gemma 3 1B on this task.

I also realized I could do speculative decoding. It was now trivial to make a second, smaller draft model. Lots of ideas. But I was tired, I needed a break, and I needed to ship actual product features instead of toying around forever. So I shipped the 15M model for that tiny feature, finally retiring the sluggish Gemma 1B ONNX build I’d been running on CPU since attempt #1. It took ~100ms to parse one or two sentences. Fine for a flag, nowhere near the “whole book” bar.

Attempt #4 (2026): Claude Code happened, so I tried again

Early 2026 I started using Claude, my first AI subscription after the GitHub Copilot autocomplete era, slightly before Opus 4.6 shipped. Long story short: fully converted. I started vibe-coding more and more. The economics shifted; code got cheaper, so you can experiment a lot, much faster.

I was still busy building features, but around early April I got distracted. Instead of doing what I should have been doing (marketing), I told myself I’d give the ambiguity problem one more try, 1 to 2 weeks max (spoiler: it was two months), just to test the water. Now that I had Claude, maybe things would go differently?

And yeah, damn, it went differently.

”Jarvis, organize all the documents and write a proper problem-statement .md with every edge case in full”

Actually it was our boy Claude.

That was step one. Having taken a few runs at this problem already, I knew what I wanted more precisely. So I dumped everything and had Claude write a complete problem statement: every sub-problem, like detecting names in text (another hard one), and handling subwords like 금-도끼 (“golden axe”), which my lemmatizer splits into two separate vocabulary words.

Then I asked what approaches it would consider. At some point it suggested this, for any BERT-like model:

My first reaction was huh, interesting, but aren’t BERT models ancient? But testing was cheap, so I had Claude build a minimal version to check whether it was viable on speed and accuracy.

And it was the best of every world I’d tried. A single forward pass, scoring one target word at a time instead of the whole sentence, and constrained to the same closed candidate list as attempt #2, so it structurally cannot drift or emit a lemma that isn’t real.

After a lot of toying, progressively adding data, building pipelines, spending about 50$ in DeepSeek V4 Flash for synthetic dataset (awesome model by the way), I got more and more confident in the approach.

For the backbone I kept swapping models as I went, and the clear winner was monologg/koelectra-small-v3-discriminator. I genuinely don’t know why, but this 5-year-old model is just so good on this task. Thank you, monologg. It worked well enough that I never even felt the need to train one from scratch this time.

Progress, and realizing small workloads run fine on CPU

As things came together, I started thinking about deployment. For the past year I’d been checking Hetzner’s GPU servers every few weeks. I already run my CPU server at Hetzner, so a GPU box on the same network would be convenient. I was eyeing an NVIDIA RTX 4000 SFF Ada, but it wasn’t exactly exciting.

I also looked at RunPod serverless and how their cold starts work, and thought about streaming payloads over the network so the GPU stays busy (my lemmatizer is much faster than the model, so I’d buffer to keep it fed).

No matter how I turned it over, I was unhappy. I could not find a good solution. So I decided to stop caring, stay CPU-only, cap how large a parse could be, and deal with GPU hardware problem later.

Then, a few days later, I noticed the model didn’t actually run that badly on CPU.

So I immediately had Claude build a playground and start squeezing CPU inference as hard as it could. I don’t recall the exact number, but early toying already hit something like 800 ambiguities/s. Pleasantly surprising. My plan shifted to a hybrid: short content on CPU, large documents and nightly jobs on a rented GPU.

That plan lasted right up until I tried to make CPU go faster. I started quantizing to int8, and it climbed to roughly 4,000 ambiguities/s (multithreaded, 16 cores). That was the moment I started seriously believing I might not need a GPU at all. No async. No extra hardware. At that point I was running the finetuned, int8-quantized KoELECTRA via ort (the Rust ONNX Runtime crate), and had Claude optimize the ONNX graph on top.

Road to CPU-only, making it go brrr

Then I got a bold idea.

My lemmatizer and the model run on a single box: Hetzner AX102, a 7950X3D. Which happens to be the exact same CPU I have in my desktop, so I can profile and optimize locally and get the same numbers in production. This is a very hot path and matters a lot, so any optimization is worth it. And since it only ever runs on one CPU architecture, why even support more than one? Just target Zen 4: -C target-cpu=znver4

“Claude, write me an inference crate for my model. Zero new deps, pure Rust, SIMD, unsafe is OK.”

I started from a naive baseline with one target: beat ort (the Rust bindings to ONNX Runtime, whose C++ SIMD kernels were already pretty damn fast). Then I put Claude in a loop for ~30 rounds of optimization while giving some hint like fitting in the L3 cache.

And it worked. Here is the nightly job that rebuilds the whole recommendation system (about 295M words at the time of writing), straight off the production CPU graph.

Grafana CPU-usage view of two nightly runs of the recs-rebuild job. — Grafana CPU-usage view, two nightly job, refreshing the entire recommendation system (295M words). The tall spikes (1, 3) are the rule-based lemmatizer alone. The long blocks are with the model inference. 2 is the old ort engine, 4 is the new pure-Rust crate.

The results

The model

The shipped model is a finetuned 14M params int8 quantized and it run 100% on cpu now. I never bought any GPU hardware.

Throughput

I was adding something on top of my speedy rule based lemmatizer, so ideally it needed to be as fast as possible, enough to make the nightly recommendation system job possible (see brrr graph).

Measured on my PC with the 7950X3D over a 362k-word text:

Pipeline	1 core	16 cores (production)	32 threads (doesn’t scale well)
Rule lemmatizer only	~7,700 w/s	~108,000 w/s	~130,000 w/s
Rule + Ambiguity Solver	~1,400 w/s	~18,500 w/s	~18,700 w/s

Production runs at 16 cores. The model never touches every word anyway: only ~40% are ambiguous (more than one candidate lemma). It solves ~7,300 ambiguities per seconds (~1.47 ms per call single-threaded, the rest is the 16 cores).

Better stats everywhere

The throughput is what made it runnable, but the point of building this model was to make stats better. Are they now better? Hell yeah. For example, the word frequency list got some fixes:

Word	Old rank	New rank
하다 (to do)	8	1
이거 (this)	50,351	12
주다 (to give)	61,314	30
여기 (here)	57,601	41
이제 (now)	78,431	44
왜 (why)	21,827	49
갑자기 (suddenly)	95,504	292

The old list was built by excluding anything ambiguous, so the most common words, which are also the most ambiguous, were exactly the ones it got most wrong. The grammar and idiom lists are also affected positively.

The metric that actually matters

Two questions, really: did users enjoy it, and could I actually run it on production? Yeah and yeah. The ones who tried the experimental flag loved it, and it fits on the CPU I already pay for, nightly job done in a night. No GPU bought, no new box. Me happy.