Building a Korean ambiguity solver fast enough to skip the GPU: 7,300 words/sec
How Kimchi Reader's Korean ambiguity solver, a 14M-parameter KoELECTRA quantized to int8, ended up running server-side on a plain CPU with no GPU, resolving thousands of word-sense ambiguities per second. The four attempts it took to get there.
The problem
For people who don’t know what Kimchi Reader is: it’s an immersion tool for learning Korean, and the core problem I’m solving is lemmatization, finding the dictionary base form of a word as it appears in real text.
Korean makes this genuinely hard. It’s agglutinative and heavily conjugated, so a single surface form can be the result of many different stems plus stacked particles and endings. My core lemmatizer is written in Rust with a rule-based strategy: it explores every valid way to decompose a word and reports back every possible lemma. It’s reliable, and it’s fast. Easily north of 100k words/s multithreaded.
The one downside: sometimes more than one decomposition is valid, and I end up with ambiguity.
From the user’s perspective this is fine. My users are smart enough to pick the right lemma from context, and I’ve already saved them the work by narrowing it down to a couple of options. The popup shows both; they move on.
But everything aggregated suffers from it. Telling a user they know “x% of this book”, building the recommendation engine, generating frequency lists: all of these have to commit to one lemma per word, and the ambiguity gets in the way every single time.
Two constraints that shaped everything
Before any modeling, two decisions framed the whole search space:
- It has to be absurdly fast. On-the-fly disambiguation isn’t enough. Comprehension stats mean resolving every word in an entire book/movie/novel, ahead of time. A model that’s accurate but slow is useless to me here.
- The model only ever suggests, on top of the deterministic rule engine. I never wanted to replace the rule-based lemmatizer, only to add a layer that picks among the candidates it already produced. This is the design decision that pays off later: the model is handed a closed set of real candidates and forced to choose one. It can’t hallucinate a lemma that doesn’t exist, and since it’s only a suggestion, sub-100% accuracy is fine by design.
The full adventure
2023 to 2024: dreaming about it
In 2023 a friend pointed me at course.fast.ai, Jeremy Howard’s course. At the time I just wanted to learn and have fun. That’s where I got into ML at all, and ran my first finetune. Excellent course, by the way. I’d still recommend it to anyone in 2026.
By 2024 I’d started dreaming, on and off, about a model that could solve ambiguity for my lemmatizer. This is the earliest screenshot I can find of the idea.
Attempt #1 (2025): finetune Gemma 3 1B as a seq2seq task
In 2025 I finally started seriously messing around to see what was possible. Real open models had landed; Gemma 3 was out.
I don’t fully recall how I got there (probably brainstorming with a chatbot), but I started framing disambiguation as a seq2seq task. It felt elegant: seq2seq is just translation. One sentence in language A, the same sentence out in language B. Except here the “languages” were Korean → lemmatized Korean. Then I’d reconcile the output against my lemmatizer wherever they matched.
I knew an LLM would be slow for this, I wasn’t deluding myself. But I just wanted a starting point, LLMs were the easiest one to reach for, and I wanted to see how far I could push it.
The plan was a classic distillation pipeline:
- Hand-build a small golden dataset.
- Finetune a big teacher (Gemma 3 27B) on it.
- Use that teacher to generate ~8M synthetic seq2seq sentences.
- Finetune the small Gemma 3 1B on the synthetic data.
I rented GPUs on vast.ai (awesome service for quick experiments, I love it), built the pipeline, and tried a few model families, Qwen too. Gemma 3 gave me the least headache and the best results at the time.
I don’t have the exact numbers saved, but I was missing roughly one to two orders of magnitude on both accuracy and speed versus what I needed. Even with batched offline inference on vLLM on a 4090, it solved about 1,500 sentences/s.
Honestly? Not bad at all. But this is a small feature of my product, and it implied a ~$500/month GPU server, for which a real production workload would never hit the ideal batched-offline throughput anyway. And the accuracy wasn’t there yet either. So I decided to conclude that experiment there.
I did ship something out of it, though: a quantized ONNX build of the Gemma 1B running on CPU, behind an experimental flag, at a sluggish ~1 to 2 seconds per sentence. That was the first version my users ever touched.
Attempt #2 (2025): sentence similarity and embeddings
Around then I started learning about embeddings, and how you can find similar things via context (i.e. attention).
I thought: why not reframe it entirely? Precompute a vector for every definition in the dictionary, then at runtime parse a sentence once, embed it, and pick the candidate whose definition vector is closest.
One thing here was actually good, though, and it’s what the diagram above shows: I only ever compared against the two candidates my rule lemmatizer had already given me. A closed set, always one winner, no way to invent a lemma, unlike attempt #1.
I built it. It got worse accuracy and worse speed than attempt #1, so I didn’t push it far. The LLM approach also had a quiet advantage I didn’t appreciate until I lost it: you can read the output. Garbage is obvious, iteration is fast. With embeddings I was just staring at cosine distances. I also just didn’t know how to finetune a sentence-similarity model well, and building the dataset wasn’t intuitive. In hindsight, probably a skill issue on my end.
It wasn’t a total loss, though. What I learned there is exactly what later shipped (2026) as semantic search for the dictionary, hanja and grammar pages. No reranking, just raw cosine distance, lol.
Attempt #3 (2025): train a nano Gemma 3 from scratch, char-level tokenizer
Note: at the time I had this idea, Gemma 3 270M wasn’t out yet (it dropped shortly after, while I was mid-build).
After the first two attempts I still liked #1 the most. I just wanted it faster and runnable on cheaper GPUs. So, clueless me: why can’t I just train my own model from scratch? Let’s try and see. I skipped pretraining entirely and threw the 8M synthetic seq2seq samples from attempt #1 straight at it. And while I was at it, a character-level tokenizer felt more intuitive than BPE for this.
I had a lot of fun here because I decided to be a serious boy and do both training and inference in Rust, with burn-rs, which looked very cool. I reimplemented the Gemma 3 architecture in burn-rs (gist) with help from some LLMs (reminder: no agentic coding existed yet, only web chat UIs, and that was pain).
And it actually worked quite well! Against my ~200-sample handwritten test set, after many attempts, I got a 15M-param model with the Gemma 3 architecture to beat Gemma 3 1B on this task.
I also realized I could do speculative decoding. It was now trivial to make a second, smaller draft model. Lots of ideas. But I was tired, I needed a break, and I needed to ship actual product features instead of toying around forever. So I shipped the 15M model for that tiny feature, finally retiring the sluggish Gemma 1B ONNX build I’d been running on CPU since attempt #1. It took ~100ms to parse one or two sentences. Fine for a flag, nowhere near the “whole book” bar.
Attempt #4 (2026): Claude Code happened, so I tried again
Early 2026 I started using Claude, my first AI subscription after the GitHub Copilot autocomplete era, slightly before Opus 4.6 shipped. Long story short: fully converted. I started vibe-coding more and more. The economics shifted; code got cheaper, so you can experiment a lot, much faster.
I was still busy building features, but around early April I got distracted. Instead of doing what I should have been doing (marketing), I told myself I’d give the ambiguity problem one more try, 1 to 2 weeks max (spoiler: it was two months), just to test the water. Now that I had Claude, maybe things would go differently?
And yeah, damn, it went differently.
”Jarvis, organize all the documents and write a proper problem-statement .md with every edge case in full”
That was step one. Having taken a few runs at this problem already, I knew what I wanted more precisely. So I dumped everything and had Claude write a complete problem statement: every sub-problem, like detecting names in text (another hard one), and handling subwords like 금-도끼 (“golden axe”), which my lemmatizer splits into two separate vocabulary words.
Then I asked what approaches it would consider. At some point it suggested this, for any BERT-like model:
My first reaction was huh, interesting, but aren’t BERT models ancient? But testing was cheap, so I had Claude build a minimal version to check whether it was viable on speed and accuracy.
And it was the best of every world I’d tried. A single forward pass, scoring one target word at a time instead of the whole sentence, and constrained to the same closed candidate list as attempt #2, so it structurally cannot drift or emit a lemma that isn’t real.
After a lot of toying (progressively adding data, building pipelines to generate more of it with LLMs, hi DeepSeek V4 Flash, awesome model by the way), I got more and more confident in the approach.
For the backbone I kept swapping models as I went, and the clear winner was monologg/koelectra-small-v3-discriminator. I genuinely don’t know why, but this 5-year-old model is just so good on this task. Thank you, monologg. It worked well enough that I never even felt the need to train one from scratch this time.
Progress, and realizing small workloads run fine on CPU
As things came together, I started thinking about deployment. For the past year I’d been checking Hetzner’s GPU servers every few weeks. I already run my CPU server at Hetzner, so a GPU box on the same network would be convenient. I was eyeing an NVIDIA RTX 4000 SFF Ada, but it wasn’t exactly exciting.
I also looked at RunPod serverless and how their cold starts work, and thought about streaming payloads over the network so the GPU stays busy (my lemmatizer is much faster than the model, so I’d buffer to keep it fed).
But every one of these options has the same catch: any GPU that isn’t physically in this box lives across a network, so I’d have to buffer payloads to it and drag async into my otherwise-synchronous lemmatizer. The only way to dodge that would be a GPU sitting in the same machine.
No matter how I turned it over, I was unhappy. So I decided to punt: stay CPU-only, cap how large a parse could be, and deal with GPU hardware later.
Then, a few days later, I noticed the model didn’t actually run that badly on CPU.
So I immediately had Claude build a playground and start squeezing CPU inference as hard as it could. I don’t recall the exact number, but early toying already hit something like 800 ambiguities/s. Pleasantly surprising. My plan shifted to a hybrid: short content on CPU, large documents and nightly jobs on a rented GPU.
That plan lasted right up until I tried to make CPU go faster. I started quantizing to int8, and it climbed to roughly 4,000 ambiguities/s (multithreaded, 16 cores). That was the moment I started seriously believing I might not need a GPU at all. No async. No extra hardware. At that point I was running the finetuned, int8-quantized KoELECTRA via ort (the Rust ONNX Runtime crate), and had Claude optimize the ONNX graph on top.
Road to CPU-only, making it go brrr
Then I got a bold idea.
My lemmatizer and the model run on a single box: Hetzner AX102, a 7950X3D. Which happens to be the exact same CPU I have in my desktop, so I can profile and optimize locally and get the same numbers in production. This is a very hot path and matters a lot, so any optimization is worth it. And since it only ever runs on one CPU architecture, why even support more than one? Just target Zen 4: -C target-cpu=znver4, AVX-512 + VNNI, the whole thing.
I also gave Claude hardware hints to optimize against, mainly that the X3D’s huge 96 MB L3 cache swallows the whole ~14 MB int8 model with room to spare.
“Claude, write me an inference crate for my model. Zero new deps, pure Rust, SIMD, unsafe is OK.”
I started from a naive baseline with one target: beat ort (the Rust bindings to ONNX Runtime, whose C++ SIMD kernels were already pretty damn fast). Then I put Claude in a loop for ~30 rounds of optimization.
And it worked. Here is the nightly job that rebuilds the whole recommendation system, straight off the production CPU graph.
The results
The model
The shipped model is a finetuned 14M-parameter KoELECTRA-small-v3 discriminator with a small listwise scoring head, matmul weights packed to int8, running on the hand-rolled Rust AVX-512 + VNNI crate (target-cpu=znver4).
Throughput
The original goal was almost naive: keep the rule engine’s raw speed. Impossible, of course. I’m stacking a neural net on top of it. So the real target was getting close enough to stay viable for the nightly recommendation rebuild, the final boss of this whole thing.
Measured on my PC with the 7950X3D over a 362k-word text:
| Pipeline | 1 core | 16 cores (production) | 32 threads (doesn’t scale well) |
|---|---|---|---|
| Rule lemmatizer only | ~7,700 w/s | ~108,000 w/s | ~130,000 w/s |
| Rule + int8 disambiguator | ~1,400 w/s | ~18,500 w/s | ~18,700 w/s |
Production runs at 16 cores. The model never touches every word anyway: only ~40% are ambiguous (more than one candidate lemma), and that’s its real workload: ~7,300 ambiguities/s (~1.47 ms per call single-threaded, the rest is the 16 cores).
The payoff: the community insights dashboard counts about 295M words across the library, all re-lemmatized by the nightly job. At ~18,500 words/s that’s the whole corpus in about 4.5 hours on one CPU, no GPU, no second machine. It’s the long block 4 in the graph above.
Better stats everywhere
The throughput is what made it runnable, but the point of running it was to fix everything downstream, like the word frequency list. A few of the changes:
| Word | Old rank | New rank |
|---|---|---|
| 하다 (to do) | 8 | 1 |
| 이거 (this) | 50,351 | 12 |
| 주다 (to give) | 61,314 | 30 |
| 여기 (here) | 57,601 | 41 |
| 이제 (now) | 78,431 | 44 |
| 왜 (why) | 21,827 | 49 |
| 갑자기 (suddenly) | 95,504 | 292 |
The old list was built by excluding anything ambiguous, so the most common words, which are also the most ambiguous, were exactly the ones it got most wrong. The grammar and idiom lists are rebuilt by the same nightly job and shifted too, just far more gently.
The metric that actually matters
Two questions, really: did users enjoy it, and could I actually run it on production? Yeah and yeah. The ones who tried the experimental flag loved it, and it fits on the CPU I already pay for, nightly job done in a night. No GPU bought, no new box. Me happy.
The learnings
Two takeaways and a bet.
The framing was the hardest part, not the training or the SIMD. Finding the right angle (a listwise choice over my rule-based candidates) took four attempts and looks obvious only in hindsight.
Datasets are hard, and they matter most. Every real jump came from better data, not a fancier model. Validation is even harder: when your labels are partly synthetic, a high score mostly proves you match the labeler, not reality.
The bet: in a year the next Claude or Codex might just crack this differently, with ideas that don’t exist yet. So rather than grind out the last few percent of accuracy now, I’ll ship what I have and let the ecosystem hand me the next leap. Not laziness, just plenty else to build.