Bigram Crunch
Planted April 12, 2026
I have been working on a Monkeytype mode I call Bigram Crunch. The basic idea is simple: instead of only telling me that I made mistakes, the mode should notice which pairs of letters are weak and then give me more words containing those pairs.
This started as something I built for myself while switching keyboard layouts to Colemak. The first version lived on my own Monkeytype deployment at typing.martinnn.com. It was not serious at all. It mostly watched for bigrams I mistyped and then tried to throw more of those back at me. That was useful enough to prove the idea, but it was also close to a mistake tracker, and I did not want to push that version to a wider audience.
The part I wanted to revisit was the scoring algorithm. If the score is wrong, the whole feature feels wrong. It will either keep repeating words that are not useful, or miss the letter pairs that are actually slowing me down.
The small experiment repo for this work is here:
github.com/kopachef/bigram-crunch-lab
What I wanted Bigram Crunch to do
I wanted the mode to find two kinds of weak bigrams:
- bigrams I mistype
- bigrams I type correctly but slowly
The second point is what makes this different from a normal weak-spot mode. If I
type th correctly every time but always pause on it, that still feels like a
useful thing to practise. A mistake-only score would never notice it.
So the algorithm needs to store a small history for each bigram:
type Stats struct {
Attempts int
Misses int
AverageMS float64
Score float64
}
Every time a word is typed, each adjacent letter pair in that word updates its
own history. For thing, the bigrams are:
th
hi
in
ng
For now I only model adjacent ASCII a-z pairs. This means words like cafe
work normally, but scripts that do not use those alphabetic characters should be
treated as unsupported for this feature. I think that is a reasonable first
limitation. Adding language-specific scoring for every writing system would make
the feature much larger before the core idea has settled.
The score
The current score combines mistakes, timing, and confidence:
missRate = misses / attempts
timingPenalty = clamp(
(averageMS - timingBaselineMS) / timingBaselineMS,
0,
2
)
confidence = min(1, attempts / confidenceAttempts)
score = confidence * (
missRate * missRateWeight +
timingPenalty * timingWeight
)
The values I am currently using are:
missRateWeight = 0.65
timingWeight = 0.35
confidenceAttempts = 10
timingBaselineMS = 180
maxTimingMs = 900
explorationRate = 0.05
The confidence term is there because one early mistake should not dominate the ranking forever. If I have only seen a bigram twice, I do not really know much about it. Once I have seen it around 10 times, I am more willing to trust the score.
Large timing samples are filtered out. If I pause because I got distracted,
that should not turn a normal bigram into a permanent practice target. With the
current settings, timings above 900ms skip the whole bigram update: no
attempt, miss, timing average, or score is changed.
A few small examples
These are the kinds of cases I wanted the scoring function to handle.
Equal histories
If all bigrams have the same history, none of them should be promoted:
th: 20 attempts, 0 misses, 145ms average
he: 20 attempts, 0 misses, 145ms average
ng: 20 attempts, 0 misses, 145ms average
mb: 20 attempts, 0 misses, 145ms average
All of those should end up with the same score. If one rises to the top anyway, the scoring function is inventing weakness where there is none.
A miss-heavy bigram
Now suppose ng has mistakes:
ng: 30 attempts, 5 misses, 145ms average
The miss rate is:
5 / 30 = 0.1667
The timing penalty is zero because 145ms is below the 180ms baseline.
So the score is roughly:
1.0 * (0.1667 * 0.65 + 0 * 0.35) = 0.1084
That should move ng above clean bigrams.
A slow but correct bigram
Now consider a bigram with no mistakes, but a slower average:
mb: 30 attempts, 0 misses, 260ms average
The miss rate is zero. The timing penalty is:
(260 - 180) / 180 = 0.4444
So the score is:
1.0 * (0 * 0.65 + 0.4444 * 0.35) = 0.1556
This is intentional. If a pair is typed correctly but slowly, I still want it to be visible. That is the main reason timing is part of the algorithm.
A timing outlier
This is the case I wanted to avoid:
he: 30 attempts, 0 misses, 145ms average
he: one later sample at 1700ms
That 1700ms sample is more likely to be a pause than a real signal about the
he bigram. With maxTimingMs = 900, the whole update is skipped. The attempt
count, miss count, average, and score all stay where they were.
Word selection
Once the bigrams have scores, the mode still has to decide which words to show.
The indexed selector builds a map from bigrams to words:
th -> thing, think, other, three
ng -> thing, bring, strong, language
mb -> bemba, mumba, chomba
qu -> quick, quiet, quote, query
When it needs a new word, it looks at the highest-scoring bigrams and pulls candidate words from this index. A word gets a score from the bigrams it contains. To stop one long word from winning just because it contains many scored pairs, the word score only uses the top three bigram scores in the word.
There is also a short recent-word buffer. If two words are similarly useful, the selector should avoid repeating the word I just typed.
Why not just be greedy?
One tempting version is:
- find the highest-scoring bigram
- pick a word containing that bigram
- repeat forever
This is too greedy. In my local experiment, greedy indexed selection found some
weak bigrams but then over-focused on those early discoveries. In one run it had
recall@10 = 0.80, while the indexed selector with a little random exploration
had recall@10 = 1.00.
The final version keeps explorationRate = 0.05. So about 5 percent of the
time, it picks a random word instead of exploiting the current weak-bigram
ranking. That gives the algorithm a chance to discover bigrams it has not seen
enough yet.
Directions I decided not to pursue
Mistakes only
The first version was basically this:
score = misses / attempts
That is easy to understand, but it misses the most interesting case for me: letter pairs that are technically correct but slow. It also makes the feature feel too close to a normal weak-spot mode.
Raw timing only
The opposite version would be to rank by timing:
score = averageMS
This also has problems. It is too sensitive to pauses, distractions, and random input noise. It also ignores the fact that an actual mistake is still a strong signal. I ended up with a combined score instead.
Per-language scoring
I considered making the cache language-specific. That would be cleaner in some
ways, but it also adds a lot of moving parts. For now, I think the simpler rule
is better: Bigram Crunch only supports alphabetic a-z languages, and should be
disabled for languages where that model does not make sense.
Machine learning
I also looked at work around keystroke dynamics. A lot of it is about authentication, not typing practice. For example, Killourhy and Maxion’s Comparing Anomaly-Detection Algorithms for Keystroke Dynamics compares different ways to decide whether a typing sample belongs to a user. There is also an implementation repo here:
github.com/vinigarg/Comparing-Anomaly-Detection-Algorithms-for-Keystroke-Dynamics
Some of the ideas are adjacent, especially the idea of using timing features and evaluating different algorithms under the same conditions. But the task here is not to classify a user or detect an impostor. I just need to rank letter pairs and choose useful practice words. A small scoring function felt like the more direct thing to try first.
The most relevant paper I found was Dhakal, Feit, Kristensson, and Oulasvirta’s Observations on Typing from 136 Million Keystrokes. That work made timing between adjacent keys feel like a reasonable signal to try. Bigram Crunch is much simpler than their analysis, but it pushed me away from looking only at mistakes.
The small lab repo
To make the tuning less hand-wavy, I wrote a small Go program that simulates a typist with known weak bigrams.
The synthetic typist has:
base miss rate
weak bigram miss rate
base key-to-key timing
extra timing delay for weak bigrams
pause/outlier rate
Then the program compares scoring and selection variants. The important metric
is recall@10: how many known weak bigrams appear in the top 10 ranked
bigrams.
The final local run I used was:
go run . -sessions 100 -words 120 -search-top 10 -top 0
The best candidate was:
grid-m0.65-t0.35-c10-max900-x0.05
objective=91.75 recall@10=1.00 false+10=5 meanRank=3.00 repeatWords=0.0002
That candidate means:
missRateWeight = 0.65
timingWeight = 0.35
confidenceAttempts = 10
maxTimingMs = 900
explorationRate = 0.05
Several candidates were very close, so I do not read this as “these numbers are universally optimal”. I read it as “these numbers did well under the assumptions I tested, and they are a better starting point than the hard-coded values I had before”.
Why I think this version works better
The final version is still deliberately simple:
- track adjacent bigrams
- update miss count and timing average
- skip bigram updates for timing outliers
- combine miss rate and timing penalty
- apply a confidence ramp
- choose words from a bigram-to-word index
- keep a small amount of exploration
The reason I like this shape is that each part has a job.
The miss rate catches obvious mistakes. The timing penalty catches correct but slow pairs. The confidence ramp stops tiny samples from overreacting. The timing cap stops pauses from becoming permanent weak spots. The indexed selector makes word choice feel targeted, and the exploration rate stops the selector from getting stuck too early.
It is not a full model of typing. It does not try to understand hands, fingers, rollover, or keyboard layout geometry. I may revisit those later. For now, this feels like the right level of complexity for the feature: enough structure to be useful, but still simple enough to explain and debug.