Bigram Crunch

Planted April 12, 2026

I have been working on a Monkeytype mode I call Bigram Crunch. The basic idea is simple: instead of only telling me that I made mistakes, the mode should notice which pairs of letters are weak and then give me more words containing those pairs.

This started as something I built for myself while switching keyboard layouts to Colemak. The first version lived on my own Monkeytype deployment at typing.martinnn.com. It was not serious at all. It mostly watched for bigrams I mistyped and then tried to throw more of those back at me. That was useful enough to prove the idea, but it was also close to a mistake tracker, and I did not want to push that version to a wider audience.

The part I wanted to revisit was the scoring algorithm. If the score is wrong, the whole feature feels wrong. It will either keep repeating words that are not useful, or miss the letter pairs that are actually slowing me down.

The small experiment repo for this work is here:

github.com/kopachef/bigram-crunch-lab

It is worth noting that Monkeytype already has Weakspot, which also biases word selection towards things the typist struggles with. My understanding is that Weakspot tracks weak individual characters using timing and mistakes. Bigram Crunch is trying to test a slightly different thing: transitions between characters. That is why I think it can make sense as a separate mode, especially for cases where the individual keys are fine but a pair like ng, th, or mb still feels slow or error-prone. For example, Weakspot can learn that g is a weak character, but it will not distinguish the transition into g from n, e, or a. Bigram Crunch can let ng, eg, and ag settle into different scores if my typing shows that those transitions are not equally easy.

What I wanted Bigram Crunch to do

I wanted the mode to find two kinds of weak bigrams:

bigrams I mistype
bigrams I type correctly but slowly

The second point is what makes this different from a normal weak-spot mode. If I type th correctly every time but always pause on it, that still feels like a useful thing to practise. A mistake-only score would never notice it.

So the algorithm needs to store a small history for each bigram:

type Stats struct {
    Attempts  int
    Misses    int
    AverageMS float64
    Score     float64
}

Every time a word is typed, each adjacent letter pair in that word updates its own history. For thing, the bigrams are:

th
hi
in
ng

For now I only model adjacent ASCII a-z pairs. This means words like cafe work normally, but scripts that do not use those alphabetic characters should be treated as unsupported for this feature. I think that is a reasonable first limitation. Adding language-specific scoring for every writing system would make the feature much larger before the core idea has settled.

The score

The current score combines mistakes, timing, and confidence. The timing baseline is adaptive, so the formula is roughly:

missRate = misses / attempts

timingPenalty = clamp(
  (averageMS - activeTimingBaselineMS) / activeTimingBaselineMS,
  0,
  2
)

confidence = min(1, attempts / confidenceAttempts)

score = confidence * (
  missRate * missRateWeight +
  timingPenalty * timingWeight
)

The values I am currently using are:

missRateWeight = 0.65
timingWeight = 0.35
confidenceAttempts = 10
fallbackTimingBaselineMs = 120
adaptiveTimingMinSamples = 150
minAdaptiveTimingBaselineMs = 50
maxAdaptiveTimingBaselineMs = 180
timingBaselineWindow = 200
maxTimingMs = 1200
explorationRate = 0.05

The adaptive baseline is there because a single fixed baseline can hide useful timing differences. I noticed this while testing locally: quite a few correct bigrams were averaging somewhere below the old 180ms baseline, so their timing score stayed at zero even when some pairs were clearly slower than the rest of my typing. The current version starts with a 120ms fallback baseline. After 150 timing samples, it uses the recent global moving average, clamped between 50ms and 180ms. Individual bigrams still keep their own timing averages; the global baseline is just the comparison point for what feels normal for the current typist.

The confidence term is there because one early mistake should not dominate the ranking forever. If I have only seen a bigram twice, I do not really know much about it. Once I have seen it around 10 times, I am more willing to trust the score.

Large timing samples are filtered out. If I pause because I got distracted, that should not turn a normal bigram into a permanent practice target. With the current settings, timings above 1200ms skip the whole bigram update: no attempt, miss, timing average, or score is changed.

A few small examples

These are the kinds of cases I wanted the scoring function to handle.

Equal histories

If all bigrams have the same history, none of them should be promoted:

th: 20 attempts, 0 misses, 90ms average
he: 20 attempts, 0 misses, 90ms average
ng: 20 attempts, 0 misses, 90ms average
mb: 20 attempts, 0 misses, 90ms average

All of those should end up with the same score. If one rises to the top anyway, the scoring function is inventing weakness where there is none.

A miss-heavy bigram

Now suppose ng has mistakes:

ng: 30 attempts, 5 misses, 90ms average

The miss rate is:

5 / 30 = 0.1667

Assuming an active timing baseline of 120ms, the timing penalty is zero because 90ms is below the baseline.

So the score is roughly:

1.0 * (0.1667 * 0.65 + 0 * 0.35) = 0.1084

That should move ng above clean bigrams.

A slow but correct bigram

Now consider a bigram with no mistakes, but a slower average:

mb: 30 attempts, 0 misses, 180ms average

The miss rate is zero. The timing penalty is:

(180 - 120) / 120 = 0.5

So the score is:

1.0 * (0 * 0.65 + 0.5 * 0.35) = 0.175

This is intentional. If a pair is typed correctly but slowly, I still want it to be visible. That is the main reason timing is part of the algorithm.

A timing outlier

This is the case I wanted to avoid:

he: 30 attempts, 0 misses, 90ms average
he: one later sample at 1700ms

That 1700ms sample is more likely to be a pause than a real signal about the he bigram. With maxTimingMs = 1200, the whole update is skipped. The attempt count, miss count, average, and score all stay where they were.

Word selection

Once the bigrams have scores, the mode still has to decide which words to show.

The indexed selector builds a map from bigrams to words:

th -> thing, think, other, three
ng -> thing, bring, strong, language
mb -> bemba, mumba, chomba
qu -> quick, quiet, quote, query

When it needs a new word, it looks at the highest-scoring bigrams and pulls candidate words from this index. A word gets a score from the bigrams it contains. To stop one long word from winning just because it contains many scored pairs, the word score only uses the top three bigram scores in the word.

There is also a short recent-word buffer. If two words are similarly useful, the selector should avoid repeating the word I just typed.

Why not just be greedy?

One tempting version is:

find the highest-scoring bigram
pick a word containing that bigram
repeat forever

This is too greedy. In my local experiment, greedy indexed selection found some weak bigrams but then over-focused on those early discoveries. In one run it had recall@10 = 0.80, while the indexed selector with a little random exploration had recall@10 = 1.00.

The final version keeps explorationRate = 0.05. So about 5 percent of the time, it picks a random word instead of exploiting the current weak-bigram ranking. That gives the algorithm a chance to discover bigrams it has not seen enough yet.

Directions I decided not to pursue

Mistakes only

The first version was basically this:

score = misses / attempts

That is easy to understand, but it misses the most interesting case for me: letter pairs that are technically correct but slow. It also makes the feature feel too close to a normal weak-spot mode.

Raw timing only

The opposite version would be to rank by timing:

score = averageMS

This also has problems. It is too sensitive to pauses, distractions, and random input noise. It also ignores the fact that an actual mistake is still a strong signal. I ended up with a combined score instead.

Per-language scoring

I considered making the cache language-specific. That would be cleaner in some ways, but it also adds a lot of moving parts. For now, I think the simpler rule is better: Bigram Crunch only supports alphabetic a-z languages, and should be disabled for languages where that model does not make sense.

Machine learning

I also looked at work around keystroke dynamics. A lot of it is about authentication, not typing practice. For example, Killourhy and Maxion’s Comparing Anomaly-Detection Algorithms for Keystroke Dynamics compares different ways to decide whether a typing sample belongs to a user. There is also an implementation repo here:

github.com/vinigarg/Comparing-Anomaly-Detection-Algorithms-for-Keystroke-Dynamics

Some of the ideas are adjacent, especially the idea of using timing features and evaluating different algorithms under the same conditions. But the task here is not to classify a user or detect an impostor. I just need to rank letter pairs and choose useful practice words. A small scoring function felt like the more direct thing to try first.

The most relevant paper I found was Dhakal, Feit, Kristensson, and Oulasvirta’s Observations on Typing from 136 Million Keystrokes. That work made timing between adjacent keys feel like a reasonable signal to try. Bigram Crunch is much simpler than their analysis, but it pushed me away from looking only at mistakes.

The small lab repo

To make the tuning less hand-wavy, I wrote a small Go program that simulates a typist with known weak bigrams.

The synthetic typist has:

base miss rate
weak bigram miss rate
base key-to-key timing
extra timing delay for weak bigrams
pause/outlier rate

Then the program compares scoring and selection variants. The important metric is recall@10: how many known weak bigrams appear in the top 10 ranked bigrams.

The final local run I used was:

go run . -sessions 100 -words 120 -search-top 10 -top 0

The search itself is not doing anything clever. It is a constrained grid search, but each candidate can be tested independently. I made the runner build the word index once and evaluate the grid in parallel. On my Ryzen 5950X, the same run took about 2m42s with -search-workers 1 and about 30s with -search-workers 32. Both runs produced the same best candidate.

The best candidate was:

grid-m0.65-t0.35-c10-max1200-x0.05
objective=91.75 recall@10=1.00 false+10=5 meanRank=3.00 repeatWords=0.0002

That candidate means:

missRateWeight = 0.65
timingWeight = 0.35
confidenceAttempts = 10
maxTimingMs = 1200
explorationRate = 0.05

I then ran a narrower second search for the adaptive timing baseline values, keeping the scoring weights above fixed. The best candidate from that run was:

adaptive-f120-s150-min50-max180-w200
objective=91.74 recall@10=1.00 false+10=5 meanRank=3.00 repeatWords=0.0003

That candidate means:

fallbackTimingBaselineMs = 120
adaptiveTimingMinSamples = 150
minAdaptiveTimingBaselineMs = 50
maxAdaptiveTimingBaselineMs = 180
timingBaselineWindow = 200

Several candidates were very close, so I do not read this as “these numbers are universally optimal”. I read it as “these numbers did well under the assumptions I tested, and they are a better starting point than the hard-coded values I had before”.

Why I think this version works better

The final version is still deliberately simple:

track adjacent bigrams
update miss count and timing average
skip bigram updates for timing outliers
combine miss rate and timing penalty
apply a confidence ramp
choose words from a bigram-to-word index
keep a small amount of exploration

The reason I like this shape is that each part has a job.

The miss rate catches obvious mistakes. The timing penalty catches correct but slow pairs. The confidence ramp stops tiny samples from overreacting. The timing cap stops pauses from becoming permanent weak spots. The indexed selector makes word choice feel targeted, and the exploration rate stops the selector from getting stuck too early.

It is not a full model of typing. It does not try to understand hands, fingers, rollover, or keyboard layout geometry. I may revisit those later. For now, this feels like the right level of complexity for the feature: enough structure to be useful, but still simple enough to explain and debug.

Blog

Bigram Crunch

Obsidian Daily Planner Plugin

5 layer QMK keyboard layout

Colemak-DH on Mac

Finding audio files with Google

Software Project Questionaire

CPP CHEATSHEET

Enunciation Exercises

Bigram Crunch

What I wanted Bigram Crunch to do

The score

A few small examples

Equal histories

A miss-heavy bigram

A slow but correct bigram

A timing outlier

Word selection

Why not just be greedy?

Directions I decided not to pursue

Mistakes only

Raw timing only

Per-language scoring

Machine learning

The small lab repo

Why I think this version works better