Blog

Bigram Crunch

I have been working on a Monkeytype mode I call Bigram Crunch. The basic idea is simple: instead of only telling me that I made mistakes, the mode should notice which pairs of letters are weak and then give me more words containing those pairs. This started as something I built for myself while switching keyboard layouts to Colemak. The first version lived on my own Monkeytype deployment at typing.martinnn.com. It was not serious at all.

Obsidian Daily Planner Plugin

If you’re like me and use Obsidian for note-taking, you might find the Daily Planner plugin extremely useful, especially when combined with a bit of scripting to automate your workflow. Thought I’d add the script I am currently using to generate templates. First, details. Template Details The template starts with the Day Planner section which displays the date for the following day in a clear, readable format. Below this, the template automatically records when the file was created and last modified.

5 layer QMK keyboard layout

New layout iteration on my customized split ergonomic keyboard with a 5-layer layout. Here’s a quick rundown: Base layer: Colemak layout for optimal typing speed and reduced finger movement. QWERTY layer: For when I need to switch back to the standard layout. Controls layer: Mouse keys, media controls, and special characters. Lower layer: Numbers, function keys, and additional symbols. Raise layer: Navigation, editing commands, and more symbols. Source code can be found on my github here

Colemak-DH on Mac

Karabiner-Elements Colemak-DH configuration for Mac. Source: karabiner.json Key features: Vim-style arrow keys (hjkl) Capslock as backspace Word and line deletion (forward/backward) Numpad overlay on vim arrows Custom controls (undo, redo, line navigation, multi-line selection) Karabiner-Elements Configuration Basics: Basic structure: { "from": { "key_code": "q", "modifiers": { "mandatory": [], "optional": [] } }, "to": [ { "key_code": "" } ], "type": "basic" } This structure defines a key remapping. “from” specifies the key to be remapped, “to” defines the new behavior, and “modifiers” allow for complex key combinations.

Finding audio files with Google

On Google type: ?intitle:index.of? *filetype* *filename* OR intitle:"index of" (*filetype*) *filename* -html -htm -php -jsp OR intitle:index.of (*filetype1*|*filetype2*) *filename* -HTML -htm -jsp -asp -php Replace: filetype: with “WAV”, “MP3” or any type of file format. Filename with whatever you are looking for (“clap”, “kick”, “stevie wonder”…). ✅ This is a great way to find sound effects, drum samples, music & obscure/weird sound files in general !

Software Project Questionaire

Project Planning Checklist @khoisan25* General Considerations Scope of the Project: All known Partially known Little known Unknown Project Timeline: Short (less than 3 months) Medium (3 to 6 months) Long (more than 6 months) Project Size: Small (less than 10,000 lines of code or 20 features) Medium (10,000 to 50,000 lines of code or 20 to 100 features) Large (over 50,000 lines of code or 100 features) Budget: Low (under $10,000 USD) Medium ($10,000 to $100,000 USD) High (over $100,000 USD) Delivery Format Software Delivery Method: SaaS PaaS IaaS Mobile App Desktop App Web App Embedded System Platform Compatibility: Windows MacOS Linux iOS Android Other: ____________ Installation Mode: Locally Installed Web Accessed Distribution Model: Open Source Commercially Sold Documentation and Tracking Project Tracking: Can define clear iterations Cannot define clear iterations Required Documentation: Requirements Document Design Document User Manual Developer Manual Other: ____________ Documentation Standard: Informal (Wiki, Google Docs) Formal (IEEE, UML Diagrams) Backend Requirements User Management: Needed Not needed Basic OAuth Database Specifications: Data Type (text, images, etc.

CPP CHEATSHEET

Table of contents Data Types I/O Iterators and Ranges Vectors Sets Maps Unordered Sets and Maps Bitsets Stack and Queue Heap Priority Queue Sorting Algorithms Search Algorithms String Operations Number Theory Geometry Graph Theory Dynamic Programming Computational Geometry Data Structures Advanced Algorithms Kosaraju&rsquo;s SSC Algorithm KMP Algorithm Manacher&rsquo;s Algorithm Suffix Array Z-function BIT and Segment Trees Data Types // int, long, long long, unsigned int, unsigned long, unsigned long long // double, long double // char, string, bool // vector<T>, pair<T1, T2>, set<T> // map<T1, T2>, unordered_map<T1, T2> // queue<T>, stack<T> //priority_queue<T> (by default max heap), priority_queue<int, vector <int>, greater <int>> (min heap) 1.

Enunciation Exercises

This is a collection of English words and phrases designed to help practice proper enunciation and pronunciation. Starter Exercises Perform these simple exercises to warm up your vocal muscles: Open your mouth and say “ah”. Put your tongue behind your top teeth and say “tuh”. Put your tongue behind your bottom teeth and say “duh”. Pucker your lips and say “puh”. Say “ee” like you are screaming. Say “oo” like you are in pain.

Bigram Crunch

Planted April 12, 2026

I have been working on a Monkeytype mode I call Bigram Crunch. The basic idea is simple: instead of only telling me that I made mistakes, the mode should notice which pairs of letters are weak and then give me more words containing those pairs.

This started as something I built for myself while switching keyboard layouts to Colemak. The first version lived on my own Monkeytype deployment at typing.martinnn.com. It was not serious at all. It mostly watched for bigrams I mistyped and then tried to throw more of those back at me. That was useful enough to prove the idea, but it was also close to a mistake tracker, and I did not want to push that version to a wider audience.

The part I wanted to revisit was the scoring algorithm. If the score is wrong, the whole feature feels wrong. It will either keep repeating words that are not useful, or miss the letter pairs that are actually slowing me down.

The small experiment repo for this work is here:

github.com/kopachef/bigram-crunch-lab

What I wanted Bigram Crunch to do

I wanted the mode to find two kinds of weak bigrams:

  • bigrams I mistype
  • bigrams I type correctly but slowly

The second point is what makes this different from a normal weak-spot mode. If I type th correctly every time but always pause on it, that still feels like a useful thing to practise. A mistake-only score would never notice it.

So the algorithm needs to store a small history for each bigram:

type Stats struct {
    Attempts  int
    Misses    int
    AverageMS float64
    Score     float64
}

Every time a word is typed, each adjacent letter pair in that word updates its own history. For thing, the bigrams are:

th
hi
in
ng

For now I only model adjacent ASCII a-z pairs. This means words like cafe work normally, but scripts that do not use those alphabetic characters should be treated as unsupported for this feature. I think that is a reasonable first limitation. Adding language-specific scoring for every writing system would make the feature much larger before the core idea has settled.

The score

The current score combines mistakes, timing, and confidence:

missRate = misses / attempts

timingPenalty = clamp(
  (averageMS - timingBaselineMS) / timingBaselineMS,
  0,
  2
)

confidence = min(1, attempts / confidenceAttempts)

score = confidence * (
  missRate * missRateWeight +
  timingPenalty * timingWeight
)

The values I am currently using are:

missRateWeight = 0.65
timingWeight = 0.35
confidenceAttempts = 10
timingBaselineMS = 180
maxTimingMs = 900
explorationRate = 0.05

The confidence term is there because one early mistake should not dominate the ranking forever. If I have only seen a bigram twice, I do not really know much about it. Once I have seen it around 10 times, I am more willing to trust the score.

Large timing samples are filtered out. If I pause because I got distracted, that should not turn a normal bigram into a permanent practice target. With the current settings, timings above 900ms skip the whole bigram update: no attempt, miss, timing average, or score is changed.

A few small examples

These are the kinds of cases I wanted the scoring function to handle.

Equal histories

If all bigrams have the same history, none of them should be promoted:

th: 20 attempts, 0 misses, 145ms average
he: 20 attempts, 0 misses, 145ms average
ng: 20 attempts, 0 misses, 145ms average
mb: 20 attempts, 0 misses, 145ms average

All of those should end up with the same score. If one rises to the top anyway, the scoring function is inventing weakness where there is none.

A miss-heavy bigram

Now suppose ng has mistakes:

ng: 30 attempts, 5 misses, 145ms average

The miss rate is:

5 / 30 = 0.1667

The timing penalty is zero because 145ms is below the 180ms baseline.

So the score is roughly:

1.0 * (0.1667 * 0.65 + 0 * 0.35) = 0.1084

That should move ng above clean bigrams.

A slow but correct bigram

Now consider a bigram with no mistakes, but a slower average:

mb: 30 attempts, 0 misses, 260ms average

The miss rate is zero. The timing penalty is:

(260 - 180) / 180 = 0.4444

So the score is:

1.0 * (0 * 0.65 + 0.4444 * 0.35) = 0.1556

This is intentional. If a pair is typed correctly but slowly, I still want it to be visible. That is the main reason timing is part of the algorithm.

A timing outlier

This is the case I wanted to avoid:

he: 30 attempts, 0 misses, 145ms average
he: one later sample at 1700ms

That 1700ms sample is more likely to be a pause than a real signal about the he bigram. With maxTimingMs = 900, the whole update is skipped. The attempt count, miss count, average, and score all stay where they were.

Word selection

Once the bigrams have scores, the mode still has to decide which words to show.

The indexed selector builds a map from bigrams to words:

th -> thing, think, other, three
ng -> thing, bring, strong, language
mb -> bemba, mumba, chomba
qu -> quick, quiet, quote, query

When it needs a new word, it looks at the highest-scoring bigrams and pulls candidate words from this index. A word gets a score from the bigrams it contains. To stop one long word from winning just because it contains many scored pairs, the word score only uses the top three bigram scores in the word.

There is also a short recent-word buffer. If two words are similarly useful, the selector should avoid repeating the word I just typed.

Why not just be greedy?

One tempting version is:

  1. find the highest-scoring bigram
  2. pick a word containing that bigram
  3. repeat forever

This is too greedy. In my local experiment, greedy indexed selection found some weak bigrams but then over-focused on those early discoveries. In one run it had recall@10 = 0.80, while the indexed selector with a little random exploration had recall@10 = 1.00.

The final version keeps explorationRate = 0.05. So about 5 percent of the time, it picks a random word instead of exploiting the current weak-bigram ranking. That gives the algorithm a chance to discover bigrams it has not seen enough yet.

Directions I decided not to pursue

Mistakes only

The first version was basically this:

score = misses / attempts

That is easy to understand, but it misses the most interesting case for me: letter pairs that are technically correct but slow. It also makes the feature feel too close to a normal weak-spot mode.

Raw timing only

The opposite version would be to rank by timing:

score = averageMS

This also has problems. It is too sensitive to pauses, distractions, and random input noise. It also ignores the fact that an actual mistake is still a strong signal. I ended up with a combined score instead.

Per-language scoring

I considered making the cache language-specific. That would be cleaner in some ways, but it also adds a lot of moving parts. For now, I think the simpler rule is better: Bigram Crunch only supports alphabetic a-z languages, and should be disabled for languages where that model does not make sense.

Machine learning

I also looked at work around keystroke dynamics. A lot of it is about authentication, not typing practice. For example, Killourhy and Maxion’s Comparing Anomaly-Detection Algorithms for Keystroke Dynamics compares different ways to decide whether a typing sample belongs to a user. There is also an implementation repo here:

github.com/vinigarg/Comparing-Anomaly-Detection-Algorithms-for-Keystroke-Dynamics

Some of the ideas are adjacent, especially the idea of using timing features and evaluating different algorithms under the same conditions. But the task here is not to classify a user or detect an impostor. I just need to rank letter pairs and choose useful practice words. A small scoring function felt like the more direct thing to try first.

The most relevant paper I found was Dhakal, Feit, Kristensson, and Oulasvirta’s Observations on Typing from 136 Million Keystrokes. That work made timing between adjacent keys feel like a reasonable signal to try. Bigram Crunch is much simpler than their analysis, but it pushed me away from looking only at mistakes.

The small lab repo

To make the tuning less hand-wavy, I wrote a small Go program that simulates a typist with known weak bigrams.

The synthetic typist has:

base miss rate
weak bigram miss rate
base key-to-key timing
extra timing delay for weak bigrams
pause/outlier rate

Then the program compares scoring and selection variants. The important metric is recall@10: how many known weak bigrams appear in the top 10 ranked bigrams.

The final local run I used was:

go run . -sessions 100 -words 120 -search-top 10 -top 0

The best candidate was:

grid-m0.65-t0.35-c10-max900-x0.05
objective=91.75 recall@10=1.00 false+10=5 meanRank=3.00 repeatWords=0.0002

That candidate means:

missRateWeight = 0.65
timingWeight = 0.35
confidenceAttempts = 10
maxTimingMs = 900
explorationRate = 0.05

Several candidates were very close, so I do not read this as “these numbers are universally optimal”. I read it as “these numbers did well under the assumptions I tested, and they are a better starting point than the hard-coded values I had before”.

Why I think this version works better

The final version is still deliberately simple:

  • track adjacent bigrams
  • update miss count and timing average
  • skip bigram updates for timing outliers
  • combine miss rate and timing penalty
  • apply a confidence ramp
  • choose words from a bigram-to-word index
  • keep a small amount of exploration

The reason I like this shape is that each part has a job.

The miss rate catches obvious mistakes. The timing penalty catches correct but slow pairs. The confidence ramp stops tiny samples from overreacting. The timing cap stops pauses from becoming permanent weak spots. The indexed selector makes word choice feel targeted, and the exploration rate stops the selector from getting stuck too early.

It is not a full model of typing. It does not try to understand hands, fingers, rollover, or keyboard layout geometry. I may revisit those later. For now, this feels like the right level of complexity for the feature: enough structure to be useful, but still simple enough to explain and debug.