What's My Level? - How the Algorithm Works

What's My Level? is an adaptive English level test launched in 2026 that estimates a student's CEFR grade (A1–C2) from a short series of questions. This page explains how the algorithm works and why it was designed the way it was.

The Short Version

The test asks 12–16 questions and adapts as you answer. If you're doing well, the questions get harder; if you're finding them difficult, they get easier. Your final level is wherever you've settled by the end.

The algorithm is deliberately cautious: it gives more weight to wrong answers than right ones, so it tends to grade slightly low rather than slightly high. If your result feels a touch conservative, that's by design. We'd rather point you towards materials that build your confidence than ones that leave you out of your depth.

If you're curious about the technical details, how we tested the algorithm, what it was tested against, and why we made the choices we did, read on.

Why a New Level Test?

This test replaces the original English Level Test, which has been on the site since the early days. That test served its purpose well for many years, but it has limitations: it's a single block of 35 gap-fill grammar questions, graded on a simple score threshold. A student who scores 26 or above is "Intermediate", below 18 is "Elementary", and so on. It tests one skill, written grammar, in one format, and it's vulnerable to near-miss answers: a student who writes "has been stolen" scores a point, but "it has been stolen" or "has been stoled" scores zero, even though those answers reveal very different things about the student's ability.

What's My Level? takes a different approach. It's adaptive: the questions get harder or easier depending on how you're doing. It tests grammar, vocabulary, and listening across multiple question types: multiple choice, word matching, sentence reordering, image-based questions, and audio comprehension. And it uses the CEFR scale (A1 to C2), which is the standard used by language schools, universities, and employers worldwide.

The original test is still available at English Level Test for anyone who prefers it.

The Problem: Grading with Limited Data

Estimating a student's CEFR level from 12–16 questions is inherently imprecise, but then, so is every other method of language assessment.

A formal placement test might use 60–100 questions, but it's still a snapshot of a single sitting. A student having a good day, a bad day, or simply getting lucky on a handful of questions will shift the result. Even a one-hour conversation with an experienced teacher is subject to the teacher's own biases; some grade generously, some grade harshly, and the topics that happen to come up will favour some students over others. Any written test is also only as good as the quality of its questions: a poorly worded item, an ambiguous option, or a question that rewards test technique over genuine ability can skew a result regardless of how many questions you ask.

The honest truth is that no assessment method - short or long, human or automated - produces a perfectly objective CEFR grade. What matters is finding an approach that lands on the right answer as often as possible and, when it does get it wrong, gets it wrong by the smallest margin and in the least harmful direction.

That's what this page is about: how we chose the algorithm behind What's My Level? and why we believe it's a sensible approach to a fundamentally difficult problem.

How the Test Works

What's My Level is an adaptive test. It doesn't simply ask a set list of questions, regardless of the student taking it. If the student performs well, they will be served tougher questions and vice versa - a weaker student will be answering mostly lower-level questions.

The test starts at A2 and adapts as the student answers. After enough questions, the level you've settled at becomes your result.

The critical design choice is how the algorithm responds to right and wrong answers: how quickly it moves up, how quickly it moves down, and how much cushion it provides against a single unlucky answer derailing the result.

The test also draws on a wider range of question types than most online level tests, which tend to rely almost entirely on gap-fill and multiple choice. What's My Level? includes listening comprehension, word matching, sentence reordering, image-based questions, and yes/no judgements alongside traditional multiple choice. This variety matters because real language ability isn't just knowing which word fills a blank, it's hearing natural speech, recognising word relationships, and understanding meaning in context.

The Half-Step Ladder

CEFR has six levels: A1, A2, B1, B2, C1, C2. A naive and brittle algorithm might move the student up or down one full level after each answer, but this produces wild swings, one right answer followed by one wrong, and you're back where you started. Over a short test, this volatility means the final result depends more on the order of right and wrong answers than on the student's actual ability.

Our algorithm uses a finer internal scale. Each CEFR level is split into a lower half and an upper half, creating a twelve-step ladder. The student never sees this, the badge always shows the full CEFR level (B1, B2, etc.).

Why a wrong answer should weigh more than a right answer

On this twelve-step ladder, a correct answer moves the student up one half-step, while a wrong answer moves them down two half-steps. This asymmetry is deliberate and central to how the algorithm works.

Consider what a right answer actually tells you. A student can get a question right through good guessing, partial knowledge, process of elimination, or simply recognising a familiar pattern without truly understanding the grammar behind it. Multiple choice questions with four options give you a 25% baseline just from random chance. A right answer is encouraging, but it's weak evidence.

A wrong answer tells you much more. If a student gets a question wrong at their own level - a level where they should be comfortable - that's a meaningful signal. It suggests the question touched a genuine gap in their knowledge, not just bad luck. The algorithm respects this by treating wrong answers as more significant than right answers, and this principle holds true well beyond our little test: in language assessment generally, what a student gets wrong reveals more about their level than what they get right.

The net result is an algorithm that tends to grade slightly low rather than slightly high when it's uncertain. This is a deliberate choice. A student told they are B1 who is actually a strong B1 will feel validated and motivated. A student told they are B2 who is actually B1 may feel out of their depth when they encounter B2 materials. Erring low is the safer, kinder direction.

Testing the Algorithm

The question

The design work above is all theory. The real test is simple: if we know a student is B1, what does the algorithm actually give them? And how often does it get it right?

This is where theory meets the real world. A B1 student won't get every B1 question right, they'll get roughly 70% of them right, guess correctly on some harder ones, and occasionally slip up on easier ones. Over 15 questions, there's genuine randomness at play. Could a weak student fluke their way to a higher grade? Yes, just as you could flip a coin and get six heads in a row. But how often? That's the question that matters, and the only way to answer it is to run the test thousands of times and look at the distribution of results.

Probability model

Each simulated student has a "true" CEFR level, and their chance of getting a question right depends on how far that question is from their true ability:

Question level vs true level Chance of answering correctly
Two or more levels below 95%
One level below 90%
At the student's true level 70%
One level above 40%
Two or more levels above 10%

The 70% at-level figure is key. It means even at their correct level, students get roughly three out of ten questions wrong. This is realistic - language ability is fuzzy, not binary - but it's exactly where poorly designed algorithms fall apart. An algorithm that panics on every wrong answer will drag a perfectly competent student down; one that's too forgiving will let a lucky guesser sail through.

These probability figures are reasonable estimates, not empirically measured values. Different assumptions would shift the exact numbers in the results below, but the relative performance of the algorithms we tested was consistent across a range of reasonable assumptions.

What we tested against

We compared the weighted half-step algorithm against three simpler approaches: full-level jumps (too volatile), a system requiring two consecutive correct answers to move up (too punishing, it creates a persistent downward drag), and an unweighted half-step system where right and wrong answers move the ladder equally (accurate but tends to grade slightly high, because lucky guesses carry the same weight as genuine mistakes). Each was tested across 1,000 simulated sessions per student profile.

Results

We ran 1,000 simulated tests for each of three student profiles: a true A2 student, a true B1 student, and a true C2 student. The question is: across a thousand attempts, how often does each student land on their correct level?

Student's true level Hit rate (weighted half-step) Best alternative algo hit rate Average result
A2 65% 50% High A2
B1 64% 60% Mid B1
C2 62% 61% Low C2

A true A2 student lands on A2 in 65 out of 100 tests, and virtually all of the remaining 35 land on either A1 or B1, just one level off. A true B1 student gets B1 64% of the time, with misses almost entirely falling on A2. These are the students who matter most: A2 and B1 learners make up the bulk of traffic on any English-learning website, and for them, the algorithm performs at its strongest.

The trade-off is at C2, where the algorithm's cautious weighting slows the climb: 62% hit rate versus 61% for the unweighted version. A C2 student has to travel the entire ladder from A2 to C2 in about 15 questions, and the heavier wrong-answer penalty makes that harder. But genuine C2 students are rare among online test-takers, and the average result of 'low C2' shows the algorithm is still doing its job well.

To put it another way: could an A2 student fluke their way to B2? In the same way you could flip a coin and get six heads in a row, technically possible, but in a thousand simulated tests of a true A2 student, it happened roughly 1% of the time. The algorithm doesn't eliminate luck, but it makes luck very expensive.

Honest Limitations

No 15-question test can match the precision of a comprehensive assessment, and we don't pretend otherwise. Here's what to keep in mind:

If we wanted higher accuracy, we could ask 30 questions, where the hit rate climbs noticeably, or 100, where it becomes near-certain. But few students will sit through a hundred-question level test, and rightly so. The real design question isn't "how accurate can we get?" but "what's the minimum number of questions where accuracy becomes good enough so we can stop asking?". We believe 12–16 is that sweet spot: enough data for the algorithm to land reliably, short enough that students actually finish.

The result is a best estimate, not a diagnosis. Two students with the same true level might get different results on different days, just as they might score differently on any test. The algorithm finds the right level roughly two-thirds of the time, and when it misses, it's almost always by just one half level. Question quality matters as much as the algorithm.

The algorithm grades cautiously by design. If you feel the result is a touch low, it may well be. This is preferable to the alternative: an algorithm that flatters students with an inflated grade serves nobody well.

Analysis based on 1,000 simulated test sessions per algorithm per student profile (12,000 total simulations). Probability model uses estimated difficulty curves; results are indicative of relative algorithm performance rather than absolute prediction of real-world outcomes.

© 2001-2026 esl-lounge.com