Your model is 90% confident. It's guessing.

A loan application arrives. The model says 90% — approve. The loan officer approves. The borrower defaults.

The model wasn't malfunctioning. Its overall accuracy was 78%, which is standard for the German Credit dataset. The model doesn't become more accurate by returning a better number. It becomes useful by saying what kind of answer it's giving. The problem here wasn't the model — it was the answer. Not wrong. The wrong kind of thing to give.

Three situations, one number

A confidence score answers: how sure are you? This collapses three fundamentally different situations into a single axis.

Situation 1. Seven rules fire for class A. One fires for class B. The evidence converges on one answer. This isn't just high confidence — it's a case where one answer has been singled out by the evidence. Score: 92%.

Situation 2. Seven rules fire for class A, five for class B, one for class C. No decisive winner, but the field has been narrowed to two. The useful information — it's A or B, not C — is present in the evidence and destroyed by the score. Score: 62%.

Situation 3. Two rules fire total. One for A, one for B. The model has almost nothing to work with. This isn't ambiguity — it's absence. Score: 59%.

Situations 2 and 3 are three points apart. An operator with a 60% threshold treats them identically. But one warrants a shortlist and the other warrants escalation to a human. The score can't distinguish them. It wasn't designed to.

The cost

So operators invent thresholds. Above 70%, auto-approve. Below 40%, deny. Between: review. The thresholds are drawn the same way regardless of whether the model's 65% reflects genuine ambiguity between two options or total evidential poverty. In credit risk, this means loans approved on confident guesses — the threshold can't distinguish a model that has narrowed the field from a model that is guessing.

Types, not scores

We built a classifier that returns a different kind of answer. Not a better number — a type.

CERTAIN: the evidence singles out one class. The margin is decisive. Act on it.

PARTIAL: the evidence narrows the field but doesn't decide. Here's the shortlist — a human picks.

UNCERTAIN: the evidence is insufficient. The model has nothing to offer. Escalate.

These aren't thresholds on a score. They're structural properties of the evidence — determined by how the evidence distributes across classes, whether it converges, narrows, or fails to discriminate. The classifier builds interpretable rules from the training data and evaluates each new sample against all of them. The type comes from the pattern of agreement and conflict across rules, not from a probability cutoff.

What this looks like

We ran this on the German Credit dataset. The model with typed uncertainty didn't achieve higher overall accuracy. It achieved a different kind of result: when it committed — when it returned CERTAIN — it was right 95.7% of the time.

The 43 applications that a standard classifier approved with high confidence and that later defaulted? Every one of them was flagged as PARTIAL or UNCERTAIN.

Not by a better model. By a model that could say I don't know this one instead of probably yes. The overall accuracy is the same. What changes is that the model now tells you when it's operating outside its range.

Try it

Tercet is a production API. Every prediction returns a type. Upload your own dataset and see what the evidence structure looks like — no signup required.

Try it with your data

Built on research from Symbolics Lab.