Teaching language models to navigate and prune formal proof search spaces.
Most generative models guess. GHMI Labs refuses that premise. Our systems operate inside the proof assistant's kernel, using it as the ground truth for every step. We do not generate 'likely' proofs. We generate verified ones.
BELOMONT-1 & 2
Autoformalisation models convert natural-language mathematics into Lean-style formal statements. BELOMONT-1 is supervised on NL↔formal pairs; BELOMONT-2 augments this with synthetic paraphrases and adversarial examples drawn from failed formalisation attempts.
PREAKNESS-1 (Policy)
PREAKNESS learns to act inside the proof assistant environment. It proposes promising tactics and decompositions, guided by curriculum-generated tasks. A search loop attempts tactics under kernel supervision, learning directly from accept/reject signals.
DERBY-1 (Critic)
DERBY estimates long-horizon value: how likely a partial proof state is to close. It shapes search, prunes dead branches, and prefers shorter, more stable proofs. It does not approximate truth, only promise.
EQUUS-1 (Orchestrator)
EQUUS orchestrates the full pipeline. It takes a conjecture, calls BELOMONT for formalisation, invokes PREAKNESS for search, and uses DERBY as a value critic to guide exploration. The entire loop is differentiable where useful, symbolic where necessary.
Neuro-Symbolic Kernel Loop
All model actions are executed inside the proof assistant kernel. Tactics that fail are discarded; tactics that progress inform the learning signal. The kernel is the source of truth and the primary RL signal.
Verified Mathematics
The system generates proofs that are not 'likely correct', but kernel-verified. Every tactic, lemma, and term is mechanically validated—nothing passes unless Lean accepts it.
EQUUS-1: Neuro-Symbolic
Proof Search at Scale
4.1 The Formalisation Gap
Let \( N \) be the space of natural language mathematical statements, and \( G \) the space of well-typed formal goals in a kernel \( K \) (e.g. Lean 4). The GHMI stack first learns an autoformalisation map \( \Phi : N \to G \) (the BELOMONT family), and then solves these goals inside the proof environment using a policy-critic pair (PREAKNESS, DERBY).
For a given statement \( s \in N \), we require that the generated goal \( \Phi(s) \) be accepted by the kernel and admit a proof under our search procedure. We write \( \operatorname{accept}_K(\cdot) \) for the kernel's judgement of well-typedness and provability.
Here \( \epsilon \) measures the autoformalisation failure rate from natural language to kernel-validated goals. Classical language models optimise token likelihood \( P(\text{token} \mid \text{context}) \), which need not correlate with either acceptance in \( K \) or downstream provability.
To close this gap, we treat \( K \) as an oracle and cast theorem proving as an environment. PREAKNESS parameterises a policy \( \pi \) over proof states, DERBY provides a value estimate \( V \), and EQUUS orchestrates search (e.g. MCTS, best-first) over the space of tactic sequences, accepting only traces that the kernel validates.
The complexity penalty favours traces that are short, reuse existing lemmas, and remain stable under small changes to the statement distribution. In practice, we estimate \( \operatorname{Complexity} \) via a mixture of proof length, tactic-level entropy, and reuse statistics across the training corpus.
We do not ask the model to be 'good at maths' in isolation. We ask it to be good at proposing actions that survive contact with the kernel. Every failure is an informative counter-example; every success becomes a reusable ingredient in later proofs. The loop between simulation and proof is the object of optimisation.
Play with the inspection loop.
This panel does not train a model live (yet), but it exposes the knobs that matter: how much you trust symbolic structure versus neural priors, and what you reward in a proof trace.
Bias the loop
Pick the flavour of prover you would rather work with. We translate that into a notional change in the loss landscape.
Free-form feedback
How would you want a prover to behave? Treat this as high-level RLHF on design goals.