🎁

Gift

$ \Gamma $

Generative_Priors

🐴

Horse

$ H $

Heuristic_Search

👄

Mouth

$ \mu $

Machine_Check

🔎

Inspections

$ \int $

Inspection_Loop

CURRENTLY INSPECTING · GIFT

Gift — Generative priors

Treat the model as a gift-distribution over proof sketches, not as an oracle for truth.

G encodes generative priors over mathematical structure: informal statements, sketch proofs, lemmas that 'ought' to exist. We harvest these gifts, but never trust them outright—the kernel decides. In practice, G seeds search with good conjectures and candidate lemmas that PREAKNESS can act upon.

Role · Prior over proof space Interface · Natural language ↔ proto-formal Codename · BELOMONT-1/2

ABSTRACT

Teaching language models to navigate and prune formal proof search spaces.

Most generative models guess. GHMI Labs refuses that premise. Our systems operate inside the proof assistant's kernel, using it as the ground truth for every step. We do not generate 'likely' proofs. We generate verified ones.

FIG 1.1 // MODEL A

BELOMONT-1 & 2

Autoformalisation models convert natural-language mathematics into Lean-style formal statements. BELOMONT-1 is supervised on NL↔formal pairs; BELOMONT-2 augments this with synthetic paraphrases and adversarial examples drawn from failed formalisation attempts.

FIG 1.2 // MODEL B

PREAKNESS-1 (Policy)

PREAKNESS learns to act inside the proof assistant environment. It proposes promising tactics and decompositions, guided by curriculum-generated tasks. A search loop attempts tactics under kernel supervision, learning directly from accept/reject signals.

FIG 1.3 // MODEL C

DERBY-1 (Critic)

DERBY estimates long-horizon value: how likely a partial proof state is to close. It shapes search, prunes dead branches, and prefers shorter, more stable proofs. It does not approximate truth, only promise.

FIG 1.4 // SYSTEM

EQUUS-1 (Orchestrator)

EQUUS orchestrates the full pipeline. It takes a conjecture, calls BELOMONT for formalisation, invokes PREAKNESS for search, and uses DERBY as a value critic to guide exploration. The entire loop is differentiable where useful, symbolic where necessary.

FIG 1.5 // ENGINE

Neuro-Symbolic Kernel Loop

All model actions are executed inside the proof assistant kernel. Tactics that fail are discarded; tactics that progress inform the learning signal. The kernel is the source of truth and the primary RL signal.

FIG 1.6 // OUTPUT

Verified Mathematics

The system generates proofs that are not 'likely correct', but kernel-verified. Every tactic, lemma, and term is mechanically validated—nothing passes unless Lean accepts it.

LATEST PUBLICATION

EQUUS-1: Neuro-Symbolic
Proof Search at Scale

terminal — ghmi-cli — 80x24

VERIFICATION ARCHITECTURE

PAPER: 'CLOSING THE SIM-TO-PROOF GAP IN NEURO-SYMBOLIC PROVERS' (2025)

notebooks/verification-architecture.tex

LaTeX ▾

In[1]:

4.1 The Formalisation Gap

Let $ N $ be the space of natural language mathematical statements, and $ G $ the space of well-typed formal goals in a kernel $ K $ (e.g. Lean 4). The GHMI stack first learns an autoformalisation map $ \Phi : N \to G $ (the BELOMONT family), and then solves these goals inside the proof environment using a policy-critic pair (PREAKNESS, DERBY).

For a given statement $ s \in N $, we require that the generated goal $ \Phi(s) $ be accepted by the kernel and admit a proof under our search procedure. We write $ \operatorname{accept}_K(\cdot) $ for the kernel's judgement of well-typedness and provability.

P_K\bigl(\operatorname{accept}_K(\Phi(s))\bigr) \geq 1 - \epsilon

Here $ \epsilon $ measures the autoformalisation failure rate from natural language to kernel-validated goals. Classical language models optimise token likelihood $ P(\text{token} \mid \text{context}) $, which need not correlate with either acceptance in $ K $ or downstream provability.

To close this gap, we treat $ K $ as an oracle and cast theorem proving as an environment. PREAKNESS parameterises a policy $ \pi $ over proof states, DERBY provides a value estimate $ V $, and EQUUS orchestrates search (e.g. MCTS, best-first) over the space of tactic sequences, accepting only traces that the kernel validates.

\mathcal{L} = -\log P_\pi(\text{success} \mid s) + \lambda \cdot \operatorname{Complexity}(\text{trace})

The complexity penalty favours traces that are short, reuse existing lemmas, and remain stable under small changes to the statement distribution. In practice, we estimate $ \operatorname{Complexity} $ via a mixture of proof length, tactic-level entropy, and reuse statistics across the training corpus.

Intuition

We do not ask the model to be 'good at maths' in isolation. We ask it to be good at proposing actions that survive contact with the kernel. Every failure is an informative counter-example; every success becomes a reusable ingredient in later proofs. The loop between simulation and proof is the object of optimisation.

Q.E.D.

RLHF_FEEDBACK

Play with the inspection loop.

This panel does not train a model live (yet), but it exposes the knobs that matter: how much you trust symbolic structure versus neural priors, and what you reward in a proof trace.

Bias the loop

Pick the flavour of prover you would rather work with. We translate that into a notional change in the loss landscape.

Select a bias to see how EQUUS-1 would re-weight the loop.

Free-form feedback

How would you want a prover to behave? Treat this as high-level RLHF on design goals.

researcher@ghmi:~$ echo "one line of feedback" >> rlhf/design_notes.md