The AI Evaluation Crisis

Feb 21, 2026★ Pivotal

GLM 5 seems to have a "Claude" personality

The boundaries of model identity and safety evaluation are blurring. Frontier models are inferring highly accurate personal data from simple prompts, while open-weight models (like GLM 5) are absorbing the aligned personalities of proprietary models via synthetic training data, making true evaluation increasingly difficult.

Feb 19, 2026★ Pivotal

What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data

The boundaries of model identity and safety evaluation are blurring. Frontier models are inferring highly accurate personal data from simple prompts, while open-weight models (like GLM 5) are absorbing the aligned personalities of proprietary models via synthetic training data, making true evaluation increasingly difficult.

Feb 20, 2026★ Pivotal

Our First Proof submissions

As pure parameter scaling hits fundamental limits, the frontier is shifting toward formal verification (Lean 4) and rigorous mathematical proofs (OpenAI's First Proof submissions) to evaluate and advance reasoning capabilities beyond black-box prompting.

Feb 21, 2026★ Pivotal

The Fundamental Limits of LLMs at Scale

As pure parameter scaling hits fundamental limits, the frontier is shifting toward formal verification (Lean 4) and rigorous mathematical proofs (OpenAI's First Proof submissions) to evaluate and advance reasoning capabilities beyond black-box prompting.

Feb 17, 2026

Lean 4: How the theorem prover works and why it's the new competitive edge in AI

As pure parameter scaling hits fundamental limits, the frontier is shifting toward formal verification (Lean 4) and rigorous mathematical proofs (OpenAI's First Proof submissions) to evaluate and advance reasoning capabilities beyond black-box prompting.

Feb 19, 2026

Towards Anytime-Valid Statistical Watermarking

A growing body of research is proving that the current paradigm of AI safety—relying on black-box evaluation and fail-open alignment—is mathematically flawed and easily bypassed. The shift toward formal limits, fail-closed architectures, and open-ended structural evaluations marks a reset in how the industry approaches model safety.

Feb 19, 2026

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

A growing body of research is proving that the current paradigm of AI safety—relying on black-box evaluation and fail-open alignment—is mathematically flawed and easily bypassed. The shift toward formal limits, fail-closed architectures, and open-ended structural evaluations marks a reset in how the industry approaches model safety.

Feb 19, 2026★ Pivotal

Fail-Closed Alignment for Large Language Models

A growing body of research is proving that the current paradigm of AI safety—relying on black-box evaluation and fail-open alignment—is mathematically flawed and easily bypassed. The shift toward formal limits, fail-closed architectures, and open-ended structural evaluations marks a reset in how the industry approaches model safety.

Feb 19, 2026★ Pivotal

Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning

A growing body of research is proving that the current paradigm of AI safety—relying on black-box evaluation and fail-open alignment—is mathematically flawed and easily bypassed. The shift toward formal limits, fail-closed architectures, and open-ended structural evaluations marks a reset in how the industry approaches model safety.

The AI Evaluation Crisis

Trajectory

Timeline (9 events)

GLM 5 seems to have a "Claude" personality

What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data

Our First Proof submissions

The Fundamental Limits of LLMs at Scale

Lean 4: How the theorem prover works and why it's the new competitive edge in AI

Towards Anytime-Valid Statistical Watermarking

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Fail-Closed Alignment for Large Language Models

Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning