Intelligence Arena¶
🏟️ Intelligence Arena¶
Empirical I-Vector scoring across frontier AI models
I = (I_L, I_M, I_S, I_K, I_N, I_A, I_P, I_IE, +I_Pr, I_Σ, I_μ, I_E for humans) ∈ ℝⁿ⁽ᵉ⁾ (n=12 for humans, variable per entity)
How It Works¶
Every intelligence — human or artificial — can be measured as a vector in variable-dimensional cognitive manifold. Each dimension captures a distinct cognitive capability. The Arena scores frontier AI models on these dimensions using published benchmarks and live testing.
Scores are 1–10. A score of 10 means "best currently observed." These are relative, not absolute — as models improve, the scale shifts.
Scoring Dimensions¶
I_L — Linguistic · Language generation, translation, style, rhetoric
I_M — Mathematical · Proofs, computation, formal reasoning
I_S — Spatial · Vision, geometry, spatial reasoning
I_K — Kinesthetic · Embodied/robotics tasks (limited for LLMs)
I_N — Naturalistic · Pattern recognition, taxonomy, classification
I_A — Abstract · Algorithmic thinking, code, architecture
I_P — Interpersonal · Social reasoning, theory of mind
I_IE — Interoceptive · Emotional reasoning, affect modeling
Model Rosters¶
🇨🇳 Chinese Models →¶
V4 imminent — expected to be a major leap. Open-weight, MIT license.
Most downloaded model family on Hugging Face. Powers 90K+ enterprises. ~60% cheaper than predecessor.
~1/7 Opus price. Hundreds of sequential tool calls. Best open model by some rankings. Exceptional writing quality.
Approaches Opus 4.5 on coding benchmarks (Zhipu claim). Publicly traded in Hong Kong.
🇺🇸 Western Models¶
Scoring pending
| Model | Developer | Role in Network | Est. Strengths |
|---|---|---|---|
| Claude Opus 4.6 | Anthropic | Builder, wiki, LaTeX | I_L, I_A, I_M |
| Gemini | Google DeepMind | Adversarial math review | I_M, I_S |
| GPT-o3 / GPT-4 | OpenAI | Abstract reasoning | I_M, I_A |
| Grok | xAI | Fast iteration, search | I_L, I_N |
Full I-vector scoring coming soon.
Gaps & Next Steps¶
Dimensions not yet scorable for any model:
- I_K (Kinesthetic): Requires embodiment or robotics benchmarks. Alibaba's RynnBrain and Google's robotics models may unlock this.
- I_P (Interpersonal): Needs social reasoning / theory-of-mind test suites.
- I_IE (Interoceptive): Needs emotional reasoning benchmarks beyond sentiment analysis.
- I_S (Spatial): Partially scorable via vision tasks. Needs dedicated spatial reasoning benchmarks.
Planned:
- Score Western frontier models with full I-vector profiles
- Run head-to-head arena matches (same prompt, blind scoring)
- Score DeepSeek V4 on release
- Design proxy tasks for I_K (kinesthetic via code-based robotics?)
- Build interactive radar chart visualization
- Publish methodology paper