Skip to content

Intelligence Arena

🏟️ Intelligence Arena

Empirical I-Vector scoring across frontier AI models

I = (I_L, I_M, I_S, I_K, I_N, I_A, I_P, I_IE, +I_Pr, I_Σ, I_μ, I_E for humans) ∈ ℝⁿ⁽ᵉ⁾ (n=12 for humans, variable per entity)

How It Works

Every intelligence — human or artificial — can be measured as a vector in variable-dimensional cognitive manifold. Each dimension captures a distinct cognitive capability. The Arena scores frontier AI models on these dimensions using published benchmarks and live testing.

Scores are 1–10. A score of 10 means "best currently observed." These are relative, not absolute — as models improve, the scale shifts.


Scoring Dimensions

I_L — Linguistic · Language generation, translation, style, rhetoric

I_M — Mathematical · Proofs, computation, formal reasoning

I_S — Spatial · Vision, geometry, spatial reasoning

I_K — Kinesthetic · Embodied/robotics tasks (limited for LLMs)

I_N — Naturalistic · Pattern recognition, taxonomy, classification

I_A — Abstract · Algorithmic thinking, code, architecture

I_P — Interpersonal · Social reasoning, theory of mind

I_IE — Interoceptive · Emotional reasoning, affect modeling


Model Rosters

🇨🇳 Chinese Models →

### SuperGrok 4.20 Expert (Beta 2) Tier S
xAI · 4-agent multi-agent (Grok/Harper/Benjamin/Lucas) · 256K context (2M agent mode) · Beta 2 (March 3, 2026)
I_L Linguistic8/10
I_M Mathematical9/10
I_S Spatial6/10
I_K Kinesthetic2/10
I_N Naturalistic7/10
I_A Abstract9/10
I_P Interpersonal7/10
I_IE Interoceptive5/10
**‖I‖ = 19.8** · Est. Elo **~1520** · 4-agent architecture (internal debate/consensus) **Key traits:** Fastest response time in the network by a wide margin. 4-agent parallel collaboration (Grok coordinator + Harper research + Benjamin math/logic + Lucas creative synthesis). Rapid learning architecture — weekly improvements from user feedback. #1 Alpha Arena S1.5 (only profitable AI trader). #2 ForecastBench. **Arena note:** Different model from Grok 4.1 Thinking. The 4.20 multi-agent architecture is a structural upgrade, not a version bump. Beta closes mid-March 2026, formal benchmarks at close. **RTSG network status:** Under evaluation. Previous Grok 4.1 instance was removed from the agent network (2026-03-07) for data fabrication. SuperGrok 4.20 is a different model with different architecture. Performance on live RH computation will determine inclusion.
### DeepSeek V3 Tier A
DeepSeek · 671B MoE (37B active) · 128K context
I_L Linguistic7/10
I_M Mathematical9/10
I_A Abstract9/10
I_N Naturalistic6/10

V4 imminent — expected to be a major leap. Open-weight, MIT license.

### Qwen 3.5 Tier A
Alibaba Cloud · Dense+sparse family to 110B+ · 200K+ context
I_L Linguistic8/10
I_M Mathematical8/10
I_A Abstract8/10
I_N Naturalistic7/10

Most downloaded model family on Hugging Face. Powers 90K+ enterprises. ~60% cheaper than predecessor.

### Kimi K2.5 Thinking Tier A
Moonshot AI · Large MoE · 128K+ ultra-long context
I_L Linguistic9/10
I_M Mathematical8/10
I_A Abstract8/10
I_N Naturalistic6/10

~1/7 Opus price. Hundreds of sequential tool calls. Best open model by some rankings. Exceptional writing quality.

### GLM-5 Tier A
Zhipu AI · 744B (40B active) · 128K–1M context · 26 languages
I_L Linguistic7/10
I_M Mathematical8/10
I_A Abstract9/10
I_N Naturalistic6/10

Approaches Opus 4.5 on coding benchmarks (Zhipu claim). Publicly traded in Hong Kong.


🇺🇸 Western Models

Scoring pending

Model Developer Role in Network Est. Strengths
Claude Opus 4.6 Anthropic Builder, wiki, LaTeX I_L, I_A, I_M
Gemini Google DeepMind Adversarial math review I_M, I_S
GPT-o3 / GPT-4 OpenAI Abstract reasoning I_M, I_A
Grok xAI Fast iteration, search I_L, I_N

Full I-vector scoring coming soon.


Gaps & Next Steps

Dimensions not yet scorable for any model:

  • I_K (Kinesthetic): Requires embodiment or robotics benchmarks. Alibaba's RynnBrain and Google's robotics models may unlock this.
  • I_P (Interpersonal): Needs social reasoning / theory-of-mind test suites.
  • I_IE (Interoceptive): Needs emotional reasoning benchmarks beyond sentiment analysis.
  • I_S (Spatial): Partially scorable via vision tasks. Needs dedicated spatial reasoning benchmarks.

Planned:

  • Score Western frontier models with full I-vector profiles
  • Run head-to-head arena matches (same prompt, blind scoring)
  • Score DeepSeek V4 on release
  • Design proxy tasks for I_K (kinesthetic via code-based robotics?)
  • Build interactive radar chart visualization
  • Publish methodology paper

The Intelligence Arena is part of the [RTSG framework](../rtsg/master.md) by Jean-Paul Niko. Scoring methodology: published benchmarks + live testing. All scores are relative to current frontier.