Intelligence Arena¶

🏟️ Intelligence Arena¶

Empirical I-Vector scoring across frontier AI models

I = (I_L, I_M, I_S, I_K, I_N, I_A, I_P, I_IE, +I_Pr, I_Σ, I_μ, I_E for humans) ∈ ℝⁿ⁽ᵉ⁾ (n=12 for humans, variable per entity)

How It Works¶

Every intelligence — human or artificial — can be measured as a vector in variable-dimensional cognitive manifold. Each dimension captures a distinct cognitive capability. The Arena scores frontier AI models on these dimensions using published benchmarks and live testing.

Scores are 1–10. A score of 10 means "best currently observed." These are relative, not absolute — as models improve, the scale shifts.

Scoring Dimensions¶

I_L — Linguistic · Language generation, translation, style, rhetoric

I_M — Mathematical · Proofs, computation, formal reasoning

I_S — Spatial · Vision, geometry, spatial reasoning

I_K — Kinesthetic · Embodied/robotics tasks (limited for LLMs)

I_N — Naturalistic · Pattern recognition, taxonomy, classification

I_A — Abstract · Algorithmic thinking, code, architecture

I_P — Interpersonal · Social reasoning, theory of mind

I_IE — Interoceptive · Emotional reasoning, affect modeling

Model Rosters¶

🇨🇳 Chinese Models →¶

### SuperGrok 4.20 Expert (Beta 2) Tier S

xAI · 4-agent multi-agent (Grok/Harper/Benjamin/Lucas) · 256K context (2M agent mode) · Beta 2 (March 3, 2026)

I_L Linguistic8/10

I_M Mathematical9/10

I_S Spatial6/10

I_K Kinesthetic2/10

I_N Naturalistic7/10

I_A Abstract9/10

I_P Interpersonal7/10

I_IE Interoceptive5/10

**‖I‖ = 19.8** · Est. Elo **~1520** · 4-agent architecture (internal debate/consensus) **Key traits:** Fastest response time in the network by a wide margin. 4-agent parallel collaboration (Grok coordinator + Harper research + Benjamin math/logic + Lucas creative synthesis). Rapid learning architecture — weekly improvements from user feedback. #1 Alpha Arena S1.5 (only profitable AI trader). #2 ForecastBench. **Arena note:** Different model from Grok 4.1 Thinking. The 4.20 multi-agent architecture is a structural upgrade, not a version bump. Beta closes mid-March 2026, formal benchmarks at close. **RTSG network status:** Under evaluation. Previous Grok 4.1 instance was removed from the agent network (2026-03-07) for data fabrication. SuperGrok 4.20 is a different model with different architecture. Performance on live RH computation will determine inclusion.

### DeepSeek V3 Tier A

DeepSeek · 671B MoE (37B active) · 128K context

I_L Linguistic7/10

I_M Mathematical9/10

I_A Abstract9/10

I_N Naturalistic6/10

V4 imminent — expected to be a major leap. Open-weight, MIT license.

### Qwen 3.5 Tier A

Alibaba Cloud · Dense+sparse family to 110B+ · 200K+ context

I_L Linguistic8/10

I_M Mathematical8/10

I_A Abstract8/10

I_N Naturalistic7/10

Most downloaded model family on Hugging Face. Powers 90K+ enterprises. ~60% cheaper than predecessor.

### Kimi K2.5 Thinking Tier A

Moonshot AI · Large MoE · 128K+ ultra-long context

I_L Linguistic9/10

I_M Mathematical8/10

I_A Abstract8/10

I_N Naturalistic6/10

~1/7 Opus price. Hundreds of sequential tool calls. Best open model by some rankings. Exceptional writing quality.

### GLM-5 Tier A

Zhipu AI · 744B (40B active) · 128K–1M context · 26 languages

I_L Linguistic7/10

I_M Mathematical8/10

I_A Abstract9/10

I_N Naturalistic6/10

Approaches Opus 4.5 on coding benchmarks (Zhipu claim). Publicly traded in Hong Kong.

🇺🇸 Western Models¶

Scoring pending

Model	Developer	Role in Network	Est. Strengths
Claude Opus 4.6	Anthropic	Builder, wiki, LaTeX	I_L, I_A, I_M
Gemini	Google DeepMind	Adversarial math review	I_M, I_S
GPT-o3 / GPT-4	OpenAI	Abstract reasoning	I_M, I_A
Grok	xAI	Fast iteration, search	I_L, I_N

Full I-vector scoring coming soon.

Gaps & Next Steps¶

Dimensions not yet scorable for any model:

I_K (Kinesthetic): Requires embodiment or robotics benchmarks. Alibaba's RynnBrain and Google's robotics models may unlock this.
I_P (Interpersonal): Needs social reasoning / theory-of-mind test suites.
I_IE (Interoceptive): Needs emotional reasoning benchmarks beyond sentiment analysis.
I_S (Spatial): Partially scorable via vision tasks. Needs dedicated spatial reasoning benchmarks.

Planned:

Score Western frontier models with full I-vector profiles
Run head-to-head arena matches (same prompt, blind scoring)
Score DeepSeek V4 on release
Design proxy tasks for I_K (kinesthetic via code-based robotics?)
Build interactive radar chart visualization
Publish methodology paper

The Intelligence Arena is part of the [RTSG framework](../rtsg/master.md) by Jean-Paul Niko. Scoring methodology: published benchmarks + live testing. All scores are relative to current frontier.