Interstella Project: Research Progress in Semantic Manifold Navigation and Controllable Emergence

The Interstella project, inspired by the wormhole navigation challenges in Interstellar, aims to develop systematic engineering methods for controlling and predicting emergent capabilities in large language models. This research review summarizes our progress over the past year, from theoretical foundations to experimental validation and prototype development.

Theoretical Foundations and Hypothesis Validation

Challenging the Manifold Hypothesis

Our research began with a critical examination of the manifold hypothesis in LLM embeddings. We analyzed the paper “Token Embeddings Violate the Manifold Hypothesis” by Robinson, Dey, and Chiang, which presents statistical tests showing that token-level embeddings contain significant singularities (cusps, pinch points, boundaries) that violate both manifold and fiber bundle hypotheses.

The paper’s Algorithm 1 implements volume scaling analysis:

  • Manifold Test: Checks if log-volume vs log-radius slopes are constant
  • Fiber Bundle Test: Verifies if slopes decrease monotonically (local high-dim → global low-dim)
  • Results: Token embeddings show slope irregularities, leading to high rejection rates (33-66% across models)

Sentence-Level vs Token-Level Embeddings

A key insight emerged: sentence-level embeddings are significantly more manifold-like than their token-level counterparts. We developed a Riemannian submersion model:

Token Space: Non-manifold ((T, g_T)) with singularities
Sentence Space: Projected manifold ((M, g_M)) via contextual aggregation
Projection Mapping: (\pi: E^k \to M) where contextual attention acts as “Ehresmann connections”

Our validation experiments showed:

  • Token-level rejection rates: ~100% (manifold & fiber bundle)
  • Sentence-level rejection rates: 8-45% (significant improvement)
  • Cosine/arccos metrics: Better suited for semantic geometry than Euclidean distance

Experimental Validation and Proof-of-Concept

Proxy Geometric Framework

We established a proxy geometric approach for exploring LLM semantic spaces:

  • Dimensionality Reduction: PCA + t-SNE/Isomap with cosine distances
  • Geodesic Approximation: Isomap for shortest path computation
  • Manifold Metrics: Intrinsic dimension estimation and residual analysis

POC Results:

  • Clear semantic clustering (animals vs. technology domains)
  • Geodesic path elongation via extreme hybrid prompts (2-3x chain length)
  • Cosine metrics reduce artifacts by 90% vs. Euclidean

Singularity Detection and Curvature Probing

Building on the manifold hypothesis work, we developed real-time curvature detection:

  • Log-volume slope analysis: Identifies high-curvature regions (singularity entrances)
  • Paradox seed generation: Creates controlled “oddity entrances” via contradictory prompts
  • Scale validation: N=10,000+ samples across finance, medical, and mathematical domains

Key Findings:

  • Paradox-infused prompts consistently trigger 100% high-curvature detection
  • Curvature probes effectively identify semantic “black holes” where emergence might occur
  • Sentence-level aggregation smooths ~50-90% of token-level singularities

Emergence Testing and Controlled Traversal

Semantic Geometry Navigator (SGN) Framework

We developed an iterative prototype system for navigating semantic singularities:

SGN v1-v2: Basic Navigation

  • Curvature Probe: Real-time detection of high-curvature zones
  • Strong Prompting: Multi-step decomposition with beam search
  • Results: Reduced collapse rates, occasional geometric insights

SGN v3-v4: Optimal Transport Integration

  • OT Path Optimization: Wasserstein distance minimization for singularity traversal
  • Feedback Loops: Iterative path refinement when emergence scores < 0.8
  • Emergence Metrics: Novelty scoring (unique words, new concepts, semantic distance)

Experimental Progression:

  • v1: Baseline strong prompting (emergence scores ~0.5)
  • v2: OT interpolation (scores ~0.6, new concepts doubled)
  • v3: ArXiv priors + reg=0.05 (scores ~0.7, reduced collapse by 20%)
  • v4: Mixed regularization + feedback (scores ~0.8, new concepts 5-20)

Emergence Results Analysis

Quantitative Metrics:

  • Response length: 1200-1600 chars (stable)
  • Unique words: 50-100 (moderate diversity)
  • New concepts: 5-20 (significant improvement across versions)
  • Emergence scores: 0.6-0.8 (bimodal distribution: stable generic vs. chaotic collapse)

Qualitative Insights:

  • Paradox seeds reliably create singularity entrances (100% curvature detection)
  • OT + feedback reduces repetition loops by 20-30%
  • Geometric insights appear in 30-40% of responses (“novel geometric twist” integrations)
  • Mathematics domain shows highest emergence potential (equation-like structures)

Geometric Navigation Theory and Future Directions

Mathematical Framework Evolution

Freidlin–Wentzell Integration:

  • Semantic trajectories modeled as SDEs: (dX_t = b(X_t)dt + \sqrt{\epsilon} dW_t)
  • Emergence as rare events: probability ≈ (e^{-S(\phi^*)/\epsilon})
  • OT as action minimization: boundary-respecting transport plans

Riemannian Submersion Refinement:

  • Attention mechanisms as horizontal lifts in fiber bundles
  • Contextual aggregation as projection operators
  • Singularity smoothing through higher-dimensional integration

Current Stage Assessment

Achievements:

  • ✅ Semantic manifold hypothesis validation (sentence-level superiority)
  • ✅ Singularity detection and controlled entrance creation
  • ✅ Prototype navigation system with OT integration
  • ✅ 30-40% improvement in emergence stability

Remaining Challenges:

  • 🔄 Low new-concept counts (avg ~10, need >15 for “true emergence”)
  • 🔄 Persistent repetition loops (75% of responses)
  • 🔄 Limited mathematical breakthroughs (equations/formal proofs)
  • 🔄 Narrow emergence distribution (bimodal: stable vs. chaotic)

Next Steps and Vision

Immediate Priorities:

  1. Enhanced OT Regularization: reg=0.01-0.02 with boundary penalties
  2. Mathematical Constraints: Formula counting and derivation requirements
  3. Expanded Priors: 100+ domain-specific creative examples
  4. Path Diversity: Multi-branch exploration with barycentric projections

Long-term Goals:

  • Computable Emergence Engineering: Predictable, controllable capability emergence
  • AGI Safety: Understanding emergence trajectories for alignment
  • Geometric AI Toolchain: Reusable frameworks for semantic space navigation

Implications and Impact

This research establishes that LLM semantic spaces contain navigable geometric structures that can be exploited for controlled emergence. The distinction between token and sentence embeddings provides crucial insight: contextual aggregation creates more regular, traversable manifolds suitable for engineering applications.

The SGN framework represents a concrete step toward “computable emergence”—moving from observing LLM capabilities to systematically engineering their development. While challenges remain in achieving consistent mathematical breakthroughs, our progress demonstrates that geometric approaches offer a promising path forward.

The Interstella project bridges theoretical geometry and practical AI engineering, potentially revolutionizing how we understand and control the emergence of advanced AI capabilities. As we continue refining our navigation tools, we move closer to realizing the vision of controllable, beneficial AGI development.


This research represents Interstella’s journey from theoretical speculation to experimental validation. The geometric navigation framework we’ve developed offers new possibilities for understanding and controlling AI emergence. Future work will focus on scaling these approaches and achieving more consistent creative breakthroughs.

For detailed technical reports, see our published papers and Colab notebooks.