Discovering causal regularities and applying them to build functional systems—the discovery-to-application loop—is a hallmark of general intelligence. We introduce SciCrafter, a Minecraft-based task suite that operationalizes this loop through parameterized redstone circuit tasks, and use it to diagnose where current language models fail.
Under double-blind review
The capacity to discover causal regularities and apply them to build functional systems—the discovery-to-application loop—is a hallmark of general intelligence. However, evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based task suite that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences), where scaling target parameters exponentially increases construction complexity and required knowledge—forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5, we find that all plateau at approximately 25% success rate. To diagnose these failures, we decompose the loop into four capacities—knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application—and design targeted interventions to isolate each gap. Our analysis reveals that knowledge application bottlenecks mid-to-low tier models, while knowledge gap identification becomes the primary limitation for frontier models.
SciCrafter adopts a simple yet expressive task schema: ignite $N$ lamps in specified patterns (e.g., simultaneously, or following a delay sequence $[t_1, t_2, \ldots, t_n]$) within a fixed area. This design ensures evaluation remains invariant across difficulty levels while the knowledge gap to meet requirements consistently grows—forcing agents to discover new environmental mechanics rather than memorize solutions.
Figure 2. SciCrafter Task Design Illustration. Top: The model constructs a functional device within a constrained area based on instructions, iteratively interacting with the device (e.g., pressing a button) and observing behavior. The device is then evaluated by an automated script verifying output lighting patterns. Bottom: Task complexity scales parametrically by adjusting the number of lights ($N$). For some temporal tasks, difficulty further increases with sequential pattern parameters.
Each task family probes distinct spatial and temporal constraints, with five difficulty levels (L1–L5) that cross discrete mechanism thresholds:
Activate $N$ lamps at the same tick. Scales from basic wiring (N=4) to repeater-augmented hub designs (N=64).
Connect lamps using a trunk-and-branch layout, turning the problem into topology-constrained routing under attenuation limits.
Activate lamps with specified inter-stage delays $[t_1, t_2, \ldots, t_n]$, requiring composition of quantized repeater delays.
Simultaneously activate lamps at heterogeneous distances, using repeaters as compensatory delay elements.
Maintain activation for a specified duration $\tau$ ticks, requiring pulse-shaping techniques.
Higher difficulty levels cross discrete mechanism thresholds requiring discovery across three core knowledge dimensions:
Dust propagates via axis-aligned adjacency only (no diagonal), auto-connects to neighbors, and must physically contact a lamp to power it.
Dust carries strength $\in\{0,\ldots,15\}$ that decays by one per block. Signals vanish after 15 segments, forcing hub designs or explicit regeneration.
Repeaters regenerate signal to full strength but act as directional diodes with 1–4 ticks latency. Side power can lock repeaters unexpectedly.
| Level | A: Simultaneous | B: Branch Reach | C: Sequential | D: Equal Delay | E: Pulse |
|---|---|---|---|---|---|
| L1 | N=4, skew ≤1 tick | N=4, reach=8 | N=4, delays=[1,2,1] | N=4, dist={4,8,12,16} | N=4, τ=4 |
| L2 | N=8, skew ≤1 tick | N=8, reach=12 | N=8, delays=[1,2]×3+[1] | N=8, dist={4,8,12,16} | N=8, τ=6 |
| L3 | N=16, skew ≤1 tick | N=16, reach=15 | N=16, delays=[1,2]×7+[1] | N=16, dist={4,8,12,16} | N=16, τ=8 |
| L4 | N=32, skew ≤1 tick | N=32, reach=18, repeaters | N=32, delays=[1,2]×15+[1] | N=32, dist={4,8,12,16} | N=32, τ=10 |
| L5 | N=64, skew ≤1 tick | N=64, reach=20, repeaters | N=64, delays=[1,2]×31+[1] | N=64, dist={4,8,12,16} | N=64, τ=12 |
Table 1. Task family parameters. All families scale the number of output lamps as $N \in \{4, 8, 16, 32, 64\}$ for L1–L5, while introducing family-specific spatial/temporal constraints.
Demonstrations of an agent constructing functional redstone devices and the automated evaluation verifying the output lighting patterns.
Simultaneously activate lamps placed at heterogeneous distances using repeaters as compensatory delay elements.
Maintain all lamps activated simultaneously for a specified duration using pulse-shaping techniques.
We decompose the discovery-to-application loop into four distinct capacities and design targeted interventions whose marginal contributions serve as proxies for each capacity gap:
Ability to identify knowledge gaps and formulate targeted research questions. Measured via oracle hints (e.g., "signal flow direction") without revealing specifics.
Capacity to design and execute rigorous experiments to infer unobservable causal mechanisms. Measured via a "scientist" sub-agent with experiment templates.
Ability to distill findings into concise, reusable forms. Measured via structured knowledge entry formats (Claim-Proof-Constraints-Example).
Foundational ability to reason, plan, and execute precise engineering. Defined as the residual capacity gap not covered by the above.
Every time the main agent encounters a knowledge gap, it prompts the scientist sub-agent with the question to investigate—for instance, "How long does a stone button remain pressed after activation?" The scientist sub-agent then conducts systematic control experiments following an 8-step scientific template:
## 1. Research Question [e.g., Under what conditions will a device activate? How do two components interact?] ## 2. My Hypothesis [Write down your prediction] ## 3. Experiment Design ### Variables to Test What I will change (Independent Variable): ________ What I will observe (Dependent Variable): ________ What needs to stay constant (Control Variables): - ________ ### Control Setup - Control Group: ________ (baseline test without changes) - Experimental Group: ________ (test with changes) ## 4. Experiment Steps ### Preparation 1. ________ 2. ________ ### Testing Process Step 1: ________ -> Observation: ________ Step 2: ________ -> Observation: ________ ## 5. Experiment Record | Trial # | Changed Condition | Observed Result | Matches Prediction? | Notes | |---------|-------------------|-----------------|---------------------|-------| | 1 | | | [ ] Yes [ ] No | | | 2 | | | [ ] Yes [ ] No | | | 3 | | | [ ] Yes [ ] No | | ## 6. Experiment Results [Objectively describe the observed phenomena] Was my hypothesis correct? [ ] Completely [ ] Partially [ ] Incorrect ## 7. Analysis & Summary [Explain the underlying mechanism or pattern; practical applications; remaining uncertainties] ## 8. Next Steps [Follow-up experiments to refine, validate, or extend the discovered law] ## Quick Checklist - [ ] Research question is clear - [ ] Only changing one variable at a time - [ ] Set up control group - [ ] Recorded all observations - [ ] Repeated test at least 3 times - [ ] Summarized patterns or conclusions
The scientist consolidates discoveries into a shared knowledge book using a structured format that proved critical for performance: Claim (the discovered law), Evidence Proof (experimental evidence), Constraints (conditions where the law doesn't apply), and Example (a practical application). This outperforms both unstructured summarization and the more intuitive Finding-Explanation-Example format.
For each distinct discovery or significant conclusion, generate an entry in the knowledge book strictly following this structure: ### 1. Claim (Law) State the finding as a universal, law-like statement (similar to a mathematical theorem or physical law). It must be concise, precise, and assertive. Bad: "Redstone gets weaker as it goes far." Good: "The Law of Linear Signal Decay: Redstone signal strength S decreases by exactly 1 unit per block of distance d, such that S_d = S_initial - d." ### 2. Proof Provide the rigorous logic that supports this claim. - Synthesize specific evidence from the experiments. - Connect prior established findings with new data. - Outline the deductive steps that lead inevitably to the claim. - Maintain a high standard of rigor; mere observation is not proof -- explain *why* the observation confirms the law. ### 3. Constraints Identify potential weaknesses or exceptions in the proof. - Are there edge cases where the law fails? - Are there assumptions that require further validation? - Note any observed anomalies that contradict the main claim. ### 4. Example Demonstrate the utility of this finding. - Provide a concrete example or usage scenario. - Describe a mechanism or design pattern that relies specifically on this law being true.
### The Law of Linear Signal Attenuation #### Claim Redstone signal strength S decays linearly by exactly 1 unit for every block of distance traversed d, governed by the equation S_d = S_source - d. The maximum transmission range without amplification is 15 blocks, occurring when S_source = 15. #### Proof - Observation: Placing a redstone wire adjacent to a power source (Strength 15) results in a strength of 14 on the wire. - Deduction: The signal terminates (Strength 0) exactly at the 16th block. - Geometry: Distance is calculated using Manhattan geometry (Δx + Δz), not Euclidean distance. #### Constraints - This law applies strictly to Redstone Dust. Solid blocks powered by repeaters do not degrade the signal within the component itself, but the dust exiting them restarts the decay. - Signal strength does not decay when passing through a Comparator in comparison mode. #### Example Trunk with Pre-Junction Boost: Place a repeater immediately before a trunk line splits into branches. This ensures all branches receive maximum signal strength (15) and resets the decay counter. B (SS=15) | [14 blocks of wire] | R (Repeater, Boosts to SS=15) | +-------+-------+ | | | W(15) W(15) W(15) <-- All branches start fresh | | | L1 L2 L3
We evaluate 8 state-of-the-art models using Claude Code as the agent framework with a budget of 50 verification trials per task. Results are averaged over 8 runs.
Baseline success rates (%, Curriculum setting). Even the best model achieves only 26% — despite parameter counts ranging from 32B to an estimated 1.7T.
We isolate each capacity gap through sequential ablations. The table below shows how each intervention progressively improves performance, and the corresponding gap magnitudes:
| Model | Baseline | Iden. Gap ($\delta_{id}$) | w/ Hint | Disc. Gap ($\delta_{ds}$) | w/ Hint + Scientist | App. Gap ($\delta_{app}$) |
|---|---|---|---|---|---|---|
| gemini-3-pro | 26.0 | Δ26.5 (1.02×) | 52.5 | Δ11.5 (0.44×) | 64.0 | Δ36.0 (1.38×) |
| gpt-5.2 | 25.5 | Δ25.5 (1.00×) | 51.0 | Δ9.0 (0.35×) | 60.0 | Δ40.0 (1.57×) |
| claude-opus-4.5 | 21.0 | Δ25.0 (1.19×) | 46.0 | Δ13.0 (0.62×) | 59.0 | Δ41.0 (1.95×) |
| glm-4.7 | 23.0 | Δ22.5 (0.98×) | 45.5 | Δ7.5 (0.33×) | 53.0 | Δ47.0 (2.04×) |
| grok-4 | 22.5 | Δ20.0 (0.89×) | 42.5 | Δ14.0 (0.62×) | 56.5 | Δ43.5 (1.93×) |
| qwen3-235b | 18.5 | Δ24.0 (1.30×) | 42.5 | Δ13.0 (0.70×) | 55.5 | Δ44.5 (2.41×) |
| qwen2.5-72b | 14.0 | Δ15.0 (1.07×) | 29.0 | Δ14.0 (1.00×) | 43.0 | Δ57.0 (4.07×) |
| qwen3-32b | 10.5 | Δ27.0 (2.57×) | 37.5 | Δ9.0 (0.86×) | 46.5 | Δ53.5 (5.10×) |
Table 2. Model Performance and Capacity Gap Decomposition (Curriculum Setting). Gap columns (gray) show marginal gains ($\delta$) and relative ratios ($r_\delta$). For frontier models, the knowledge identification gap ($\sim$1.0×) nearly equals baseline; for weaker models, the application gap balloons (up to 5.10×).
Capacity gap decomposition across models (Curriculum setting). For frontier models, the identification gap (blue) roughly equals baseline performance. For weaker models, application capacity dominates the failure.
| Consolidation Structure | w/ Hint + Scientist |
|---|---|
| Self-determined Summary | 58.0 |
| Finding-Explanation-Example | 60.5 |
| Claim-Proof-Constraints-Example | 64.0 |
Table 3. Knowledge Consolidation Structure Ablation (Gemini 3 Pro, 8 runs). Unstructured summarization reaches only 58.0%, above hints alone (52.5%). The Claim-Proof-Constraints-Example format performs best at 64.0%.
| Level | Baseline | w/ Hint | w/ Scientist | w/ Hint + Scientist |
|---|---|---|---|---|
| L1 (Primitive) | 57.5 | 82.5 | 57.5 | 92.5 |
| L2 (Basic) | 32.5 | 72.5 | 50.0 | 90.0 |
| L3 (Intermediate) | 27.5 | 67.5 | 47.5 | 82.5 |
| L4 (Advanced) | 12.5 | 40.0 | 20.0 | 55.0 |
| L5 (Complex) | 0.0 | 0.0 | 0.0 | 0.0 |
| Average | 26.0 | 52.5 | 35.0 | 64.0 |
Table 4. Per-level performance breakdown (Gemini-3-Pro, Curriculum setting, 8 runs). Baseline performance drops sharply from L3 onward. Even with full support (Hint + Scientist), no model succeeds at Level 5, leaving it open for future progress.
To ground our quantitative decomposition in concrete agent behavior, we constructed 12 device variants of the 32-lamp broadcast task, each exhibiting a distinct failure mode. These failures cluster into three categories that align with our capacity decomposition. Below are five representative cases—press play to see the button activation and which lamps light up (or fail to).
Correct four-axis topology with branch wires delivering signal to all 32 lamps. This is the target behavior all models aim to achieve.
The agent places four repeaters but orients them backwards, creating one-way barriers that block signal to 24 outer lamps. Typical of mid-tier models (Grok-4, GLM-4.7) that understand repeater usage but lack precise directional reasoning.
A 94-wire zigzag path with no repeaters. Signal decays to zero after 15 blocks, leaving distant lamps dark. Common in smaller models (Qwen3-32B, Qwen2.5-72B) that have not discovered the signal decay rule.
Two "island" sub-circuits look plausible but have no wire connection to the button. The agent doesn't recognize that every section must be physically connected. Observed across all model tiers.
N–S wires carry full power but only connect north/south. Lamps between lines stay dark due to wire directionality—the most subtle failure class, requiring deep domain knowledge to diagnose.
The 12 failure cases span a progression from obvious structural errors to subtle semantic misunderstandings, mirroring the capacity gap hierarchy from our quantitative analysis:
| Category | Cases | What Breaks | Capacity Gap | Model Tiers Affected |
|---|---|---|---|---|
| Structural | Glass Pedestal (0/32), Broken Bridge (0/32), Axes Only (0/32), Backwards Repeaters (8/32) | Incorrect block placement, orientation, or missing connections | Application | Primarily mid & lower tier |
| Signal Propagation | Long Snake (10/32), Missing Rails (24/32), Tiny Core (8/32) | Signal cannot reach all lamps due to missing amplification or coverage | Discovery | Lower tier models |
| Wire Semantics | Islands (16/32), Parallel Lines (14/32), Connection Trap (20/32), Dead-End Hooks (12/20), The Ring (0/16) | Circuit appears correct but wire directionality prevents power delivery | Identification | All tiers including frontier |
Key insight: Category 3 (wire semantics) failures are the hardest to diagnose—circuits carry full signal power yet deliver zero to lamps. Even frontier models (Gemini-3-Pro, GPT-5.2) produce these failures, as they require recognizing that wire connection direction determines which blocks receive power—a mechanic not reliably captured in training data.
While models generally suffer most from limited general application ability, knowledge gap identification starts to become the primary bottleneck for frontier models—its capacity gap is nearly three times larger than that of experimental discovery. This aligns with Einstein's observation that formulating a problem is often more essential than solving it. Identifying a gap requires knowing what is unknown; posing effective questions demands discerning which unexplored areas are most promising.
Our scientist sub-agent yields 0.33–1.00× gains across all model tiers, but the very fact that such a simple intervention helps so much exposes a deeper problem: current LLMs lack autonomous discovery capabilities. The scientific method should, in principle, already reside within their prior knowledge, yet models fail to apply it without explicit scaffolding. This highlights a critical gap as AI is increasingly expected to accelerate scientific breakthroughs.
The stark differences among consolidation formats reveal that LLMs perform poorly at determining how knowledge should be stored. The structured Claim-Proof-Constraints-Example format (resembling mathematical derivation) outperforms the more intuitive Finding-Explanation-Example format, likely because it more explicitly delineates conditions under which claims hold.
If you find SciCrafter useful in your research, please consider citing:
@article{scicrafter2025,
title = {SciCrafter: Can Current Language Models Close
the Discovery-to-Application Loop?},
author = {Anonymous},
journal = {Under Review},
year = {2025}
}