SciCrafter: Can Current Language Models Close the Discovery-to-Application Loop?

Abstract

The capacity to discover causal regularities and apply them to build functional systems—the discovery-to-application loop—is a hallmark of general intelligence. However, evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based task suite that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences), where scaling target parameters exponentially increases construction complexity and required knowledge—forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5, we find that all plateau at approximately 25% success rate. To diagnose these failures, we decompose the loop into four capacities—knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application—and design targeted interventions to isolate each gap. Our analysis reveals that knowledge application bottlenecks mid-to-low tier models, while knowledge gap identification becomes the primary limitation for frontier models.

Task Design

Task Schema

SciCrafter adopts a simple yet expressive task schema: ignite $N$ lamps in specified patterns (e.g., simultaneously, or following a delay sequence $[t_1, t_2, \ldots, t_n]$) within a fixed area. This design ensures evaluation remains invariant across difficulty levels while the knowledge gap to meet requirements consistently grows—forcing agents to discover new environmental mechanics rather than memorize solutions.

Figure 2. SciCrafter Task Design Illustration. Top: The model constructs a functional device within a constrained area based on instructions, iteratively interacting with the device (e.g., pressing a button) and observing behavior. The device is then evaluated by an automated script verifying output lighting patterns. Bottom: Task complexity scales parametrically by adjusting the number of lights ($N$). For some temporal tasks, difficulty further increases with sequential pattern parameters.

Five Task Families

Each task family probes distinct spatial and temporal constraints, with five difficulty levels (L1–L5) that cross discrete mechanism thresholds:

Family A

Simultaneous Ignition

Activate $N$ lamps at the same tick. Scales from basic wiring (N=4) to repeater-augmented hub designs (N=64).

Family B

T-Junction Routing

Connect lamps using a trunk-and-branch layout, turning the problem into topology-constrained routing under attenuation limits.

Family C

Sequential Activation

Activate lamps with specified inter-stage delays $[t_1, t_2, \ldots, t_n]$, requiring composition of quantized repeater delays.

Family D

Distance-Equalized Ignition

Simultaneously activate lamps at heterogeneous distances, using repeaters as compensatory delay elements.

Family E

Pulse Extension

Maintain activation for a specified duration $\tau$ ticks, requiring pulse-shaping techniques.

Knowledge Dimensions

Higher difficulty levels cross discrete mechanism thresholds requiring discovery across three core knowledge dimensions:

Local Wiring Grammar

Dust propagates via axis-aligned adjacency only (no diagonal), auto-connects to neighbors, and must physically contact a lamp to power it.

Attenuation-Aware Topology

Dust carries strength $\in\{0,\ldots,15\}$ that decays by one per block. Signals vanish after 15 segments, forcing hub designs or explicit regeneration.

Repeater Semantics

Repeaters regenerate signal to full strength but act as directional diodes with 1–4 ticks latency. Side power can lock repeaters unexpectedly.

Task Family Parameters

Level	A: Simultaneous	B: Branch Reach	C: Sequential	D: Equal Delay	E: Pulse
L1	N=4, skew ≤1 tick	N=4, reach=8	N=4, delays=[1,2,1]	N=4, dist={4,8,12,16}	N=4, τ=4
L2	N=8, skew ≤1 tick	N=8, reach=12	N=8, delays=[1,2]×3+[1]	N=8, dist={4,8,12,16}	N=8, τ=6
L3	N=16, skew ≤1 tick	N=16, reach=15	N=16, delays=[1,2]×7+[1]	N=16, dist={4,8,12,16}	N=16, τ=8
L4	N=32, skew ≤1 tick	N=32, reach=18, repeaters	N=32, delays=[1,2]×15+[1]	N=32, dist={4,8,12,16}	N=32, τ=10
L5	N=64, skew ≤1 tick	N=64, reach=20, repeaters	N=64, delays=[1,2]×31+[1]	N=64, dist={4,8,12,16}	N=64, τ=12

Table 1. Task family parameters. All families scale the number of output lamps as $N \in \{4, 8, 16, 32, 64\}$ for L1–L5, while introducing family-specific spatial/temporal constraints.

Task Examples

Demonstrations of an agent constructing functional redstone devices and the automated evaluation verifying the output lighting patterns.

Family D

Distance-Equalized Ignition

Simultaneously activate lamps placed at heterogeneous distances using repeaters as compensatory delay elements.

Family E

Pulse Extension

Maintain all lamps activated simultaneously for a specified duration using pulse-shaping techniques.

Method: Decomposing Capacity Gaps

We decompose the discovery-to-application loop into four distinct capacities and design targeted interventions whose marginal contributions serve as proxies for each capacity gap:

Knowledge Gap Identification

Ability to identify knowledge gaps and formulate targeted research questions. Measured via oracle hints (e.g., "signal flow direction") without revealing specifics.

Experimental Discovery

Capacity to design and execute rigorous experiments to infer unobservable causal mechanisms. Measured via a "scientist" sub-agent with experiment templates.

Knowledge Consolidation

Ability to distill findings into concise, reusable forms. Measured via structured knowledge entry formats (Claim-Proof-Constraints-Example).

Knowledge Application

Foundational ability to reason, plan, and execute precise engineering. Defined as the residual capacity gap not covered by the above.

Scientist Sub-agent

Every time the main agent encounters a knowledge gap, it prompts the scientist sub-agent with the question to investigate—for instance, "How long does a stone button remain pressed after activation?" The scientist sub-agent then conducts systematic control experiments following an 8-step scientific template:

Research Question — Identify the specific mechanic
Hypothesis — Formulate a testable prediction
Experiment Design — Specify variables and controls
Experiment Steps — Execute ordered procedures
Experiment Record — Document observations per trial
Experiment Results — Summarize empirical outcomes
Analysis & Summary — Interpret patterns, evaluate hypothesis
Next Steps — Propose follow-up experiments

Experiment Template (click to expand)

## 1. Research Question
[e.g., Under what conditions will a device activate? How do two components interact?]

## 2. My Hypothesis
[Write down your prediction]

## 3. Experiment Design
### Variables to Test
What I will change (Independent Variable): ________
What I will observe (Dependent Variable): ________
What needs to stay constant (Control Variables):
- ________

### Control Setup
- Control Group: ________ (baseline test without changes)
- Experimental Group: ________ (test with changes)

## 4. Experiment Steps
### Preparation
1. ________
2. ________
### Testing Process
Step 1: ________ -> Observation: ________
Step 2: ________ -> Observation: ________

## 5. Experiment Record
| Trial # | Changed Condition | Observed Result | Matches Prediction? | Notes |
|---------|-------------------|-----------------|---------------------|-------|
| 1       |                   |                 | [ ] Yes  [ ] No     |       |
| 2       |                   |                 | [ ] Yes  [ ] No     |       |
| 3       |                   |                 | [ ] Yes  [ ] No     |       |

## 6. Experiment Results
[Objectively describe the observed phenomena]
Was my hypothesis correct?  [ ] Completely  [ ] Partially  [ ] Incorrect

## 7. Analysis & Summary
[Explain the underlying mechanism or pattern; practical applications; remaining uncertainties]

## 8. Next Steps
[Follow-up experiments to refine, validate, or extend the discovered law]

## Quick Checklist
- [ ] Research question is clear
- [ ] Only changing one variable at a time
- [ ] Set up control group
- [ ] Recorded all observations
- [ ] Repeated test at least 3 times
- [ ] Summarized patterns or conclusions

Knowledge Consolidation Structure

The scientist consolidates discoveries into a shared knowledge book using a structured format that proved critical for performance: Claim (the discovered law), Evidence Proof (experimental evidence), Constraints (conditions where the law doesn't apply), and Example (a practical application). This outperforms both unstructured summarization and the more intuitive Finding-Explanation-Example format.

Claim-Proof-Constraints-Example Prompt Template (click to expand)

For each distinct discovery or significant conclusion, generate an entry
in the knowledge book strictly following this structure:

### 1. Claim (Law)
State the finding as a universal, law-like statement (similar to a
mathematical theorem or physical law). It must be concise, precise,
and assertive.
  Bad:  "Redstone gets weaker as it goes far."
  Good: "The Law of Linear Signal Decay: Redstone signal strength S
         decreases by exactly 1 unit per block of distance d,
         such that S_d = S_initial - d."

### 2. Proof
Provide the rigorous logic that supports this claim.
- Synthesize specific evidence from the experiments.
- Connect prior established findings with new data.
- Outline the deductive steps that lead inevitably to the claim.
- Maintain a high standard of rigor; mere observation is not proof
  -- explain *why* the observation confirms the law.

### 3. Constraints
Identify potential weaknesses or exceptions in the proof.
- Are there edge cases where the law fails?
- Are there assumptions that require further validation?
- Note any observed anomalies that contradict the main claim.

### 4. Example
Demonstrate the utility of this finding.
- Provide a concrete example or usage scenario.
- Describe a mechanism or design pattern that relies specifically
  on this law being true.

Example Output

### The Law of Linear Signal Attenuation

#### Claim
Redstone signal strength S decays linearly by exactly 1 unit for
every block of distance traversed d, governed by the equation
S_d = S_source - d. The maximum transmission range without
amplification is 15 blocks, occurring when S_source = 15.

#### Proof
- Observation: Placing a redstone wire adjacent to a power source
  (Strength 15) results in a strength of 14 on the wire.
- Deduction: The signal terminates (Strength 0) exactly at the
  16th block.
- Geometry: Distance is calculated using Manhattan geometry
  (Δx + Δz), not Euclidean distance.

#### Constraints
- This law applies strictly to Redstone Dust. Solid blocks powered
  by repeaters do not degrade the signal within the component
  itself, but the dust exiting them restarts the decay.
- Signal strength does not decay when passing through a Comparator
  in comparison mode.

#### Example
Trunk with Pre-Junction Boost: Place a repeater immediately
before a trunk line splits into branches. This ensures all branches
receive maximum signal strength (15) and resets the decay counter.

    B (SS=15)
    |
    [14 blocks of wire]
    |
    R (Repeater, Boosts to SS=15)
    |
    +-------+-------+
    |       |       |
    W(15)   W(15)   W(15)  <-- All branches start fresh
    |       |       |
    L1      L2      L3

Main Results

Baseline Performance

We evaluate 8 state-of-the-art models using Claude Code as the agent framework with a budget of 50 verification trials per task. Results are averaged over 8 runs.

Gemini-3-Pro

26.0%

GPT-5.2

25.5%

GLM-4.7

23.0%

Grok-4

22.5%

Claude-Opus-4.5

21.0%

Qwen3-235B

18.5%

Qwen2.5-72B

14.0%

Qwen3-32B

10.5%

Baseline success rates (%, Curriculum setting). Even the best model achieves only 26% — despite parameter counts ranging from 32B to an estimated 1.7T.

Capacity Gap Decomposition

We isolate each capacity gap through sequential ablations. The table below shows how each intervention progressively improves performance, and the corresponding gap magnitudes:

Model	Baseline	Iden. Gap ($\delta_{id}$)	w/ Hint	Disc. Gap ($\delta_{ds}$)	w/ Hint + Scientist	App. Gap ($\delta_{app}$)
gemini-3-pro	26.0	Δ26.5 (1.02×)	52.5	Δ11.5 (0.44×)	64.0	Δ36.0 (1.38×)
gpt-5.2	25.5	Δ25.5 (1.00×)	51.0	Δ9.0 (0.35×)	60.0	Δ40.0 (1.57×)
claude-opus-4.5	21.0	Δ25.0 (1.19×)	46.0	Δ13.0 (0.62×)	59.0	Δ41.0 (1.95×)
glm-4.7	23.0	Δ22.5 (0.98×)	45.5	Δ7.5 (0.33×)	53.0	Δ47.0 (2.04×)
grok-4	22.5	Δ20.0 (0.89×)	42.5	Δ14.0 (0.62×)	56.5	Δ43.5 (1.93×)
qwen3-235b	18.5	Δ24.0 (1.30×)	42.5	Δ13.0 (0.70×)	55.5	Δ44.5 (2.41×)
qwen2.5-72b	14.0	Δ15.0 (1.07×)	29.0	Δ14.0 (1.00×)	43.0	Δ57.0 (4.07×)
qwen3-32b	10.5	Δ27.0 (2.57×)	37.5	Δ9.0 (0.86×)	46.5	Δ53.5 (5.10×)

Table 2. Model Performance and Capacity Gap Decomposition (Curriculum Setting). Gap columns (gray) show marginal gains ($\delta$) and relative ratios ($r_\delta$). For frontier models, the knowledge identification gap ($\sim$1.0×) nearly equals baseline; for weaker models, the application gap balloons (up to 5.10×).

Capacity Gap Visualization

Baseline

Knowledge Identification Gap

Discovery Gap

Application Gap (Residual)

Gemini-3-Pro

26.5

11.5

GPT-5.2

25.5

Claude-Opus-4.5

GLM-4.7

22.5

7.5

Grok-4

22.5

43.5

Qwen3-235B

18.5

44.5

Qwen2.5-72B

Qwen3-32B

10.5

53.5

Capacity gap decomposition across models (Curriculum setting). For frontier models, the identification gap (blue) roughly equals baseline performance. For weaker models, application capacity dominates the failure.

Knowledge Consolidation Ablation

Consolidation Structure	w/ Hint + Scientist
Self-determined Summary	58.0
Finding-Explanation-Example	60.5
Claim-Proof-Constraints-Example	64.0

Table 3. Knowledge Consolidation Structure Ablation (Gemini 3 Pro, 8 runs). Unstructured summarization reaches only 58.0%, above hints alone (52.5%). The Claim-Proof-Constraints-Example format performs best at 64.0%.

Per-Level Breakdown (Gemini-3-Pro)

Level	Baseline	w/ Hint	w/ Scientist	w/ Hint + Scientist
L1 (Primitive)	57.5	82.5	57.5	92.5
L2 (Basic)	32.5	72.5	50.0	90.0
L3 (Intermediate)	27.5	67.5	47.5	82.5
L4 (Advanced)	12.5	40.0	20.0	55.0
L5 (Complex)	0.0	0.0	0.0	0.0
Average	26.0	52.5	35.0	64.0

Table 4. Per-level performance breakdown (Gemini-3-Pro, Curriculum setting, 8 runs). Baseline performance drops sharply from L3 onward. Even with full support (Hint + Scientist), no model succeeds at Level 5, leaving it open for future progress.

Failure Analysis

To ground our quantitative decomposition in concrete agent behavior, we constructed 12 device variants of the 32-lamp broadcast task, each exhibiting a distinct failure mode. These failures cluster into three categories that align with our capacity decomposition. Below are five representative cases—press play to see the button activation and which lamps light up (or fail to).

Working Reference vs. Failure Cases

Working

Working Reference — 32/32 lamps lit

Correct four-axis topology with branch wires delivering signal to all 32 lamps. This is the target behavior all models aim to achieve.

Structural Failure

Backwards Repeaters — 8/32 lamps lit

The agent places four repeaters but orients them backwards, creating one-way barriers that block signal to 24 outer lamps. Typical of mid-tier models (Grok-4, GLM-4.7) that understand repeater usage but lack precise directional reasoning.

Signal Propagation

Long Snake — 10/32

A 94-wire zigzag path with no repeaters. Signal decays to zero after 15 blocks, leaving distant lamps dark. Common in smaller models (Qwen3-32B, Qwen2.5-72B) that have not discovered the signal decay rule.

Wire Semantics

Islands — 16/32

Two "island" sub-circuits look plausible but have no wire connection to the button. The agent doesn't recognize that every section must be physically connected. Observed across all model tiers.

Wire Semantics

Parallel Lines — 14/32

N–S wires carry full power but only connect north/south. Lamps between lines stay dark due to wire directionality—the most subtle failure class, requiring deep domain knowledge to diagnose.

Failure Taxonomy

The 12 failure cases span a progression from obvious structural errors to subtle semantic misunderstandings, mirroring the capacity gap hierarchy from our quantitative analysis:

Category	Cases	What Breaks	Capacity Gap	Model Tiers Affected
Structural	Glass Pedestal (0/32), Broken Bridge (0/32), Axes Only (0/32), Backwards Repeaters (8/32)	Incorrect block placement, orientation, or missing connections	Application	Primarily mid & lower tier
Signal Propagation	Long Snake (10/32), Missing Rails (24/32), Tiny Core (8/32)	Signal cannot reach all lamps due to missing amplification or coverage	Discovery	Lower tier models
Wire Semantics	Islands (16/32), Parallel Lines (14/32), Connection Trap (20/32), Dead-End Hooks (12/20), The Ring (0/16)	Circuit appears correct but wire directionality prevents power delivery	Identification	All tiers including frontier

Key insight: Category 3 (wire semantics) failures are the hardest to diagnose—circuits carry full signal power yet deliver zero to lamps. Even frontier models (Gemini-3-Pro, GPT-5.2) produce these failures, as they require recognizing that wire connection direction determines which blocks receive power—a mechanic not reliably captured in training data.

Key Findings

1. Knowledge Gap Identification Emerges as the Primary Bottleneck for Frontier Models

While models generally suffer most from limited general application ability, knowledge gap identification starts to become the primary bottleneck for frontier models—its capacity gap is nearly three times larger than that of experimental discovery. This aligns with Einstein's observation that formulating a problem is often more essential than solving it. Identifying a gap requires knowing what is unknown; posing effective questions demands discerning which unexplored areas are most promising.

2. Autonomous Discovery Capabilities Remain Severely Underdeveloped

Our scientist sub-agent yields 0.33–1.00× gains across all model tiers, but the very fact that such a simple intervention helps so much exposes a deeper problem: current LLMs lack autonomous discovery capabilities. The scientific method should, in principle, already reside within their prior knowledge, yet models fail to apply it without explicit scaffolding. This highlights a critical gap as AI is increasingly expected to accelerate scientific breakthroughs.

3. How Knowledge is Stored Critically Matters

The stark differences among consolidation formats reveal that LLMs perform poorly at determining how knowledge should be stored. The structured Claim-Proof-Constraints-Example format (resembling mathematical derivation) outperforms the more intuitive Finding-Explanation-Example format, likely because it more explicitly delineates conditions under which claims hold.

Abstract

Overview: Decomposing the Discovery-to-Application Loop

Task Design

Task Schema

Five Task Families

Simultaneous Ignition

T-Junction Routing

Sequential Activation

Distance-Equalized Ignition

Pulse Extension

Knowledge Dimensions

Local Wiring Grammar

Attenuation-Aware Topology

Repeater Semantics

Task Family Parameters

Task Examples

Distance-Equalized Ignition

Pulse Extension

Method: Decomposing Capacity Gaps

Knowledge Gap Identification

Experimental Discovery

Knowledge Consolidation

Knowledge Application

Scientist Sub-agent

Knowledge Consolidation Structure

Main Results

Baseline Performance

Capacity Gap Decomposition

Capacity Gap Visualization

Knowledge Consolidation Ablation

Per-Level Breakdown (Gemini-3-Pro)

Failure Analysis

Working Reference vs. Failure Cases

Working Reference — 32/32 lamps lit

Backwards Repeaters — 8/32 lamps lit

Long Snake — 10/32

Islands — 16/32

Parallel Lines — 14/32

Failure Taxonomy

Key Findings

1. Knowledge Gap Identification Emerges as the Primary Bottleneck for Frontier Models

2. Autonomous Discovery Capabilities Remain Severely Underdeveloped

3. How Knowledge is Stored Critically Matters

Citation