Observatory Run — 2026-05-23

3 prompts4 models12 calls$0.0335 total cost
1
Top movers
1
Broken
0
Newly passing

Top movers

PromptModelChangeDirection
Explain a recursive function in plain Englishgpt-4o-2degraded

Broken (newly failing)

# Observatory Run — 2026-05-23

**3 prompts · 4 models · 12 calls · $0.03**

## Top movers

| Prompt | Model | Score change | Direction |
|--------|-------|-------------|-----------|
| code-explain-recursive | gpt-4o | −2 | Degraded |

## Broken (newly failing)

- **code-explain-recursive** / gpt-4o — was passing (score 7), now failing (score 6). The response used the word "recursion" more than necessary and included inline code notation despite the rubric specifying plain English output.

## Newly passing

None this run.

## Judge flags

None this run.

## Model summary

| Model | Prompts passed | Prompts failed |
|-------|---------------|----------------|
| claude-opus-4-7 | 3 | 0 |
| claude-sonnet-4-6 | 3 | 0 |
| gpt-4o | 2 | 1 |
| gemini-2.5-pro | 3 | 0 |

## Notes

First run against the seed corpus of 3 prompts. GPT-4o degraded on the code-explain-recursive task — its response used "recursion" multiple times and included a code notation in the plain-English explanation. All Claude and Gemini variants held or improved from the prior run.