Observatory Run — 2026-05-23
3 prompts4 models12 calls$0.0335 total cost
1
Top movers
1
Broken
0
Newly passing
Top movers
| Prompt | Model | Change | Direction |
|---|---|---|---|
| Explain a recursive function in plain English | gpt-4o | -2 | degraded |
Broken (newly failing)
# Observatory Run — 2026-05-23 **3 prompts · 4 models · 12 calls · $0.03** ## Top movers | Prompt | Model | Score change | Direction | |--------|-------|-------------|-----------| | code-explain-recursive | gpt-4o | −2 | Degraded | ## Broken (newly failing) - **code-explain-recursive** / gpt-4o — was passing (score 7), now failing (score 6). The response used the word "recursion" more than necessary and included inline code notation despite the rubric specifying plain English output. ## Newly passing None this run. ## Judge flags None this run. ## Model summary | Model | Prompts passed | Prompts failed | |-------|---------------|----------------| | claude-opus-4-7 | 3 | 0 | | claude-sonnet-4-6 | 3 | 0 | | gpt-4o | 2 | 1 | | gemini-2.5-pro | 3 | 0 | ## Notes First run against the seed corpus of 3 prompts. GPT-4o degraded on the code-explain-recursive task — its response used "recursion" multiple times and included a code notation in the plain-English explanation. All Claude and Gemini variants held or improved from the prior run.