preloader
blog post

Notes from a meal-planning AI model shootout

author image

Two AI models produced the same passing meal plan for the same test households. One model cost 30x less than the other, while being 2.5x faster.

The surprise isn't that the cheap model won - it's that we could only see it because we'd already decided, in writing, what "a good plan" meant.

One of the use cases that tests Nornilo's capabilities is Madklar, an AI-supported meal-planning app for busy families.

Every plan the user sees is generated by an LLM picking a week's worth of dinners from a household's profile and preferences, available ingredients in the pantry, and recent history.

Which LLM, though? "Try a few and see" doesn't scale, especially when prices and capabilities shift so fast. So we built a small shootout - a controlled, reproducible way to compare models on the actual task.

Defining a "contract" on what counts as success

The first thing we had to write was a contract: what does "a good plan" actually mean?

If I can't decide that before any model runs, I can't compare them.

The contract for make_meal_plan v0.5 is one JSON file output with eight hard rules and four soft scorers. Hard rules are pass/fail:

  1. The output is in proper JSON format
  2. It matches a specific envelope schema
  3. It contains exactly the right number of days
  4. The dates are consecutive from the requested start
  5. Each day has exactly one dinner
  6. Recipe IDs are well-formed slugs and unique within the plan
  7. The headcount on each meal matches the resolved per-day attendance
  8. No meal violates a member's declared dietary restrictions

Soft scorers are 0.0–1.0 and just get logged:

  • Language match - if the household speaks Danish, does the model write Danish meal names?
  • No-repeats-from-history - does the plan avoid dishes the household just cooked?
  • Protein/starch variety - fish twice, chicken twice, vegetarian twice, etc.
  • Pantry utilisation - does the plan actually use what's already on the shelf?
  • Format strictness - raw JSON, or does the model add unwanted extra characters?

The contract is the testable version of "a good plan." Everything else flows from it - the validators implement it, the runner calls models against it, the dashboard renders pass-rates against it.

Diagram of the Nornilo model evaluation loop: contract, fixtures, AI router, model slate, validator, analytics, and final model choice.
The model itself is only one box in the loop. The contract, the test fixtures, and the validator are just as important for the outcome - and they're the parts that survive when a new model lands on the menu.

Test households

Real household data is a privacy minefield. So we built three synthetic ones for a start, plus our own:

  • time-pressed-parent-2-kids: the base case. Two parents, two kids, no restrictions, 30-min weeknight cap.
  • single-dad-three-kids: high-volume, age-banded, one picky eater (notes only, not a structured diet).
  • vegetarian-tree-nut-allergy: household-wide vegetarian + free of tree nuts + one Tuesday-only kid. Catches the "pesto with pine nuts" problem.
  • dravecky-skov: our family. A hard one. Six members, fluid attendance, Danish language, overlapping restrictions: lactose-free × 2, no-mushrooms × 2, no-tofu for one of the teens, no-soups-as-mains for another. Pescatarian household. Recent meal history with rated dishes and Danish notes about what worked and what didn't.

The dravecky-skov household is the one no model could solve in the v0.5 sweep. That's the point: solving the easy cases doesn't prove we can succeed in real life.

The v0.5 baseline: seven AI model candidates, four households.

The first run included 7 AI models provided by Cloudflare AI Workers. We ran it on all four households, three replicates per cell. 84 calls, about 25 minutes wall time.

The headline: no model solved the dravecky-skov problem in the first sweep. Every candidate ran into something - markdown fences around the JSON, cooking for six on a day when only three were home, or suggesting cheese to lactose-free members.

The best model solved only 50% of cases successfully, with a median latency of 15 seconds per plan. Some were much worse.

I was getting worried: if we couldn't move this number significantly, Madklar would never take off.

Prompt iteration - small models, literal models

For v0.6 we focused on the hardest household, the two best models from v0.5, and one question: is the failure a model problem or a prompt problem? We wrote nine variants - terse, schema-first, few-shot, constraint-first, fence-warning, OpenAI's literal small-model style, a hidden attendance ledger, and a "validator-mirror" that read back the hard rules.

Three findings did most of the work:

  • Format compliance is structural, not promptable. The variant that added three explicit "DO NOT use markdown fences" reminders did not move a single fence-wrapping model.
  • Short prompts beat long ones. The terse and schema-first variants were the surprise winners. Long explanations buried the per-day attendance clause, and the model would default to "everyone attends every meal."
  • OpenAI's "literal small-model" guidance held up, and so did a prompt phrased to mirror the validator directly.

Broad sweep, then precision pass

Once we added GPT-5.4-mini and Claude Haiku on the AI router's menu, we extended the sweep to four models. Then we added seven more candidates from Cloudflare's Workers AI menu. By the end we had 78 cells across 11 models.

The next precision pass was bigger and cleaner: 336 calls across four households, four prompt variants, seven models, and three replicates per cell. It produced 290 passes and one upstream error.

Heatmap of pass-rate per model and household. Across all models the dravecky-skov column has the lowest pass-rates; single-dad-three-kids and time-pressed-parent-2-kids are easiest. Claude Sonnet struggles most on dravecky-skov (5/12); GPT-5.4-mini scores 9/12 there, the strongest result.
Pass-rate per model × household cell. The dravecky-skov column is the difficulty meter: six members, fluid attendance, overlapping restrictions, Danish language. Every model is weaker there than on the easy households - and the gap is where a model actually earns its place in production.
Heatmap of pass-rate per model and prompt variant (A, C, G, I) across the precision pass. GPT-5.4-mini and Claude Haiku score 11/12 or 12/12 across all four prompts; Opus is mostly 12/12; GPT-OSS-20B and Sonnet are mixed; nano is brittle on prompt A but clean on prompt I.
Pass-rate per model × prompt cell across the precision pass (12 calls per cell). The cleaner the model, the more uniformly green the row. Brittle models swing wildly between prompts - nano is the most dramatic example.

Crucially, we don't need a model to work with every prompt. In production each model runs against the single prompt that suits it best. The right question is: how many model × prompt combinations hit a clean 12/12 across all four households - and what do they cost?

Nine combinations did. Plotted on cost and latency, they tell a sharper story than any average pass-rate:

Scatter plot of model + prompt combinations that passed unanimously across replicates, plotted by p95 latency (x-axis, 0–12.5 seconds) and cost per call (y-axis, dollars). GPT-5.4-mini and GPT-5.4-nano cluster in the cheap, fast bottom-left; Claude Opus sits in the slow, expensive top-right; open-source GPT-OSS models are cheap but slow on the right.
Each dot is a model+prompt combination that passed every hard rule across all four households. Bottom-left is what you want: cheap and fast. GPT-5.4-mini · C/I and GPT-5.4-nano · I land there. Opus · A/C delivers the same result from the top of the chart.

GPT-5.4-nano · I is the surprise winner. Once paired with the validator-mirror prompt, nano clears 12/12 across every household - including dravecky-skov - at a median cost of $0.0007 per call and ~3s p95 latency. Across all four prompts nano averaged only 35/48, but that average is the wrong frame: with the right prompt, this is the cheapest passing model on the menu and ~30x cheaper than Opus.

GPT-5.4-mini · C and · I are the safe pragmatic default. Two unanimous combinations, ~$0.0024 per call, ~2s median latency. Roughly 9x cheaper than Opus for the same structural result, and far less prompt-sensitive than nano if you only want to maintain one configuration.

Claude Opus · A and · C earn the premium only on the hardest edges. Two unanimous combinations at $0.0212 per call and 6.4s median latency. On the easy households Opus is paying ~30x for a result nano already gets right. Where it pulls ahead is the harder edges of dravecky-skov where nano stumbles on the wrong prompt.

Claude Haiku · A and · I are unanimous too, with a catch. Haiku wraps its JSON in markdown fences every single time. The evaluator was generous and stripped the fence; production would need a response-cleaning step, and that becomes part of the operating cost.

The broader lesson is not "always use the smallest model." It is "measure the task." If a workflow is standardised, bounded, and backed by validators, you can often choose a model that is good enough rather than the largest one on the menu.

What the validator can't see yet

Every model in the shootout produced invented recipes - names that sound real but aren't tied to anything in the household's actual recipe library. A plan can score 100% on every validator and still be worthless to a user if "Polenta-Ost Gratin" isn't a recipe they have or want.

The next contract revision will add a recipe_grounded rule: every recipe ID in the plan must resolve to a real recipe in Madklar's (or the household's) library. Once we have grounding, additional criteria become possible - budget, prep time, culinary attractiveness - and the question shifts from "which model produces a structurally-valid plan?" to "which model produces a plan a household will actually cook and eat?"

That progression is itself the point worth keeping. The validator became a product spec in disguise: once the rules were explicit enough to score a model, they were also explicit enough to describe what Madklar should guarantee in production - correct dates, real attendance, dietary safety, language match, grounded recipes. The shootout stopped being a side experiment and became a sharper definition of the product.

For today's question, GPT-5.4-nano is the pragmatic default: fast, strict about JSON, much cheaper than Opus, and still reliable on the hardest household. The model market will keep moving and that answer will change. The contract, the fixtures, and the validator won't, unless our business logic changes.

If you're running an AI-backed workflow today, three questions matter more than any benchmark someone else publishes:


  1. What does "a good output" mean, in writing?
  2. Which cases would expose every weakness of that definition?
  3. How can we find the best-fitting model and prompt and keep tracking its performance?

Get those three right and the model question answers itself every time the market moves.

Related Articles

Want to talk through where this could go?

Start with a conversation. We'll help you figure out if there's something worth shaping, building, or simplifying - and tell you honestly when there isn't.

Book a conversation