Ask HN: How do you choose a model for a task?

How do you decide a model is good enough for a given task? Right now I use Opus for planning and harder tasks and switch to sonnet for more defined tasks. But I feel like sonnet is kind of stupid and is introducing issues because it can’t grasp the larger context? Is there some definitive way to say a model is good enough for a task? Or is it all vibes?

10 points | by bix6 2 days ago

13 comments

PaulHoule 2 days ago
Evaluation is harder than you think because of statistics.
Like if you want to accurately know if one model is better than another you have to test it on hundreds if not thousands of examples which are carefully graded in difficulty, not in the training sets, etc.
Practically you might try model A and model B and use each one 2-3 times on different tasks and walk out with the impression that A is really good and B sux, but it could be model A got lucky because you asked it to do things it is good at or maybe it just got lucky and got the right answer anyway.
See https://arxiv.org/html/2410.12972v1 and https://arxiv.org/pdf/2505.14810 -- those papers are considering a general space of tasks but you could totally do the same kind of eval for the tasks you care about.
[-]
- bix6 2 days ago
  Have you implemented any of this in practice? Eg are you benchmarking models?
  [-]
  - PaulHoule 2 days ago
    I've done some for classification, ranking, and other sorts of non-generative tasks.
freedomben 2 days ago
This is a hard problem for me as well. Right now I've just been using the best model available (like Opus, or GPT 5.5, or Gemin Pro) but it's not ideal. My problem is anytime I step down the results are subtlely worse and sometimes I don't notice immediately depending on what I'm doing.
As far as Opus vs. GPT 5.5 etc, I generally decide with:
1. Code? -> Opus
2. Docs? -> GPT
3. Real-time or recent information needed? -> Gemini
It's far from perfect though. Would love to hear others thoughts.
[-]
- bix6 2 days ago
  Opus eats tokens so fast so I try to minimize it but compared to Sonnet I definitely see fewer issues in my larger projects. Sonnet has gone off the rails a few times.
- CanopyCoder 57 minutes ago
  [flagged]
kamscruz 1 day ago
I have been using Sonnet 4.6 for quite a long pretty much satisfied with it. My job involves preparing business plans and write-ups after doing a proper ground work and preliminary market analysis. I'm not sure how to decide which one is good for my task like Sonnet or Opus. And after I read all the comments about how lame opus latest version has become, I never touched it.
2ero_wf 22 hours ago
Sonnet 4.6 for coding, some initial research Deepseek(V4 pro, R1) for building actual product keyword search/summary, cheap model like deepseek v4-flash ( i'd do this if i know the task needs 0 reasoning )
other models are mostly for cross-validation
wontopos 1 day ago
Mostly vibes, but you can make the vibes more reliable. I set a “error budget” per task type - if I have to correct the model’s output more than once every ~5 runs, it’s not good enough for that task. Cheap to track, and it forces you to notice degradation instead of just feeling it.
noashavit 2 days ago
Gemini for recent search and google workspace automation
Perplexity for deep research
Claude Opus for coding, Sonnet for writing
Gemma4 for local AI overviews and analysis
Qwen coder for local prototyping
[-]
- abstract257 1 day ago
  Similar for me, just no perplexity and I try to utilize more of Qwen for my coding, instead of cursor/claude, but i see my lazyness. This brings the advantage of having to go back to the basics aswell e.g. Search and Replace instead of just a brainless prompt...
mikejulietbravo 1 day ago
The short answer is that it depends how well you define the boundaries of the task and the relative complexity. For example, smaller model is usually fine for something like summarization, but an "easier" coding task might still actually be quite difficult unless you eval it heavily like @paulhoule said
shouvik12 2 days ago
for short, stateless stuff,definitions, formatting, quick lookups I have never noticed a meaningful difference between models. But anything that requires reasoning across a lot of prior context, it's usually claude sonet or opus. But feels like the vibe will soon take me to codex
cyanydeez 1 day ago
I've using the same qwen3-coder-next for open code for everything. Models should be considered like an investment in quirks, constant fomo that you're missing some magic token experience robs you of familiarity with construction.
It's likely any of the SOTA will do what you want if you take the time to learn the prompts.
journal 1 day ago
Use that model in production that gives you acceptable answer 1000 times in a row.
jabeer 2 days ago
[flagged]
warren455 1 day ago
[flagged]
OutrageousTea 2 days ago
[flagged]