5 comments

  • zihotki 12 minutes ago
    I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.
  • stared 46 minutes ago
    Thank you for sharing benchmark. However, the results are selective.

    Why no Opus 4.7? Why Gemini 3.1 Pro is missing?

    If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.

    When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.

    • Flux159 32 minutes ago
      Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.

      Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.

  • dalberto 7 minutes ago
    A benchmark without Opus 4.6/4.7 feels incomplete.
  • iLoveOncall 3 minutes ago
    This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?

    > Our goal is to be the best general model for deterministic tasks

    I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.

  • alphainfo 14 minutes ago
    [flagged]