SwellJoe 1 day ago

I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.

https://swelljoe.com/post/will-it-mythos/

  • hedgehog 5 hours ago

    It would be really interesting to see how the Qwen 3.6 35B model compares to the 27B on your benchmark.

Balinares 17 hours ago

I'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.

  • chid 16 hours ago

    I thought so too when I read the headline but I expect it's basically Qwen3.5-9B

  • juliangoldsmith 11 hours ago

    It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.

    In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.

    I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.

nzach 17 hours ago

Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?

If that is the case, this isn't just a fancy way to perform prompt optimization?