Qwen 3.6 35B (finetuned) is so good that it became standard open weights for everyday use. Is not far at all from proprietary models if you give it tools, skills and agents etc, it can actually finish the job. (Thank you Qwen team, appreciated). Using opensource now we can definitely rely to design from scratch very complicated architecture and build pretty fast the full pack.
Wish to see Europe AI unleashed, wake up.
Here is a dataset you can choose from: https://huggingface.co/datasets/Avtrkrb/combined-reasoning-o...
Get a 10000 samples from it according to your needs and go for it. The key (in my opinion) is not cutting the Sequence Length among other things. Whatever traditional finetuning repo will do, if your hardware supports it Unsloth is faster.
> Is not far at all from proprietary models if you give it tools, skills and agents etc,
I use Qwen 3.6 27B, the dense version of this model which is slightly better.
I don't agree that it's close at all. Maybe for some small, easy tasks, but not for working on real codebases. It's amazing for something I can run at home, but the difference between it and Opus or GPT-5.5 is huge.
Really, how so? Because we work with codebases daily, can you tell us a concrete example!
In our case we work in consumer hardware (ish), 10 million ctx (1 million output, 1 million input proven, sometimes it loops or breaks at over 500k ctx byt at ~17tps linear). IT can read the full codebase, unleash agents, and write in disk editing and patching files creating a full app in 3-4 minutes. IT can do Web search and Rag pretty fast, it understands and fix the user query, sys prompts and adapt/fix them if needed on the fly. I am wondering what more do you do?
Edit: Forgot to mention that it can process images and pdf, and 100s of other files, it can even create presentations in code or mermaid, svg, charts js etc.
Here a basic version of it: https://hugston.com/chat
I don't think I can handle another small model release by qwen, I'm still trying to find the limits of 3.6 27B and they are already threatening us with a new one?
But jokes aside, I love the fast iteration, these are most probably again finetunes on the 3.5 architecture that appear better in internal testing, which is still very nice to see. Putting more and more pressure on the bigger labs to perform better is always a good thing.
Finetuning takes little resources, the base model training is the slow and expensive part. Architecturally 3.5 models are identical to their 3.6 counterparts, that is why there is a consensus that those are probably finetunes and not re-trained from scratch, like you will se many people publish their own on huggingface.
Understood, but look at their larger cadence over the years and the breadth of models. They are clearly not all finetunes. Meta for all its billions, doesn't have anything comparable.
I am very interested in seeing new qwen models. Qwen3.6 27b is the first one that can do things and doesnt constantly loose "it's mind" and that can be run on a 3090 with a good context size. But it's sometimes getting into a loop.
I had a flavor of an older version of Qwen (I forget which one to be fair) that was coding along, then lost itself in a loop, I was so confused, it was just a random greenfield "lets see how it does" type of project anyway.
Agreed. Incidentally, in my testing, qwen models (qwen3.6-35b-a3b and earlier 3.5) are WAY better with vision than gemma4-26b-a4b. I would normally want to stick with gemma4 only (I use it for spam filtering), but it just doesn't cut it for vision work, and qwen models do.
Qwen 3.5/3.6 are far better at vision. Even the 9B model beats Gemma 4 31B in my use case. They describe the scene more accurately and they focus on the important elements like a human would.
Gemma 4 frequently misses important element, doesn't understand what things are, and is very coy even if you ask for lots of detail. You have to give it hints "hey what's that round thing on the left" to get half decent answers.
(Yes I did set the min-tokens correctly. I also tested bf16 and Q8 to make sure it wasn't a quant issue.)
It's unfortunate because Gemma 4 is so so so much better at natural language interactions.
Can someone explain what the current state of model benchmarking is? If you try to look up what the best locally runnable model is, you get a bunch of random blog posts using idiosyncratic criteria to rank things seemingly based on one dude's opinion.
Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.
I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.
That would be nice, but it's not going to be possible.
Any open benchmark has a very short life, since it will be pulled in and DPO / RL trained quickly for benchmaxxing purposes. So, you'll need a private test to have a hope of something fair. (These also get leaked over time, btw, so even then there's a window of usability).
These are expensive to run.
Now consider that there might be 15-20 viable quants for a given open model release; someone would have to want to pay for these private evals to be run on them. Even then, a good read through unsloth's commits and blog posts will remind you that there's quite a lot of engineering work to be done to get model inference working properly, even for models released by frontier or near-frontier labs. So, you'd want to make sure that you have a replicable 'best engineered' deployment to evaluate, or at least one that's closest to your hardware and fits the bill.
Upshot - it's much faster to download and try out a model, and possibly cheaper too. Well, cheaper since hugging face is paying the bandwidth bills.
>I just want to know what the best model is. Let me worry about how I will afford to run it.
This is a very typical manager question that I suppose many people have who fail to see the simple truth: There is no "best" model. There are only best models for certain use-cases. Sometimes you'll find these in custom community leaderboards on platforms like huggingface, but for most business applications you'll probably have to come up with your own benchmark. Most common benchmarks are pretty worthless by now because all the usual ones are being gamed hard by model providers, to the point that there are now sometimes drastic differences between models that perform very similarly on common benchmarks.
The best thing I have come up with is just make a bunch of prompts / tasks that I personally care about and need a model to know how to do. As an example, when qwen3.6 27B dropped, I ran it, kimi, claude and glm 5/5.1 on a bunch of LLM-architecture specific tasks (stuff like 'implement an incremental KV-cache for autoregressive transformer inference' or 'implement flash Attention backward pass with D-optimization') and analyze the results, who made tests, are the tests valid, does their implementation actually work or are they only claiming it to, that sort of thing.
It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).
Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.
I love that open weight models are catching up so quickly. Also hilarious how far behind Grok is. I guess demand for Grok must be poor if Anthropic is able to rent resources from xAI.
To play devil's advocate I do feel like Grok has a unique "feel" to it. All the Chinese models feel like GPT or Claude distillations, but Grok has a certain unique way of saying and doing things. But that said, it also feels a year behind the state of the art.
Qwen 3.6 35B (finetuned) is so good that it became standard open weights for everyday use. Is not far at all from proprietary models if you give it tools, skills and agents etc, it can actually finish the job. (Thank you Qwen team, appreciated). Using opensource now we can definitely rely to design from scratch very complicated architecture and build pretty fast the full pack. Wish to see Europe AI unleashed, wake up.
Do you have a good resource on how to finetune a model like Qwen? I am curious to try it out.
Unsloth has good resources
Here is a dataset you can choose from: https://huggingface.co/datasets/Avtrkrb/combined-reasoning-o... Get a 10000 samples from it according to your needs and go for it. The key (in my opinion) is not cutting the Sequence Length among other things. Whatever traditional finetuning repo will do, if your hardware supports it Unsloth is faster.
> Is not far at all from proprietary models if you give it tools, skills and agents etc,
I use Qwen 3.6 27B, the dense version of this model which is slightly better.
I don't agree that it's close at all. Maybe for some small, easy tasks, but not for working on real codebases. It's amazing for something I can run at home, but the difference between it and Opus or GPT-5.5 is huge.
I've had the opposite experience, and have built multiple fantastic applications with Qwen3.6 27b. What quantization have you tested with?
Similarly I haven't seen Qwen 27B as remotely competitive with Opus, at least Q4 hooked up to Claude Code. What harness are you using?
Really, how so? Because we work with codebases daily, can you tell us a concrete example! In our case we work in consumer hardware (ish), 10 million ctx (1 million output, 1 million input proven, sometimes it loops or breaks at over 500k ctx byt at ~17tps linear). IT can read the full codebase, unleash agents, and write in disk editing and patching files creating a full app in 3-4 minutes. IT can do Web search and Rag pretty fast, it understands and fix the user query, sys prompts and adapt/fix them if needed on the fly. I am wondering what more do you do?
Edit: Forgot to mention that it can process images and pdf, and 100s of other files, it can even create presentations in code or mermaid, svg, charts js etc. Here a basic version of it: https://hugston.com/chat
For coding it’s really bad. Writing is ok, chat is good. It’ll get better but it’s not that close yet
I don't think I can handle another small model release by qwen, I'm still trying to find the limits of 3.6 27B and they are already threatening us with a new one?
But jokes aside, I love the fast iteration, these are most probably again finetunes on the 3.5 architecture that appear better in internal testing, which is still very nice to see. Putting more and more pressure on the bigger labs to perform better is always a good thing.
How good must their training pipelines be? Releasing publicly and at this rate has made them very efficient.
Finetuning takes little resources, the base model training is the slow and expensive part. Architecturally 3.5 models are identical to their 3.6 counterparts, that is why there is a consensus that those are probably finetunes and not re-trained from scratch, like you will se many people publish their own on huggingface.
Understood, but look at their larger cadence over the years and the breadth of models. They are clearly not all finetunes. Meta for all its billions, doesn't have anything comparable.
I am very interested in seeing new qwen models. Qwen3.6 27b is the first one that can do things and doesnt constantly loose "it's mind" and that can be run on a 3090 with a good context size. But it's sometimes getting into a loop.
I had a flavor of an older version of Qwen (I forget which one to be fair) that was coding along, then lost itself in a loop, I was so confused, it was just a random greenfield "lets see how it does" type of project anyway.
I've completely replaced GitHub Copilot using Sonnet 3.6 with OpenCode using Qwen3.6 27b, and it's been a great experience.
Similar, but I'm using 35B A3B variation with experimental MTP support
OpenCode is pretty good too
A3B is especially nice, MoE really shines on memory bandwidth contained platforms like the DGX Spark.
Is Sonnet 3.6 a typo? Claude Sonnet 3.6 (aka 3.5 New) is an ancient model from 2024
Look on HuggingFace, there is a template that is supposed to fix the updates for the Qwen Models.
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
Maybe will help you?
So glad they’re holding steady on open weights.
At least for now. Worried the Chinese team will change their mind once they have parity
https://xcancel.com/Alibaba_Qwen/status/2056403591464984753
> Qwen3.7 Preview lands on Arena !
> Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.
> Can't wait to release Qwen3.7 series models!Stay tuned! @arena
Vision has become totally underappreciated, whereas I believe it brings important advantages to a model
Also, a big caveat in using Qwen models has always been its speech patterns. I do wonder how Google made the Gemma lineup so good at this
Let's hope Alibaba continues to open source its models
Agreed. Incidentally, in my testing, qwen models (qwen3.6-35b-a3b and earlier 3.5) are WAY better with vision than gemma4-26b-a4b. I would normally want to stick with gemma4 only (I use it for spam filtering), but it just doesn't cut it for vision work, and qwen models do.
God I love qwen3.6-35b-a3b especially Q8
I second this notion, I am impressed daily with what little Qwen can do
That has been my experience has well.
Qwen 3.5/3.6 are far better at vision. Even the 9B model beats Gemma 4 31B in my use case. They describe the scene more accurately and they focus on the important elements like a human would.
Gemma 4 frequently misses important element, doesn't understand what things are, and is very coy even if you ask for lots of detail. You have to give it hints "hey what's that round thing on the left" to get half decent answers.
(Yes I did set the min-tokens correctly. I also tested bf16 and Q8 to make sure it wasn't a quant issue.)
It's unfortunate because Gemma 4 is so so so much better at natural language interactions.
Can someone explain what the current state of model benchmarking is? If you try to look up what the best locally runnable model is, you get a bunch of random blog posts using idiosyncratic criteria to rank things seemingly based on one dude's opinion.
Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.
I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.
That would be nice, but it's not going to be possible.
Any open benchmark has a very short life, since it will be pulled in and DPO / RL trained quickly for benchmaxxing purposes. So, you'll need a private test to have a hope of something fair. (These also get leaked over time, btw, so even then there's a window of usability).
These are expensive to run.
Now consider that there might be 15-20 viable quants for a given open model release; someone would have to want to pay for these private evals to be run on them. Even then, a good read through unsloth's commits and blog posts will remind you that there's quite a lot of engineering work to be done to get model inference working properly, even for models released by frontier or near-frontier labs. So, you'd want to make sure that you have a replicable 'best engineered' deployment to evaluate, or at least one that's closest to your hardware and fits the bill.
Upshot - it's much faster to download and try out a model, and possibly cheaper too. Well, cheaper since hugging face is paying the bandwidth bills.
>I just want to know what the best model is. Let me worry about how I will afford to run it.
This is a very typical manager question that I suppose many people have who fail to see the simple truth: There is no "best" model. There are only best models for certain use-cases. Sometimes you'll find these in custom community leaderboards on platforms like huggingface, but for most business applications you'll probably have to come up with your own benchmark. Most common benchmarks are pretty worthless by now because all the usual ones are being gamed hard by model providers, to the point that there are now sometimes drastic differences between models that perform very similarly on common benchmarks.
The best thing I have come up with is just make a bunch of prompts / tasks that I personally care about and need a model to know how to do. As an example, when qwen3.6 27B dropped, I ran it, kimi, claude and glm 5/5.1 on a bunch of LLM-architecture specific tasks (stuff like 'implement an incremental KV-cache for autoregressive transformer inference' or 'implement flash Attention backward pass with D-optimization') and analyze the results, who made tests, are the tests valid, does their implementation actually work or are they only claiming it to, that sort of thing.
It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).
Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.
There I was waiting on a smaller version of Qwen 3.6 to drop so I can run it on my Mac, and then bam, they drop this.
Today I learned Meta's new model is preferred to everything but claude. That is .. a real surprise! Congrats to the Meta team.
I love that open weight models are catching up so quickly. Also hilarious how far behind Grok is. I guess demand for Grok must be poor if Anthropic is able to rent resources from xAI.
To play devil's advocate I do feel like Grok has a unique "feel" to it. All the Chinese models feel like GPT or Claude distillations, but Grok has a certain unique way of saying and doing things. But that said, it also feels a year behind the state of the art.
Where's Grok 4.3 on the leaderboard?
88th
lmao at opus 4.7 being a downgrade