Hands on How much can reinforcement learning – and a bit of extra verification – improve large language models, aka LLMs? Alibaba’s Qwen team aims to find out with its latest release, QwQ.
Despite having a fraction of DeepSeek R1’s claimed 671 billion parameters, Alibaba touts its comparatively compact 32-billion “reasoning” model as outperforming R1 in select math, coding, and function-calling benchmarks.
Much like R1, the Qwen team fine-tuned QwQ using reinforcement learning to improve its chain-of-thought reasoning for problem analysis and breakdown. This approach typically reinforces stepwise reasoning by rewarding models for correct answers, encouraging more accurate responses. However, for QwQ, the team also integrated a so-called accuracy verifier and a code execution server to ensure rewards were given only for correct math solutions and functional code.
The result, the Qwen team claims, is a model that punches far above its weight class, achieving performance on par with and, in some cases, edging out far larger models.
However, AI benchmarks aren’t always what they seem to be. So, let’s take a look at how these claims hold up in the real world, and then we’ll show you how to get QwQ up and running so you can test it out for yourself.
How does it stack up?
We ran QwQ through a slate of test prompts ranging from general knowledge to spatial reasoning, problem solving, mathematics, and other questions known to trip up even the best LLMs.
Because the full model requires substantial memory, we ran our tests in two configurations to cater to those of you who have a lot of RAM and those of you who don’t. First, we evaluated the full model using the QwQ demo on Hugging Face. Then, we tested a 4-bit quantized version on a 24 GB GPU (Nvidia 3090 or AMD Radeon RX 7900XTX) to assess the impact of quantization on accuracy.
As for most general knowledge questions, we found that QwQ performed similarly to DeepSeek’s 671 billion parameter R1 and other reasoning models like OpenAI’s o3-mini, spending a few seconds to compose its thoughts before spitting out the answer to the query.
Where the model stands out, perhaps unsurprisingly, is when it’s tasked with solving more complex logic, coding, or mathematics challenges, so we’ll focus on those before addressing some of its weak points.
Spatial reasoning
For fun, we decided to start with a relatively new spatial-reasoning test developed by the folks at Homebrew Research as part of their AlphaMaze project.
The test, illustrated above, presents the model with a maze in the form of a text prompt, like the one below. The model’s objective is then to navigate from the origin “O” to the target “T.”
You are a helpful assistant that solves mazes. You will be given a maze represented by a series of tokens. The tokens represent: - Coordinates: <|row-col|> (e.g., <|0-0|>, <|2-4|>) - Walls: <|no_wall|>, <|up_wall|>, <|down_wall|>, <|left_wall|>, <|right_wall|>, <|up_down_wall|>, etc. - Origin: <|origin|> - Target: <|target|> - Movement: <|up|>, <|down|>, <|left|>, <|right|>, <|blank|> Your task is to output the sequence of movements (<|up|>, <|down|>, <|left|>, <|right|>) required to navigate from the origin to the target, based on the provided maze representation. Think step by step. At each step, predict only the next movement token. Output only the move tokens, separated by spaces. MAZE: <|0-0|><|up_left_right_wall|><|blank|><|0-1|><|up_down_left_wall|><|blank|><|0-2|><|up_down_wall|><|blank|><|0-3|><|up_wall|><|blank|><|0-4|><|up_right_wall|><|blank|> <|1-0|><|down_left_wall|><|blank|><|1-1|><|up_right_wall|><|blank|><|1-2|><|up_left_wall|><|blank|><|1-3|><|down_right_wall|><|target|><|1-4|><|down_left_right_wall|><|blank|> <|2-0|><|up_left_right_wall|><|blank|><|2-1|><|left_right_wall|><|blank|><|2-2|><|down_left_wall|><|blank|><|2-3|><|up_down_wall|><|blank|><|2-4|><|up_right_wall|><|blank|> <|3-0|><|left_right_wall|><|blank|><|3-1|><|down_left_wall|><|origin|><|3-2|><|up_down_wall|><|blank|><|3-3|><|up_down_wall|><|blank|><|3-4|><|right_wall|><|blank|> <|4-0|><|down_left_wall|><|blank|><|4-1|><|up_down_wall|><|blank|><|4-2|><|up_down_wall|><|blank|><|4-3|><|up_down_wall|><|blank|><|4-4|><|down_right_wall|><|blank|>
Both our locally hosted QwQ instance and the full-sized model were able to solve these puzzles successfully every time, though each run did take a few minutes to finish.
The same couldn’t be said of DeepSeek’s R1 and its 32B distill. Both models were able to solve the first maze, but R1 struggled to complete the second, while the 32B distill solved it correctly nine times out of ten. This level of variation isn’t too surprising considering R1 and the distill use completely different base models.
While QwQ outperformed DeepSeek in this test, we did observe some strange behavior with our 4-bit model, which required nearly twice as many “thought” tokens to complete the test. At first, it looked as though this may be due to quantization-related losses – a challenge we explored here. But, as it turned out, the quantized model was just broken out of the box. After adjusting the hyperparameters – don’t worry, we’ll show you how to fix those in a bit – and the tests run again, the problem disappeared.
A one-shot code champ?
Since its launch, QwQ has garnered a lot of interest from netizens curious as to whether the model can generate usable code on the first attempt in a so-called one-shot test. And this particular challenge certainly seems to be a bright spot for the model.
We asked the model to recreate a number of relatively simple games, namely Pong, Breakout, Asteroids, and Flappy Bird, in Python using the pygame library.
Pong and Breakout weren’t much of a challenge for QwQ. After a few minutes of work, the model spat out working versions of each.

In our testing, QwQ was able to recreate classic arcade games like Breakout in a single shot with relative ease – click to enlarge
Tasked with recreating Asteroids, however, QwQ fell on its face. While the code ran, both the graphics and game mechanics were frequently distorted and buggy. By comparison, on its first attempt, R1 faithfully recreated the classic arcade shooter.
Some folks have even managed to get R1 and QwQ to one-shot code a minimalist version of Flappy Bird, which we can confirm also worked without issue. If you’re interested, you can find the prompt we tested here.
It has occurred to us that these models were trained on a huge set of openly available source code, which no doubt included reproductions of classic games. Aren’t the models therefore just remembering what they learned during training rather than independently figuring out game mechanics from scratch? That’s the whole illusion of these massive neural networks.
At least when it comes to recreating classic arcade games, QwQ performs well beyond what its parameter count might suggest, even if it can’t match R1 in every test. To borrow a phrase from the automotive world, there’s no replacement for displacement. This might explain why Alibaba isn’t stopping with QwQ 32B and has a “Max” version in the works. Not that we expect to be running that locally anytime soon.
With all that said, compared to DeepSeek’s similarly sized R1 Qwen 2.5 32B distill, Alibaba’s decision to integrate a code execution server into its reinforcement learning pipeline may have given it an edge in programming-related challenges.
Can it do math? Sure, but please don’t
Historically, LLMs have been really bad at mathematics – unsurprising given their language-focused training. While newer models have improved, QwQ still faces challenges, but not for the reasons you might think.
QwQ was able to solve all of the mathematics problems we threw at R1 in our earlier deep dive. So, QwQ can handle basic arithmetic and even some algebra – it just takes forever to do it. Asking an LLM to do math seems bonkers to us; calculators and direct computation still work in 2025.
For example, to solve a simple equation like what is 7*43?
, QwQ required generating more than 1,000 tokens over about 23 seconds on an RTX 3090 Ti – all for a problem that would have taken less time to punch into a pocket calculator than to type the prompt.
And the inefficiency doesn’t stop there. To solve 3394*35979
, a far more challenging multiplication problem beyond the capabilities of most non-reasoning models we’ve tested, our local instance of QwQ needed three minutes and more than 5,000 tokens to arrive at an answer.
That’s when it’s configured correctly. Before we applied the hyperparameter fix, that same equation needed nine minutes and nearly 12,000 tokens to solve.
The takeaway here is that just because a model can brute force its way to the right answer doesn’t mean it’s the right tool for the job. Instead, we recommend giving QwQ access to a Python calculator. If you’re new to LLM function calling, check out our guide here.
Tasked with solving the same 3394*35979
equation using tooling, QwQ’s response time dropped to eight seconds as the calculator handled all the heavy lifting.
But ‘wait’…
If you wade through QwQ’s “thoughts,” you’re bound to run into the word “wait” a lot, particularly on complex tasks or word problems, as the model checks its work against alternative outcomes.
This kind of behavior is common for reasoning models, but it’s particularly frustrating when QwQ generates the wrong answer – even after demonstrating that it recognized the correct one during its “thought” process.
We ran into this problem a fair bit with QwQ. However, one of the prompts that demonstrated this most clearly was AutoGen AI’s take on the classic wolf, goat, and cabbage problem. The prompt is a spin on a classic transportation optimization puzzle and it goes like this:
The trick is that the answer is embedded in the prompt. With three secure compartments, the farmer can transport both animals and his produce in a single trip. But, because it mirrors the classic puzzle so closely, models often overlook the compartments.
In our testing, QwQ consistently got this puzzle wrong, and, peering into its thought process, it wasn’t because it overlooked the three compartments. In fact, it acknowledged them but decided that that would be too easy:
Wait, if the farmer can take all three in one trip, then he can just do that and be done. But that would make the problem trivial, which is unlikely. So perhaps the compartments are separate but the boat can only carry two items plus the farmer?
Regardless of whether we ran this test on the full model in the cloud or locally on our machine, QwQ just couldn’t solve this consistently.
Hypersensitive hyperparameters
Compared to other models we’ve tested, we found QwQ to be particularly twitchy when it comes to its configuration. Initially, Alibaba recommended setting the following sampling parameters:
- Temperature: 0.6
- TopP: 0.95
- TopK: between 20 and 40
Since then, it’s updated its recommendations to also set:
- MinP: 0
- Presence Penalty: between 0 and 2
Due to what appears to be a bug in Llama.cpp’s handling of sampling parameters – we use Llama.cpp for running inference on models – we found it was also necessary to disable the repeat penalty by setting it to 1.
As we mentioned earlier, the results were pretty dramatic – more than halving the number of “thinking” tokens to arrive at an answer. However, this bug appears to be specific to GGUF-quantized versions of the model when running on the Llama.cpp inference engine, which is used by popular apps such as Ollama and LM Studio.
If you plan to use Llama.cpp, we recommend checking out Unsloth’s guide to correcting the sampling order.
Try it for yourself
If you’d like to try out QwQ for yourself, it’s fairly easy to get up and running in Ollama. Unfortunately, it does require a GPU with a fair bit of vRAM. We managed to get the model running on a 24 GB 3090 Ti with a large enough context window to be useful.
Technically speaking, you could run the model on your CPU and system memory, but unless you’ve got a high-end workstation or server lying around, there’s a good chance you’ll end up waiting half an hour or more for it to respond.
Prerequisites
- You’ll need a machine that’s capable of running medium-sized LLMs at 4-bit quantization. For this, we recommend a compatible GPU with at least 24 GB of vRAM. You can find a full list of supported cards here.
- For Apple Silicon Macs, we recommend one with at least 32 GB of memory.
This guide also assumes some familiarity with a Linux-world command-line interface as well as Ollama. If this is your first time using the latter, you can find our guide here.
Installing Ollama
Ollama is a popular model runner that provides an easy method for downloading and serving LLMs on consumer hardware. For those running Windows or macOS, head over to ollama.com and download and install it like any other application.
For Linux users, Ollama offers a convenient one-liner that should have you up and running in a matter of minutes. Alternatively, Ollama provides manual installation instructions for those who don’t want to run shell scripts straight from the source, which can be found here.
That one-liner to install Ollama on Linux is:
curl -fsSL https://ollama.com/install.sh | sh
Deploying QwQ
In order to deploy QwQ without running out of memory on our 24 GB card, we need to launch Ollama with a couple of additional flags. Start by closing Ollama if it’s already running. For Mac and Windows users, this is as simple as right clicking on the Ollama icon in the taskbar and clicking close.
Those running systemd-based operating systems, such as Ubuntu, should terminate it by running:
sudo systemctl stop ollama
From there, spin Ollama back up using our special flags, using the following commands:
OLLAMA_FLASH_ATTENTION=true OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve
In case you’re wondering, OLLAMA_FLASH_ATTENTION=true
enables a technology called Flash Attention for supported GPUs that helps to keep memory consumption in check when using large context windows. OLLAMA_KV_CACHE_TYPE=q4_0
, meanwhile, compresses the key-value cache used to store our context to 4-bits.
Together, these flags should allow us to fit QwQ along with a decent-sized context window into less than 24 GB of vRAM.
Next, we’ll open a second terminal window and pull down our model by running the command below. Depending on the speed of your internet connection, this could take a few minutes, as the model files are around 20 GB in size.
ollama pull qwq
At this point, we’d normally be able to run the model in our terminal. Unfortunately, at the time of writing, only the model’s temperature parameter has been set correctly.
To resolve this, we’ll create a custom version of the model with a few tweaks that appear to correct the issue. To do this, create a new file in your home directory called Modelfile
and paste in the following:
FROM qwq PARAMETER temperature 0.6 PARAMETER repeat_penalty 1 PARAMETER top_k 40 PARAMETER top_p 0.95 PARAMETER min_p 0 PARAMETER num_ctx 10240
This will configure QwQ to run with optimized parameters by default and tell Ollama to run the model with a 10,240 token context length by default. If you run into issues with Ollama offloading part of the model onto the CPU, you may need to reduce this to 8,192 tokens.
A word on context length
If you’re not familiar with it, you can think of a model’s context window a bit like its short-term memory. Set it too low, and eventually the model will start forgetting details. This is problematic for reasoning models, as their “thought” or “reasoning” tokens can burn through the context window pretty quickly.
To remedy this, QwQ supports a fairly large 131,072 (128K) token context window. Unfortunately for anyone interested in running the model at home, you probably don’t have enough memory to get anywhere close to that.
Still in your home directory, run the following command to generate a new model with the fixes applied. We’ll name the new model qwq-fixed
:
ollama create qwq-fixed
We’ll then test it’s working by loading up the model and chatting with it in the terminal:
ollama run qwq-fixed
If you’d like to tweak any of the hyperparameters we set earlier, for example, the context length or top_k settings, you can do so by querying the following after the model has loaded:
/set parameter top_k
Finally, if you’d prefer to use QwQ in a more ChatGPT-style user interface, we recommend checking out our retrieval-augmented generation guide, which will walk you through the process of deploying a model using Open WebUI in a Docker Container.
The Register aims to bring you more on using LLMs and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we’d love to hear about them in the comments section below. ®
Editor’s note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these vendors had any input as to the content of this or other articles.
Discussion about this post