Qwen 3.7 Ollama: Setup, Config & Honest Review (2026)

Q: How do I increase the Qwen 3.6 context window in Ollama?

Create a Modelfile with PARAMETER num_ctx 32768 (or up to 262144 for the full 256K that Qwen 3.6 supports). Then build a custom model with ollama create mymodel -f mymodelfile. Ollama defaults to a small context window regardless of what the model supports, so this step is required for any serious agent work.

If you searched "qwen 3.7 ollama" hoping to pull it and run it locally, here is the version of this post that does not waste your time with fake commands. Qwen 3.7 Max is an API-only model. No open weights exist. You cannot pull it from Ollama. Every guide giving you an ollama pull qwen3.7 command is making things up.

Run Qwen without managing Ollama.
Connect Qwen 3.7 Max by API or your own local endpoint, get the visual builder and 200+ verified skills. BYOK across 28+ providers, zero markup. Free forever, not a trial. Start free → No credit card · No Docker · No config files

This post covers what actually works. How to access Qwen 3.7 Max through the API. How to run Qwen 3.6 locally on Ollama as the best open-weight Qwen available. How to configure either option for agent workflows with copy-paste commands. And an honest review of what these models are genuinely good at and where they fall short.

Everything here has been verified against the Ollama library and Alibaba's documentation as of June 2026.

What Qwen 3.7 Is

Qwen 3.7 Max is a proprietary flagship model from Alibaba's Qwen team, announced May 20, 2026 at the Alibaba Cloud Summit in Hangzhou. The API rolled out one day earlier on May 19. It is the latest entry in a monthly release cadence: Qwen 3.5 in March, Qwen 3.6 in April, Qwen 3.7 in May.

What changed from Qwen 3.6 is significant across every axis. The context window went from 256K to 1 million tokens. Reasoning benchmarks jumped hard: CritPt went from 3.7% to 13.4% (almost a 4x improvement), Humanity's Last Exam climbed from 28.9% to 38.1%, and Terminal-Bench Hard crossed the 50% threshold for the first time at 50.8%. The model natively supports the Anthropic API protocol, which means tools built for Claude work directly with Qwen 3.7 without any adapter or translation layer.

Why this matters for agent builders specifically: Qwen 3.7 Max is designed for tasks that run for hours and involve hundreds or thousands of tool calls without losing context. Alibaba demonstrated a 35-hour autonomous kernel optimization run with over 1,158 tool calls, finishing with a 10x speedup over the reference implementation. The model scored 92.4% on GPQA Diamond (beating Claude Opus 4.6's 91.3%), 69.7 on Terminal-Bench 2.0, 60.6 on SWE-Bench Pro, and 76.4 on MCP-Atlas for tool use.

The pricing is aggressive: $2.50 per million input tokens and $7.50 per million output tokens through Alibaba Cloud Model Studio. For comparison, Claude Opus 4.7 costs $15/$75 and GPT-5.5 costs $10/$30. Qwen 3.7 Max delivers frontier-adjacent performance at a fraction of the price.

But none of that helps you if you want to run it locally. Which brings us to the honest answer.

Can You Run Qwen 3.7 on Ollama?

No. Not as of June 2026.

Qwen 3.7 Max is API-only in the cloud while Qwen 3.6 runs locally via Ollama as the best local alternative, hand-drawn pastel style

The Ollama library at ollama.com/library does not contain any Qwen 3.7 model. Running ollama pull qwen3.7 will fail with a "model not found" error. The HuggingFace Qwen organization at huggingface.co/Qwen shows only Qwen 3.6 and earlier weights. Direct checks of every plausible model card path return nothing.

Here is a quick reference for what Qwen models are actually available on Ollama right now:

qwen3.6 with 27B dense and 35B-A3B MoE variants, 256K context (the latest and best open-weight Qwen)
qwen3.5 with variants from 0.8B to 122B, 256K context, multimodal (text + image)
qwen3 (the original April 2025 family) with variants from 0.6B to 235B

Qwen 3.6 is the model you should run locally right now. It is strong, it is available, and it handles agent workflows well. We will set it up in the sections below.

Alibaba follows a consistent pattern: ship the API first, release open weights 3 to 4 weeks later. Qwen 3.6 API launched late March 2026, open weights dropped mid-April. By that pattern, Qwen 3.7 open weights could appear sometime in late June or July 2026. We will update this post when they land.

Hardware Requirements Table

Since Qwen 3.7 is not available locally, this table covers Qwen 3.6, the model you should actually be running. It gives you the real numbers instead of made-up specs for a model that does not exist on Ollama.

RAM	Quantization / Variant	Size on Disk	Speed Estimate	Pull Command	Verdict
8GB	`qwen3.5:4b` (fallback)	~3.4GB	15-25 tok/s	`ollama pull qwen3.5:4b`	Works well for simple tasks on limited hardware
16GB	`qwen3.6:35b-a3b-q4_K_M` (MoE)	~23GB	8-15 tok/s CPU, 20+ GPU	`ollama pull qwen3.6:35b-a3b-q4_K_M`	Sweet spot for personal agent use. Only 3B active params.
16GB	`qwen3.6` (27B dense, Q4)	~17GB	5-10 tok/s CPU	`ollama pull qwen3.6`	Good general option, tighter fit than MoE
24GB	`qwen3.6:27b-q8_0`	~30GB	20+ tok/s GPU	`ollama pull qwen3.6:27b-q8_0`	Full quality, fast generation with dedicated GPU
32GB+	`qwen3.6:27b-mtp-bf16`	~56GB	30+ tok/s	`ollama pull qwen3.6:27b-mtp-bf16`	Maximum quality with multi-token prediction

GPU VRAM requirements (separate from system RAM):

6GB VRAM: Minimum for partial offload of the 35B-A3B MoE variant at Q4. Some layers stay in system RAM.
12GB VRAM: Full offload of the 35B-A3B MoE model. All active parameters fit in GPU memory.
24GB VRAM: Full 27B dense model at Q8 quantization. Room for concurrent requests if you manage context carefully.

The 35B-A3B MoE variant is the smart pick for most people. It has 35 billion total parameters but only activates 3 billion per token thanks to the Mixture-of-Experts architecture. You get code quality approaching the full 27B dense model but at dramatically lower compute cost per forward pass. Ollama still loads all 35 billion parameters into memory, so the disk and RAM footprint matches a 35B model, but inference speed benefits from only running 3B active parameters. For a deeper look at what each hardware tier buys you, see our local model hardware guide.

RAM-based model speed and performance guide from 8 GB to 32 GB across qwen3.6, gemma4, llama3.3, mistral-small, and deepseek-r1, hand-drawn pastel style

Install and First Run

Three commands to get from nothing to a working local LLM:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3.6:35b-a3b-q4_K_M
ollama run qwen3.6:35b-a3b-q4_K_M "Read this error log and tell me what's broken: FATAL: could not open file 'pg_hba.conf': No such file or directory"

That third command does something useful immediately. Not "hello world." A real debugging prompt that shows the model working on an actual problem you might encounter. The model will explain that PostgreSQL cannot find its client authentication configuration file, suggest checking the data directory path, and offer a fix.

If you want the smaller 27B dense default:

ollama pull qwen3.6
ollama run qwen3.6 "Parse this JSON and extract all email addresses, return them as a flat array: {\"users\": [{\"name\": \"alice\", \"email\": \"alice@example.com\"}, {\"name\": \"bob\", \"contact\": \"bob@test.org\"}]}"

Mac users: Replace the first line with brew install ollama or download the app from ollama.com/download. Everything else stays the same.

Windows users: Download the installer from ollama.com/download. After installation, the commands work in PowerShell or Command Prompt.

Verifying GPU detection: After installation, run ollama ps while a model is loaded. If you see a GPU listed with VRAM usage, inference is hardware-accelerated. If you only see CPU, check that your NVIDIA CUDA drivers or AMD ROCm drivers are installed and up to date. On NVIDIA, run nvidia-smi to confirm your GPU is visible to the system.

The Modelfile Config

This is the bookmark section. The one you come back to when you need to tune the model for your specific workflow.

Create a file called qwen36-agent.modelfile with these contents:

FROM qwen3.6:35b-a3b-q4_K_M
PARAMETER num_ctx 32768
PARAMETER num_predict 4096
PARAMETER temperature 0.7

What each parameter does in one sentence:

num_ctx: Ollama's default context window is 2048 or 4096, which is way too small for agent tasks where you need multi-turn conversation history, file contents, and tool call results in context. 32768 gives room for serious work without consuming excessive RAM.

num_predict: Caps output length at 4096 tokens per response. This prevents runaway generation where the model keeps going until it hits the context limit. Increase to 8192 for long-form code generation. Decrease to 2048 for fast classification or extraction tasks.

temperature: 0.7 is the sweet spot for general agent use. Drop to 0.3 for classification, data extraction, and structured JSON output where you want deterministic, repeatable results. Push to 0.9 for creative writing or brainstorming where variety matters.

Build and run the custom model:

ollama create qwen36-agent -f qwen36-agent.modelfile
ollama run qwen36-agent "You are a code reviewer. Review this function for bugs and security issues: def login(user, pwd): return db.query(f'SELECT * FROM users WHERE name={user} AND pass={pwd}')"

The model will correctly identify the SQL injection vulnerability, the lack of password hashing, and the missing input validation. That is the kind of structured analysis that makes Qwen 3.6 useful for real development workflows. (If your context keeps filling up and truncating, see why in our context window guide.)

The three Modelfile parameters that matter: num_ctx for context window size, num_predict for output length, and temperature for creativity, hand-drawn pastel style

Thinking Mode vs Non-Thinking Mode

Qwen 3.6 (and the Qwen 3.5 series before it) supports toggling between thinking and non-thinking modes. This is a practical feature that changes how the model allocates compute.

Thinking mode: The model generates an internal chain of thought before producing the final answer. It reasons step by step, considers alternatives, and checks its work. This is slower because the model outputs thinking tokens before the actual response, but the accuracy improvement on complex tasks is significant.

When to use it: multi-step reasoning, debugging complex code, writing SQL queries with joins, any task where getting the answer right matters more than getting it fast.

Example:

/think
Given a PostgreSQL database with tables users(id, email, plan), orders(id, user_id, amount, created_at), and refunds(id, order_id, amount, reason), write a query that finds all Pro plan users who had more than $500 in orders last month but also had at least one refund. Include the total order amount and total refund amount per user.

Non-thinking mode: Direct answer without the chain-of-thought overhead. The model skips the reasoning tokens and jumps straight to the output. On the same hardware, this can be 2x to 3x faster.

When to use it: email triage, simple summarization, JSON extraction, classification, and any task where the answer is straightforward and speed matters.

Example:

/no_think
Classify this support ticket as billing, technical, or general: "I can't log into my account after changing my password yesterday"

The speed difference is real and measurable. A complex coding question might take 15 to 20 seconds in thinking mode but produce a correct, well-structured answer. The same question in non-thinking mode might return in 5 to 8 seconds but miss edge cases or produce less thorough output.

If you want dedicated model checkpoints that are specifically trained for one mode, Ollama has the 2507 split variants for the Qwen 3 family (these are older than Qwen 3.6 but demonstrate the concept):

# Fast, no chain-of-thought, optimized for direct answers
ollama pull qwen3:30b-a3b-instruct-2507

# Deep reasoner, optimized for step-by-step thinking
ollama pull qwen3:30b-a3b-thinking-2507

Qwen recommends temperature 0.7 with top_p 0.8 for the instruct variant and temperature 0.6 with top_p 0.95 for the thinking variant.

What Qwen 3.6 Is Good At (Honest Tier List)

No model is good at everything. Here is an honest assessment based on community testing and benchmark data, not marketing material.

S tier (genuinely excellent):

Agentic coding: writing, editing, and debugging code across multi-file projects
Tool calling: reliable function calling with structured JSON output
Structured output: generating valid JSON, XML, YAML consistently
Following complex multi-step instructions: the model tracks requirements well

A tier (strong, competitive with larger models):

Math and logic puzzles
Data extraction from unstructured text
Classification tasks
Repository-level code understanding with the 256K context window

B tier (usable, not best-in-class):

Summarization (competent but not as nuanced as Claude or GPT)
Chinese language tasks (strong heritage but Qwen 3.7 Max is significantly better)
Agent workflows with multiple tool handoffs
Translation between languages

C tier (works but competitors do it better):

Creative writing (tends to get repetitive after a few paragraphs, falls into patterns)
Nuanced tone matching (struggles with subtle voice differences)
Very long-form generation (quality degrades past 3000 to 4000 words)
Open-ended conversation (can feel mechanical compared to models optimized for chat)

Where specific competitors beat Qwen 3.6: Gemma 4 31B is stronger for general conversation and creative tasks on similar hardware. Claude Sonnet 4.6 is better at nuanced instruction following and produces more natural prose. DeepSeek V4 Flash is cheaper via API ($0.14/$0.28 per million tokens) for high-volume batch processing where you do not need local inference.

Where Qwen 3.6 beats them: On pure coding benchmarks for an open-weight model you can run locally, Qwen 3.6 is the strongest option in its class. The MoE architecture gives you near-frontier coding quality at a fraction of the compute cost. And the 256K context window (when configured properly in the Modelfile) is among the largest for local models.

Task performance tier list placing coding and reasoning in S tier, tool calling in A tier, tool use in B tier, and creative writing in C tier, hand-drawn pastel style

Quick Comparison Table

For local models in the same weight class, here is how Qwen 3.6 stacks up against the main alternatives:

	Qwen 3.6 35B-A3B	Gemma 4 26B-A4B	Qwen 3 30B-A3B (2507)	DeepSeek V4 Flash (API)
Best at	Coding, tool calling	General chat, multimodal	Math, reasoning	Cost-efficiency at scale
Active params per token	3B	4B	3B	13B
Context window	256K	256K	256K	1M
Speed (relative)	Fast (MoE)	Fast (MoE)	Fast (MoE)	API-dependent
Agent suitability	Excellent	Good	Good	Good
Native tool calling	Yes	Yes	Yes	Yes
Multimodal	Yes (text + image)	Yes (text + image)	Yes (text + image)	Text only
License	Apache 2.0	Apache 2.0	Apache 2.0	MIT
Disk size (Q4)	~23GB	~14GB (26B)	~20GB	N/A (API)

We go deeper on the open-weight matchup in our Gemma 4 27B vs Qwen 3.6 27B breakdown.

If you need the frontier Qwen experience and API pricing is acceptable, here is the API comparison:

	Qwen 3.7 Max (API)	Claude Sonnet 4.6 (API)	GLM 5.2 (API)	DeepSeek V4 Pro (API)
Context window	1M tokens	200K standard, 1M beta	1M tokens	1M tokens
Input price per 1M	$2.50	$3.00	$1.40	$0.435
Output price per 1M	$7.50	$15.00	$4.40	$0.87
Open weights	No	No	Yes (MIT)	Yes (MIT)
Best for	Long-horizon agents	General purpose, computer use	Coding on a budget	Cheapest frontier-class

Connecting to Agent Platforms

Three setups, each described in the minimum words needed to get it working.

Qwen 3.6 + Ollama + OpenClaw/Hermes: Set the model to qwen36-agent (or whatever you named your Modelfile build) in your agent configuration file. Set the API endpoint to http://localhost:11434. The Ollama API is OpenAI-compatible, so any tool that works with the OpenAI chat completions format works here without changes. For the full local-model walkthrough, see our OpenClaw Ollama guide.

Qwen 3.6 + Ollama + n8n: Add an Ollama node in your n8n workflow. Set the base URL to http://host.docker.internal:11434 if n8n runs in Docker, or http://localhost:11434 if it runs natively on the same machine. Select your model name from the dropdown.

Qwen 3.7 Max via BetterClaw BYOK: Paste your OpenRouter API key or Alibaba Cloud Model Studio key in the BYOK settings. Select the Qwen 3.7 Max model. Your agent is live. BetterClaw handles model routing across 28+ providers. You pick the one that fits your workload.

Task performance tier list placing coding and reasoning in S tier, tool calling in A tier, tool use in B tier, and creative writing in C tier, hand-drawn pastel style

Common Errors and Fixes

"model not found" when pulling: You are most likely trying ollama pull qwen3.7, which does not exist. Use ollama pull qwen3.6 or ollama pull qwen3.6:35b-a3b-q4_K_M. If you get this error on a model that should exist, run ollama update first to make sure your Ollama installation is current, then try the pull again.

"out of memory" during generation: Your model variant is too large for your available RAM. Options: switch to the 35B-A3B MoE variant (uses less compute per token despite higher total parameters), drop to a smaller model like qwen3.5:9b, or close other applications to free system memory. See the hardware table above for the right variant for your setup.

"slow generation" or output appears to freeze: This usually means Ollama fell back to CPU-only inference because it could not detect your GPU. Run ollama ps while a model is loaded. If no GPU shows up, check your drivers. For NVIDIA: run nvidia-smi. For AMD: verify ROCm is installed. On Mac with Apple Silicon, GPU acceleration should work automatically through Metal.

Responses cutting off mid-sentence: Your num_ctx or num_predict values are too low. The default context in Ollama is often 2048 or 4096, which fills up fast in multi-turn conversations. Create a Modelfile with PARAMETER num_predict 8192 for longer outputs, or increase num_ctx to 32768 or higher if the model is losing context from earlier turns.

First response after creating a custom model is very slow: This is normal. Building a custom model from a Modelfile triggers a cold load of the base weights. The first inference takes extra time. Subsequent runs load from cache and respond at normal speed.

For a more detailed troubleshooting walkthrough, especially for the "response truncated" issue, see our Hermes response truncation fix guide.

What Happens When Qwen 3.7 Open Weights Drop

Timeline showing Qwen 3.7 API launch in May 2026, now in June 2026, and expected open weights in late June or July 2026, hand-drawn pastel style

Alibaba's release pattern is predictable. When the open-weight Qwen 3.7 variants appear, here is how to know they are real and ready:

A repository appears at huggingface.co/Qwen with a model card, license text, and actual weight files (not just a placeholder)
Ollama adds a qwen3.7 entry in its library with proper tags for different quantizations
Community quantizations (GGUF files) start appearing from teams like Unsloth

Until all three happen, treat any "Qwen 3.7 local setup" guide with skepticism. Based on the release pattern, expect an MoE variant around 35B to 40B total parameters that fits on a 24GB GPU at Q4 quantization, with a likely jump in reasoning benchmarks compared to 3.6.

We will update this post with verified pull commands, hardware requirements, and Modelfile configs when the weights become available.

Start Building Agents Today

You do not need to wait for Qwen 3.7 open weights. Qwen 3.6 on Ollama is production-capable for local agent workflows right now. And if you need Qwen 3.7 Max performance immediately, the API is live through Alibaba Cloud Model Studio and OpenRouter.

BetterClaw gives you both paths in one place. Use BYOK to connect your local Ollama instance for complete data privacy. Or paste an OpenRouter API key to access Qwen 3.7 Max, Claude Sonnet 4.6, MiniMax M3, GLM 5.2, DeepSeek V4, and 28+ other providers through a single agent configuration.

Get started with BetterClaw for free. Free plan includes 1 agent with every feature. No credit card required. No time limit. For local Ollama setups alongside MiniMax M3, see our MiniMax M3 and Qwen 3.7 agent guide.

Frequently Asked Questions

How much RAM does Qwen 3.7 need?

Qwen 3.7 Max is API-only and cannot be run locally as of June 2026. For the best local Qwen experience, Qwen 3.6 35B-A3B (MoE) needs about 23GB of disk space and runs well on 16GB of system RAM at Q4_K_M quantization. The full 27B dense model at Q8 quantization needs around 30GB and works best with 24GB or more of system RAM.

Is Qwen 3.7 better than Gemma 4 for coding?

Qwen 3.7 Max scores higher on coding benchmarks than any Gemma 4 variant (60.6% SWE-Bench Pro, 69.7 Terminal-Bench 2.0). But Qwen 3.7 is API-only while Gemma 4 runs locally. For local coding work, Qwen 3.6 35B-A3B and Gemma 4 26B-A4B are the top open-weight contenders. Qwen 3.6 edges ahead on pure coding tasks. Gemma 4 is stronger for general reasoning and multimodal tasks.

How do I increase the Qwen 3.6 context window in Ollama?

Create a Modelfile with PARAMETER num_ctx 32768 (or up to 262144 for the full 256K that Qwen 3.6 supports). Then build a custom model with ollama create mymodel -f mymodelfile. Ollama defaults to a small context window regardless of what the model supports, so this step is required for any serious agent work.

Does Qwen 3.7 support tool calling?

Yes. Qwen 3.7 Max has native tool calling support and scored 76.4 on MCP-Atlas for tool use. It works natively with Claude Code, OpenClaw, Qwen Code, and custom tool-use frameworks. Qwen 3.6 also supports tool calling and function calling, though at a lower benchmark ceiling.

Can I run Qwen 3.7 on 8GB RAM?

No. Qwen 3.7 Max is not available for local use at all. If you have 8GB of RAM and want a Qwen model running locally, your best options are ollama pull qwen3.5:4b (3.4GB on disk, works well for simple tasks) or ollama pull qwen3:8b (about 6GB on disk, stronger but slower on 8GB RAM).

What is Qwen 3.7 thinking mode?

Qwen 3.7 Max has a native extended-thinking mode where the model generates an internal chain of thought before producing a final answer. This is optimized for high-difficulty reasoning, scientific computation, and complex multi-step tasks. The thinking mode is on by default for Max and tunable per request. Qwen 3.6 supports a similar toggle via /think and /no_think commands in Ollama, or through the dedicated 2507 thinking/instruct checkpoints in the Qwen 3 family.

Skip the Ollama setup entirely.
Connect Qwen by API or your own endpoint, deploy a managed agent in 60 seconds. BYOK across 28+ providers, zero markup. Free forever, not a trial. Start free →

Qwen 3.7 + Ollama: Full Setup, Best Config, and Honest Review (2026)

Your agent. Working. Not broken.

Run Qwen without managing Ollama.

What Qwen 3.7 Is

Can You Run Qwen 3.7 on Ollama?

Hardware Requirements Table

Install and First Run

The Modelfile Config

Thinking Mode vs Non-Thinking Mode

What Qwen 3.6 Is Good At (Honest Tier List)

Quick Comparison Table

Connecting to Agent Platforms

Common Errors and Fixes

What Happens When Qwen 3.7 Open Weights Drop

Start Building Agents Today

Frequently Asked Questions

Skip the Ollama setup entirely.

Want to skip the setup?

Related Articles

A2A vs MCP vs ACP: Which AI Agent Protocol Do You Actually Need?

AI Agent Assist: What It Is, How It Works, and When to Go Fully Autonomous

AI Agent Builder for Ecommerce: 5 Automations That Pay for Themselves in Week One