How we debugged and fixed a production issue where reasoning models polluted user answers with internal chain-of-thought tokens
Our RAG system suddenly started showing users the AI’s “internal thoughts” alongside actual answers. We discovered that Qwen and DeepSeek reasoning models emit <think>
tags containing chain-of-thought reasoning. This case study walks through how we detected, debugged, and fixed the issue across our entire pipeline – from answer generation to metadata extraction.
Timeline: October 2025
Impact: All services using Qwen3 and DeepSeek models
Fix Duration: 2 days
Lines of Code Changed: ~150
Services Affected: 3 (Answer Generation, Intent Detection, Metadata Extraction)
Our RAG system had been running smoothly for weeks. We’d just finished implementing Query 1.3 testing (finding highest-priced products with specifications), and everything looked great. Then, one of our team members ran a routine test and noticed something… odd.
The Query:
“Find the highest-priced product listed across all invoices and describe its key features and technical specifications.”
What We Expected:
The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999. It features high-temperature sanitizing, 40 racks/hour capacity, Energy Star certification, and a chemical/solid waste pumping system.
What We Got:
<think> Okay, let me analyze this query carefully. The user wants: 1. Highest-priced product – I need to compare all prices 2. Key features – I should list the main attributes 3. Technical specs – Need to find detailed specifications Looking at the context chunks, I see: – chunk_12 has price: $8,999 for Hobart Dishwasher – chunk_14 has specifications – Let me combine these… </think> The highest-priced product is the Hobart Commercial Dishwasher…
Us: “Wait… why is the AI showing its homework?!” 😱
At first, we thought it was a prompt engineering issue. Maybe we’d accidentally told the model to explain its reasoning? We checked our system prompts – nope, nothing there.
Then we thought: “Is this a one-off bug?” We ran the same query again. Same result. Different query? Same issue. The <think>
tags appeared everywhere.
The Pattern:
<think>
tags<think>
tags<think>
tagsHypothesis: Something changed with the reasoning models.
We checked Nebius AI Studio documentation and discovered something interesting:
Nebius Update (October 2025): Qwen3 reasoning models now include chain-of-thought reasoning via
<think>
tags by default. This enables better reasoning quality but requires output cleaning for production use.
Bingo! 🎯
Nebius had updated their Qwen models to be more “helpful” by showing their reasoning process. Great for debugging, terrible for end users.
We started auditing our entire pipeline to see where these tags appeared:
Services Using Reasoning Models:
1. Answer Generation Service (Port 8069)
Qwen/Qwen3-32B-fast
2. Intent Detection Service (Port 8067)
Qwen/Qwen3-32B-fast
3. Metadata Extraction Service (Port 8062)
Qwen/Qwen3-32B-fast
The Damage:
# Real error logs we saw
[ERROR] Intent Service: JSON decode failed
JSONDecodeError: Invalid control character at line 1 column 45
[ERROR] Metadata Service: Response truncated at 1500 tokens
Expected fields: ['keywords', 'topics', 'questions', 'summary']
Got fields: ['keywords', 'topics'] # Missing fields due to truncation!
[ERROR] Answer Generation: User complained about "weird XML-like tags"
The metadata service revealed another issue we hadn’t anticipated:
Before <think>
Tags:
{
"keywords": "TechSupply Solutions, Dell XPS 15, laptop",
"topics": "Technology Purchase, Business Transactions",
"questions": "What equipment was purchased?",
"summary": "Technology equipment purchased from TechSupply Solutions"
}
Tokens Used: ~450 / 1500 limit ✅
After <think>
Tags:
<think>
Let me extract metadata from this text. I see:
- Company names: TechSupply Solutions
- Products: Dell XPS 15
- Context: This is an invoice
- I should extract keywords first...
[250+ tokens of reasoning]
</think>
{
"keywords": "TechSupply Solutions, Dell XPS 15, laptop",
"topics": "Technology Purchase"
[RESPONSE TRUNCATED AT 1500 TOKENS]
Tokens Used: 1500 / 1500 limit 🔴 (TRUNCATED!)
The reasoning tokens were eating into our token budget, causing responses to be cut off before completing the JSON!
We couldn’t just “turn off” reasoning models – they were actually better at complex queries than non-reasoning models. We needed a surgical fix.
File: /shared/model_registry.py
We built a model registry that tracks which models need output cleaning:
class LLMModels(str, Enum):
"""Centralized model registry with metadata"""
# Reasoning models (emit <think> tags)
QWEN_32B_FAST = "Qwen/Qwen3-32B-fast"
DEEPSEEK_R1 = "deepseek-ai/DeepSeek-R1-0528"
QWQ_32B_FAST = "Qwen/QwQ-32B-fast"
# Clean models (no reasoning tags)
LLAMA_70B_FAST = "meta-llama/Llama-3.3-70B-Instruct-fast"
LLAMA_405B = "meta-llama/Llama-3.1-405B-Instruct-Turbo"
# Model metadata (the secret sauce!)
MODEL_INFO = {
LLMModels.QWEN_32B_FAST: {
"provider": "nebius",
"size": "32B",
"context_window": 41000,
"cost_per_1m_tokens": 0.20,
"supports_reasoning": True, # 🔑 Key flag
"requires_cleaning": True, # 🔑 Key flag
"cleaning_pattern": r'<think>.*?</think>', # Regex pattern
"strengths": ["reasoning", "code", "math", "json"],
"use_cases": ["complex_queries", "multi_step_reasoning"]
},
LLMModels.DEEPSEEK_R1: {
"provider": "sambanova",
"size": "671B MoE",
"context_window": 64000,
"cost_per_1m_tokens": 0.0, # FREE!
"supports_reasoning": True,
"requires_cleaning": True,
"cleaning_pattern": r'<think>.*?</think>',
"strengths": ["advanced_reasoning", "scientific"],
"use_cases": ["research", "complex_analysis"]
},
LLMModels.LLAMA_70B_FAST: {
"provider": "nebius",
"size": "70B",
"context_window": 128000,
"cost_per_1m_tokens": 0.20,
"supports_reasoning": False, # No <think> tags!
"requires_cleaning": False,
"strengths": ["general", "instruction_following"],
"use_cases": ["general_qa", "summarization"]
}
}
def requires_output_cleaning(model_name: str) -> bool:
"""
Check if model output requires cleaning (e.g., <think> tag removal)
This is the gatekeeper function that every service calls before
returning LLM outputs to users.
Args:
model_name: Model identifier (e.g., "Qwen/Qwen3-32B-fast")
Returns:
True if model emits reasoning tags that need cleaning
"""
# Method 1: Check registry metadata (most accurate)
info = get_model_info(model_name)
if info:
return info.get("requires_cleaning", False)
# Method 2: Fallback pattern matching (for unknown models)
# Reasoning models from Qwen, DeepSeek, and QwQ families
reasoning_model_patterns = [
"Qwen3-", # Qwen3-32B, Qwen3-72B
"QwQ-", # QwQ-32B
"DeepSeek-R1", # DeepSeek-R1, DeepSeek-R1-Distill
"DeepSeek-V3", # DeepSeek-V3, DeepSeek-V3.1
"gpt-oss-" # Some OpenAI-compatible reasoning models
]
return any(pattern in model_name for pattern in reasoning_model_patterns)
def get_cleaning_pattern(model_name: str) -> str:
"""
Get regex pattern for cleaning model output
Returns:
Regex pattern to match and remove reasoning tags
"""
info = get_model_info(model_name)
if info:
return info.get("cleaning_pattern", r'<think>.*?</think>')
# Default pattern works for most reasoning models
return r'<think>.*?</think>'
def clean_reasoning_output(text: str, model_name: str) -> str:
"""
Clean reasoning tags from model output
This is the universal cleaning function used across all services.
Args:
text: Raw model output (may contain <think> tags)
model_name: Model that generated the output
Returns:
Cleaned text with reasoning tags removed
"""
if not requires_output_cleaning(model_name):
return text # Model doesn't need cleaning
pattern = get_cleaning_pattern(model_name)
# Remove all <think>...</think> blocks
# flags=re.DOTALL ensures we match across newlines
cleaned = re.sub(pattern, '', text, flags=re.DOTALL)
# Clean up extra whitespace left behind
cleaned = re.sub(r'\n{3,}', '\n\n', cleaned) # Max 2 consecutive newlines
cleaned = cleaned.strip()
return cleaned
Why This Approach?
Problem: <think>
tags broke JSON parsing
File: /Retrieval/services/intent/v1.0.0/intent_api.py
async def call_llm_gateway(query: str) -> dict:
"""
Call LLM Gateway to analyze query intent
"""
# Prepare LLM request
payload = {
"model": config.INTENT_DETECTION_MODEL, # Qwen/Qwen3-32B-fast
"messages": [
{"role": "user", "content": detection_prompt}
],
"max_tokens": config.INTENT_MAX_TOKENS,
"temperature": config.INTENT_TEMPERATURE
}
# Call LLM Gateway
response = await http_client.post(
f"{config.LLM_GATEWAY_URL}/v1/chat/completions",
json=payload
)
llm_data = response.json()
llm_content = llm_data["choices"][0]["message"]["content"]
# ✨ THE FIX: Clean up response before parsing JSON
llm_content = llm_content.strip()
# Remove reasoning tags if model requires it
if requires_output_cleaning(config.INTENT_DETECTION_MODEL):
pattern = get_cleaning_pattern(config.INTENT_DETECTION_MODEL)
if pattern:
llm_content = re.sub(pattern, '', llm_content, flags=re.DOTALL)
llm_content = llm_content.strip()
# Remove markdown code blocks if present
if llm_content.startswith("```json"):
llm_content = llm_content[7:]
elif llm_content.startswith("```"):
llm_content = llm_content[3:]
if llm_content.endswith("```"):
llm_content = llm_content[:-3]
llm_content = llm_content.strip()
# Parse JSON response from LLM
intent_data = json.loads(llm_content) # ✅ Now parses successfully!
return intent_data
Result:
Problem: Response truncation due to reasoning tokens consuming token budget
File: /Ingestion/services/metadata/v1.0.0/config.py
# ============================================================================
# Model Configuration - UPDATED to handle <think> tags
# ============================================================================
# CRITICAL: max_tokens must be 2500+ to handle 45 fields + LLM reasoning (<think> tags)
MODEL_CONFIGS = {
ModelType.FAST: {
"temperature": 0.1,
"max_tokens": 2500, # ⬆️ Increased from 1500
"timeout": 40
},
ModelType.RECOMMENDED: {
"temperature": 0.1,
"max_tokens": 2500, # ⬆️ Increased from 1500
"timeout": 40
},
ModelType.ADVANCED: {
"temperature": 0.1,
"max_tokens": 2500, # ⬆️ Increased from 1500
"timeout": 70
}
}
# ============================================================================
# Prompt Templates - UPDATED to discourage reasoning
# ============================================================================
METADATA_PROMPT_BASIC = """You must respond with ONLY valid JSON. Do not include any reasoning, thinking, or explanations.
TEXT:
{text}
Extract these 4 fields:
- keywords ({keywords_count}): Key terms, proper nouns - comma separated
- topics ({topics_count}): High-level themes - comma separated
- questions ({questions_count}): Natural questions this answers - use | to separate
- summary ({summary_length}): Concise overview in complete sentences
Output format (respond with ONLY this JSON, nothing else):
{{"keywords": "term1, term2", "topics": "theme1, theme2", "questions": "question1|question2", "summary": "brief overview"}}"""
Additional Fix in metadata_api.py:
async def extract_metadata(text: str, model: ModelType) -> dict:
"""Extract metadata from text using LLM"""
# ... LLM call code ...
llm_content = llm_data["choices"][0]["message"]["content"]
# ✨ Clean reasoning tags BEFORE JSON parsing
if requires_output_cleaning(model_name):
llm_content = clean_reasoning_output(llm_content, model_name)
# Remove markdown code blocks
llm_content = llm_content.strip()
if llm_content.startswith("```json"):
llm_content = llm_content[7:]
# ... etc ...
# Parse JSON
metadata = json.loads(llm_content)
return metadata
Result:
Problem: User-facing answers polluted with reasoning
Solution: We took a different approach here – model switching.
File: /Retrieval/services/answer_generation/v1.0.0/config.py
# ============================================================================
# Model Change History
# ============================================================================
# October 8, 2025: Switched from Qwen3-32B-fast to Llama-3.3-70B-Instruct-fast
#
# Reason: Nebius changed Qwen3 models to output <think> reasoning tokens by
# default, polluting answer output with chain-of-thought reasoning instead
# of clean answers.
#
# Result:
# ✅ Clean answers without <think> tags
# ✅ Better quality for end-user presentation
# ✅ Same cost ($0.20/1M tokens)
# ✅ Similar performance (~1-2s generation time)
# ============================================================================
DEFAULT_LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct-fast" # Changed from Qwen3-32B-fast
Why Model Switch Instead of Cleaning?
For user-facing answers, we wanted zero risk of reasoning leaking through. Llama models:
<think>
tags at allBackup Implementation (belt and suspenders):
async def generate_answer(query: str, context_chunks: list, llm_model: str) -> dict:
"""Generate answer from context using LLM"""
# ... LLM call code ...
answer_text = llm_data["choices"][0]["message"]["content"]
# ✨ Clean reasoning tags (just in case we switch back to reasoning models)
if requires_output_cleaning(llm_model):
pattern = get_cleaning_pattern(llm_model)
if pattern:
answer_text = re.sub(pattern, '', answer_text, flags=re.DOTALL)
answer_text = answer_text.strip()
return {"answer": answer_text, ...}
Result:
<think>
tag leaks: Impossible (Llama doesn’t emit them)The <think>
tags weren’t always simple:
# Simple case (easy to match)
<think>
Let me analyze this...
</think>
# Nested case (trickier)
<think>
Step 1: <important>Consider X</important>
Step 2: Y
</think>
# Multi-line with special characters (regex hell)
<think>
JSON: {"key": "value"}
Regex: \d+\.\d+
</think>
Our Solution:
# Use DOTALL flag to match across newlines
# Use non-greedy matching (.*?) to stop at first </think>
pattern = r'<think>.*?</think>'
cleaned = re.sub(pattern, '', text, flags=re.DOTALL)
# Edge case: Multiple <think> blocks
# Non-greedy matching handles this automatically:
# <think>A</think> text <think>B</think>
# → "text" (both blocks removed)
We had to clean BEFORE JSON parsing, not after:
# ❌ WRONG: Parse first, then clean
response = llm_call()
json_data = json.loads(response) # 💥 FAILS if <think> tags present
cleaned_json = clean(json_data)
# ✅ RIGHT: Clean first, then parse
response = llm_call()
cleaned_response = clean(response) # Remove <think> tags
json_data = json.loads(cleaned_response) # ✅ Parses successfully
LLMs sometimes wrapped responses in markdown code blocks:
<think>
Analyzing...
</think>
json
{"intent": "factual_retrieval"}
```
**Cleaning Order Matters:**
python
text = re.sub(r'.*?', '', text, flags=re.DOTALL)
if text.startswith("json"): text = text[7:] elif text.startswith(""):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
text = text.strip()
data = json.loads(text)
### Challenge 4: Token Budget Calculation
For metadata extraction, we had to account for reasoning overhead:
**Analysis of Reasoning Token Usage:**
| Extraction Mode | Actual Output | Reasoning Tokens | Total Tokens | Old Limit | New Limit |
|----------------|---------------|------------------|--------------|-----------|-----------|
| BASIC (4 fields) | ~400 tokens | ~200-300 tokens | ~700 tokens | 500 ❌ | **1000** ✅ |
| STANDARD (20 fields) | ~900 tokens | ~300-400 tokens | ~1300 tokens | 1200 ❌ | **1800** ✅ |
| FULL (45 fields) | ~1400 tokens | ~400-500 tokens | ~1900 tokens | 1500 ❌ | **2500** ✅ |
**Formula:**
python
required_tokens = (output_tokens + reasoning_tokens) * safety_margin
safety_margin = 1.2 (20% buffer for variability)
Example: FULL mode
required_tokens = (1400 + 500) * 1.2 = 2280 tokens
We set to 2500 to be safe
---
## Testing: How We Validated the Fix
### Test Suite 1: Regression Testing
**File:** `test_think_tag_cleaning.py`
python
import pytest
import re
from shared.model_registry import (
requires_output_cleaning,
get_cleaning_pattern,
clean_reasoning_output
)
def test_basic_think_tag_removal():
"""Test basic tag removal""" input_text = """ Let me analyze this query…
The answer is 42.
"""
expected = "The answer is 42."
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)
assert result.strip() == expected
assert "<think>" not in result
assert "</think>" not in result
def test_multiple_think_blocks():
"""Test removal of multiple blocks""" input_text = """ First reasoning
Part 1 of answer.
Second reasoning
Part 2 of answer.
"""
expected = "Part 1 of answer.\n\nPart 2 of answer."
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)
assert "<think>" not in result
assert "Part 1" in result
assert "Part 2" in result
def test_think_tags_with_json():
"""Test cleaning when JSON is present"""
input_text = """
I should format this as JSON…
{"intent": "factual_retrieval", "confidence": 0.95}
"""
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)
# Should be valid JSON after cleaning
import json
parsed = json.loads(result.strip())
assert parsed["intent"] == "factual_retrieval"
def test_model_without_cleaning():
"""Test that non-reasoning models don't get cleaned"""
input_text = "The answer is this should stay visible."
model = "meta-llama/Llama-3.3-70B-Instruct-fast"
result = clean_reasoning_output(input_text, model)
# Llama doesn't need cleaning, so text should be unchanged
assert result == input_text
def test_model_registry_accuracy():
"""Test model registry flags"""
# Reasoning models should require cleaning
assert requires_output_cleaning("Qwen/Qwen3-32B-fast") == True
assert requires_output_cleaning("deepseek-ai/DeepSeek-R1-0528") == True
assert requires_output_cleaning("Qwen/QwQ-32B-fast") == True
# Non-reasoning models should not
assert requires_output_cleaning("meta-llama/Llama-3.3-70B-Instruct-fast") == False
assert requires_output_cleaning("meta-llama/Llama-3.1-405B-Instruct-Turbo") == False
@pytest.mark.parametrize("model,text,expected_clean", [
(
"Qwen/Qwen3-32B-fast",
"reasoninganswer",
"answer"
),
(
"DeepSeek-R1-0528",
"\nmulti\nline\nreasoning\nanswer",
"answer"
),
(
"meta-llama/Llama-3.3-70B-Instruct-fast",
"answer",
"answer"
),
])
def test_cleaning_parametrized(model, text, expected_clean):
"""Parametrized tests for various models and inputs"""
result = clean_reasoning_output(text, model)
assert result.strip() == expected_clean
**Test Results:**
bash
$ pytest test_think_tag_cleaning.py -v
test_basic_think_tag_removal PASSED
test_multiple_think_blocks PASSED
test_think_tags_with_json PASSED
test_model_without_cleaning PASSED
test_model_registry_accuracy PASSED
test_cleaning_parametrized[Qwen/Qwen3-32B-fast-…] PASSED
test_cleaning_parametrized[DeepSeek-R1-0528-…] PASSED
test_cleaning_parametrized[meta-llama/Llama-3.3-70B-Instruct-fast-…] PASSED
======================== 8 passed in 0.23s ========================
### Test Suite 2: End-to-End Service Testing
**Test: Intent Detection**
bash
curl -X POST http://localhost:8067/v1/analyze \
-H "Content-Type: application/json" \
-d '{"query": "What is the capital of France?"}'
Expected: Clean JSON response (no tags)
{
"intent": "factual_retrieval",
"language": "en",
"complexity": "simple",
"confidence": 0.98,
"system_prompt": "…"
}
✅ PASS: No tags in response
**Test: Metadata Extraction**
bash
Extract metadata from sample text
curl -X POST http://localhost:8062/v1/extract \
-H "Content-Type: application/json" \
-d '{
"text": "Apple Inc. released the iPhone 15 Pro in September 2023…",
"mode": "basic"
}'
Expected: Complete metadata (all 4 fields)
{
"keywords": "Apple Inc, iPhone 15 Pro, September 2023",
"topics": "Technology, Product Launch",
"questions": "What did Apple release?|When was iPhone 15 Pro released?",
"summary": "Apple Inc. launched the iPhone 15 Pro in September 2023."
}
✅ PASS: All fields present, no truncation
**Test: Answer Generation**
bash
Generate answer from context
curl -X POST http://localhost:8069/v1/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Find the highest-priced product",
"context_chunks": […]
}'
Expected: Clean answer (no tags)
{
"answer": "The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999…",
"citations": […],
"generation_time_ms": 1250.5
}
✅ PASS: Clean answer, no reasoning leaked
### Test Suite 3: Load Testing
We ran load tests to ensure cleaning didn't impact performance:
python
import asyncio
import time
async def benchmark_cleaning():
"""Benchmark cleaning performance"""
text_with_tags = """
<think>
Long reasoning block with 500+ tokens of chain-of-thought...
Step 1: Analyze query
Step 2: Find relevant context
Step 3: Synthesize answer
... (lots more text) ...
</think>
The actual answer goes here (200 tokens).
"""
iterations = 10000
# Measure cleaning time
start = time.time()
for _ in range(iterations):
cleaned = clean_reasoning_output(text_with_tags, "Qwen/Qwen3-32B-fast")
end = time.time()
avg_time_ms = ((end - start) / iterations) * 1000
print(f"Average cleaning time: {avg_time_ms:.3f}ms per call")
print(f"Throughput: {iterations / (end - start):.0f} ops/sec")
Results:
Average cleaning time: 0.042ms per call
Throughput: 23,800 ops/sec
#
Conclusion: Cleaning is essentially free (sub-millisecond)
---
## The Results: Before and After
### Metrics Comparison
| Metric | Before Fix | After Fix | Improvement |
|--------|-----------|-----------|-------------|
| **Intent Detection** | | | |
| - JSON parsing errors | 23% | 0% | ✅ 100% reduction |
| - Latency | 185ms | 190ms | ❌ +5ms (negligible) |
| - Accuracy | 94% | 95% | ✅ +1% |
| **Metadata Extraction** | | | |
| - Truncation rate | 18% | 0% | ✅ 100% reduction |
| - Complete extractions | 82% | 100% | ✅ +18% |
| - Avg tokens used | 1450/1500 | 1800/2500 | ✅ Better headroom |
| **Answer Generation** | | | |
| - User complaints | 12 (in 1 week) | 0 | ✅ 100% reduction |
| - Answer quality score | 7.5/10 | 9.2/10 | ✅ +23% |
| - Clean outputs | 75% | 100% | ✅ +25% |
### Cost Impact
**Before Fix:**
Daily LLM Costs (all services):
After Fix:
Daily LLM Costs (all services):
Cost increase: $36/month (8.6%)
ROI Analysis:
Cost increase: $36/month
Error reduction: 23% → 0% (saved debugging time)
Customer satisfaction: ↑ (zero complaints)
Estimated engineering time saved: 10 hours/month
Engineering cost: $100/hour × 10 hours = $1,000/month saved
Net benefit: $1,000 – $36 = $964/month
ROI: 2,678%
**Verdict:** The fix paid for itself immediately. 🎉
—
## Key Takeaways: Lessons Learned
### 1. **Monitor Model Provider Changes**
**Lesson:** Model providers can change behavior without breaking changes in API.
**What We Did:**
– Set up alerts for unusual response patterns
– Subscribe to Nebius/SambaNova update newsletters
– Version-pin critical models in production
Recommendation: python
Log model metadata for debugging
logger.info(f"LLM call: model={model}, provider={provider}, version={version}")
Add response validation
def validate_llm_response(response: str, model: str) -> bool:
"""Check for unexpected patterns"""
warnings = []
if "<think>" in response and not is_reasoning_model(model):
warnings.append("Unexpected <think> tags in non-reasoning model")
if len(response) > expected_max_tokens * 1.5:
warnings.append("Response longer than expected")
if warnings:
logger.warning(f"LLM response validation failed: {warnings}")
return False
return True
### 2. **Build Centralized Utilities for Cross-Cutting Concerns**
**Lesson:** When multiple services need the same fix, centralize it.
**What We Did:**
- Created `/shared/model_registry.py` for model metadata
- Built universal cleaning functions used byreturn re.sub(pattern, '', text, flags=re all services
- Single source of truth for model capabilities
**Anti-Pattern to Avoid:**
python
❌ DON’T: Copy-paste cleaning logic in every service
service_a.py
def clean_output(text):
return re.sub(r'.*?', '', text)
service_b.py
def remove_tags(text):
return re.sub(r'.*?', '', text)
service_c.py
def strip_reasoning(text):
return re.sub(r'.*?', '', text)
Problem: If we need to update the regex, we have to change 3 files!
**Better Pattern:**
python
✅ DO: Centralize in shared module
shared/model_registry.py
def clean_reasoning_output(text, model):
"""Universal cleaning function"""
if not requires_cleaning(model):
return text
pattern = get_pattern(model)
return re.sub(pattern, '', text, flags=re.DOTALL)
service_a.py, service_b.py, service_c.py
from shared.model_registry import clean_reasoning_output
cleaned = clean_reasoning_output(response, model_name)
Benefit: Update once, fix everywhere
### 3. **Token Budgets Need Safety Margins**
**Lesson:** LLMs are unpredictable. Budget for variability.
**Formula We Use Now:**
python
def calculate_max_tokens(expected_output: int, has_reasoning: bool) -> int:
"""
Calculate max_tokens with safety margins
Args:
expected_output: Expected output tokens (measured from testing)
has_reasoning: Whether model emits reasoning tokens
Returns:
max_tokens to set in LLM request
"""
base_tokens = expected_output
if has_reasoning:
# Reasoning models use 25-40% extra tokens for <think> content
reasoning_overhead = int(base_tokens * 0.40)
base_tokens += reasoning_overhead
# Add 20% safety margin for variability
safety_margin = int(base_tokens * 0.20)
base_tokens += safety_margin
return base_tokens
Example:
Metadata extraction (basic mode): 400 tokens expected output
max_tokens = calculate_max_tokens(400, has_reasoning=True)
= 400 + (400 * 0.40) + ((400 + 160) * 0.20)
= 400 + 160 + 112
= 672 tokens
We round up to 1000 to be safe
### 4. **Test Error Paths, Not Just Happy Paths**
**Lesson:** Our tests validated clean outputs but didn't test malformed outputs.
**What We Added:**
python
def test_malformed_responses():
"""Test handling of various malformed LLM outputs"""
# Case 1: Unclosed <think> tag
input_text = "<think>reasoning without closing tag"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should not crash, should handle gracefully
# Case 2: Nested tags
input_text = "<think><think>nested</think></think>answer"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
assert result.strip() == "answer"
# Case 3: Multiple unclosed tags
input_text = "<think>A<think>B<think>C"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should handle without regex explosions
# Case 4: Very large reasoning block (10K+ tokens)
input_text = "<think>" + "x" * 10000 + "</think>answer"
start = time.time()
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
elapsed = time.time() - start
assert elapsed < 0.1 # Should complete in <100ms
### 5. **When in Doubt, Switch Models**
**Lesson:** Sometimes the right fix is avoiding the problem entirely.
**Decision Matrix:**
Should I clean the output or switch models?
Clean Output If:
✅ Model has unique capabilities you need (e.g., best reasoning)
✅ Cleaning is reliable (well-defined patterns)
✅ No alternative models available
✅ Cost/latency benefits outweigh cleaning complexity
Switch Models If:
✅ Alternative models have comparable quality
✅ Output cleaning is unreliable or risky
✅ User-facing outputs (zero tolerance for leaks)
✅ Same or better cost/latency
Our Choice for Answer Generation: Switch to Llama
1. Comparable quality ✅
2. No tags ✅
3. User-facing ✅
4. Same cost ✅
---
## The Tech Stack
**Languages & Frameworks:**
- Python 3.11
- FastAPI (API services)
- asyncio (async processing)
- httpx (HTTP client)
- Pydantic (data validation)
**LLM Providers:**
- Nebius AI Studio (Qwen, Llama models)
- SambaNova Cloud (DeepSeek models)
**Models Involved:**
- Qwen/Qwen3-32B-fast (reasoning, has `<think>` tags)
- DeepSeek-R1-0528 (reasoning, has `<think>` tags)
- meta-llama/Llama-3.3-70B-Instruct-fast (no `<think>` tags)
**Infrastructure:**
- Docker containers
- APISIX API Gateway
- Redis (caching)
- Milvus (vector database)
**Key Files Modified:**
/shared/model_registry.py (+150 lines)
/Retrieval/services/intent/v1.0.0/intent_api.py (+10 lines)
/Ingestion/services/metadata/v1.0.0/config.py (+15 lines)
/Retrieval/services/answer_generation/v1.0.0/config.py (+5 lines, model switch)
— ## What’s Next: Future Improvements ### 1. **Smarter Reasoning Detection** Currently, we use regex to remove `<think>` tags. But what if we could: **Idea: Semantic Reasoning Detection**
python
def extract_reasoning_insights(text: str, model: str) -> dict:
“””
Instead of discarding reasoning, extract insights
Returns:
{
"answer": "cleaned answer",
"reasoning": "extracted reasoning",
"confidence": 0.95,
"reasoning_steps": ["Step 1: ...", "Step 2: ..."]
}
"""
if not requires_output_cleaning(model):
return {"answer": text, "reasoning": None}
# Extract <think> content before removing
think_pattern = r'<think>(.*?)</think>'
reasoning_blocks = re.findall(think_pattern, text, flags=re.DOTALL)
# Clean the answer
cleaned_answer = re.sub(think_pattern, '', text, flags=re.DOTALL)
# Parse reasoning into steps
reasoning_steps = []
for block in reasoning_blocks:
steps = extract_steps(block) # Parse "Step 1:", "Step 2:", etc.
reasoning_steps.extend(steps)
return {
"answer": cleaned_answer.strip(),
"reasoning": "\n".join(reasoning_blocks),
"reasoning_steps": reasoning_steps,
"num_steps": len(reasoning_steps)
}
Use Cases:
– Debug mode: Show reasoning to developers
– Confidence scoring: More reasoning steps = higher confidence?
– Explainability: Show users “how” the AI reached its answer
### 2. **Model-Specific Prompt Engineering**
Different models need different prompts:
python
def get_optimized_prompt(task: str, model: str) -> str:
"""
Get model-specific prompt optimizations
"""
base_prompt = TASK_PROMPTS[task]
if is_reasoning_model(model):
# Reasoning models: Explicitly ask for JSON-only output
return base_prompt + "\n\nIMPORTANT: Respond with ONLY valid JSON. Do not include reasoning, explanations, or <think> tags."
else:
# Standard models: Regular prompt is fine
return base_prompt
### 3. **Automated Model Testing Pipeline** We want to catch issues like this automatically:
python
Automated daily tests
async def test_all_models():
"""
Test all models in registry for unexpected behaviors
"""
test_queries = [
"What is 2+2?",
"List the first 3 prime numbers",
"Extract keywords from: 'The quick brown fox jumps over the lazy dog'"
]
for model in get_all_models():
for query in test_queries:
response = await call_llm(model, query)
# Check for issues
if "<think>" in response and not is_reasoning_model(model):
alert(f"Model {model} unexpectedly emits <think> tags!")
if len(response) == 0:
alert(f"Model {model} returned empty response!")
# Check response time
if response_time > expected_latency * 1.5:
alert(f"Model {model} slower than expected: {response_time}ms")
```
What started as a mysterious bug – AI showing its “homework” to users – turned into a great learning experience about production LLM systems.
The core issue: Reasoning models (Qwen, DeepSeek) started emitting <think>
tags containing chain-of-thought reasoning, which broke JSON parsing and polluted user-facing answers.
The solution: Multi-layered approach:
<think>
tags)Impact:
Time to fix: 2 days (8 engineer-hours)
Lines of code: ~150 lines (mostly in model registry)
Modern LLM systems are living, breathing infrastructure. Model providers update models, behaviors change, and you need:
This fix was implemented by the CrawlEnginePro team, building production-ready RAG systems with multi-model support, semantic chunking, and intent-aware retrieval.
Tech Stack:
System Stats:
Questions? Reach out to us via contact us form.—
Last Updated: October 2025
Case Study Version: 1.0
Reading Time: ~25 minutes
Facing AI Output Issues Like These?