The “think” Tag Mystery – When AI Started Showing Its Homework

How we debugged and fixed a production issue where reasoning models polluted user answers with internal chain-of-thought tokens
TL;DR
Our RAG system suddenly started showing users the AI’s “internal thoughts” alongside actual answers. We discovered that Qwen and DeepSeek reasoning models emit <think> tags containing chain-of-thought reasoning. This case study walks through how we detected, debugged, and fixed the issue across our entire pipeline – from answer generation to metadata extraction.
Timeline: October 2025
Impact: All services using Qwen3 and DeepSeek models
Fix Duration: 2 days
Lines of Code Changed: ~150
Services Affected: 3 (Answer Generation, Intent Detection, Metadata Extraction)
The Story Behind the Bug
It Started on a Tuesday Morning…
Our RAG system had been running smoothly for weeks. We’d just finished implementing Query 1.3 testing (finding highest-priced products with specifications), and everything looked great. Then, one of our team members ran a routine test and noticed something… odd.
The Query:
“Find the highest-priced product listed across all invoices and describe its key features and technical specifications.”
What We Expected:
The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999. It features high-temperature sanitizing, 40 racks/hour capacity, Energy Star certification, and a chemical/solid waste pumping system.
What We Got:
<think> Okay, let me analyze this query carefully. The user wants: 1. Highest-priced product – I need to compare all prices 2. Key features – I should list the main attributes 3. Technical specs – Need to find detailed specifications Looking at the context chunks, I see: – chunk_12 has price: $8,999 for Hobart Dishwasher – chunk_14 has specifications – Let me combine these… </think> The highest-priced product is the Hobart Commercial Dishwasher…
Us: “Wait… why is the AI showing its homework?!” 😱
The Initial Confusion
At first, we thought it was a prompt engineering issue. Maybe we’d accidentally told the model to explain its reasoning? We checked our system prompts – nope, nothing there.
Then we thought: “Is this a one-off bug?” We ran the same query again. Same result. Different query? Same issue. The <think> tags appeared everywhere.
The Pattern:
- ✅ Llama models: Clean outputs
- ❌ Qwen3-32B-fast: Outputs
<think>tags - ❌ DeepSeek-R1: Outputs
<think>tags - ❌ QwQ-32B: Outputs
<think>tags
Hypothesis: Something changed with the reasoning models.
The Investigation: Going Down the Rabbit Hole
Phase 1: Is This a Model Change?
We checked Nebius AI Studio documentation and discovered something interesting:
Nebius Update (October 2025): Qwen3 reasoning models now include chain-of-thought reasoning via
<think>tags by default. This enables better reasoning quality but requires output cleaning for production use.
Bingo! 🎯
Nebius had updated their Qwen models to be more “helpful” by showing their reasoning process. Great for debugging, terrible for end users.
Phase 2: How Deep Does This Go?
We started auditing our entire pipeline to see where these tags appeared:
Services Using Reasoning Models:
1. Answer Generation Service (Port 8069)
- Model:
Qwen/Qwen3-32B-fast - Impact: User-facing answers polluted with reasoning
- Severity: 🔴 CRITICAL
2. Intent Detection Service (Port 8067)
- Model:
Qwen/Qwen3-32B-fast - Impact: JSON parsing errors (tags broke JSON structure)
- Severity: 🔴 CRITICAL
3. Metadata Extraction Service (Port 8062)
- Model:
Qwen/Qwen3-32B-fast - Impact: Response truncation (reasoning consumed token budget)
- Severity: 🟡 HIGH
The Damage:
# Real error logs we saw
[ERROR] Intent Service: JSON decode failed
JSONDecodeError: Invalid control character at line 1 column 45
[ERROR] Metadata Service: Response truncated at 1500 tokens
Expected fields: ['keywords', 'topics', 'questions', 'summary']
Got fields: ['keywords', 'topics'] # Missing fields due to truncation!
[ERROR] Answer Generation: User complained about "weird XML-like tags"
Phase 3: The Token Budget Crisis
The metadata service revealed another issue we hadn’t anticipated:
Before <think> Tags:
{
"keywords": "TechSupply Solutions, Dell XPS 15, laptop",
"topics": "Technology Purchase, Business Transactions",
"questions": "What equipment was purchased?",
"summary": "Technology equipment purchased from TechSupply Solutions"
}
Tokens Used: ~450 / 1500 limit ✅
After <think> Tags:
<think>
Let me extract metadata from this text. I see:
- Company names: TechSupply Solutions
- Products: Dell XPS 15
- Context: This is an invoice
- I should extract keywords first...
[250+ tokens of reasoning]
</think>
{
"keywords": "TechSupply Solutions, Dell XPS 15, laptop",
"topics": "Technology Purchase"
[RESPONSE TRUNCATED AT 1500 TOKENS]
Tokens Used: 1500 / 1500 limit 🔴 (TRUNCATED!)
The reasoning tokens were eating into our token budget, causing responses to be cut off before completing the JSON!
The Solution: Multi-Layered Fix Strategy
We couldn’t just “turn off” reasoning models – they were actually better at complex queries than non-reasoning models. We needed a surgical fix.
Strategy 1: Create a Centralized Cleaning System
File: /shared/model_registry.py
We built a model registry that tracks which models need output cleaning:
class LLMModels(str, Enum):
"""Centralized model registry with metadata"""
# Reasoning models (emit <think> tags)
QWEN_32B_FAST = "Qwen/Qwen3-32B-fast"
DEEPSEEK_R1 = "deepseek-ai/DeepSeek-R1-0528"
QWQ_32B_FAST = "Qwen/QwQ-32B-fast"
# Clean models (no reasoning tags)
LLAMA_70B_FAST = "meta-llama/Llama-3.3-70B-Instruct-fast"
LLAMA_405B = "meta-llama/Llama-3.1-405B-Instruct-Turbo"
# Model metadata (the secret sauce!)
MODEL_INFO = {
LLMModels.QWEN_32B_FAST: {
"provider": "nebius",
"size": "32B",
"context_window": 41000,
"cost_per_1m_tokens": 0.20,
"supports_reasoning": True, # 🔑 Key flag
"requires_cleaning": True, # 🔑 Key flag
"cleaning_pattern": r'<think>.*?</think>', # Regex pattern
"strengths": ["reasoning", "code", "math", "json"],
"use_cases": ["complex_queries", "multi_step_reasoning"]
},
LLMModels.DEEPSEEK_R1: {
"provider": "sambanova",
"size": "671B MoE",
"context_window": 64000,
"cost_per_1m_tokens": 0.0, # FREE!
"supports_reasoning": True,
"requires_cleaning": True,
"cleaning_pattern": r'<think>.*?</think>',
"strengths": ["advanced_reasoning", "scientific"],
"use_cases": ["research", "complex_analysis"]
},
LLMModels.LLAMA_70B_FAST: {
"provider": "nebius",
"size": "70B",
"context_window": 128000,
"cost_per_1m_tokens": 0.20,
"supports_reasoning": False, # No <think> tags!
"requires_cleaning": False,
"strengths": ["general", "instruction_following"],
"use_cases": ["general_qa", "summarization"]
}
}
def requires_output_cleaning(model_name: str) -> bool:
"""
Check if model output requires cleaning (e.g., <think> tag removal)
This is the gatekeeper function that every service calls before
returning LLM outputs to users.
Args:
model_name: Model identifier (e.g., "Qwen/Qwen3-32B-fast")
Returns:
True if model emits reasoning tags that need cleaning
"""
# Method 1: Check registry metadata (most accurate)
info = get_model_info(model_name)
if info:
return info.get("requires_cleaning", False)
# Method 2: Fallback pattern matching (for unknown models)
# Reasoning models from Qwen, DeepSeek, and QwQ families
reasoning_model_patterns = [
"Qwen3-", # Qwen3-32B, Qwen3-72B
"QwQ-", # QwQ-32B
"DeepSeek-R1", # DeepSeek-R1, DeepSeek-R1-Distill
"DeepSeek-V3", # DeepSeek-V3, DeepSeek-V3.1
"gpt-oss-" # Some OpenAI-compatible reasoning models
]
return any(pattern in model_name for pattern in reasoning_model_patterns)
def get_cleaning_pattern(model_name: str) -> str:
"""
Get regex pattern for cleaning model output
Returns:
Regex pattern to match and remove reasoning tags
"""
info = get_model_info(model_name)
if info:
return info.get("cleaning_pattern", r'<think>.*?</think>')
# Default pattern works for most reasoning models
return r'<think>.*?</think>'
def clean_reasoning_output(text: str, model_name: str) -> str:
"""
Clean reasoning tags from model output
This is the universal cleaning function used across all services.
Args:
text: Raw model output (may contain <think> tags)
model_name: Model that generated the output
Returns:
Cleaned text with reasoning tags removed
"""
if not requires_output_cleaning(model_name):
return text # Model doesn't need cleaning
pattern = get_cleaning_pattern(model_name)
# Remove all <think>...</think> blocks
# flags=re.DOTALL ensures we match across newlines
cleaned = re.sub(pattern, '', text, flags=re.DOTALL)
# Clean up extra whitespace left behind
cleaned = re.sub(r'\n{3,}', '\n\n', cleaned) # Max 2 consecutive newlines
cleaned = cleaned.strip()
return cleaned
Why This Approach?
- Centralized Truth: One place to check if a model needs cleaning
- Easy to Extend: Add new models without changing service code
- Pattern Matching Fallback: Handles unknown models gracefully
- Metadata-Rich: Services can query model capabilities, costs, etc.
Strategy 2: Implement Cleaning in Each Service
Service 1: Intent Detection (Port 8067)
Problem: <think> tags broke JSON parsing
File: /Retrieval/services/intent/v1.0.0/intent_api.py
async def call_llm_gateway(query: str) -> dict:
"""
Call LLM Gateway to analyze query intent
"""
# Prepare LLM request
payload = {
"model": config.INTENT_DETECTION_MODEL, # Qwen/Qwen3-32B-fast
"messages": [
{"role": "user", "content": detection_prompt}
],
"max_tokens": config.INTENT_MAX_TOKENS,
"temperature": config.INTENT_TEMPERATURE
}
# Call LLM Gateway
response = await http_client.post(
f"{config.LLM_GATEWAY_URL}/v1/chat/completions",
json=payload
)
llm_data = response.json()
llm_content = llm_data["choices"][0]["message"]["content"]
# ✨ THE FIX: Clean up response before parsing JSON
llm_content = llm_content.strip()
# Remove reasoning tags if model requires it
if requires_output_cleaning(config.INTENT_DETECTION_MODEL):
pattern = get_cleaning_pattern(config.INTENT_DETECTION_MODEL)
if pattern:
llm_content = re.sub(pattern, '', llm_content, flags=re.DOTALL)
llm_content = llm_content.strip()
# Remove markdown code blocks if present
if llm_content.startswith("```json"):
llm_content = llm_content[7:]
elif llm_content.startswith("```"):
llm_content = llm_content[3:]
if llm_content.endswith("```"):
llm_content = llm_content[:-3]
llm_content = llm_content.strip()
# Parse JSON response from LLM
intent_data = json.loads(llm_content) # ✅ Now parses successfully!
return intent_data
Result:
- ✅ JSON parsing errors: Eliminated
- ✅ Intent detection accuracy: Maintained at 95%+
- ✅ Latency impact: +5ms (negligible)
Service 2: Metadata Extraction (Port 8062)
Problem: Response truncation due to reasoning tokens consuming token budget
File: /Ingestion/services/metadata/v1.0.0/config.py
# ============================================================================
# Model Configuration - UPDATED to handle <think> tags
# ============================================================================
# CRITICAL: max_tokens must be 2500+ to handle 45 fields + LLM reasoning (<think> tags)
MODEL_CONFIGS = {
ModelType.FAST: {
"temperature": 0.1,
"max_tokens": 2500, # ⬆️ Increased from 1500
"timeout": 40
},
ModelType.RECOMMENDED: {
"temperature": 0.1,
"max_tokens": 2500, # ⬆️ Increased from 1500
"timeout": 40
},
ModelType.ADVANCED: {
"temperature": 0.1,
"max_tokens": 2500, # ⬆️ Increased from 1500
"timeout": 70
}
}
# ============================================================================
# Prompt Templates - UPDATED to discourage reasoning
# ============================================================================
METADATA_PROMPT_BASIC = """You must respond with ONLY valid JSON. Do not include any reasoning, thinking, or explanations.
TEXT:
{text}
Extract these 4 fields:
- keywords ({keywords_count}): Key terms, proper nouns - comma separated
- topics ({topics_count}): High-level themes - comma separated
- questions ({questions_count}): Natural questions this answers - use | to separate
- summary ({summary_length}): Concise overview in complete sentences
Output format (respond with ONLY this JSON, nothing else):
{{"keywords": "term1, term2", "topics": "theme1, theme2", "questions": "question1|question2", "summary": "brief overview"}}"""
Additional Fix in metadata_api.py:
async def extract_metadata(text: str, model: ModelType) -> dict:
"""Extract metadata from text using LLM"""
# ... LLM call code ...
llm_content = llm_data["choices"][0]["message"]["content"]
# ✨ Clean reasoning tags BEFORE JSON parsing
if requires_output_cleaning(model_name):
llm_content = clean_reasoning_output(llm_content, model_name)
# Remove markdown code blocks
llm_content = llm_content.strip()
if llm_content.startswith("```json"):
llm_content = llm_content[7:]
# ... etc ...
# Parse JSON
metadata = json.loads(llm_content)
return metadata
Result:
- ✅ Response truncation: Eliminated
- ✅ Metadata extraction completeness: 100% (all fields present)
- ✅ Token usage: ~1800/2500 (healthy headroom)
Service 3: Answer Generation (Port 8069)
Problem: User-facing answers polluted with reasoning
Solution: We took a different approach here – model switching.
File: /Retrieval/services/answer_generation/v1.0.0/config.py
# ============================================================================
# Model Change History
# ============================================================================
# October 8, 2025: Switched from Qwen3-32B-fast to Llama-3.3-70B-Instruct-fast
#
# Reason: Nebius changed Qwen3 models to output <think> reasoning tokens by
# default, polluting answer output with chain-of-thought reasoning instead
# of clean answers.
#
# Result:
# ✅ Clean answers without <think> tags
# ✅ Better quality for end-user presentation
# ✅ Same cost ($0.20/1M tokens)
# ✅ Similar performance (~1-2s generation time)
# ============================================================================
DEFAULT_LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct-fast" # Changed from Qwen3-32B-fast
Why Model Switch Instead of Cleaning?
For user-facing answers, we wanted zero risk of reasoning leaking through. Llama models:
- Don’t emit
<think>tags at all - Have excellent instruction-following
- Same pricing tier as Qwen
- Slightly better at natural language generation
Backup Implementation (belt and suspenders):
async def generate_answer(query: str, context_chunks: list, llm_model: str) -> dict:
"""Generate answer from context using LLM"""
# ... LLM call code ...
answer_text = llm_data["choices"][0]["message"]["content"]
# ✨ Clean reasoning tags (just in case we switch back to reasoning models)
if requires_output_cleaning(llm_model):
pattern = get_cleaning_pattern(llm_model)
if pattern:
answer_text = re.sub(pattern, '', answer_text, flags=re.DOTALL)
answer_text = answer_text.strip()
return {"answer": answer_text, ...}
Result:
- ✅ User complaints: Zero (after fix)
- ✅ Answer quality: Improved (Llama better at natural language)
- ✅
<think>tag leaks: Impossible (Llama doesn’t emit them)
The Technical Deep Dive: Why This Was Tricky
Challenge 1: Regex Pattern Complexity
The <think> tags weren’t always simple:
# Simple case (easy to match)
<think>
Let me analyze this...
</think>
# Nested case (trickier)
<think>
Step 1: <important>Consider X</important>
Step 2: Y
</think>
# Multi-line with special characters (regex hell)
<think>
JSON: {"key": "value"}
Regex: \d+\.\d+
</think>
Our Solution:
# Use DOTALL flag to match across newlines
# Use non-greedy matching (.*?) to stop at first </think>
pattern = r'<think>.*?</think>'
cleaned = re.sub(pattern, '', text, flags=re.DOTALL)
# Edge case: Multiple <think> blocks
# Non-greedy matching handles this automatically:
# <think>A</think> text <think>B</think>
# → "text" (both blocks removed)
Challenge 2: JSON Parsing Timing
We had to clean BEFORE JSON parsing, not after:
# ❌ WRONG: Parse first, then clean
response = llm_call()
json_data = json.loads(response) # 💥 FAILS if <think> tags present
cleaned_json = clean(json_data)
# ✅ RIGHT: Clean first, then parse
response = llm_call()
cleaned_response = clean(response) # Remove <think> tags
json_data = json.loads(cleaned_response) # ✅ Parses successfully
Challenge 3: Markdown Code Block Confusion
LLMs sometimes wrapped responses in markdown code blocks:
<think>
Analyzing...
</think>
json
{"intent": "factual_retrieval"}
```
**Cleaning Order Matters:**
python
Step 1: Remove tags first
text = re.sub(r'.*?', '', text, flags=re.DOTALL)
Step 2: Remove markdown code blocks
if text.startswith("json"): text = text[7:] elif text.startswith(""):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
Step 3: Strip whitespace
text = text.strip()
Step 4: Parse JSON
data = json.loads(text)
### Challenge 4: Token Budget Calculation
For metadata extraction, we had to account for reasoning overhead:
**Analysis of Reasoning Token Usage:**
| Extraction Mode | Actual Output | Reasoning Tokens | Total Tokens | Old Limit | New Limit |
|----------------|---------------|------------------|--------------|-----------|-----------|
| BASIC (4 fields) | ~400 tokens | ~200-300 tokens | ~700 tokens | 500 ❌ | **1000** ✅ |
| STANDARD (20 fields) | ~900 tokens | ~300-400 tokens | ~1300 tokens | 1200 ❌ | **1800** ✅ |
| FULL (45 fields) | ~1400 tokens | ~400-500 tokens | ~1900 tokens | 1500 ❌ | **2500** ✅ |
**Formula:**
python
required_tokens = (output_tokens + reasoning_tokens) * safety_margin
safety_margin = 1.2 (20% buffer for variability)
Example: FULL mode
required_tokens = (1400 + 500) * 1.2 = 2280 tokens
We set to 2500 to be safe
---
## Testing: How We Validated the Fix
### Test Suite 1: Regression Testing
**File:** `test_think_tag_cleaning.py`
python
import pytest
import re
from shared.model_registry import (
requires_output_cleaning,
get_cleaning_pattern,
clean_reasoning_output
)
def test_basic_think_tag_removal():
"""Test basic tag removal""" input_text = """ Let me analyze this query…
The answer is 42.
"""
expected = "The answer is 42."
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)
assert result.strip() == expected
assert "<think>" not in result
assert "</think>" not in result
def test_multiple_think_blocks():
"""Test removal of multiple blocks""" input_text = """ First reasoning
Part 1 of answer.
Second reasoning
Part 2 of answer.
"""
expected = "Part 1 of answer.\n\nPart 2 of answer."
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)
assert "<think>" not in result
assert "Part 1" in result
assert "Part 2" in result
def test_think_tags_with_json():
"""Test cleaning when JSON is present"""
input_text = """
I should format this as JSON…
{"intent": "factual_retrieval", "confidence": 0.95}
"""
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)
# Should be valid JSON after cleaning
import json
parsed = json.loads(result.strip())
assert parsed["intent"] == "factual_retrieval"
def test_model_without_cleaning():
"""Test that non-reasoning models don't get cleaned"""
input_text = "The answer is this should stay visible."
model = "meta-llama/Llama-3.3-70B-Instruct-fast"
result = clean_reasoning_output(input_text, model)
# Llama doesn't need cleaning, so text should be unchanged
assert result == input_text
def test_model_registry_accuracy():
"""Test model registry flags"""
# Reasoning models should require cleaning
assert requires_output_cleaning("Qwen/Qwen3-32B-fast") == True
assert requires_output_cleaning("deepseek-ai/DeepSeek-R1-0528") == True
assert requires_output_cleaning("Qwen/QwQ-32B-fast") == True
# Non-reasoning models should not
assert requires_output_cleaning("meta-llama/Llama-3.3-70B-Instruct-fast") == False
assert requires_output_cleaning("meta-llama/Llama-3.1-405B-Instruct-Turbo") == False
@pytest.mark.parametrize("model,text,expected_clean", [
(
"Qwen/Qwen3-32B-fast",
"reasoninganswer",
"answer"
),
(
"DeepSeek-R1-0528",
"\nmulti\nline\nreasoning\nanswer",
"answer"
),
(
"meta-llama/Llama-3.3-70B-Instruct-fast",
"answer",
"answer"
),
])
def test_cleaning_parametrized(model, text, expected_clean):
"""Parametrized tests for various models and inputs"""
result = clean_reasoning_output(text, model)
assert result.strip() == expected_clean
**Test Results:**
bash
$ pytest test_think_tag_cleaning.py -v
test_basic_think_tag_removal PASSED
test_multiple_think_blocks PASSED
test_think_tags_with_json PASSED
test_model_without_cleaning PASSED
test_model_registry_accuracy PASSED
test_cleaning_parametrized[Qwen/Qwen3-32B-fast-…] PASSED
test_cleaning_parametrized[DeepSeek-R1-0528-…] PASSED
test_cleaning_parametrized[meta-llama/Llama-3.3-70B-Instruct-fast-…] PASSED
======================== 8 passed in 0.23s ========================
### Test Suite 2: End-to-End Service Testing
**Test: Intent Detection**
bash
Query 1: Simple factual retrieval
curl -X POST http://localhost:8067/v1/analyze \
-H "Content-Type: application/json" \
-d '{"query": "What is the capital of France?"}'
Expected: Clean JSON response (no tags)
{
"intent": "factual_retrieval",
"language": "en",
"complexity": "simple",
"confidence": 0.98,
"system_prompt": "…"
}
✅ PASS: No tags in response
**Test: Metadata Extraction**
bash
Extract metadata from sample text
curl -X POST http://localhost:8062/v1/extract \
-H "Content-Type: application/json" \
-d '{
"text": "Apple Inc. released the iPhone 15 Pro in September 2023…",
"mode": "basic"
}'
Expected: Complete metadata (all 4 fields)
{
"keywords": "Apple Inc, iPhone 15 Pro, September 2023",
"topics": "Technology, Product Launch",
"questions": "What did Apple release?|When was iPhone 15 Pro released?",
"summary": "Apple Inc. launched the iPhone 15 Pro in September 2023."
}
✅ PASS: All fields present, no truncation
**Test: Answer Generation**
bash
Generate answer from context
curl -X POST http://localhost:8069/v1/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Find the highest-priced product",
"context_chunks": […]
}'
Expected: Clean answer (no tags)
{
"answer": "The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999…",
"citations": […],
"generation_time_ms": 1250.5
}
✅ PASS: Clean answer, no reasoning leaked
### Test Suite 3: Load Testing
We ran load tests to ensure cleaning didn't impact performance:
python
import asyncio
import time
async def benchmark_cleaning():
"""Benchmark cleaning performance"""
text_with_tags = """
<think>
Long reasoning block with 500+ tokens of chain-of-thought...
Step 1: Analyze query
Step 2: Find relevant context
Step 3: Synthesize answer
... (lots more text) ...
</think>
The actual answer goes here (200 tokens).
"""
iterations = 10000
# Measure cleaning time
start = time.time()
for _ in range(iterations):
cleaned = clean_reasoning_output(text_with_tags, "Qwen/Qwen3-32B-fast")
end = time.time()
avg_time_ms = ((end - start) / iterations) * 1000
print(f"Average cleaning time: {avg_time_ms:.3f}ms per call")
print(f"Throughput: {iterations / (end - start):.0f} ops/sec")
Results:
Average cleaning time: 0.042ms per call
Throughput: 23,800 ops/sec
#
Conclusion: Cleaning is essentially free (sub-millisecond)
---
## The Results: Before and After
### Metrics Comparison
| Metric | Before Fix | After Fix | Improvement |
|--------|-----------|-----------|-------------|
| **Intent Detection** | | | |
| - JSON parsing errors | 23% | 0% | ✅ 100% reduction |
| - Latency | 185ms | 190ms | ❌ +5ms (negligible) |
| - Accuracy | 94% | 95% | ✅ +1% |
| **Metadata Extraction** | | | |
| - Truncation rate | 18% | 0% | ✅ 100% reduction |
| - Complete extractions | 82% | 100% | ✅ +18% |
| - Avg tokens used | 1450/1500 | 1800/2500 | ✅ Better headroom |
| **Answer Generation** | | | |
| - User complaints | 12 (in 1 week) | 0 | ✅ 100% reduction |
| - Answer quality score | 7.5/10 | 9.2/10 | ✅ +23% |
| - Clean outputs | 75% | 100% | ✅ +25% |
### Cost Impact
**Before Fix:**
Daily LLM Costs (all services):
- Intent Detection: $2.40 (12K requests × 200 tokens × $0.20/1M)
- Metadata Extraction: $3.60 (6K requests × 300 tokens × $0.20/1M)
- Answer Generation: $8.00 (4K requests × 1000 tokens × $0.20/1M)
Total: $14.00/day = $420/month
After Fix:
Daily LLM Costs (all services):
- Intent Detection: $2.40 (same – cleaning doesn’t add LLM calls)
- Metadata Extraction: $4.80 (+33% due to higher token limit)
- Answer Generation: $8.00 (same model, same usage)
Total: $15.20/day = $456/month
Cost increase: $36/month (8.6%)
ROI Analysis:
Cost increase: $36/month
Error reduction: 23% → 0% (saved debugging time)
Customer satisfaction: ↑ (zero complaints)
Estimated engineering time saved: 10 hours/month
Engineering cost: $100/hour × 10 hours = $1,000/month saved
Net benefit: $1,000 – $36 = $964/month
ROI: 2,678%
**Verdict:** The fix paid for itself immediately. 🎉
—
## Key Takeaways: Lessons Learned
### 1. **Monitor Model Provider Changes**
**Lesson:** Model providers can change behavior without breaking changes in API.
**What We Did:**
– Set up alerts for unusual response patterns
– Subscribe to Nebius/SambaNova update newsletters
– Version-pin critical models in production
Recommendation: python
Log model metadata for debugging
logger.info(f"LLM call: model={model}, provider={provider}, version={version}")
Add response validation
def validate_llm_response(response: str, model: str) -> bool:
"""Check for unexpected patterns"""
warnings = []
if "<think>" in response and not is_reasoning_model(model):
warnings.append("Unexpected <think> tags in non-reasoning model")
if len(response) > expected_max_tokens * 1.5:
warnings.append("Response longer than expected")
if warnings:
logger.warning(f"LLM response validation failed: {warnings}")
return False
return True
### 2. **Build Centralized Utilities for Cross-Cutting Concerns**
**Lesson:** When multiple services need the same fix, centralize it.
**What We Did:**
- Created `/shared/model_registry.py` for model metadata
- Built universal cleaning functions used byreturn re.sub(pattern, '', text, flags=re all services
- Single source of truth for model capabilities
**Anti-Pattern to Avoid:**
python
❌ DON’T: Copy-paste cleaning logic in every service
service_a.py
def clean_output(text):
return re.sub(r'.*?', '', text)
service_b.py
def remove_tags(text):
return re.sub(r'.*?', '', text)
service_c.py
def strip_reasoning(text):
return re.sub(r'.*?', '', text)
Problem: If we need to update the regex, we have to change 3 files!
**Better Pattern:**
python
✅ DO: Centralize in shared module
shared/model_registry.py
def clean_reasoning_output(text, model):
"""Universal cleaning function"""
if not requires_cleaning(model):
return text
pattern = get_pattern(model)
return re.sub(pattern, '', text, flags=re.DOTALL)
service_a.py, service_b.py, service_c.py
from shared.model_registry import clean_reasoning_output
cleaned = clean_reasoning_output(response, model_name)
Benefit: Update once, fix everywhere
### 3. **Token Budgets Need Safety Margins**
**Lesson:** LLMs are unpredictable. Budget for variability.
**Formula We Use Now:**
python
def calculate_max_tokens(expected_output: int, has_reasoning: bool) -> int:
"""
Calculate max_tokens with safety margins
Args:
expected_output: Expected output tokens (measured from testing)
has_reasoning: Whether model emits reasoning tokens
Returns:
max_tokens to set in LLM request
"""
base_tokens = expected_output
if has_reasoning:
# Reasoning models use 25-40% extra tokens for <think> content
reasoning_overhead = int(base_tokens * 0.40)
base_tokens += reasoning_overhead
# Add 20% safety margin for variability
safety_margin = int(base_tokens * 0.20)
base_tokens += safety_margin
return base_tokens
Example:
Metadata extraction (basic mode): 400 tokens expected output
max_tokens = calculate_max_tokens(400, has_reasoning=True)
= 400 + (400 * 0.40) + ((400 + 160) * 0.20)
= 400 + 160 + 112
= 672 tokens
We round up to 1000 to be safe
### 4. **Test Error Paths, Not Just Happy Paths**
**Lesson:** Our tests validated clean outputs but didn't test malformed outputs.
**What We Added:**
python
def test_malformed_responses():
"""Test handling of various malformed LLM outputs"""
# Case 1: Unclosed <think> tag
input_text = "<think>reasoning without closing tag"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should not crash, should handle gracefully
# Case 2: Nested tags
input_text = "<think><think>nested</think></think>answer"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
assert result.strip() == "answer"
# Case 3: Multiple unclosed tags
input_text = "<think>A<think>B<think>C"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should handle without regex explosions
# Case 4: Very large reasoning block (10K+ tokens)
input_text = "<think>" + "x" * 10000 + "</think>answer"
start = time.time()
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
elapsed = time.time() - start
assert elapsed < 0.1 # Should complete in <100ms
### 5. **When in Doubt, Switch Models**
**Lesson:** Sometimes the right fix is avoiding the problem entirely.
**Decision Matrix:**
Should I clean the output or switch models?
Clean Output If:
✅ Model has unique capabilities you need (e.g., best reasoning)
✅ Cleaning is reliable (well-defined patterns)
✅ No alternative models available
✅ Cost/latency benefits outweigh cleaning complexity
Switch Models If:
✅ Alternative models have comparable quality
✅ Output cleaning is unreliable or risky
✅ User-facing outputs (zero tolerance for leaks)
✅ Same or better cost/latency
Our Choice for Answer Generation: Switch to Llama
1. Comparable quality ✅
2. No tags ✅
3. User-facing ✅
4. Same cost ✅
---
## The Tech Stack
**Languages & Frameworks:**
- Python 3.11
- FastAPI (API services)
- asyncio (async processing)
- httpx (HTTP client)
- Pydantic (data validation)
**LLM Providers:**
- Nebius AI Studio (Qwen, Llama models)
- SambaNova Cloud (DeepSeek models)
**Models Involved:**
- Qwen/Qwen3-32B-fast (reasoning, has `<think>` tags)
- DeepSeek-R1-0528 (reasoning, has `<think>` tags)
- meta-llama/Llama-3.3-70B-Instruct-fast (no `<think>` tags)
**Infrastructure:**
- Docker containers
- APISIX API Gateway
- Redis (caching)
- Milvus (vector database)
**Key Files Modified:**
/shared/model_registry.py (+150 lines)
/Retrieval/services/intent/v1.0.0/intent_api.py (+10 lines)
/Ingestion/services/metadata/v1.0.0/config.py (+15 lines)
/Retrieval/services/answer_generation/v1.0.0/config.py (+5 lines, model switch)
— ## What’s Next: Future Improvements ### 1. **Smarter Reasoning Detection** Currently, we use regex to remove `<think>` tags. But what if we could: **Idea: Semantic Reasoning Detection**
python
def extract_reasoning_insights(text: str, model: str) -> dict:
“””
Instead of discarding reasoning, extract insights
Returns:
{
"answer": "cleaned answer",
"reasoning": "extracted reasoning",
"confidence": 0.95,
"reasoning_steps": ["Step 1: ...", "Step 2: ..."]
}
"""
if not requires_output_cleaning(model):
return {"answer": text, "reasoning": None}
# Extract <think> content before removing
think_pattern = r'<think>(.*?)</think>'
reasoning_blocks = re.findall(think_pattern, text, flags=re.DOTALL)
# Clean the answer
cleaned_answer = re.sub(think_pattern, '', text, flags=re.DOTALL)
# Parse reasoning into steps
reasoning_steps = []
for block in reasoning_blocks:
steps = extract_steps(block) # Parse "Step 1:", "Step 2:", etc.
reasoning_steps.extend(steps)
return {
"answer": cleaned_answer.strip(),
"reasoning": "\n".join(reasoning_blocks),
"reasoning_steps": reasoning_steps,
"num_steps": len(reasoning_steps)
}
Use Cases:
– Debug mode: Show reasoning to developers
– Confidence scoring: More reasoning steps = higher confidence?
– Explainability: Show users “how” the AI reached its answer
### 2. **Model-Specific Prompt Engineering**
Different models need different prompts:
python
def get_optimized_prompt(task: str, model: str) -> str:
"""
Get model-specific prompt optimizations
"""
base_prompt = TASK_PROMPTS[task]
if is_reasoning_model(model):
# Reasoning models: Explicitly ask for JSON-only output
return base_prompt + "\n\nIMPORTANT: Respond with ONLY valid JSON. Do not include reasoning, explanations, or <think> tags."
else:
# Standard models: Regular prompt is fine
return base_prompt
### 3. **Automated Model Testing Pipeline** We want to catch issues like this automatically:
python
Automated daily tests
async def test_all_models():
"""
Test all models in registry for unexpected behaviors
"""
test_queries = [
"What is 2+2?",
"List the first 3 prime numbers",
"Extract keywords from: 'The quick brown fox jumps over the lazy dog'"
]
for model in get_all_models():
for query in test_queries:
response = await call_llm(model, query)
# Check for issues
if "<think>" in response and not is_reasoning_model(model):
alert(f"Model {model} unexpectedly emits <think> tags!")
if len(response) == 0:
alert(f"Model {model} returned empty response!")
# Check response time
if response_time > expected_latency * 1.5:
alert(f"Model {model} slower than expected: {response_time}ms")
```
Conclusion
What started as a mysterious bug – AI showing its “homework” to users – turned into a great learning experience about production LLM systems.
The core issue: Reasoning models (Qwen, DeepSeek) started emitting <think> tags containing chain-of-thought reasoning, which broke JSON parsing and polluted user-facing answers.
The solution: Multi-layered approach:
- ✅ Built centralized model registry with cleaning utilities
- ✅ Implemented automatic tag cleaning in 3 services
- ✅ Switched Answer Generation to Llama (no
<think>tags) - ✅ Increased token budgets to prevent truncation
- ✅ Added comprehensive testing for edge cases
Impact:
- 23% → 0% error rate in Intent Detection
- 18% → 0% truncation rate in Metadata Extraction
- 12 → 0 user complaints per week
- $36/month additional cost, $964/month value (2,678% ROI)
Time to fix: 2 days (8 engineer-hours)
Lines of code: ~150 lines (mostly in model registry)
The Big Lesson
Modern LLM systems are living, breathing infrastructure. Model providers update models, behaviors change, and you need:
- 🔍 Monitoring for unusual patterns
- 🛠️ Centralized utilities for common fixes
- 🧪 Comprehensive testing including error cases
- 📊 Metrics to catch regressions early
- 🚀 Flexibility to switch models when needed
About the Team
This fix was implemented by the CrawlEnginePro team, building production-ready RAG systems with multi-model support, semantic chunking, and intent-aware retrieval.
Tech Stack:
- Python, FastAPI, asyncio
- Nebius AI Studio (Qwen, Llama models)
- SambaNova Cloud (DeepSeek models)
- Milvus (vector database)
- Redis (caching)
System Stats:
- 8 microservices
- 15+ LLM models supported
- 99.97% uptime
- <2s average query latency
Questions? Reach out to us via contact us form.—
Last Updated: October 2025
Case Study Version: 1.0
Reading Time: ~25 minutes
Facing AI Output Issues Like These?
