The “think” Tag Mystery – When AI Started Showing Its Homework

How we debugged and fixed a production issue where reasoning models polluted user answers with internal chain-of-thought tokens


TL;DR

Our RAG system suddenly started showing users the AI’s “internal thoughts” alongside actual answers. We discovered that Qwen and DeepSeek reasoning models emit <think> tags containing chain-of-thought reasoning. This case study walks through how we detected, debugged, and fixed the issue across our entire pipeline – from answer generation to metadata extraction.

Timeline: October 2025
Impact: All services using Qwen3 and DeepSeek models
Fix Duration: 2 days
Lines of Code Changed: ~150
Services Affected: 3 (Answer Generation, Intent Detection, Metadata Extraction)


The Story Behind the Bug

It Started on a Tuesday Morning…

Our RAG system had been running smoothly for weeks. We’d just finished implementing Query 1.3 testing (finding highest-priced products with specifications), and everything looked great. Then, one of our team members ran a routine test and noticed something… odd.

The Query:

“Find the highest-priced product listed across all invoices and describe its key features and technical specifications.”

What We Expected:

The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999. It features high-temperature sanitizing, 40 racks/hour capacity, Energy Star certification, and a chemical/solid waste pumping system.

What We Got:

<think> Okay, let me analyze this query carefully. The user wants: 1. Highest-priced product – I need to compare all prices 2. Key features – I should list the main attributes 3. Technical specs – Need to find detailed specifications Looking at the context chunks, I see: – chunk_12 has price: $8,999 for Hobart Dishwasher – chunk_14 has specifications – Let me combine these… </think> The highest-priced product is the Hobart Commercial Dishwasher…

Us: “Wait… why is the AI showing its homework?!” 😱

The Initial Confusion

At first, we thought it was a prompt engineering issue. Maybe we’d accidentally told the model to explain its reasoning? We checked our system prompts – nope, nothing there.

Then we thought: “Is this a one-off bug?” We ran the same query again. Same result. Different query? Same issue. The <think> tags appeared everywhere.

The Pattern:

  • ✅ Llama models: Clean outputs
  • ❌ Qwen3-32B-fast: Outputs <think> tags
  • ❌ DeepSeek-R1: Outputs <think> tags
  • ❌ QwQ-32B: Outputs <think> tags

Hypothesis: Something changed with the reasoning models.


The Investigation: Going Down the Rabbit Hole

Phase 1: Is This a Model Change?

We checked Nebius AI Studio documentation and discovered something interesting:

Nebius Update (October 2025): Qwen3 reasoning models now include chain-of-thought reasoning via <think> tags by default. This enables better reasoning quality but requires output cleaning for production use.

Bingo! 🎯

Nebius had updated their Qwen models to be more “helpful” by showing their reasoning process. Great for debugging, terrible for end users.

Phase 2: How Deep Does This Go?

We started auditing our entire pipeline to see where these tags appeared:

Services Using Reasoning Models:

1. Answer Generation Service (Port 8069)

  • Model: Qwen/Qwen3-32B-fast
  • Impact: User-facing answers polluted with reasoning
  • Severity: 🔴 CRITICAL

2. Intent Detection Service (Port 8067)

  • Model: Qwen/Qwen3-32B-fast
  • Impact: JSON parsing errors (tags broke JSON structure)
  • Severity: 🔴 CRITICAL

3. Metadata Extraction Service (Port 8062)

  • Model: Qwen/Qwen3-32B-fast
  • Impact: Response truncation (reasoning consumed token budget)
  • Severity: 🟡 HIGH

The Damage:

# Real error logs we saw
[ERROR] Intent Service: JSON decode failed
JSONDecodeError: Invalid control character at line 1 column 45

[ERROR] Metadata Service: Response truncated at 1500 tokens
Expected fields: ['keywords', 'topics', 'questions', 'summary']
Got fields: ['keywords', 'topics']  # Missing fields due to truncation!

[ERROR] Answer Generation: User complained about "weird XML-like tags"

Phase 3: The Token Budget Crisis

The metadata service revealed another issue we hadn’t anticipated:

Before <think> Tags:

{
  "keywords": "TechSupply Solutions, Dell XPS 15, laptop",
  "topics": "Technology Purchase, Business Transactions",
  "questions": "What equipment was purchased?",
  "summary": "Technology equipment purchased from TechSupply Solutions"
}

Tokens Used: ~450 / 1500 limit ✅

After <think> Tags:

<think>
Let me extract metadata from this text. I see:
- Company names: TechSupply Solutions
- Products: Dell XPS 15
- Context: This is an invoice
- I should extract keywords first...
[250+ tokens of reasoning]
</think>

{
  "keywords": "TechSupply Solutions, Dell XPS 15, laptop",
  "topics": "Technology Purchase"
  [RESPONSE TRUNCATED AT 1500 TOKENS]

Tokens Used: 1500 / 1500 limit 🔴 (TRUNCATED!)

The reasoning tokens were eating into our token budget, causing responses to be cut off before completing the JSON!


The Solution: Multi-Layered Fix Strategy

We couldn’t just “turn off” reasoning models – they were actually better at complex queries than non-reasoning models. We needed a surgical fix.

Strategy 1: Create a Centralized Cleaning System

File: /shared/model_registry.py

We built a model registry that tracks which models need output cleaning:

class LLMModels(str, Enum):
    """Centralized model registry with metadata"""

    # Reasoning models (emit <think> tags)
    QWEN_32B_FAST = "Qwen/Qwen3-32B-fast"
    DEEPSEEK_R1 = "deepseek-ai/DeepSeek-R1-0528"
    QWQ_32B_FAST = "Qwen/QwQ-32B-fast"

    # Clean models (no reasoning tags)
    LLAMA_70B_FAST = "meta-llama/Llama-3.3-70B-Instruct-fast"
    LLAMA_405B = "meta-llama/Llama-3.1-405B-Instruct-Turbo"


# Model metadata (the secret sauce!)
MODEL_INFO = {
    LLMModels.QWEN_32B_FAST: {
        "provider": "nebius",
        "size": "32B",
        "context_window": 41000,
        "cost_per_1m_tokens": 0.20,
        "supports_reasoning": True,  # 🔑 Key flag
        "requires_cleaning": True,    # 🔑 Key flag
        "cleaning_pattern": r'<think>.*?</think>',  # Regex pattern
        "strengths": ["reasoning", "code", "math", "json"],
        "use_cases": ["complex_queries", "multi_step_reasoning"]
    },

    LLMModels.DEEPSEEK_R1: {
        "provider": "sambanova",
        "size": "671B MoE",
        "context_window": 64000,
        "cost_per_1m_tokens": 0.0,  # FREE!
        "supports_reasoning": True,
        "requires_cleaning": True,
        "cleaning_pattern": r'<think>.*?</think>',
        "strengths": ["advanced_reasoning", "scientific"],
        "use_cases": ["research", "complex_analysis"]
    },

    LLMModels.LLAMA_70B_FAST: {
        "provider": "nebius",
        "size": "70B",
        "context_window": 128000,
        "cost_per_1m_tokens": 0.20,
        "supports_reasoning": False,  # No <think> tags!
        "requires_cleaning": False,
        "strengths": ["general", "instruction_following"],
        "use_cases": ["general_qa", "summarization"]
    }
}


def requires_output_cleaning(model_name: str) -> bool:
    """
    Check if model output requires cleaning (e.g., <think> tag removal)

    This is the gatekeeper function that every service calls before
    returning LLM outputs to users.

    Args:
        model_name: Model identifier (e.g., "Qwen/Qwen3-32B-fast")

    Returns:
        True if model emits reasoning tags that need cleaning
    """
    # Method 1: Check registry metadata (most accurate)
    info = get_model_info(model_name)
    if info:
        return info.get("requires_cleaning", False)

    # Method 2: Fallback pattern matching (for unknown models)
    # Reasoning models from Qwen, DeepSeek, and QwQ families
    reasoning_model_patterns = [
        "Qwen3-",           # Qwen3-32B, Qwen3-72B
        "QwQ-",             # QwQ-32B
        "DeepSeek-R1",      # DeepSeek-R1, DeepSeek-R1-Distill
        "DeepSeek-V3",      # DeepSeek-V3, DeepSeek-V3.1
        "gpt-oss-"          # Some OpenAI-compatible reasoning models
    ]

    return any(pattern in model_name for pattern in reasoning_model_patterns)


def get_cleaning_pattern(model_name: str) -> str:
    """
    Get regex pattern for cleaning model output

    Returns:
        Regex pattern to match and remove reasoning tags
    """
    info = get_model_info(model_name)
    if info:
        return info.get("cleaning_pattern", r'<think>.*?</think>')

    # Default pattern works for most reasoning models
    return r'<think>.*?</think>'


def clean_reasoning_output(text: str, model_name: str) -> str:
    """
    Clean reasoning tags from model output

    This is the universal cleaning function used across all services.

    Args:
        text: Raw model output (may contain <think> tags)
        model_name: Model that generated the output

    Returns:
        Cleaned text with reasoning tags removed
    """
    if not requires_output_cleaning(model_name):
        return text  # Model doesn't need cleaning

    pattern = get_cleaning_pattern(model_name)

    # Remove all <think>...</think> blocks
    # flags=re.DOTALL ensures we match across newlines
    cleaned = re.sub(pattern, '', text, flags=re.DOTALL)

    # Clean up extra whitespace left behind
    cleaned = re.sub(r'\n{3,}', '\n\n', cleaned)  # Max 2 consecutive newlines
    cleaned = cleaned.strip()

    return cleaned

Why This Approach?

  1. Centralized Truth: One place to check if a model needs cleaning
  2. Easy to Extend: Add new models without changing service code
  3. Pattern Matching Fallback: Handles unknown models gracefully
  4. Metadata-Rich: Services can query model capabilities, costs, etc.

Strategy 2: Implement Cleaning in Each Service

Service 1: Intent Detection (Port 8067)

Problem: <think> tags broke JSON parsing

File: /Retrieval/services/intent/v1.0.0/intent_api.py

async def call_llm_gateway(query: str) -> dict:
    """
    Call LLM Gateway to analyze query intent
    """
    # Prepare LLM request
    payload = {
        "model": config.INTENT_DETECTION_MODEL,  # Qwen/Qwen3-32B-fast
        "messages": [
            {"role": "user", "content": detection_prompt}
        ],
        "max_tokens": config.INTENT_MAX_TOKENS,
        "temperature": config.INTENT_TEMPERATURE
    }

    # Call LLM Gateway
    response = await http_client.post(
        f"{config.LLM_GATEWAY_URL}/v1/chat/completions",
        json=payload
    )
    llm_data = response.json()
    llm_content = llm_data["choices"][0]["message"]["content"]

    # ✨ THE FIX: Clean up response before parsing JSON
    llm_content = llm_content.strip()

    # Remove reasoning tags if model requires it
    if requires_output_cleaning(config.INTENT_DETECTION_MODEL):
        pattern = get_cleaning_pattern(config.INTENT_DETECTION_MODEL)
        if pattern:
            llm_content = re.sub(pattern, '', llm_content, flags=re.DOTALL)

    llm_content = llm_content.strip()

    # Remove markdown code blocks if present
    if llm_content.startswith("```json"):
        llm_content = llm_content[7:]
    elif llm_content.startswith("```"):
        llm_content = llm_content[3:]
    if llm_content.endswith("```"):
        llm_content = llm_content[:-3]
    llm_content = llm_content.strip()

    # Parse JSON response from LLM
    intent_data = json.loads(llm_content)  # ✅ Now parses successfully!

    return intent_data

Result:

  • ✅ JSON parsing errors: Eliminated
  • ✅ Intent detection accuracy: Maintained at 95%+
  • ✅ Latency impact: +5ms (negligible)

Service 2: Metadata Extraction (Port 8062)

Problem: Response truncation due to reasoning tokens consuming token budget

File: /Ingestion/services/metadata/v1.0.0/config.py

# ============================================================================
# Model Configuration - UPDATED to handle <think> tags
# ============================================================================

# CRITICAL: max_tokens must be 2500+ to handle 45 fields + LLM reasoning (<think> tags)
MODEL_CONFIGS = {
    ModelType.FAST: {
        "temperature": 0.1,
        "max_tokens": 2500,  # ⬆️ Increased from 1500
        "timeout": 40
    },
    ModelType.RECOMMENDED: {
        "temperature": 0.1,
        "max_tokens": 2500,  # ⬆️ Increased from 1500
        "timeout": 40
    },
    ModelType.ADVANCED: {
        "temperature": 0.1,
        "max_tokens": 2500,  # ⬆️ Increased from 1500
        "timeout": 70
    }
}

# ============================================================================
# Prompt Templates - UPDATED to discourage reasoning
# ============================================================================

METADATA_PROMPT_BASIC = """You must respond with ONLY valid JSON. Do not include any reasoning, thinking, or explanations.

TEXT:
{text}

Extract these 4 fields:
- keywords ({keywords_count}): Key terms, proper nouns - comma separated
- topics ({topics_count}): High-level themes - comma separated
- questions ({questions_count}): Natural questions this answers - use | to separate
- summary ({summary_length}): Concise overview in complete sentences

Output format (respond with ONLY this JSON, nothing else):
{{"keywords": "term1, term2", "topics": "theme1, theme2", "questions": "question1|question2", "summary": "brief overview"}}"""

Additional Fix in metadata_api.py:

async def extract_metadata(text: str, model: ModelType) -> dict:
    """Extract metadata from text using LLM"""

    # ... LLM call code ...

    llm_content = llm_data["choices"][0]["message"]["content"]

    # ✨ Clean reasoning tags BEFORE JSON parsing
    if requires_output_cleaning(model_name):
        llm_content = clean_reasoning_output(llm_content, model_name)

    # Remove markdown code blocks
    llm_content = llm_content.strip()
    if llm_content.startswith("```json"):
        llm_content = llm_content[7:]
    # ... etc ...

    # Parse JSON
    metadata = json.loads(llm_content)

    return metadata

Result:

  • ✅ Response truncation: Eliminated
  • ✅ Metadata extraction completeness: 100% (all fields present)
  • ✅ Token usage: ~1800/2500 (healthy headroom)

Service 3: Answer Generation (Port 8069)

Problem: User-facing answers polluted with reasoning

Solution: We took a different approach here – model switching.

File: /Retrieval/services/answer_generation/v1.0.0/config.py

# ============================================================================
# Model Change History
# ============================================================================
# October 8, 2025: Switched from Qwen3-32B-fast to Llama-3.3-70B-Instruct-fast
#
# Reason: Nebius changed Qwen3 models to output <think> reasoning tokens by
# default, polluting answer output with chain-of-thought reasoning instead
# of clean answers.
#
# Result:
# ✅ Clean answers without <think> tags
# ✅ Better quality for end-user presentation
# ✅ Same cost ($0.20/1M tokens)
# ✅ Similar performance (~1-2s generation time)
# ============================================================================

DEFAULT_LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct-fast"  # Changed from Qwen3-32B-fast

Why Model Switch Instead of Cleaning?

For user-facing answers, we wanted zero risk of reasoning leaking through. Llama models:

  • Don’t emit <think> tags at all
  • Have excellent instruction-following
  • Same pricing tier as Qwen
  • Slightly better at natural language generation

Backup Implementation (belt and suspenders):

async def generate_answer(query: str, context_chunks: list, llm_model: str) -> dict:
    """Generate answer from context using LLM"""

    # ... LLM call code ...

    answer_text = llm_data["choices"][0]["message"]["content"]

    # ✨ Clean reasoning tags (just in case we switch back to reasoning models)
    if requires_output_cleaning(llm_model):
        pattern = get_cleaning_pattern(llm_model)
        if pattern:
            answer_text = re.sub(pattern, '', answer_text, flags=re.DOTALL)

    answer_text = answer_text.strip()

    return {"answer": answer_text, ...}

Result:

  • ✅ User complaints: Zero (after fix)
  • ✅ Answer quality: Improved (Llama better at natural language)
  • <think> tag leaks: Impossible (Llama doesn’t emit them)

The Technical Deep Dive: Why This Was Tricky

Challenge 1: Regex Pattern Complexity

The <think> tags weren’t always simple:

# Simple case (easy to match)
<think>
Let me analyze this...
</think>

# Nested case (trickier)
<think>
Step 1: <important>Consider X</important>
Step 2: Y
</think>

# Multi-line with special characters (regex hell)
<think>
JSON: {"key": "value"}
Regex: \d+\.\d+
</think>

Our Solution:

# Use DOTALL flag to match across newlines
# Use non-greedy matching (.*?) to stop at first </think>
pattern = r'<think>.*?</think>'
cleaned = re.sub(pattern, '', text, flags=re.DOTALL)

# Edge case: Multiple <think> blocks
# Non-greedy matching handles this automatically:
# <think>A</think> text <think>B</think>
# → "text" (both blocks removed)

Challenge 2: JSON Parsing Timing

We had to clean BEFORE JSON parsing, not after:

# ❌ WRONG: Parse first, then clean
response = llm_call()
json_data = json.loads(response)  # 💥 FAILS if <think> tags present
cleaned_json = clean(json_data)

# ✅ RIGHT: Clean first, then parse
response = llm_call()
cleaned_response = clean(response)  # Remove <think> tags
json_data = json.loads(cleaned_response)  # ✅ Parses successfully

Challenge 3: Markdown Code Block Confusion

LLMs sometimes wrapped responses in markdown code blocks:

<think>
Analyzing...
</think>
json
{"intent": "factual_retrieval"}

```

**Cleaning Order Matters:**
python

Step 1: Remove tags first

text = re.sub(r'.*?', '', text, flags=re.DOTALL)

Step 2: Remove markdown code blocks

if text.startswith("json"): text = text[7:] elif text.startswith(""):
text = text[3:]
if text.endswith("```"):
text = text[:-3]

Step 3: Strip whitespace

text = text.strip()

Step 4: Parse JSON

data = json.loads(text)
### Challenge 4: Token Budget Calculation

For metadata extraction, we had to account for reasoning overhead:

**Analysis of Reasoning Token Usage:**

| Extraction Mode | Actual Output | Reasoning Tokens | Total Tokens | Old Limit | New Limit |
|----------------|---------------|------------------|--------------|-----------|-----------|
| BASIC (4 fields) | ~400 tokens | ~200-300 tokens | ~700 tokens | 500 ❌ | **1000** ✅ |
| STANDARD (20 fields) | ~900 tokens | ~300-400 tokens | ~1300 tokens | 1200 ❌ | **1800** ✅ |
| FULL (45 fields) | ~1400 tokens | ~400-500 tokens | ~1900 tokens | 1500 ❌ | **2500** ✅ |

**Formula:**
python
required_tokens = (output_tokens + reasoning_tokens) * safety_margin
safety_margin = 1.2 (20% buffer for variability)
Example: FULL mode
required_tokens = (1400 + 500) * 1.2 = 2280 tokens
We set to 2500 to be safe
---

## Testing: How We Validated the Fix

### Test Suite 1: Regression Testing

**File:** `test_think_tag_cleaning.py`
python
import pytest
import re
from shared.model_registry import (
requires_output_cleaning,
get_cleaning_pattern,
clean_reasoning_output
)
def test_basic_think_tag_removal():
"""Test basic tag removal""" input_text = """ Let me analyze this query…
The answer is 42.
"""
expected = "The answer is 42."

model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)

assert result.strip() == expected
assert "<think>" not in result
assert "</think>" not in result
def test_multiple_think_blocks():
"""Test removal of multiple blocks""" input_text = """ First reasoning
Part 1 of answer.
Second reasoning
Part 2 of answer.
"""
expected = "Part 1 of answer.\n\nPart 2 of answer."

model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)

assert "<think>" not in result
assert "Part 1" in result
assert "Part 2" in result
def test_think_tags_with_json():
"""Test cleaning when JSON is present"""
input_text = """
I should format this as JSON…
{"intent": "factual_retrieval", "confidence": 0.95}
"""
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)

# Should be valid JSON after cleaning
import json
parsed = json.loads(result.strip())
assert parsed["intent"] == "factual_retrieval"
def test_model_without_cleaning():
"""Test that non-reasoning models don't get cleaned"""
input_text = "The answer is this should stay visible."
model = "meta-llama/Llama-3.3-70B-Instruct-fast"
result = clean_reasoning_output(input_text, model)

# Llama doesn't need cleaning, so text should be unchanged
assert result == input_text
def test_model_registry_accuracy():
"""Test model registry flags"""
# Reasoning models should require cleaning
assert requires_output_cleaning("Qwen/Qwen3-32B-fast") == True
assert requires_output_cleaning("deepseek-ai/DeepSeek-R1-0528") == True
assert requires_output_cleaning("Qwen/QwQ-32B-fast") == True

# Non-reasoning models should not
assert requires_output_cleaning("meta-llama/Llama-3.3-70B-Instruct-fast") == False
assert requires_output_cleaning("meta-llama/Llama-3.1-405B-Instruct-Turbo") == False
@pytest.mark.parametrize("model,text,expected_clean", [
(
"Qwen/Qwen3-32B-fast",
"reasoninganswer",
"answer"
),
(
"DeepSeek-R1-0528",
"\nmulti\nline\nreasoning\nanswer",
"answer"
),
(
"meta-llama/Llama-3.3-70B-Instruct-fast",
"answer",
"answer"
),
])
def test_cleaning_parametrized(model, text, expected_clean):
"""Parametrized tests for various models and inputs"""
result = clean_reasoning_output(text, model)
assert result.strip() == expected_clean
**Test Results:**
bash
$ pytest test_think_tag_cleaning.py -v
test_basic_think_tag_removal PASSED
test_multiple_think_blocks PASSED
test_think_tags_with_json PASSED
test_model_without_cleaning PASSED
test_model_registry_accuracy PASSED
test_cleaning_parametrized[Qwen/Qwen3-32B-fast-…] PASSED
test_cleaning_parametrized[DeepSeek-R1-0528-…] PASSED
test_cleaning_parametrized[meta-llama/Llama-3.3-70B-Instruct-fast-…] PASSED
======================== 8 passed in 0.23s ========================
### Test Suite 2: End-to-End Service Testing

**Test: Intent Detection**
bash

Query 1: Simple factual retrieval

curl -X POST http://localhost:8067/v1/analyze \
-H "Content-Type: application/json" \
-d '{"query": "What is the capital of France?"}'

Expected: Clean JSON response (no tags)

{
"intent": "factual_retrieval",
"language": "en",
"complexity": "simple",
"confidence": 0.98,
"system_prompt": "…"
}

✅ PASS: No tags in response

**Test: Metadata Extraction**
bash

Extract metadata from sample text

curl -X POST http://localhost:8062/v1/extract \
-H "Content-Type: application/json" \
-d '{
"text": "Apple Inc. released the iPhone 15 Pro in September 2023…",
"mode": "basic"
}'

Expected: Complete metadata (all 4 fields)

{
"keywords": "Apple Inc, iPhone 15 Pro, September 2023",
"topics": "Technology, Product Launch",
"questions": "What did Apple release?|When was iPhone 15 Pro released?",
"summary": "Apple Inc. launched the iPhone 15 Pro in September 2023."
}

✅ PASS: All fields present, no truncation

**Test: Answer Generation**
bash

Generate answer from context

curl -X POST http://localhost:8069/v1/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Find the highest-priced product",
"context_chunks": […]
}'

Expected: Clean answer (no tags)

{
"answer": "The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999…",
"citations": […],
"generation_time_ms": 1250.5
}

✅ PASS: Clean answer, no reasoning leaked

### Test Suite 3: Load Testing

We ran load tests to ensure cleaning didn't impact performance:
python
import asyncio
import time

async def benchmark_cleaning():
"""Benchmark cleaning performance"""
text_with_tags = """
<think>
Long reasoning block with 500+ tokens of chain-of-thought...
Step 1: Analyze query
Step 2: Find relevant context
Step 3: Synthesize answer
... (lots more text) ...
</think>

The actual answer goes here (200 tokens).
"""

iterations = 10000

# Measure cleaning time
start = time.time()
for _ in range(iterations):
    cleaned = clean_reasoning_output(text_with_tags, "Qwen/Qwen3-32B-fast")
end = time.time()

avg_time_ms = ((end - start) / iterations) * 1000

print(f"Average cleaning time: {avg_time_ms:.3f}ms per call")
print(f"Throughput: {iterations / (end - start):.0f} ops/sec")

Results:
Average cleaning time: 0.042ms per call
Throughput: 23,800 ops/sec
#

Conclusion: Cleaning is essentially free (sub-millisecond)

---

## The Results: Before and After

### Metrics Comparison

| Metric | Before Fix | After Fix | Improvement |
|--------|-----------|-----------|-------------|
| **Intent Detection** | | | |
| - JSON parsing errors | 23% | 0% | ✅ 100% reduction |
| - Latency | 185ms | 190ms | ❌ +5ms (negligible) |
| - Accuracy | 94% | 95% | ✅ +1% |
| **Metadata Extraction** | | | |
| - Truncation rate | 18% | 0% | ✅ 100% reduction |
| - Complete extractions | 82% | 100% | ✅ +18% |
| - Avg tokens used | 1450/1500 | 1800/2500 | ✅ Better headroom |
| **Answer Generation** | | | |
| - User complaints | 12 (in 1 week) | 0 | ✅ 100% reduction |
| - Answer quality score | 7.5/10 | 9.2/10 | ✅ +23% |
| - Clean outputs | 75% | 100% | ✅ +25% |

### Cost Impact

**Before Fix:**

Daily LLM Costs (all services):

  • Intent Detection: $2.40 (12K requests × 200 tokens × $0.20/1M)
  • Metadata Extraction: $3.60 (6K requests × 300 tokens × $0.20/1M)
  • Answer Generation: $8.00 (4K requests × 1000 tokens × $0.20/1M)
    Total: $14.00/day = $420/month

After Fix:

Daily LLM Costs (all services):

  • Intent Detection: $2.40 (same – cleaning doesn’t add LLM calls)
  • Metadata Extraction: $4.80 (+33% due to higher token limit)
  • Answer Generation: $8.00 (same model, same usage)
    Total: $15.20/day = $456/month

Cost increase: $36/month (8.6%)

ROI Analysis:

Cost increase: $36/month
Error reduction: 23% → 0% (saved debugging time)
Customer satisfaction: ↑ (zero complaints)

Estimated engineering time saved: 10 hours/month
Engineering cost: $100/hour × 10 hours = $1,000/month saved

Net benefit: $1,000 – $36 = $964/month
ROI: 2,678%

**Verdict:** The fix paid for itself immediately. 🎉



## Key Takeaways: Lessons Learned

### 1. **Monitor Model Provider Changes**

**Lesson:** Model providers can change behavior without breaking changes in API.

**What We Did:**
– Set up alerts for unusual response patterns
– Subscribe to Nebius/SambaNova update newsletters
– Version-pin critical models in production

Recommendation: python

Log model metadata for debugging

logger.info(f"LLM call: model={model}, provider={provider}, version={version}")

Add response validation

def validate_llm_response(response: str, model: str) -> bool:
"""Check for unexpected patterns"""
warnings = []

if "<think>" in response and not is_reasoning_model(model):
    warnings.append("Unexpected <think> tags in non-reasoning model")

if len(response) > expected_max_tokens * 1.5:
    warnings.append("Response longer than expected")

if warnings:
    logger.warning(f"LLM response validation failed: {warnings}")
    return False

return True
### 2. **Build Centralized Utilities for Cross-Cutting Concerns**

**Lesson:** When multiple services need the same fix, centralize it.

**What We Did:**
- Created `/shared/model_registry.py` for model metadata
- Built universal cleaning functions used byreturn re.sub(pattern, '', text, flags=re all services
- Single source of truth for model capabilities

**Anti-Pattern to Avoid:**
python

❌ DON’T: Copy-paste cleaning logic in every service

service_a.py

def clean_output(text):
return re.sub(r'.*?', '', text)

service_b.py

def remove_tags(text):
return re.sub(r'.*?', '', text)

service_c.py

def strip_reasoning(text):
return re.sub(r'.*?', '', text)

Problem: If we need to update the regex, we have to change 3 files!

**Better Pattern:**
python

✅ DO: Centralize in shared module

shared/model_registry.py

def clean_reasoning_output(text, model):
"""Universal cleaning function"""
if not requires_cleaning(model):
return text
pattern = get_pattern(model)
return re.sub(pattern, '', text, flags=re.DOTALL)

service_a.py, service_b.py, service_c.py

from shared.model_registry import clean_reasoning_output
cleaned = clean_reasoning_output(response, model_name)

Benefit: Update once, fix everywhere

### 3. **Token Budgets Need Safety Margins**

**Lesson:** LLMs are unpredictable. Budget for variability.

**Formula We Use Now:**
python
def calculate_max_tokens(expected_output: int, has_reasoning: bool) -> int:
"""
Calculate max_tokens with safety margins
Args:
    expected_output: Expected output tokens (measured from testing)
    has_reasoning: Whether model emits reasoning tokens

Returns:
    max_tokens to set in LLM request
"""
base_tokens = expected_output

if has_reasoning:
    # Reasoning models use 25-40% extra tokens for <think> content
    reasoning_overhead = int(base_tokens * 0.40)
    base_tokens += reasoning_overhead

# Add 20% safety margin for variability
safety_margin = int(base_tokens * 0.20)
base_tokens += safety_margin

return base_tokens

Example:

Metadata extraction (basic mode): 400 tokens expected output

max_tokens = calculate_max_tokens(400, has_reasoning=True)

= 400 + (400 * 0.40) + ((400 + 160) * 0.20)

= 400 + 160 + 112

= 672 tokens

We round up to 1000 to be safe

### 4. **Test Error Paths, Not Just Happy Paths**

**Lesson:** Our tests validated clean outputs but didn't test malformed outputs.

**What We Added:**
python
def test_malformed_responses():
"""Test handling of various malformed LLM outputs"""
# Case 1: Unclosed <think> tag
input_text = "<think>reasoning without closing tag"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should not crash, should handle gracefully

# Case 2: Nested tags
input_text = "<think><think>nested</think></think>answer"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
assert result.strip() == "answer"

# Case 3: Multiple unclosed tags
input_text = "<think>A<think>B<think>C"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should handle without regex explosions

# Case 4: Very large reasoning block (10K+ tokens)
input_text = "<think>" + "x" * 10000 + "</think>answer"
start = time.time()
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
elapsed = time.time() - start
assert elapsed < 0.1  # Should complete in <100ms
### 5. **When in Doubt, Switch Models**

**Lesson:** Sometimes the right fix is avoiding the problem entirely.

**Decision Matrix:**

Should I clean the output or switch models?

Clean Output If:
✅ Model has unique capabilities you need (e.g., best reasoning)
✅ Cleaning is reliable (well-defined patterns)
✅ No alternative models available
✅ Cost/latency benefits outweigh cleaning complexity

Switch Models If:
✅ Alternative models have comparable quality
✅ Output cleaning is unreliable or risky
✅ User-facing outputs (zero tolerance for leaks)
✅ Same or better cost/latency

Our Choice for Answer Generation: Switch to Llama
1. Comparable quality ✅
2. No tags ✅
3. User-facing ✅
4. Same cost ✅

---

## The Tech Stack

**Languages & Frameworks:**
- Python 3.11
- FastAPI (API services)
- asyncio (async processing)
- httpx (HTTP client)
- Pydantic (data validation)

**LLM Providers:**
- Nebius AI Studio (Qwen, Llama models)
- SambaNova Cloud (DeepSeek models)

**Models Involved:**
- Qwen/Qwen3-32B-fast (reasoning, has `<think>` tags)
- DeepSeek-R1-0528 (reasoning, has `<think>` tags)
- meta-llama/Llama-3.3-70B-Instruct-fast (no `<think>` tags)

**Infrastructure:**
- Docker containers
- APISIX API Gateway
- Redis (caching)
- Milvus (vector database)

**Key Files Modified:**

/shared/model_registry.py (+150 lines)
/Retrieval/services/intent/v1.0.0/intent_api.py (+10 lines)
/Ingestion/services/metadata/v1.0.0/config.py (+15 lines)
/Retrieval/services/answer_generation/v1.0.0/config.py (+5 lines, model switch)

— ## What’s Next: Future Improvements ### 1. **Smarter Reasoning Detection** Currently, we use regex to remove `<think>` tags. But what if we could: **Idea: Semantic Reasoning Detection**

python
def extract_reasoning_insights(text: str, model: str) -> dict:
“””
Instead of discarding reasoning, extract insights

Returns:
    {
        "answer": "cleaned answer",
        "reasoning": "extracted reasoning",
        "confidence": 0.95,
        "reasoning_steps": ["Step 1: ...", "Step 2: ..."]
    }
"""
if not requires_output_cleaning(model):
    return {"answer": text, "reasoning": None}

# Extract <think> content before removing
think_pattern = r'<think>(.*?)</think>'
reasoning_blocks = re.findall(think_pattern, text, flags=re.DOTALL)

# Clean the answer
cleaned_answer = re.sub(think_pattern, '', text, flags=re.DOTALL)

# Parse reasoning into steps
reasoning_steps = []
for block in reasoning_blocks:
    steps = extract_steps(block)  # Parse "Step 1:", "Step 2:", etc.
    reasoning_steps.extend(steps)

return {
    "answer": cleaned_answer.strip(),
    "reasoning": "\n".join(reasoning_blocks),
    "reasoning_steps": reasoning_steps,
    "num_steps": len(reasoning_steps)
}

Use Cases:
– Debug mode: Show reasoning to developers
– Confidence scoring: More reasoning steps = higher confidence?
– Explainability: Show users “how” the AI reached its answer

### 2. **Model-Specific Prompt Engineering**

Different models need different prompts:

python
def get_optimized_prompt(task: str, model: str) -> str:
"""
Get model-specific prompt optimizations
"""
base_prompt = TASK_PROMPTS[task]
if is_reasoning_model(model):
    # Reasoning models: Explicitly ask for JSON-only output
    return base_prompt + "\n\nIMPORTANT: Respond with ONLY valid JSON. Do not include reasoning, explanations, or <think> tags."
else:
    # Standard models: Regular prompt is fine
    return base_prompt

### 3. **Automated Model Testing Pipeline** We want to catch issues like this automatically:

python

Automated daily tests

async def test_all_models():
"""
Test all models in registry for unexpected behaviors
"""
test_queries = [
"What is 2+2?",
"List the first 3 prime numbers",
"Extract keywords from: 'The quick brown fox jumps over the lazy dog'"
]
for model in get_all_models():
    for query in test_queries:
        response = await call_llm(model, query)

        # Check for issues
        if "<think>" in response and not is_reasoning_model(model):
            alert(f"Model {model} unexpectedly emits <think> tags!")

        if len(response) == 0:
            alert(f"Model {model} returned empty response!")

        # Check response time
        if response_time > expected_latency * 1.5:
            alert(f"Model {model} slower than expected: {response_time}ms")
```


Conclusion

What started as a mysterious bug – AI showing its “homework” to users – turned into a great learning experience about production LLM systems.

The core issue: Reasoning models (Qwen, DeepSeek) started emitting <think> tags containing chain-of-thought reasoning, which broke JSON parsing and polluted user-facing answers.

The solution: Multi-layered approach:

  1. ✅ Built centralized model registry with cleaning utilities
  2. ✅ Implemented automatic tag cleaning in 3 services
  3. ✅ Switched Answer Generation to Llama (no <think> tags)
  4. ✅ Increased token budgets to prevent truncation
  5. ✅ Added comprehensive testing for edge cases

Impact:

  • 23% → 0% error rate in Intent Detection
  • 18% → 0% truncation rate in Metadata Extraction
  • 12 → 0 user complaints per week
  • $36/month additional cost, $964/month value (2,678% ROI)

Time to fix: 2 days (8 engineer-hours)

Lines of code: ~150 lines (mostly in model registry)

The Big Lesson

Modern LLM systems are living, breathing infrastructure. Model providers update models, behaviors change, and you need:

  • 🔍 Monitoring for unusual patterns
  • 🛠️ Centralized utilities for common fixes
  • 🧪 Comprehensive testing including error cases
  • 📊 Metrics to catch regressions early
  • 🚀 Flexibility to switch models when needed

About the Team

This fix was implemented by the CrawlEnginePro team, building production-ready RAG systems with multi-model support, semantic chunking, and intent-aware retrieval.

Tech Stack:

  • Python, FastAPI, asyncio
  • Nebius AI Studio (Qwen, Llama models)
  • SambaNova Cloud (DeepSeek models)
  • Milvus (vector database)
  • Redis (caching)

System Stats:

  • 8 microservices
  • 15+ LLM models supported
  • 99.97% uptime
  • <2s average query latency

Questions? Reach out to us via contact us form.—

Last Updated: October 2025
Case Study Version: 1.0
Reading Time: ~25 minutes