Table of Contents
  1. What Is Artificial Intelligence?
  2. How Artificial Intelligence Works: Understanding the Process
  3. Types of Artificial Intelligence: From Simple to Theoretical
  4. Key AI Technologies You Should Know
  5. Practical Use Cases of AI in Everyday Industries
  6. Benefits and Business Value of Artificial Intelligence Technology
  7. Key Limitations of Artificial Intelligence and How to Overcome Them
  8. How Organizations Should Evaluate AI Readiness
  9. Ready to Transform Your Business with AI Solutions?
  10. Frequently Asked Questions on Artificial Intelligence Technology

The “think” Tag Mystery – When AI Started Showing Its Homework

How we debugged and fixed a production issue where reasoning models polluted user answers with internal chain-of-thought tokens


TL;DR

Our RAG system suddenly started showing users the AI’s “internal thoughts” alongside actual answers. We discovered that Qwen and DeepSeek reasoning models emit <think> tags containing chain-of-thought reasoning. This case study walks through how we detected, debugged, and fixed the issue across our entire pipeline – from answer generation to metadata extraction.

Timeline: October 2025
Impact: All services using Qwen3 and DeepSeek models
Fix Duration: 2 days
Lines of Code Changed: ~150
Services Affected: 3 (Answer Generation, Intent Detection, Metadata Extraction)


The Story Behind the Bug

It Started on a Tuesday Morning…

Our RAG system had been running smoothly for weeks. We’d just finished implementing Query 1.3 testing (finding highest-priced products with specifications), and everything looked great. Then, one of our team members ran a routine test and noticed something… odd.

The Query:

“Find the highest-priced product listed across all invoices and describe its key features and technical specifications.”

What We Expected:

The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999. It features high-temperature sanitizing, 40 racks/hour capacity, Energy Star certification, and a chemical/solid waste pumping system.

What We Got:

<think> Okay, let me analyze this query carefully. The user wants: 1. Highest-priced product – I need to compare all prices 2. Key features – I should list the main attributes 3. Technical specs – Need to find detailed specifications Looking at the context chunks, I see: – chunk_12 has price: $8,999 for Hobart Dishwasher – chunk_14 has specifications – Let me combine these… </think> The highest-priced product is the Hobart Commercial Dishwasher…

Us: “Wait… why is the AI showing its homework?!” 😱

The Initial Confusion

At first, we thought it was a prompt engineering issue. Maybe we’d accidentally told the model to explain its reasoning? We checked our system prompts – nope, nothing there.

Then we thought: “Is this a one-off bug?” We ran the same query again. Same result. Different query? Same issue. The <think> tags appeared everywhere.

The Pattern:

  • ✅ Llama models: Clean outputs
  • ❌ Qwen3-32B-fast: Outputs <think> tags
  • ❌ DeepSeek-R1: Outputs <think> tags
  • ❌ QwQ-32B: Outputs <think> tags

Hypothesis: Something changed with the reasoning models.


The Investigation: Going Down the Rabbit Hole

Phase 1: Is This a Model Change?

We checked Nebius AI Studio documentation and discovered something interesting:

Nebius Update (October 2025): Qwen3 reasoning models now include chain-of-thought reasoning via <think> tags by default. This enables better reasoning quality but requires output cleaning for production use.

Bingo! 🎯

Nebius had updated their Qwen models to be more “helpful” by showing their reasoning process. Great for debugging, terrible for end users.

Phase 2: How Deep Does This Go?

We started auditing our entire pipeline to see where these tags appeared:

Services Using Reasoning Models:

1. Answer Generation Service (Port 8069)

  • Model: Qwen/Qwen3-32B-fast
  • Impact: User-facing answers polluted with reasoning
  • Severity: 🔴 CRITICAL

2. Intent Detection Service (Port 8067)

  • Model: Qwen/Qwen3-32B-fast
  • Impact: JSON parsing errors (tags broke JSON structure)
  • Severity: 🔴 CRITICAL

3. Metadata Extraction Service (Port 8062)

  • Model: Qwen/Qwen3-32B-fast
  • Impact: Response truncation (reasoning consumed token budget)
  • Severity: 🟡 HIGH

The Damage:

# Real error logs we saw
[ERROR] Intent Service: JSON decode failed
JSONDecodeError: Invalid control character at line 1 column 45

[ERROR] Metadata Service: Response truncated at 1500 tokens
Expected fields: ['keywords', 'topics', 'questions', 'summary']
Got fields: ['keywords', 'topics']  # Missing fields due to truncation!

[ERROR] Answer Generation: User complained about "weird XML-like tags"

Phase 3: The Token Budget Crisis

The metadata service revealed another issue we hadn’t anticipated:

Before <think> Tags:

{
  "keywords": "TechSupply Solutions, Dell XPS 15, laptop",
  "topics": "Technology Purchase, Business Transactions",
  "questions": "What equipment was purchased?",
  "summary": "Technology equipment purchased from TechSupply Solutions"
}

Tokens Used: ~450 / 1500 limit ✅

After <think> Tags:

<think>
Let me extract metadata from this text. I see:
- Company names: TechSupply Solutions
- Products: Dell XPS 15
- Context: This is an invoice
- I should extract keywords first...
[250+ tokens of reasoning]
</think>

{
  "keywords": "TechSupply Solutions, Dell XPS 15, laptop",
  "topics": "Technology Purchase"
  [RESPONSE TRUNCATED AT 1500 TOKENS]

Tokens Used: 1500 / 1500 limit 🔴 (TRUNCATED!)

The reasoning tokens were eating into our token budget, causing responses to be cut off before completing the JSON!


The Solution: Multi-Layered Fix Strategy

We couldn’t just “turn off” reasoning models – they were actually better at complex queries than non-reasoning models. We needed a surgical fix.

Strategy 1: Create a Centralized Cleaning System

File: /shared/model_registry.py

We built a model registry that tracks which models need output cleaning:

class LLMModels(str, Enum):
    """Centralized model registry with metadata"""

    # Reasoning models (emit <think> tags)
    QWEN_32B_FAST = "Qwen/Qwen3-32B-fast"
    DEEPSEEK_R1 = "deepseek-ai/DeepSeek-R1-0528"
    QWQ_32B_FAST = "Qwen/QwQ-32B-fast"

    # Clean models (no reasoning tags)
    LLAMA_70B_FAST = "meta-llama/Llama-3.3-70B-Instruct-fast"
    LLAMA_405B = "meta-llama/Llama-3.1-405B-Instruct-Turbo"


# Model metadata (the secret sauce!)
MODEL_INFO = {
    LLMModels.QWEN_32B_FAST: {
        "provider": "nebius",
        "size": "32B",
        "context_window": 41000,
        "cost_per_1m_tokens": 0.20,
        "supports_reasoning": True,  # 🔑 Key flag
        "requires_cleaning": True,    # 🔑 Key flag
        "cleaning_pattern": r'<think>.*?</think>',  # Regex pattern
        "strengths": ["reasoning", "code", "math", "json"],
        "use_cases": ["complex_queries", "multi_step_reasoning"]
    },

    LLMModels.DEEPSEEK_R1: {
        "provider": "sambanova",
        "size": "671B MoE",
        "context_window": 64000,
        "cost_per_1m_tokens": 0.0,  # FREE!
        "supports_reasoning": True,
        "requires_cleaning": True,
        "cleaning_pattern": r'<think>.*?</think>',
        "strengths": ["advanced_reasoning", "scientific"],
        "use_cases": ["research", "complex_analysis"]
    },

    LLMModels.LLAMA_70B_FAST: {
        "provider": "nebius",
        "size": "70B",
        "context_window": 128000,
        "cost_per_1m_tokens": 0.20,
        "supports_reasoning": False,  # No <think> tags!
        "requires_cleaning": False,
        "strengths": ["general", "instruction_following"],
        "use_cases": ["general_qa", "summarization"]
    }
}


def requires_output_cleaning(model_name: str) -> bool:
    """
    Check if model output requires cleaning (e.g., <think> tag removal)

    This is the gatekeeper function that every service calls before
    returning LLM outputs to users.

    Args:
        model_name: Model identifier (e.g., "Qwen/Qwen3-32B-fast")

    Returns:
        True if model emits reasoning tags that need cleaning
    """
    # Method 1: Check registry metadata (most accurate)
    info = get_model_info(model_name)
    if info:
        return info.get("requires_cleaning", False)

    # Method 2: Fallback pattern matching (for unknown models)
    # Reasoning models from Qwen, DeepSeek, and QwQ families
    reasoning_model_patterns = [
        "Qwen3-",           # Qwen3-32B, Qwen3-72B
        "QwQ-",             # QwQ-32B
        "DeepSeek-R1",      # DeepSeek-R1, DeepSeek-R1-Distill
        "DeepSeek-V3",      # DeepSeek-V3, DeepSeek-V3.1
        "gpt-oss-"          # Some OpenAI-compatible reasoning models
    ]

    return any(pattern in model_name for pattern in reasoning_model_patterns)


def get_cleaning_pattern(model_name: str) -> str:
    """
    Get regex pattern for cleaning model output

    Returns:
        Regex pattern to match and remove reasoning tags
    """
    info = get_model_info(model_name)
    if info:
        return info.get("cleaning_pattern", r'<think>.*?</think>')

    # Default pattern works for most reasoning models
    return r'<think>.*?</think>'


def clean_reasoning_output(text: str, model_name: str) -> str:
    """
    Clean reasoning tags from model output

    This is the universal cleaning function used across all services.

    Args:
        text: Raw model output (may contain <think> tags)
        model_name: Model that generated the output

    Returns:
        Cleaned text with reasoning tags removed
    """
    if not requires_output_cleaning(model_name):
        return text  # Model doesn't need cleaning

    pattern = get_cleaning_pattern(model_name)

    # Remove all <think>...</think> blocks
    # flags=re.DOTALL ensures we match across newlines
    cleaned = re.sub(pattern, '', text, flags=re.DOTALL)

    # Clean up extra whitespace left behind
    cleaned = re.sub(r'\n{3,}', '\n\n', cleaned)  # Max 2 consecutive newlines
    cleaned = cleaned.strip()

    return cleaned

Why This Approach?

  1. Centralized Truth: One place to check if a model needs cleaning
  2. Easy to Extend: Add new models without changing service code
  3. Pattern Matching Fallback: Handles unknown models gracefully
  4. Metadata-Rich: Services can query model capabilities, costs, etc.

Strategy 2: Implement Cleaning in Each Service

Service 1: Intent Detection (Port 8067)

Problem: <think> tags broke JSON parsing

File: /Retrieval/services/intent/v1.0.0/intent_api.py

async def call_llm_gateway(query: str) -> dict:
    """
    Call LLM Gateway to analyze query intent
    """
    # Prepare LLM request
    payload = {
        "model": config.INTENT_DETECTION_MODEL,  # Qwen/Qwen3-32B-fast
        "messages": [
            {"role": "user", "content": detection_prompt}
        ],
        "max_tokens": config.INTENT_MAX_TOKENS,
        "temperature": config.INTENT_TEMPERATURE
    }

    # Call LLM Gateway
    response = await http_client.post(
        f"{config.LLM_GATEWAY_URL}/v1/chat/completions",
        json=payload
    )
    llm_data = response.json()
    llm_content = llm_data["choices"][0]["message"]["content"]

    # ✨ THE FIX: Clean up response before parsing JSON
    llm_content = llm_content.strip()

    # Remove reasoning tags if model requires it
    if requires_output_cleaning(config.INTENT_DETECTION_MODEL):
        pattern = get_cleaning_pattern(config.INTENT_DETECTION_MODEL)
        if pattern:
            llm_content = re.sub(pattern, '', llm_content, flags=re.DOTALL)

    llm_content = llm_content.strip()

    # Remove markdown code blocks if present
    if llm_content.startswith("```json"):
        llm_content = llm_content[7:]
    elif llm_content.startswith("```"):
        llm_content = llm_content[3:]
    if llm_content.endswith("```"):
        llm_content = llm_content[:-3]
    llm_content = llm_content.strip()

    # Parse JSON response from LLM
    intent_data = json.loads(llm_content)  # ✅ Now parses successfully!

    return intent_data

Result:

  • ✅ JSON parsing errors: Eliminated
  • ✅ Intent detection accuracy: Maintained at 95%+
  • ✅ Latency impact: +5ms (negligible)

Service 2: Metadata Extraction (Port 8062)

Problem: Response truncation due to reasoning tokens consuming token budget

File: /Ingestion/services/metadata/v1.0.0/config.py

# ============================================================================
# Model Configuration - UPDATED to handle <think> tags
# ============================================================================

# CRITICAL: max_tokens must be 2500+ to handle 45 fields + LLM reasoning (<think> tags)
MODEL_CONFIGS = {
    ModelType.FAST: {
        "temperature": 0.1,
        "max_tokens": 2500,  # ⬆️ Increased from 1500
        "timeout": 40
    },
    ModelType.RECOMMENDED: {
        "temperature": 0.1,
        "max_tokens": 2500,  # ⬆️ Increased from 1500
        "timeout": 40
    },
    ModelType.ADVANCED: {
        "temperature": 0.1,
        "max_tokens": 2500,  # ⬆️ Increased from 1500
        "timeout": 70
    }
}

# ============================================================================
# Prompt Templates - UPDATED to discourage reasoning
# ============================================================================

METADATA_PROMPT_BASIC = """You must respond with ONLY valid JSON. Do not include any reasoning, thinking, or explanations.

TEXT:
{text}

Extract these 4 fields:
- keywords ({keywords_count}): Key terms, proper nouns - comma separated
- topics ({topics_count}): High-level themes - comma separated
- questions ({questions_count}): Natural questions this answers - use | to separate
- summary ({summary_length}): Concise overview in complete sentences

Output format (respond with ONLY this JSON, nothing else):
{{"keywords": "term1, term2", "topics": "theme1, theme2", "questions": "question1|question2", "summary": "brief overview"}}"""

Additional Fix in metadata_api.py:

async def extract_metadata(text: str, model: ModelType) -> dict:
    """Extract metadata from text using LLM"""

    # ... LLM call code ...

    llm_content = llm_data["choices"][0]["message"]["content"]

    # ✨ Clean reasoning tags BEFORE JSON parsing
    if requires_output_cleaning(model_name):
        llm_content = clean_reasoning_output(llm_content, model_name)

    # Remove markdown code blocks
    llm_content = llm_content.strip()
    if llm_content.startswith("```json"):
        llm_content = llm_content[7:]
    # ... etc ...

    # Parse JSON
    metadata = json.loads(llm_content)

    return metadata

Result:

  • ✅ Response truncation: Eliminated
  • ✅ Metadata extraction completeness: 100% (all fields present)
  • ✅ Token usage: ~1800/2500 (healthy headroom)

Service 3: Answer Generation (Port 8069)

Problem: User-facing answers polluted with reasoning

Solution: We took a different approach here – model switching.

File: /Retrieval/services/answer_generation/v1.0.0/config.py

# ============================================================================
# Model Change History
# ============================================================================
# October 8, 2025: Switched from Qwen3-32B-fast to Llama-3.3-70B-Instruct-fast
#
# Reason: Nebius changed Qwen3 models to output <think> reasoning tokens by
# default, polluting answer output with chain-of-thought reasoning instead
# of clean answers.
#
# Result:
# ✅ Clean answers without <think> tags
# ✅ Better quality for end-user presentation
# ✅ Same cost ($0.20/1M tokens)
# ✅ Similar performance (~1-2s generation time)
# ============================================================================

DEFAULT_LLM_MODEL = "meta-llama/Llama-3.3-70B-Instruct-fast"  # Changed from Qwen3-32B-fast

Why Model Switch Instead of Cleaning?

For user-facing answers, we wanted zero risk of reasoning leaking through. Llama models:

  • Don’t emit <think> tags at all
  • Have excellent instruction-following
  • Same pricing tier as Qwen
  • Slightly better at natural language generation

Backup Implementation (belt and suspenders):

async def generate_answer(query: str, context_chunks: list, llm_model: str) -> dict:
    """Generate answer from context using LLM"""

    # ... LLM call code ...

    answer_text = llm_data["choices"][0]["message"]["content"]

    # ✨ Clean reasoning tags (just in case we switch back to reasoning models)
    if requires_output_cleaning(llm_model):
        pattern = get_cleaning_pattern(llm_model)
        if pattern:
            answer_text = re.sub(pattern, '', answer_text, flags=re.DOTALL)

    answer_text = answer_text.strip()

    return {"answer": answer_text, ...}

Result:

  • ✅ User complaints: Zero (after fix)
  • ✅ Answer quality: Improved (Llama better at natural language)
  • <think> tag leaks: Impossible (Llama doesn’t emit them)

The Technical Deep Dive: Why This Was Tricky

Challenge 1: Regex Pattern Complexity

The <think> tags weren’t always simple:

# Simple case (easy to match)
<think>
Let me analyze this...
</think>

# Nested case (trickier)
<think>
Step 1: <important>Consider X</important>
Step 2: Y
</think>

# Multi-line with special characters (regex hell)
<think>
JSON: {"key": "value"}
Regex: \d+\.\d+
</think>

Our Solution:

# Use DOTALL flag to match across newlines
# Use non-greedy matching (.*?) to stop at first </think>
pattern = r'<think>.*?</think>'
cleaned = re.sub(pattern, '', text, flags=re.DOTALL)

# Edge case: Multiple <think> blocks
# Non-greedy matching handles this automatically:
# <think>A</think> text <think>B</think>
# → "text" (both blocks removed)

Challenge 2: JSON Parsing Timing

We had to clean BEFORE JSON parsing, not after:

# ❌ WRONG: Parse first, then clean
response = llm_call()
json_data = json.loads(response)  # 💥 FAILS if <think> tags present
cleaned_json = clean(json_data)

# ✅ RIGHT: Clean first, then parse
response = llm_call()
cleaned_response = clean(response)  # Remove <think> tags
json_data = json.loads(cleaned_response)  # ✅ Parses successfully

Challenge 3: Markdown Code Block Confusion

LLMs sometimes wrapped responses in markdown code blocks:

<think>
Analyzing...
</think>
json
{"intent": "factual_retrieval"}

```

**Cleaning Order Matters:**
python

Step 1: Remove tags first

text = re.sub(r'.*?', '', text, flags=re.DOTALL)

Step 2: Remove markdown code blocks

if text.startswith("json"): text = text[7:] elif text.startswith(""):
text = text[3:]
if text.endswith("```"):
text = text[:-3]

Step 3: Strip whitespace

text = text.strip()

Step 4: Parse JSON

data = json.loads(text)
### Challenge 4: Token Budget Calculation

For metadata extraction, we had to account for reasoning overhead:

**Analysis of Reasoning Token Usage:**

| Extraction Mode | Actual Output | Reasoning Tokens | Total Tokens | Old Limit | New Limit |
|----------------|---------------|------------------|--------------|-----------|-----------|
| BASIC (4 fields) | ~400 tokens | ~200-300 tokens | ~700 tokens | 500 ❌ | **1000** ✅ |
| STANDARD (20 fields) | ~900 tokens | ~300-400 tokens | ~1300 tokens | 1200 ❌ | **1800** ✅ |
| FULL (45 fields) | ~1400 tokens | ~400-500 tokens | ~1900 tokens | 1500 ❌ | **2500** ✅ |

**Formula:**
python
required_tokens = (output_tokens + reasoning_tokens) * safety_margin
safety_margin = 1.2 (20% buffer for variability)
Example: FULL mode
required_tokens = (1400 + 500) * 1.2 = 2280 tokens
We set to 2500 to be safe
---

## Testing: How We Validated the Fix

### Test Suite 1: Regression Testing

**File:** `test_think_tag_cleaning.py`
python
import pytest
import re
from shared.model_registry import (
requires_output_cleaning,
get_cleaning_pattern,
clean_reasoning_output
)
def test_basic_think_tag_removal():
"""Test basic tag removal""" input_text = """ Let me analyze this query…
The answer is 42.
"""
expected = "The answer is 42."

model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)

assert result.strip() == expected
assert "<think>" not in result
assert "</think>" not in result
def test_multiple_think_blocks():
"""Test removal of multiple blocks""" input_text = """ First reasoning
Part 1 of answer.
Second reasoning
Part 2 of answer.
"""
expected = "Part 1 of answer.\n\nPart 2 of answer."

model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)

assert "<think>" not in result
assert "Part 1" in result
assert "Part 2" in result
def test_think_tags_with_json():
"""Test cleaning when JSON is present"""
input_text = """
I should format this as JSON…
{"intent": "factual_retrieval", "confidence": 0.95}
"""
model = "Qwen/Qwen3-32B-fast"
result = clean_reasoning_output(input_text, model)

# Should be valid JSON after cleaning
import json
parsed = json.loads(result.strip())
assert parsed["intent"] == "factual_retrieval"
def test_model_without_cleaning():
"""Test that non-reasoning models don't get cleaned"""
input_text = "The answer is this should stay visible."
model = "meta-llama/Llama-3.3-70B-Instruct-fast"
result = clean_reasoning_output(input_text, model)

# Llama doesn't need cleaning, so text should be unchanged
assert result == input_text
def test_model_registry_accuracy():
"""Test model registry flags"""
# Reasoning models should require cleaning
assert requires_output_cleaning("Qwen/Qwen3-32B-fast") == True
assert requires_output_cleaning("deepseek-ai/DeepSeek-R1-0528") == True
assert requires_output_cleaning("Qwen/QwQ-32B-fast") == True

# Non-reasoning models should not
assert requires_output_cleaning("meta-llama/Llama-3.3-70B-Instruct-fast") == False
assert requires_output_cleaning("meta-llama/Llama-3.1-405B-Instruct-Turbo") == False
@pytest.mark.parametrize("model,text,expected_clean", [
(
"Qwen/Qwen3-32B-fast",
"reasoninganswer",
"answer"
),
(
"DeepSeek-R1-0528",
"\nmulti\nline\nreasoning\nanswer",
"answer"
),
(
"meta-llama/Llama-3.3-70B-Instruct-fast",
"answer",
"answer"
),
])
def test_cleaning_parametrized(model, text, expected_clean):
"""Parametrized tests for various models and inputs"""
result = clean_reasoning_output(text, model)
assert result.strip() == expected_clean
**Test Results:**
bash
$ pytest test_think_tag_cleaning.py -v
test_basic_think_tag_removal PASSED
test_multiple_think_blocks PASSED
test_think_tags_with_json PASSED
test_model_without_cleaning PASSED
test_model_registry_accuracy PASSED
test_cleaning_parametrized[Qwen/Qwen3-32B-fast-…] PASSED
test_cleaning_parametrized[DeepSeek-R1-0528-…] PASSED
test_cleaning_parametrized[meta-llama/Llama-3.3-70B-Instruct-fast-…] PASSED
======================== 8 passed in 0.23s ========================
### Test Suite 2: End-to-End Service Testing

**Test: Intent Detection**
bash

Query 1: Simple factual retrieval

curl -X POST http://localhost:8067/v1/analyze \
-H "Content-Type: application/json" \
-d '{"query": "What is the capital of France?"}'

Expected: Clean JSON response (no tags)

{
"intent": "factual_retrieval",
"language": "en",
"complexity": "simple",
"confidence": 0.98,
"system_prompt": "…"
}

✅ PASS: No tags in response

**Test: Metadata Extraction**
bash

Extract metadata from sample text

curl -X POST http://localhost:8062/v1/extract \
-H "Content-Type: application/json" \
-d '{
"text": "Apple Inc. released the iPhone 15 Pro in September 2023…",
"mode": "basic"
}'

Expected: Complete metadata (all 4 fields)

{
"keywords": "Apple Inc, iPhone 15 Pro, September 2023",
"topics": "Technology, Product Launch",
"questions": "What did Apple release?|When was iPhone 15 Pro released?",
"summary": "Apple Inc. launched the iPhone 15 Pro in September 2023."
}

✅ PASS: All fields present, no truncation

**Test: Answer Generation**
bash

Generate answer from context

curl -X POST http://localhost:8069/v1/generate \
-H "Content-Type: application/json" \
-d '{
"query": "Find the highest-priced product",
"context_chunks": […]
}'

Expected: Clean answer (no tags)

{
"answer": "The highest-priced product is the Hobart Commercial Dishwasher (Model LXeR-2) priced at $8,999…",
"citations": […],
"generation_time_ms": 1250.5
}

✅ PASS: Clean answer, no reasoning leaked

### Test Suite 3: Load Testing

We ran load tests to ensure cleaning didn't impact performance:
python
import asyncio
import time

async def benchmark_cleaning():
"""Benchmark cleaning performance"""
text_with_tags = """
<think>
Long reasoning block with 500+ tokens of chain-of-thought...
Step 1: Analyze query
Step 2: Find relevant context
Step 3: Synthesize answer
... (lots more text) ...
</think>

The actual answer goes here (200 tokens).
"""

iterations = 10000

# Measure cleaning time
start = time.time()
for _ in range(iterations):
    cleaned = clean_reasoning_output(text_with_tags, "Qwen/Qwen3-32B-fast")
end = time.time()

avg_time_ms = ((end - start) / iterations) * 1000

print(f"Average cleaning time: {avg_time_ms:.3f}ms per call")
print(f"Throughput: {iterations / (end - start):.0f} ops/sec")

Results:
Average cleaning time: 0.042ms per call
Throughput: 23,800 ops/sec
#

Conclusion: Cleaning is essentially free (sub-millisecond)

---

## The Results: Before and After

### Metrics Comparison

| Metric | Before Fix | After Fix | Improvement |
|--------|-----------|-----------|-------------|
| **Intent Detection** | | | |
| - JSON parsing errors | 23% | 0% | ✅ 100% reduction |
| - Latency | 185ms | 190ms | ❌ +5ms (negligible) |
| - Accuracy | 94% | 95% | ✅ +1% |
| **Metadata Extraction** | | | |
| - Truncation rate | 18% | 0% | ✅ 100% reduction |
| - Complete extractions | 82% | 100% | ✅ +18% |
| - Avg tokens used | 1450/1500 | 1800/2500 | ✅ Better headroom |
| **Answer Generation** | | | |
| - User complaints | 12 (in 1 week) | 0 | ✅ 100% reduction |
| - Answer quality score | 7.5/10 | 9.2/10 | ✅ +23% |
| - Clean outputs | 75% | 100% | ✅ +25% |

### Cost Impact

**Before Fix:**

Daily LLM Costs (all services):

  • Intent Detection: $2.40 (12K requests × 200 tokens × $0.20/1M)
  • Metadata Extraction: $3.60 (6K requests × 300 tokens × $0.20/1M)
  • Answer Generation: $8.00 (4K requests × 1000 tokens × $0.20/1M)
    Total: $14.00/day = $420/month

After Fix:

Daily LLM Costs (all services):

  • Intent Detection: $2.40 (same – cleaning doesn’t add LLM calls)
  • Metadata Extraction: $4.80 (+33% due to higher token limit)
  • Answer Generation: $8.00 (same model, same usage)
    Total: $15.20/day = $456/month

Cost increase: $36/month (8.6%)

ROI Analysis:

Cost increase: $36/month
Error reduction: 23% → 0% (saved debugging time)
Customer satisfaction: ↑ (zero complaints)

Estimated engineering time saved: 10 hours/month
Engineering cost: $100/hour × 10 hours = $1,000/month saved

Net benefit: $1,000 – $36 = $964/month
ROI: 2,678%

**Verdict:** The fix paid for itself immediately. 🎉



## Key Takeaways: Lessons Learned

### 1. **Monitor Model Provider Changes**

**Lesson:** Model providers can change behavior without breaking changes in API.

**What We Did:**
– Set up alerts for unusual response patterns
– Subscribe to Nebius/SambaNova update newsletters
– Version-pin critical models in production

Recommendation: python

Log model metadata for debugging

logger.info(f"LLM call: model={model}, provider={provider}, version={version}")

Add response validation

def validate_llm_response(response: str, model: str) -> bool:
"""Check for unexpected patterns"""
warnings = []

if "<think>" in response and not is_reasoning_model(model):
    warnings.append("Unexpected <think> tags in non-reasoning model")

if len(response) > expected_max_tokens * 1.5:
    warnings.append("Response longer than expected")

if warnings:
    logger.warning(f"LLM response validation failed: {warnings}")
    return False

return True
### 2. **Build Centralized Utilities for Cross-Cutting Concerns**

**Lesson:** When multiple services need the same fix, centralize it.

**What We Did:**
- Created `/shared/model_registry.py` for model metadata
- Built universal cleaning functions used byreturn re.sub(pattern, '', text, flags=re all services
- Single source of truth for model capabilities

**Anti-Pattern to Avoid:**
python

❌ DON’T: Copy-paste cleaning logic in every service

service_a.py

def clean_output(text):
return re.sub(r'.*?', '', text)

service_b.py

def remove_tags(text):
return re.sub(r'.*?', '', text)

service_c.py

def strip_reasoning(text):
return re.sub(r'.*?', '', text)

Problem: If we need to update the regex, we have to change 3 files!

**Better Pattern:**
python

✅ DO: Centralize in shared module

shared/model_registry.py

def clean_reasoning_output(text, model):
"""Universal cleaning function"""
if not requires_cleaning(model):
return text
pattern = get_pattern(model)
return re.sub(pattern, '', text, flags=re.DOTALL)

service_a.py, service_b.py, service_c.py

from shared.model_registry import clean_reasoning_output
cleaned = clean_reasoning_output(response, model_name)

Benefit: Update once, fix everywhere

### 3. **Token Budgets Need Safety Margins**

**Lesson:** LLMs are unpredictable. Budget for variability.

**Formula We Use Now:**
python
def calculate_max_tokens(expected_output: int, has_reasoning: bool) -> int:
"""
Calculate max_tokens with safety margins
Args:
    expected_output: Expected output tokens (measured from testing)
    has_reasoning: Whether model emits reasoning tokens

Returns:
    max_tokens to set in LLM request
"""
base_tokens = expected_output

if has_reasoning:
    # Reasoning models use 25-40% extra tokens for <think> content
    reasoning_overhead = int(base_tokens * 0.40)
    base_tokens += reasoning_overhead

# Add 20% safety margin for variability
safety_margin = int(base_tokens * 0.20)
base_tokens += safety_margin

return base_tokens

Example:

Metadata extraction (basic mode): 400 tokens expected output

max_tokens = calculate_max_tokens(400, has_reasoning=True)

= 400 + (400 * 0.40) + ((400 + 160) * 0.20)

= 400 + 160 + 112

= 672 tokens

We round up to 1000 to be safe

### 4. **Test Error Paths, Not Just Happy Paths**

**Lesson:** Our tests validated clean outputs but didn't test malformed outputs.

**What We Added:**
python
def test_malformed_responses():
"""Test handling of various malformed LLM outputs"""
# Case 1: Unclosed <think> tag
input_text = "<think>reasoning without closing tag"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should not crash, should handle gracefully

# Case 2: Nested tags
input_text = "<think><think>nested</think></think>answer"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
assert result.strip() == "answer"

# Case 3: Multiple unclosed tags
input_text = "<think>A<think>B<think>C"
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
# Should handle without regex explosions

# Case 4: Very large reasoning block (10K+ tokens)
input_text = "<think>" + "x" * 10000 + "</think>answer"
start = time.time()
result = clean_reasoning_output(input_text, "Qwen/Qwen3-32B-fast")
elapsed = time.time() - start
assert elapsed < 0.1  # Should complete in <100ms
### 5. **When in Doubt, Switch Models**

**Lesson:** Sometimes the right fix is avoiding the problem entirely.

**Decision Matrix:**

Should I clean the output or switch models?

Clean Output If:
✅ Model has unique capabilities you need (e.g., best reasoning)
✅ Cleaning is reliable (well-defined patterns)
✅ No alternative models available
✅ Cost/latency benefits outweigh cleaning complexity

Switch Models If:
✅ Alternative models have comparable quality
✅ Output cleaning is unreliable or risky
✅ User-facing outputs (zero tolerance for leaks)
✅ Same or better cost/latency

Our Choice for Answer Generation: Switch to Llama
1. Comparable quality ✅
2. No tags ✅
3. User-facing ✅
4. Same cost ✅

---

## The Tech Stack

**Languages & Frameworks:**
- Python 3.11
- FastAPI (API services)
- asyncio (async processing)
- httpx (HTTP client)
- Pydantic (data validation)

**LLM Providers:**
- Nebius AI Studio (Qwen, Llama models)
- SambaNova Cloud (DeepSeek models)

**Models Involved:**
- Qwen/Qwen3-32B-fast (reasoning, has `<think>` tags)
- DeepSeek-R1-0528 (reasoning, has `<think>` tags)
- meta-llama/Llama-3.3-70B-Instruct-fast (no `<think>` tags)

**Infrastructure:**
- Docker containers
- APISIX API Gateway
- Redis (caching)
- Milvus (vector database)

**Key Files Modified:**

/shared/model_registry.py (+150 lines)
/Retrieval/services/intent/v1.0.0/intent_api.py (+10 lines)
/Ingestion/services/metadata/v1.0.0/config.py (+15 lines)
/Retrieval/services/answer_generation/v1.0.0/config.py (+5 lines, model switch)

— ## What’s Next: Future Improvements ### 1. **Smarter Reasoning Detection** Currently, we use regex to remove `<think>` tags. But what if we could: **Idea: Semantic Reasoning Detection**

python
def extract_reasoning_insights(text: str, model: str) -> dict:
“””
Instead of discarding reasoning, extract insights

Returns:
    {
        "answer": "cleaned answer",
        "reasoning": "extracted reasoning",
        "confidence": 0.95,
        "reasoning_steps": ["Step 1: ...", "Step 2: ..."]
    }
"""
if not requires_output_cleaning(model):
    return {"answer": text, "reasoning": None}

# Extract <think> content before removing
think_pattern = r'<think>(.*?)</think>'
reasoning_blocks = re.findall(think_pattern, text, flags=re.DOTALL)

# Clean the answer
cleaned_answer = re.sub(think_pattern, '', text, flags=re.DOTALL)

# Parse reasoning into steps
reasoning_steps = []
for block in reasoning_blocks:
    steps = extract_steps(block)  # Parse "Step 1:", "Step 2:", etc.
    reasoning_steps.extend(steps)

return {
    "answer": cleaned_answer.strip(),
    "reasoning": "\n".join(reasoning_blocks),
    "reasoning_steps": reasoning_steps,
    "num_steps": len(reasoning_steps)
}

Use Cases:
– Debug mode: Show reasoning to developers
– Confidence scoring: More reasoning steps = higher confidence?
– Explainability: Show users “how” the AI reached its answer

### 2. **Model-Specific Prompt Engineering**

Different models need different prompts:

python
def get_optimized_prompt(task: str, model: str) -> str:
"""
Get model-specific prompt optimizations
"""
base_prompt = TASK_PROMPTS[task]
if is_reasoning_model(model):
    # Reasoning models: Explicitly ask for JSON-only output
    return base_prompt + "\n\nIMPORTANT: Respond with ONLY valid JSON. Do not include reasoning, explanations, or <think> tags."
else:
    # Standard models: Regular prompt is fine
    return base_prompt

### 3. **Automated Model Testing Pipeline** We want to catch issues like this automatically:

python

Automated daily tests

async def test_all_models():
"""
Test all models in registry for unexpected behaviors
"""
test_queries = [
"What is 2+2?",
"List the first 3 prime numbers",
"Extract keywords from: 'The quick brown fox jumps over the lazy dog'"
]
for model in get_all_models():
    for query in test_queries:
        response = await call_llm(model, query)

        # Check for issues
        if "<think>" in response and not is_reasoning_model(model):
            alert(f"Model {model} unexpectedly emits <think> tags!")

        if len(response) == 0:
            alert(f"Model {model} returned empty response!")

        # Check response time
        if response_time > expected_latency * 1.5:
            alert(f"Model {model} slower than expected: {response_time}ms")
```


Conclusion

What started as a mysterious bug – AI showing its “homework” to users – turned into a great learning experience about production LLM systems.

The core issue: Reasoning models (Qwen, DeepSeek) started emitting <think> tags containing chain-of-thought reasoning, which broke JSON parsing and polluted user-facing answers.

The solution: Multi-layered approach:

  1. ✅ Built centralized model registry with cleaning utilities
  2. ✅ Implemented automatic tag cleaning in 3 services
  3. ✅ Switched Answer Generation to Llama (no <think> tags)
  4. ✅ Increased token budgets to prevent truncation
  5. ✅ Added comprehensive testing for edge cases

Impact:

  • 23% → 0% error rate in Intent Detection
  • 18% → 0% truncation rate in Metadata Extraction
  • 12 → 0 user complaints per week
  • $36/month additional cost, $964/month value (2,678% ROI)

Time to fix: 2 days (8 engineer-hours)

Lines of code: ~150 lines (mostly in model registry)

The Big Lesson

Modern LLM systems are living, breathing infrastructure. Model providers update models, behaviors change, and you need:

  • 🔍 Monitoring for unusual patterns
  • 🛠️ Centralized utilities for common fixes
  • 🧪 Comprehensive testing including error cases
  • 📊 Metrics to catch regressions early
  • 🚀 Flexibility to switch models when needed

About the Team

This fix was implemented by the CrawlEnginePro team, building production-ready RAG systems with multi-model support, semantic chunking, and intent-aware retrieval.

Tech Stack:

  • Python, FastAPI, asyncio
  • Nebius AI Studio (Qwen, Llama models)
  • SambaNova Cloud (DeepSeek models)
  • Milvus (vector database)
  • Redis (caching)

System Stats:

  • 8 microservices
  • 15+ LLM models supported
  • 99.97% uptime
  • <2s average query latency

Questions? Reach out to us via contact us form.—

Last Updated: October 2025
Case Study Version: 1.0
Reading Time: ~25 minutes