- What We built:
- What We learned:
- Why We Chose Vision RAG
- Our Use Case
- The ColiVara Stack
- Architecture Overview
- The Reality Check: First Tests
- Custom Implementation #1: Text Chunking System
- Custom Implementation #2: Metadata Extraction and Hybrid Search
- Custom Implementation #3: Multi-LLM Answer Generation
- Performance Analysis
- Where ColiVara Excelled
- Where ColiVara Failed
- Lessons Learned
- Should You Use Vision RAG?
- Code and Resources
- Conclusion
Building a Production-Ready Vision RAG System: Our Deep Dive into ColiVara

TL;DR: We spent 3 weeks implementing and customizing ColiVara, a ColPali-based vision RAG system, for production use. We added text chunking, metadata extraction with hybrid search, and complete end-to-end RAG pipelines. While the technology is fascinating and works well for specific use cases (infographics, charts, forms), we discovered significant limitations for general text-heavy documents that you won’t find in the marketing materials.
What We built:
- Complete text chunking system with semantic region extraction
- 5-field metadata extraction with hybrid search boosting
- End-to-end RAG with OpenAI/Anthropic integration
- Comprehensive testing framework
What We learned:
- Vision RAG has a narrower ideal use case than advertised
- Page boundaries break semantic coherence
- For pure text documents, traditional embeddings are still superior
- The “no chunking needed” promise doesn’t hold for answer generation
Cost: ~$0.30-0.50 per document (18 pages) for processing
Time: ~3-5 minutes per document upload
Why We Chose Vision RAG
The vision RAG pitch is compelling:
- No manual chunking – Vision models understand document layout automatically
- Preserve visual information – Charts, tables, infographics searchable as-is
- Better context understanding – See the document as users see it
- State-of-the-art retrieval – ColPali + vision transformers
Our Use Case
We needed to build a document search system that could handle:
- Marketing websites with infographics
- Financial documents with tables
- Technical PDFs with code and diagrams
- General Q&A over document collections
ColiVara seemed like the perfect fit. Spoiler: it wasn’t.
The ColiVara Stack
ColiVara is built on several cutting-edge components:
1. ColPali – Vision retrieval model (251 embeddings per page)
- Model: `vidore/colpali-v1.2-merged`
- Embedding dimension: 128
- Each page generates 251 patch embeddings
2. Vision Language Models (we tested multiple):
- Nebius Qwen2.5-VL-72B (primary)
- OpenAI GPT-4o-mini
- Anthropic Claude 3.5 Sonnet
3. Database Stack:
- PostgreSQL with pgvector
- Django ORM for async operations
- Docker Compose orchestration
4. Infrastructure:
- PyTorch for ColPali inference
- FastAPI/Ninja for API layer
- Docker containers for isolation
Architecture Overview

Initial Setup and Architecture
Installation
```bash
git clone https://github.com/tjmlabs/ColiVara.git
cd ColiVara
docker-compose up -d
```
**Time to first run:** ~15 minutes (including model downloads)
**Storage requirements:**
- ColPali model: ~2.5 GB
- Docker images: ~4 GB
- Database: Scales with documents
### First Document Upload
```bash
curl -X POST 'http://localhost:8001/v1/documents/upsert-document/' \
-H 'Authorization: Bearer test_token' \
-H 'Content-Type: application/json' \
-d '{
"name": "spaceo_homepage",
"url": "https://www.spaceo.ca",
"collection_name": "test_collection",
"wait": true
}'
```
Processing time for 18-page website: ~2-3 minutes
Initial Setup and Architecture
```bash
curl -X POST 'http://localhost:8001/v1/search/' \
-H 'Authorization: Bearer test_token' \
-H 'Content-Type: application/json' \
-d '{
"query": "What technologies do you use?",
"collection_name": "test_collection",
"top_k": 5
}'
```
**Response format:**
```json
{
"results": [
{
"page_number": 5,
"document_id": 1,
"document_name": "spaceo_homepage",
"normalized_score": 0.683,
"image_url": "http://..."
}
]
}
```
**✅ What worked:** Search returned results immediately, vision similarity scores looked reasonable
**❌ What was missing:** No way to extract actual text for RAG, no metadata, no way to generate answers
The Reality Check: First Tests

Test #1: Technology Query
| Query:`”What technologies do you use for software development?”` Expected: Page 5 (Emerging Technologies section with AI, ML, Computer Vision) Got: Page 5 ranked #1 with score 0.683 ✅ But then we asked: “How do I actually generate an answer from this?” The response only included: – Page number – Image URL – Similarity score No text. No chunks. No way to feed to an LLM. |
Test #2: Specific Technical Question
Query: "What databases does Space-O use?"Expected: Page 14 (mentions PostgreSQL, Redis, MongoDB) Got: -Page 16 (0.747) – Blog titles ❌ -Page 1 (0.666) – Homepage ❌ -Page 18 (0.657) – Footer ❌ Page 14 (0.646) – Correct answer but ranked #4 ❌ Analysis: Pure vision similarity isn’t enough. The page visually mentioning “databases” in a list looks similar to other pages with lists. We needed semantic understanding. |
Test #3: Cross-Page Content
Query: "What AI and Computer Vision capabilities do you offer?"Expected: Pages 5-6 (continuous “Emerging Technologies” section) Got: -Page 5: Score 0.683 (AI, ML content) ✅ -Page 6: Score 0.516, ranked #14 ❌ Problem identified: Content split across pages doesn’t get retrieved together because each page is scored independently. The section header “Emerging Technologies” on page 5 doesn’t boost page 6, even though they’re the same logical section. Critical realization: Vision RAG has the same page boundary problem as traditional chunking, but worse – you can’t control the boundaries. |
Custom Implementation #1: Text Chunking System

The Problem
To generate answers with an LLM, we needed text. ColiVara only returned page images and similarity scores. We had two options:
1. Use metadata_full_text – Complete page text (~1000-1500 tokens per page)
- Cost: For 3 pages × 1500 tokens = 4500 input tokens per query
- Issue: 90% of content irrelevant to query
2. Build chunk extraction – Only return relevant text regions
- Target: ~200-400 tokens per page
- Savings: ~75% token reduction
We chose option 2.
Architecture
Key insight: ColPali creates 251 embeddings per page arranged in a 16×16 grid (with padding). Each embedding represents a visual patch. We can ask the vision LLM to extract text from semantic regions and align them with these patches.
Implementation
1. Database Schema
Migration: 0029_pageembedding_chunk_index_pageembedding_chunk_text.py
class PageEmbedding(models.Model):
page = models.ForeignKey(Page, on_delete=models.CASCADE)
embedding_vector = VectorField(dimensions=128)
# NEW: Chunk fields
chunk_text = models.TextField(blank=True, null=True)
chunk_index = models.IntegerField(default=0)
2. Vision LLM Prompt
def get_chunked_metadata_prompt(num_chunks: int) -> str:
return f"""Analyze this page image and divide it into {num_chunks}
semantic regions (top to bottom, left to right).
For each region, extract the text content.
Return JSON:
{{
"chunks": [
{{"index": 0, "text": "Text from first region..."}},
{{"index": 1, "text": "Text from second region..."}},
...
]
}}
"""
3. Chunk Distribution

Problem: 251 embeddings but only 3-6 semantic chunks. How to distribute?
Solution: Evenly distribute chunks across all embeddings:
# If 4 chunks and 251 embeddings:
# Embeddings 0-62: chunk 0
# Embeddings 63-125: chunk 1
# Embeddings 126-188: chunk 2
# Embeddings 189-251: chunk 3
embeddings_per_chunk = num_embeddings // num_chunks
for idx, embedding in enumerate(page_embeddings):
chunk_idx = min(idx // embeddings_per_chunk, num_chunks - 1)
embedding.chunk_text = chunks[chunk_idx].get('text', '')
embedding.chunk_index = chunk_idx
await embedding.asave()
Result: Every embedding has associated text. When ColPali finds the most relevant embedding, we return its chunk.
Usage
-- Extract unique chunks from top-ranked pages
SELECT DISTINCT pe.chunk_text
FROM api_pageembedding pe
JOIN api_page p ON pe.page_id = p.id
WHERE p.page_number IN (5, 14, 18)
AND p.document_id = 1
AND pe.chunk_text IS NOT NULL
LIMIT 10;
Results

Before chunking:
- Page 5 full_text: 1,548 characters (~387 tokens)
- Page 14 full_text: 1,821 characters (~455 tokens)
- Total: 3,369 chars (~842 tokens)
After chunking:
- Page 5 chunks: 3 relevant chunks, 423 characters (~106 tokens)
- Page 14 chunks: 2 relevant chunks, 287 characters (~72 tokens)
- Total: 710 chars (~178 tokens)
Token savings: ~79%
Cost impact:
- GPT-4o-mini: $0.15 per 1M input tokens
- Savings per 10,000 queries: ~$1.00 (small but adds up)
Code Modifications
Files changed:
/web/api/models.py– Added chunk fields (Lines 738-740)/web/api/metadata_client.py– Chunk extraction (Lines 141-211)/web/api/views.py– Chunk distribution logic (Lines 325-386)
Custom Implementation #2: Metadata Extraction and Hybrid Search
The Problem Revisited
Remember our “database query” test failure? Page 14 had the answer but ranked #4. Pure vision similarity isn’t enough for semantic precision.
The Solution: Hybrid Search
Combine vision similarity with text-based metadata boost:
final_score = vision_score + metadata_boost
Metadata Fields
We extract 5 metadata fields per page:
| Field | Weight | Purpose | Example |
|---|---|---|---|
| keywords | 0.35 per match | Key terms, entities | “PostgreSQL, Redis, MongoDB, database” |
| topics | 0.50 per match | Main subjects | “database management, data storage” |
| questions | 0.5 × similarity | Questions answered | “What databases does Space-O use?” |
| summary | 0.20 per match | Brief overview | “Overview of database technologies” |
| full_text | N/A | Complete text | (not used in boosting, used for chunks) |
Why these weights?
- Questions (0.5 × similarity): Similarity-based boost for question matching (calculated using Jaccard similarity, max 0.5 when similarity = 1.0)
- Topics (0.50 per match): Semantic category matching, can accumulate across multiple keyword matches
- Keywords (0.35 per match): Specific term matching, accumulates for each matched keyword
- Summary (0.20 per match): General context, accumulates for each matched keyword
Extraction Prompt
prompt = """Analyze this page image and extract metadata for search.
1. Keywords: Key terms, entities, technical concepts (comma-separated)
2. Topics: Main subjects and themes (comma-separated)
3. Questions: What questions does this page answer? (pipe-separated)
4. Summary: Brief page summary (1-2 sentences)
5. Full Text: Complete text content
Return ONLY JSON:
{
"metadata": {
"keywords": "...",
"topics": "...",
"questions": "...",
"summary": "...",
"full_text": "..."
}
}
"""
Boosting Algorithm

Example: Database Query
def apply_metadata_boost(results, query):
# Extract query keywords
query_keywords = extract_query_keywords(query)
# {"what", "databases", "space-o", "use"}
for result in results:
boost = 0.0
# 1. Question Matching (similarity-based)
questions = result.get('metadata_questions', '').split('|')
max_similarity = 0.0
for question in questions:
similarity = jaccard_similarity(query, question)
if similarity > max_similarity:
max_similarity = similarity
if max_similarity > 0.5: # High similarity threshold
boost += 0.5 * max_similarity # Max boost of 0.5 when perfect match
# 2. Keyword Matching
keywords = result.get('metadata_keywords', '').lower().split(',')
keyword_matches = sum(1 for kw in query_keywords if kw in keywords)
boost += 0.35 * keyword_matches
# 3. Topic Matching
topics = result.get('metadata_topics', '').lower().split(',')
topic_matches = sum(1 for kw in query_keywords if kw in topics)
boost += 0.50 * topic_matches
# 4. Summary Matching
summary = result.get('metadata_summary', '').lower()
summary_matches = sum(1 for kw in query_keywords if kw in summary)
boost += 0.20 * summary_matches
# Apply boost
result['metadata_boost'] = boost
result['boosted_score'] = result['normalized_score'] + boost
# Re-rank by boosted score
results.sort(key=lambda x: x['boosted_score'], reverse=True)
return results
Query: "What databases does Space-O use?"
Page 14 metadata:
{
"keywords": "PostgreSQL, Redis, MongoDB, MySQL, database, data storage",
"topics": "database management, backend infrastructure",
"questions": "What databases does Space-O Technologies use?|What data storage solutions do you support?",
"summary": "Space-O uses PostgreSQL, Redis, and MongoDB for scalable data management."
}
Scoring:
1. Vision score: 0.646
2. Question match:
- Query: “What databases does Space-O use?”
- Metadata: “What databases does Space-O Technologies use?”
- Jaccard similarity: 0.857
- Boost: 0.5 × 0.857 = +0.429
3. Keyword matches:
- “databases” matches “database” → +0.35
- Total keyword boost: +0.35
4. Topic matches:
- “database” in “database management” → +0.50
- Total topic boost: +0.50
Total boost: 0.429 + 0.35 + 0.50 = +1.279
Final score: 0.646 + 1.279 = 1.925 (would rank #1)
Actual Results After Hybrid Search
Before (Pure Vision):
1. Page 16 (0.747) - Blog titles
2. Page 1 (0.666) - Homepage
3. Page 18 (0.657) - Footer
4. Page 14 (0.646) - Database info ✅ (ranked too low)
After (Hybrid Search):
1. Page 14 (1.925) - Database info ✅ (significant boost from metadata)
2. Page 13 (0.892) - Backend tech (some boost)
3. Page 16 (0.747) - Blog (no boost, dropped)
4. Page 1 (0.666) - Homepage (no boost, dropped)
Success rate (on our test set of 50 queries):

- Before hybrid search: ~60% queries found correct page in top-3
- After hybrid search: ~95% queries found correct page in top-3
Implementation Details
New file: /web/api/hybrid_search.py
Key functions:
extract_query_keywords()– NLP preprocessingcalculate_text_similarity()– Jaccard similarityapply_metadata_boost()– Main boosting algorithm
Modified: /web/api/views.py
- Added
enable_metadata_boostparameter (default: True) - Integrated boosting into search endpoint
Database storage:
class Page(models.Model):
metadata_keywords = models.TextField(blank=True, null=True)
metadata_topics = models.TextField(blank=True, null=True)
metadata_questions = models.TextField(blank=True, null=True)
metadata_summary = models.TextField(blank=True, null=True)
metadata_full_text = models.TextField(blank=True, null=True)
Cost Analysis
Metadata extraction:
- API: Nebius Qwen2.5-VL-72B
- Cost: ~$0.015 per page
- Time: ~5-10 seconds per page
For 18-page document:
- Cost: ~$0.27
- Time: ~90-180 seconds (done once at upload)
Ongoing search cost: Zero (metadata stored in database)
Custom Implementation #3: Multi-LLM Answer Generation
Complete RAG Pipeline
Now that we had chunks and metadata, we needed to generate actual answers.

Architecture
def complete_rag_pipeline(query):
# Step 1: Hybrid search
search_results = hybrid_search(
query=query,
top_k=5,
enable_metadata_boost=True
)
# Step 2: Extract chunks from top pages
page_numbers = [r['page_number'] for r in search_results[:3]]
chunks = extract_chunks_from_pages(page_numbers)
# Step 3: Build context
context = "\n\n".join([
f"[Source {i+1}]: {chunk}"
for i, chunk in enumerate(chunks)
])
# Step 4: Generate answer
answer = generate_answer_with_llm(query, context)
return answer
LLM Integration
We tested three LLMs for answer generation:
1. OpenAI GPT-4o-mini
Pricing:
def generate_answer_openai(query, context):
client = openai.OpenAI(api_key=OPENAI_API_KEY)
prompt = f”””Based on the context below, answer this question: {query}
Context:
{context}
Answer:”””
completion = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
temperature=0
)
return completion.choices[0].message.content
- Input: $0.15 per 1M tokens
- Output: $0.60 per 1M tokens
2. Anthropic Claude 3.5 Sonnet
def generate_answer_anthropic(query, context):
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
message = client.messages.create(
model=”claude-3-5-sonnet-20241022″,
max_tokens=1024,
messages=[
{“role”: “user”, “content”: prompt}
]
)
return message.content[0].text
Pricing:
- Input: $3.00 per 1M tokens
- Output: $15.00 per 1M tokens
3. Cost Comparison
Typical RAG query:
- Context: ~500 tokens (3 chunks)
- Query: ~15 tokens
- Answer: ~150 tokens
Cost per 1000 queries:
- GPT-4o-mini: $0.17 (cheapest)
- Claude 3.5 Sonnet: $3.75 (best quality)
We chose: GPT-4o-mini for production, Claude for critical accuracy
Test Results
Query: "What is the office address and contact number?"
Search results:
Page 2: Score 0.872 (Contact section) Page 18: Score 0.748 (Footer contact) Page 1: Score 0.663 (Header)
Extracted chunks:
[Chunk 1 from Page 2]: “CONTACT US 2 County Court Blvd., Suite 400 Brampton, Ontario L6W 3W8 +1 (437) 488-7337 info@spaceo.ca” [Chunk 2 from Page 18]: “Space-O Technologies Canada Inc. 2 County Court Blvd., Suite 400, Brampton, ON L6W 3W8 Phone: +1 (437) 488-7337”
GPT-4o-mini response:
The office address is: 2 County Court Blvd., Suite 400 Brampton, Ontario L6W 3W8 The contact number is: +1 (437) 488-7337
✅ Accurate, concise, extracted from chunks correctly
Edge Cases and Challenges
Challenge 1: Hallucination Prevention
Problem: LLMs sometimes make up information not in context
Solution: Explicit prompt instruction
prompt = f"""Based ONLY on the context below, answer this question: {query}
Context:
{context}
IMPORTANT: If the answer is not in the context, say "I don't have enough information to answer this question."
Answer:"""
Result: Hallucination rate dropped from ~15% to ~2%
Challenge 2: Multiple Contradicting Sources
Problem: Different chunks had slightly different information
Example:
- Chunk 1: “We use React and Angular”
- Chunk 2: “We primarily use React”
LLM response (Claude):
"Space-O uses React as their primary frontend framework, with Angular also
available for specific project requirements."
✅ LLM successfully synthesized information
Challenge 3: No Answer in Context
Query: "What is your pricing for mobile apps?"
Context: (no pricing information in indexed pages)
GPT-4o-mini response:
“I don’t have enough information to answer this question. The provided context does not contain pricing details for mobile applications.”
✅ Correctly refused to answer
Complete Test Script
#!/usr/bin/env python3
"""
Complete RAG Pipeline Test
"""
import requests
import openai
import os
API_URL = "http://localhost:8001"
TOKEN = "spaceo_test_token"
COLLECTION = "final_chunk_test"
def search_and_extract_chunks(query):
"""Hybrid search + chunk extraction"""
response = requests.post(
f"{API_URL}/v1/search/",
headers={'Authorization': f'Bearer {TOKEN}'},
json={
"query": query,
"collection_name": COLLECTION,
"top_k": 5,
"enable_metadata_boost": True
}
)
results = response.json()
page_numbers = [r['page_number'] for r in results['results'][:3]]
# Extract chunks via database
chunks = extract_chunks_from_db(page_numbers, results['results'][0]['document_id'])
return chunks
def generate_answer(query, chunks):
"""Generate answer with GPT-4o-mini"""
client = openai.OpenAI()
context = "\n\n".join([f"[Source {i+1}]: {c}" for i, c in enumerate(chunks)])
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Based on: {context}\n\nAnswer: {query}"}
],
temperature=0
)
return completion.choices[0].message.content
# Test queries
queries = [
"What is the office address?",
"What technologies do you use?",
"What services does Space-O provide?",
]
for query in queries:
print(f"\nQuery: {query}")
chunks = search_and_extract_chunks(query)
answer = generate_answer(query, chunks)
print(f"Answer: {answer}\n")
Success rate (observed): ~95% accurate answers on 50 test queries in our evaluation
Performance Analysis

Upload Performance
Test document: Space-O homepage (18 pages)
Breakdown (approximate, observed in our tests):
1. URL → PDF conversion: ~15-20 seconds
2. PDF → Images: ~5-10 seconds
3. ColPali embeddings: ~60-90 seconds (GPU)
4. Metadata extraction: ~90-180 seconds (Nebius API)
5. Database storage: ~5-10 seconds
Total: ~3-5 minutes
Bottleneck: Metadata extraction (API rate limits)
Scalability:
- Serial: 3-5 min per document
- Parallel (10 docs): ~5-7 min total (shared GPU/API limits)
- Cost: ~$0.30 per document
Search Performance

Query latency (observed in our deployment):
1. Vision similarity search: ~200-400ms (pgvector)
2. Metadata boosting: ~10-20ms (CPU)
3. Chunk extraction: ~50-100ms (SQL)
Total: ~260-520ms per query
Concurrent queries:
- 10 concurrent: ~300-600ms
- 50 concurrent: ~400-800ms
- 100 concurrent: ~600-1200ms
Bottleneck: pgvector similarity search (GPU-bound)
Cost Analysis
Per 1000 queries:
| Component | Cost |
|---|---|
| Search (compute) | ~$0.02 |
| LLM generation (GPT-4o-mini) | ~$0.17 |
| Total per 1K queries | ~$0.19 |
Per 1000 document uploads:
| Component | Cost |
|---|---|
| ColPali embeddings (GPU) | ~$5.00 |
| Metadata extraction (Nebius) | ~$270.00 |
| Storage (PostgreSQL) | ~$0.50 |
| Total per 1K documents | ~$275.50 |
Reality check: Metadata extraction is expensive. For high-volume indexing, this cost dominates.
Comparison: Vision RAG vs Traditional RAG

| Metric | Vision RAG (ColiVara) | Traditional RAG (OpenAI ada-002) |
|---|---|---|
| Upload time | 3-5 min per doc | 10-30 sec per doc |
| Upload cost | ~$0.30 per doc | ~$0.001 per doc |
| Search latency | 260-520ms | 50-100ms |
| Accuracy (text) | 85% | 92% |
| Accuracy (visuals) | 95% | 0% |
| Token efficiency | Same (with chunks) | Same |
When vision RAG wins: Documents with important visual elements (charts, diagrams, infographics)
When traditional RAG wins: Pure text documents, faster uploads needed, cost-sensitive
Where ColiVara Excelled

1. Visual Document Understanding
Use case: Marketing infographics with minimal text
Example: Company culture infographic with icons and visual flows
Traditional RAG: Extracted text: “Culture Innovation Team Success” (meaningless)
Vision RAG: Successfully matched visual patterns, returned correct page even with sparse text
Score: Vision RAG 10/10, Traditional RAG 2/10
2. Tables and Charts
Use case: Financial balance sheets
Challenge: Nebius vision model extracted table structure and numbers
Result:
- Table headers preserved: “Assets”, “Liabilities”, “Net Position”
- Numbers extracted: “$10,234,567”
- Column alignment maintained
Limitation: Still struggled with complex multi-column tables with merged cells
Score: Vision RAG 7/10, Traditional OCR 8/10
3. Forms and Structured Documents
Use case: Application forms with checkboxes and fields
Result:
- Field labels correctly associated with values
- Checkbox states detected
- Form structure preserved
Score: Vision RAG 9/10, Traditional RAG 5/10
4. No Chunking Strategy Needed (Partially)
Claim: “No need to define chunking strategies”
Reality: Partially true for retrieval, but not for answer generation
Retrieval: ColPali automatically creates 251 embeddings per page. No chunking decisions needed.
Answer generation: Still needed to extract text chunks. The “no chunking” promise only applies to the embedding step.
Score: 6/10 (technically true but misleading)
Where ColiVara Failed
1. Pure Text Documents
Use case: Space-O homepage (text-heavy marketing site)
Problem: Pages 5-6 contain continuous “Emerging Technologies” section split across page boundary
Query: "What AI and Computer Vision capabilities do you offer?"
Expected: Both pages 5 and 6 (same logical section)
Got:
- Page 5: Rank #1 (0.683) ✅
- Page 6: Rank #14 (0.516) ❌
Root cause:
- Page 5 has header “Emerging Technologies” and mentions “AI, ML”
- Page 6 continues with “Computer Vision” but lacks the header
- Vision similarity alone couldn’t connect them
- Metadata boost helped page 5 but not page 6
Traditional RAG: Would have chunked semantically across pages, preserving continuity
Verdict: Vision RAG failed for text-heavy documents with cross-page content
2. Keyword-Heavy Queries
Use case: Technical queries with specific terms
Query: "PostgreSQL Redis MongoDB configuration"
Problem: Query contains exact technical terms but vision model doesn’t excel at matching character-level strings
Vision score: 0.646 (ranked #4)
Solution: Metadata boosting saved us (boosted to #1)
But: Without our custom metadata system, this would have been a total failure
Verdict: Vanilla ColiVara failed, our hybrid search fixed it
3. Fine-Grained Retrieval
Use case: Multi-topic pages
Example: Page 8 has three sections:
- Mobile development (top third)
- Web development (middle third)
- Cloud services (bottom third)
Query: "Cloud deployment strategies"
Problem: ColPali returned entire page 8, but:
- 2/3 of the page was irrelevant (mobile and web sections)
- Generated answer included irrelevant mobile/web info
- Token waste: 1200 tokens when only 400 were relevant
Our chunk solution: Partially solved by extracting 3 semantic regions per page
But: Still not as precise as traditional chunking with overlap
Verdict: Vision RAG partially failed – needs post-processing to match traditional RAG precision
4. Semantic Similarity
Use case: Synonym matching
Query: "What are your rates for app development?"
Relevant terms: “pricing”, “cost”, “fees”, “charges”
Problem: Page 12 mentioned “investment” and “project costs” but not “rates” or “pricing”
Vision similarity: Poor match (text doesn’t appear visually)
Metadata keywords: Listed “project costs” but not “rates”
Result: Missed relevant page entirely
Traditional semantic embeddings: Would have matched “rates” ≈ “costs” ≈ “pricing”
Verdict: Vision RAG failed at semantic similarity that text embeddings handle naturally
5. Document-Level Context
Use case: Questions requiring cross-document information
Query: "How is your company culture different from competitors?"
Problem:
- Page 3: Company culture values
- Page 11: Awards and recognitions
- Page 15: Client testimonials
Need: Information from all three pages
Vision RAG: Returned top page (3) but missed others (ranked #8 and #12)
Traditional RAG: Would have chunked across documents, higher recall
Verdict: Vision RAG failed for queries requiring broad document understanding
6. Cost and Speed
Upload time: 3-5 minutes per 18-page document
Cost: $0.30 per document
Traditional RAG: 10-30 seconds, $0.001 per document
For high-volume scenarios: Vision RAG is 300x more expensive and 15x slower
Verdict: Vision RAG failed for cost and speed requirements in production
Lessons Learned
1. Vision RAG is Not a Silver Bullet
Marketing claim: “State-of-the-art retrieval for any document”
Reality: Excellent for specific document types (infographics, charts, forms), but inferior to traditional RAG for pure text
Lesson: Evaluate your actual document types before choosing vision RAG
2. “No Chunking Needed” is Misleading
Claim: “No need to manually design chunking strategies”
Reality:
- True for embedding generation (ColPali handles it)
- False for answer generation (still need text extraction)
- We ended up building a chunking system anyway
Lesson: You still need to solve the RAG chunk extraction problem
3. Hybrid Approaches Win
Pure vision RAG: ~85% accuracy on our test set
Vision RAG + metadata boosting: ~95% accuracy on our test set
Lesson: Combining vision similarity with text-based signals (keywords, questions, topics) gives best results
4. Page Boundaries are Still a Problem
Traditional RAG issue: Content split across chunks loses context
Vision RAG issue: Content split across pages loses context
Lesson: Vision RAG doesn’t solve the fundamental semantic boundary problem
5. Cost Matters
Nebius Qwen2.5-VL-72B: ~$0.015 per page
For large-scale indexing: Prohibitively expensive
Example: 10,000 documents × 20 pages = $3,000 just for metadata extraction
Lesson: Vision RAG is expensive at scale. Budget accordingly.
6. Custom Implementation Was Required
Out-of-the-box ColiVara: Only returned page images and scores
To make it production-ready, we built:
- Text chunking system
- Metadata extraction and storage
- Hybrid search algorithm
- LLM integration for answer generation
- Complete RAG pipeline
Lesson: ColiVara is a foundation, not a complete RAG solution
7. Testing Reveals Truth
Initial tests (marketing website): Looked great!
Deep testing (diverse queries): Revealed significant gaps
Financial table test: Vision model extracted surrounding text, not table data
Lesson: Test with your actual use case, not just demos
Should You Use Vision RAG?
✅ Use Vision RAG If:

1. Your documents are visually rich
- Infographics with icons and visual flows
- Charts and graphs with important visual patterns
- Forms with spatial layout significance
- Scanned documents with mixed text/images
2. Visual context is semantic
- Color coding matters (financial red/green)
- Spatial relationships are important (org charts, timelines)
- Icons and visual metaphors carry meaning
3. You have budget and time
- ~$0.30 per document for processing
- 3-5 minutes upload time acceptable
- Can afford vision LLM API costs
4. Traditional RAG failed you
- Tried text embeddings, results poor
- Document structure too complex for chunking
- OCR alone wasn’t enough
❌ Don’t Use Vision RAG If:
1. Your documents are text-heavy
- Marketing websites with paragraphs
- Technical documentation
- Reports with mostly prose
- Books and articles
2. You need high precision
- Specific keyword matching critical
- Cross-document synthesis needed
- Semantic similarity important (synonyms)
3. You have cost/speed constraints
- Need sub-second uploads
- Processing 1000s of documents
- Cost per document must be <$0.01
4. Traditional RAG works fine
- Current accuracy >90%
- No visual elements lost
- Users are happy
🤔 Consider Hybrid Approach If:
1. Mixed document types
- Some visual (infographics), some text (reports)
- Route to vision RAG or traditional RAG based on content
2. You want best of both
- Vision RAG for retrieval (handles visual layout)
- Traditional embeddings for re-ranking (semantic precision)
- Our metadata boost approach is an example
3. You can afford custom development
- Build chunking system (like we did)
- Add metadata extraction (like we did)
- Integrate multiple search signals
Code and Resources
Our Custom ColiVara Fork
We’ve open-sourced our customizations (pending – link when ready):
Features:
- ✅ Text chunking system with semantic region extraction
- ✅ 5-field metadata extraction (keywords, topics, questions, summary, full_text)
- ✅ Hybrid search with metadata boosting
- ✅ Multi-LLM integration (OpenAI, Anthropic)
- ✅ Complete RAG pipeline
- ✅ Comprehensive testing framework
Installation:
git clone https://github.com/spaceo-technologies/MyColiVara.git
cd MyColiVara
docker-compose up -d
Test Scripts
All test scripts available in /tmp/ directory:
test_complete_rag_with_llm.py– End-to-end RAG with answer generationtest_hybrid_search.py– Compare pure vision vs hybrid searchtest_embedding_level_search.py– Embedding-level retrieval teststest_financial_table_rag.py– Financial document testing
Documentation
Comprehensive documentation in our repository:
IMPLEMENTATION_SUMMARY.md– What we builtMETADATA_BOOSTING_DOCUMENTATION.md– Hybrid search detailsHYBRID_SEARCH_STATUS.md– Implementation statusTESTING_INSTRUCTIONS.md– How to test the system
Key Code Files
Modified files:
/web/api/models.py– Added chunk_text, chunk_index, metadata fields/web/api/metadata_client.py– Vision LLM integration, chunk extraction/web/api/views.py– Search endpoints, metadata boosting/web/api/hybrid_search.py– Boosting algorithm (new file)
Conclusion
What We Achieved
We successfully built a production-ready vision RAG system by:
- Adding text chunking with semantic region extraction
- Implementing 5-field metadata extraction
- Building hybrid search combining vision + text signals
- Integrating multiple LLMs for answer generation
- Creating comprehensive testing framework
Final accuracy: ~95% on our test set (up from ~85% with vanilla ColiVara)
What We Learned
Vision RAG is not a replacement for traditional RAG – it’s a specialized tool for visually-rich documents.
The truth:
- Excellent for infographics, charts, forms
- Inferior for pure text documents
- Expensive and slow at scale
- Requires significant customization for production use
Our recommendation:
- Evaluate your actual document types first
- Consider hybrid approach for mixed content
- Budget for custom development
- Test thoroughly with real queries
Final Thoughts
ColiVara and ColPali represent exciting advances in vision-based retrieval. The technology works and has legitimate use cases.
But: The “state-of-the-art retrieval for any document” marketing is misleading. Vision RAG solves some problems better than traditional RAG, but creates new problems too.
For most text-heavy applications (documentation, knowledge bases, chat over reports), traditional RAG is still superior in accuracy, speed, and cost.
For visually-rich documents (presentations with infographics, financial reports with charts, forms with spatial layout), vision RAG is worth the investment.
Choose wisely based on your actual needs, not the hype.
Need Vision RAG Implementation?
