TL;DR: We spent 3 weeks implementing and customizing ColiVara, a ColPali-based vision RAG system, for production use. We added text chunking, metadata extraction with hybrid search, and complete end-to-end RAG pipelines. While the technology is fascinating and works well for specific use cases (infographics, charts, forms), we discovered significant limitations for general text-heavy documents that you won’t find in the marketing materials.
Cost: ~$0.30-0.50 per document (18 pages) for processing
Time: ~3-5 minutes per document upload
The vision RAG pitch is compelling:
We needed to build a document search system that could handle:
ColiVara seemed like the perfect fit. Spoiler: it wasn’t.
ColiVara is built on several cutting-edge components:
1. ColPali – Vision retrieval model (251 embeddings per page)
2. Vision Language Models (we tested multiple):
3. Database Stack:
4. Infrastructure:
Installation
```bash
git clone https://github.com/tjmlabs/ColiVara.git
cd ColiVara
docker-compose up -d
```
**Time to first run:** ~15 minutes (including model downloads)
**Storage requirements:**
- ColPali model: ~2.5 GB
- Docker images: ~4 GB
- Database: Scales with documents
### First Document Upload
```bash
curl -X POST 'http://localhost:8001/v1/documents/upsert-document/' \
-H 'Authorization: Bearer test_token' \
-H 'Content-Type: application/json' \
-d '{
"name": "spaceo_homepage",
"url": "https://www.spaceo.ca",
"collection_name": "test_collection",
"wait": true
}'
```
Processing time for 18-page website: ~2-3 minutes
```bash
curl -X POST 'http://localhost:8001/v1/search/' \
-H 'Authorization: Bearer test_token' \
-H 'Content-Type: application/json' \
-d '{
"query": "What technologies do you use?",
"collection_name": "test_collection",
"top_k": 5
}'
```
**Response format:**
```json
{
"results": [
{
"page_number": 5,
"document_id": 1,
"document_name": "spaceo_homepage",
"normalized_score": 0.683,
"image_url": "http://..."
}
]
}
```
**✅ What worked:** Search returned results immediately, vision similarity scores looked reasonable
**❌ What was missing:** No way to extract actual text for RAG, no metadata, no way to generate answers
Query:`”What technologies do you use for software development?”` Expected: Page 5 (Emerging Technologies section with AI, ML, Computer Vision) Got: Page 5 ranked #1 with score 0.683 ✅ But then we asked: “How do I actually generate an answer from this?” The response only included: – Page number – Image URL – Similarity score No text. No chunks. No way to feed to an LLM. |
Query: "What databases does Space-O use?" Expected: Page 14 (mentions PostgreSQL, Redis, MongoDB) Got: -Page 16 (0.747) – Blog titles ❌ -Page 1 (0.666) – Homepage ❌ -Page 18 (0.657) – Footer ❌ Page 14 (0.646) – Correct answer but ranked #4 ❌ Analysis: Pure vision similarity isn’t enough. The page visually mentioning “databases” in a list looks similar to other pages with lists. We needed semantic understanding. |
Query: "What AI and Computer Vision capabilities do you offer?" Expected: Pages 5-6 (continuous “Emerging Technologies” section) Got: -Page 5: Score 0.683 (AI, ML content) ✅ -Page 6: Score 0.516, ranked #14 ❌ Problem identified: Content split across pages doesn’t get retrieved together because each page is scored independently. The section header “Emerging Technologies” on page 5 doesn’t boost page 6, even though they’re the same logical section. Critical realization: Vision RAG has the same page boundary problem as traditional chunking, but worse – you can’t control the boundaries. |
To generate answers with an LLM, we needed text. ColiVara only returned page images and similarity scores. We had two options:
1. Use metadata_full_text – Complete page text (~1000-1500 tokens per page)
2. Build chunk extraction – Only return relevant text regions
We chose option 2.
Key insight: ColPali creates 251 embeddings per page arranged in a 16×16 grid (with padding). Each embedding represents a visual patch. We can ask the vision LLM to extract text from semantic regions and align them with these patches.
Migration: 0029_pageembedding_chunk_index_pageembedding_chunk_text.py
class PageEmbedding(models.Model):
page = models.ForeignKey(Page, on_delete=models.CASCADE)
embedding_vector = VectorField(dimensions=128)
# NEW: Chunk fields
chunk_text = models.TextField(blank=True, null=True)
chunk_index = models.IntegerField(default=0)
def get_chunked_metadata_prompt(num_chunks: int) -> str:
return f"""Analyze this page image and divide it into {num_chunks}
semantic regions (top to bottom, left to right).
For each region, extract the text content.
Return JSON:
{{
"chunks": [
{{"index": 0, "text": "Text from first region..."}},
{{"index": 1, "text": "Text from second region..."}},
...
]
}}
"""
Problem: 251 embeddings but only 3-6 semantic chunks. How to distribute?
Solution: Evenly distribute chunks across all embeddings:
# If 4 chunks and 251 embeddings:
# Embeddings 0-62: chunk 0
# Embeddings 63-125: chunk 1
# Embeddings 126-188: chunk 2
# Embeddings 189-251: chunk 3
embeddings_per_chunk = num_embeddings // num_chunks
for idx, embedding in enumerate(page_embeddings):
chunk_idx = min(idx // embeddings_per_chunk, num_chunks - 1)
embedding.chunk_text = chunks[chunk_idx].get('text', '')
embedding.chunk_index = chunk_idx
await embedding.asave()
Result: Every embedding has associated text. When ColPali finds the most relevant embedding, we return its chunk.
-- Extract unique chunks from top-ranked pages
SELECT DISTINCT pe.chunk_text
FROM api_pageembedding pe
JOIN api_page p ON pe.page_id = p.id
WHERE p.page_number IN (5, 14, 18)
AND p.document_id = 1
AND pe.chunk_text IS NOT NULL
LIMIT 10;
Before chunking:
After chunking:
Token savings: ~79%
Cost impact:
Files changed:
/web/api/models.py
– Added chunk fields (Lines 738-740)/web/api/metadata_client.py
– Chunk extraction (Lines 141-211)/web/api/views.py
– Chunk distribution logic (Lines 325-386)Remember our “database query” test failure? Page 14 had the answer but ranked #4. Pure vision similarity isn’t enough for semantic precision.
Combine vision similarity with text-based metadata boost:
final_score = vision_score + metadata_boost
We extract 5 metadata fields per page:
Field | Weight | Purpose | Example |
---|---|---|---|
keywords | 0.35 per match | Key terms, entities | “PostgreSQL, Redis, MongoDB, database” |
topics | 0.50 per match | Main subjects | “database management, data storage” |
questions | 0.5 × similarity | Questions answered | “What databases does Space-O use?” |
summary | 0.20 per match | Brief overview | “Overview of database technologies” |
full_text | N/A | Complete text | (not used in boosting, used for chunks) |
prompt = """Analyze this page image and extract metadata for search.
1. Keywords: Key terms, entities, technical concepts (comma-separated)
2. Topics: Main subjects and themes (comma-separated)
3. Questions: What questions does this page answer? (pipe-separated)
4. Summary: Brief page summary (1-2 sentences)
5. Full Text: Complete text content
Return ONLY JSON:
{
"metadata": {
"keywords": "...",
"topics": "...",
"questions": "...",
"summary": "...",
"full_text": "..."
}
}
"""
def apply_metadata_boost(results, query):
# Extract query keywords
query_keywords = extract_query_keywords(query)
# {"what", "databases", "space-o", "use"}
for result in results:
boost = 0.0
# 1. Question Matching (similarity-based)
questions = result.get('metadata_questions', '').split('|')
max_similarity = 0.0
for question in questions:
similarity = jaccard_similarity(query, question)
if similarity > max_similarity:
max_similarity = similarity
if max_similarity > 0.5: # High similarity threshold
boost += 0.5 * max_similarity # Max boost of 0.5 when perfect match
# 2. Keyword Matching
keywords = result.get('metadata_keywords', '').lower().split(',')
keyword_matches = sum(1 for kw in query_keywords if kw in keywords)
boost += 0.35 * keyword_matches
# 3. Topic Matching
topics = result.get('metadata_topics', '').lower().split(',')
topic_matches = sum(1 for kw in query_keywords if kw in topics)
boost += 0.50 * topic_matches
# 4. Summary Matching
summary = result.get('metadata_summary', '').lower()
summary_matches = sum(1 for kw in query_keywords if kw in summary)
boost += 0.20 * summary_matches
# Apply boost
result['metadata_boost'] = boost
result['boosted_score'] = result['normalized_score'] + boost
# Re-rank by boosted score
results.sort(key=lambda x: x['boosted_score'], reverse=True)
return results
Query: "What databases does Space-O use?"
Page 14 metadata:
{
"keywords": "PostgreSQL, Redis, MongoDB, MySQL, database, data storage",
"topics": "database management, backend infrastructure",
"questions": "What databases does Space-O Technologies use?|What data storage solutions do you support?",
"summary": "Space-O uses PostgreSQL, Redis, and MongoDB for scalable data management."
}
Scoring:
1. Vision score: 0.646
2. Question match:
3. Keyword matches:
4. Topic matches:
Total boost: 0.429 + 0.35 + 0.50 = +1.279
Final score: 0.646 + 1.279 = 1.925 (would rank #1)
Before (Pure Vision):
1. Page 16 (0.747) - Blog titles
2. Page 1 (0.666) - Homepage
3. Page 18 (0.657) - Footer
4. Page 14 (0.646) - Database info ✅ (ranked too low)
After (Hybrid Search):
1. Page 14 (1.925) - Database info ✅ (significant boost from metadata)
2. Page 13 (0.892) - Backend tech (some boost)
3. Page 16 (0.747) - Blog (no boost, dropped)
4. Page 1 (0.666) - Homepage (no boost, dropped)
Success rate (on our test set of 50 queries):
New file: /web/api/hybrid_search.py
Key functions:
extract_query_keywords()
– NLP preprocessingcalculate_text_similarity()
– Jaccard similarityapply_metadata_boost()
– Main boosting algorithmModified: /web/api/views.py
enable_metadata_boost
parameter (default: True)Database storage:
class Page(models.Model):
metadata_keywords = models.TextField(blank=True, null=True)
metadata_topics = models.TextField(blank=True, null=True)
metadata_questions = models.TextField(blank=True, null=True)
metadata_summary = models.TextField(blank=True, null=True)
metadata_full_text = models.TextField(blank=True, null=True)
Metadata extraction:
For 18-page document:
Ongoing search cost: Zero (metadata stored in database)
Now that we had chunks and metadata, we needed to generate actual answers.
def complete_rag_pipeline(query):
# Step 1: Hybrid search
search_results = hybrid_search(
query=query,
top_k=5,
enable_metadata_boost=True
)
# Step 2: Extract chunks from top pages
page_numbers = [r['page_number'] for r in search_results[:3]]
chunks = extract_chunks_from_pages(page_numbers)
# Step 3: Build context
context = "\n\n".join([
f"[Source {i+1}]: {chunk}"
for i, chunk in enumerate(chunks)
])
# Step 4: Generate answer
answer = generate_answer_with_llm(query, context)
return answer
We tested three LLMs for answer generation:
Pricing:
def generate_answer_openai(query, context):
client = openai.OpenAI(api_key=OPENAI_API_KEY)
prompt = f”””Based on the context below, answer this question: {query}
Context:
{context}
Answer:”””
completion = client.chat.completions.create(
model=”gpt-4o-mini”,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: prompt}
],
temperature=0
)
return completion.choices[0].message.content
def generate_answer_anthropic(query, context):
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
message = client.messages.create(
model=”claude-3-5-sonnet-20241022″,
max_tokens=1024,
messages=[
{“role”: “user”, “content”: prompt}
]
)
return message.content[0].text
Pricing:
Typical RAG query:
Cost per 1000 queries:
We chose: GPT-4o-mini for production, Claude for critical accuracy
Query: "What is the office address and contact number?"
Search results:
Page 2: Score 0.872 (Contact section) Page 18: Score 0.748 (Footer contact) Page 1: Score 0.663 (Header)
Extracted chunks:
[Chunk 1 from Page 2]: “CONTACT US 2 County Court Blvd., Suite 400 Brampton, Ontario L6W 3W8 +1 (437) 488-7337 info@spaceo.ca” [Chunk 2 from Page 18]: “Space-O Technologies Canada Inc. 2 County Court Blvd., Suite 400, Brampton, ON L6W 3W8 Phone: +1 (437) 488-7337”
GPT-4o-mini response:
The office address is: 2 County Court Blvd., Suite 400 Brampton, Ontario L6W 3W8 The contact number is: +1 (437) 488-7337
✅ Accurate, concise, extracted from chunks correctly
Problem: LLMs sometimes make up information not in context
Solution: Explicit prompt instruction
prompt = f"""Based ONLY on the context below, answer this question: {query}
Context:
{context}
IMPORTANT: If the answer is not in the context, say "I don't have enough information to answer this question."
Answer:"""
Result: Hallucination rate dropped from ~15% to ~2%
Problem: Different chunks had slightly different information
Example:
LLM response (Claude):
"Space-O uses React as their primary frontend framework, with Angular also
available for specific project requirements."
✅ LLM successfully synthesized information
Query: "What is your pricing for mobile apps?"
Context: (no pricing information in indexed pages)
GPT-4o-mini response:
“I don’t have enough information to answer this question. The provided context does not contain pricing details for mobile applications.”
✅ Correctly refused to answer
#!/usr/bin/env python3
"""
Complete RAG Pipeline Test
"""
import requests
import openai
import os
API_URL = "http://localhost:8001"
TOKEN = "spaceo_test_token"
COLLECTION = "final_chunk_test"
def search_and_extract_chunks(query):
"""Hybrid search + chunk extraction"""
response = requests.post(
f"{API_URL}/v1/search/",
headers={'Authorization': f'Bearer {TOKEN}'},
json={
"query": query,
"collection_name": COLLECTION,
"top_k": 5,
"enable_metadata_boost": True
}
)
results = response.json()
page_numbers = [r['page_number'] for r in results['results'][:3]]
# Extract chunks via database
chunks = extract_chunks_from_db(page_numbers, results['results'][0]['document_id'])
return chunks
def generate_answer(query, chunks):
"""Generate answer with GPT-4o-mini"""
client = openai.OpenAI()
context = "\n\n".join([f"[Source {i+1}]: {c}" for i, c in enumerate(chunks)])
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"Based on: {context}\n\nAnswer: {query}"}
],
temperature=0
)
return completion.choices[0].message.content
# Test queries
queries = [
"What is the office address?",
"What technologies do you use?",
"What services does Space-O provide?",
]
for query in queries:
print(f"\nQuery: {query}")
chunks = search_and_extract_chunks(query)
answer = generate_answer(query, chunks)
print(f"Answer: {answer}\n")
Success rate (observed): ~95% accurate answers on 50 test queries in our evaluation
Test document: Space-O homepage (18 pages)
Breakdown (approximate, observed in our tests):
1. URL → PDF conversion: ~15-20 seconds
2. PDF → Images: ~5-10 seconds
3. ColPali embeddings: ~60-90 seconds (GPU)
4. Metadata extraction: ~90-180 seconds (Nebius API)
5. Database storage: ~5-10 seconds
Total: ~3-5 minutes
Bottleneck: Metadata extraction (API rate limits)
Scalability:
Query latency (observed in our deployment):
1. Vision similarity search: ~200-400ms (pgvector)
2. Metadata boosting: ~10-20ms (CPU)
3. Chunk extraction: ~50-100ms (SQL)
Total: ~260-520ms per query
Concurrent queries:
Bottleneck: pgvector similarity search (GPU-bound)
Per 1000 queries:
Component | Cost |
---|---|
Search (compute) | ~$0.02 |
LLM generation (GPT-4o-mini) | ~$0.17 |
Total per 1K queries | ~$0.19 |
Per 1000 document uploads:
Component | Cost |
---|---|
ColPali embeddings (GPU) | ~$5.00 |
Metadata extraction (Nebius) | ~$270.00 |
Storage (PostgreSQL) | ~$0.50 |
Total per 1K documents | ~$275.50 |
Reality check: Metadata extraction is expensive. For high-volume indexing, this cost dominates.
Metric | Vision RAG (ColiVara) | Traditional RAG (OpenAI ada-002) |
---|---|---|
Upload time | 3-5 min per doc | 10-30 sec per doc |
Upload cost | ~$0.30 per doc | ~$0.001 per doc |
Search latency | 260-520ms | 50-100ms |
Accuracy (text) | 85% | 92% |
Accuracy (visuals) | 95% | 0% |
Token efficiency | Same (with chunks) | Same |
When vision RAG wins: Documents with important visual elements (charts, diagrams, infographics)
When traditional RAG wins: Pure text documents, faster uploads needed, cost-sensitive
Use case: Marketing infographics with minimal text
Example: Company culture infographic with icons and visual flows
Traditional RAG: Extracted text: “Culture Innovation Team Success” (meaningless)
Vision RAG: Successfully matched visual patterns, returned correct page even with sparse text
Score: Vision RAG 10/10, Traditional RAG 2/10
Use case: Financial balance sheets
Challenge: Nebius vision model extracted table structure and numbers
Result:
Limitation: Still struggled with complex multi-column tables with merged cells
Score: Vision RAG 7/10, Traditional OCR 8/10
Use case: Application forms with checkboxes and fields
Result:
Score: Vision RAG 9/10, Traditional RAG 5/10
Claim: “No need to define chunking strategies”
Reality: Partially true for retrieval, but not for answer generation
Retrieval: ColPali automatically creates 251 embeddings per page. No chunking decisions needed.
Answer generation: Still needed to extract text chunks. The “no chunking” promise only applies to the embedding step.
Score: 6/10 (technically true but misleading)
Use case: Space-O homepage (text-heavy marketing site)
Problem: Pages 5-6 contain continuous “Emerging Technologies” section split across page boundary
Query: "What AI and Computer Vision capabilities do you offer?"
Expected: Both pages 5 and 6 (same logical section)
Got:
Root cause:
Traditional RAG: Would have chunked semantically across pages, preserving continuity
Verdict: Vision RAG failed for text-heavy documents with cross-page content
Use case: Technical queries with specific terms
Query: "PostgreSQL Redis MongoDB configuration"
Problem: Query contains exact technical terms but vision model doesn’t excel at matching character-level strings
Vision score: 0.646 (ranked #4)
Solution: Metadata boosting saved us (boosted to #1)
But: Without our custom metadata system, this would have been a total failure
Verdict: Vanilla ColiVara failed, our hybrid search fixed it
Use case: Multi-topic pages
Example: Page 8 has three sections:
Query: "Cloud deployment strategies"
Problem: ColPali returned entire page 8, but:
Our chunk solution: Partially solved by extracting 3 semantic regions per page
But: Still not as precise as traditional chunking with overlap
Verdict: Vision RAG partially failed – needs post-processing to match traditional RAG precision
Use case: Synonym matching
Query: "What are your rates for app development?"
Relevant terms: “pricing”, “cost”, “fees”, “charges”
Problem: Page 12 mentioned “investment” and “project costs” but not “rates” or “pricing”
Vision similarity: Poor match (text doesn’t appear visually)
Metadata keywords: Listed “project costs” but not “rates”
Result: Missed relevant page entirely
Traditional semantic embeddings: Would have matched “rates” ≈ “costs” ≈ “pricing”
Verdict: Vision RAG failed at semantic similarity that text embeddings handle naturally
Use case: Questions requiring cross-document information
Query: "How is your company culture different from competitors?"
Problem:
Need: Information from all three pages
Vision RAG: Returned top page (3) but missed others (ranked #8 and #12)
Traditional RAG: Would have chunked across documents, higher recall
Verdict: Vision RAG failed for queries requiring broad document understanding
Upload time: 3-5 minutes per 18-page document
Cost: $0.30 per document
Traditional RAG: 10-30 seconds, $0.001 per document
For high-volume scenarios: Vision RAG is 300x more expensive and 15x slower
Verdict: Vision RAG failed for cost and speed requirements in production
Marketing claim: “State-of-the-art retrieval for any document”
Reality: Excellent for specific document types (infographics, charts, forms), but inferior to traditional RAG for pure text
Lesson: Evaluate your actual document types before choosing vision RAG
Claim: “No need to manually design chunking strategies”
Reality:
Lesson: You still need to solve the RAG chunk extraction problem
Pure vision RAG: ~85% accuracy on our test set
Vision RAG + metadata boosting: ~95% accuracy on our test set
Lesson: Combining vision similarity with text-based signals (keywords, questions, topics) gives best results
Traditional RAG issue: Content split across chunks loses context
Vision RAG issue: Content split across pages loses context
Lesson: Vision RAG doesn’t solve the fundamental semantic boundary problem
Nebius Qwen2.5-VL-72B: ~$0.015 per page
For large-scale indexing: Prohibitively expensive
Example: 10,000 documents × 20 pages = $3,000 just for metadata extraction
Lesson: Vision RAG is expensive at scale. Budget accordingly.
Out-of-the-box ColiVara: Only returned page images and scores
To make it production-ready, we built:
Lesson: ColiVara is a foundation, not a complete RAG solution
Initial tests (marketing website): Looked great!
Deep testing (diverse queries): Revealed significant gaps
Financial table test: Vision model extracted surrounding text, not table data
Lesson: Test with your actual use case, not just demos
1. Your documents are visually rich
2. Visual context is semantic
3. You have budget and time
4. Traditional RAG failed you
1. Your documents are text-heavy
2. You need high precision
3. You have cost/speed constraints
4. Traditional RAG works fine
1. Mixed document types
2. You want best of both
3. You can afford custom development
We’ve open-sourced our customizations (pending – link when ready):
Features:
Installation:
git clone https://github.com/spaceo-technologies/MyColiVara.git
cd MyColiVara
docker-compose up -d
All test scripts available in /tmp/
directory:
test_complete_rag_with_llm.py
– End-to-end RAG with answer generationtest_hybrid_search.py
– Compare pure vision vs hybrid searchtest_embedding_level_search.py
– Embedding-level retrieval teststest_financial_table_rag.py
– Financial document testingComprehensive documentation in our repository:
IMPLEMENTATION_SUMMARY.md
– What we builtMETADATA_BOOSTING_DOCUMENTATION.md
– Hybrid search detailsHYBRID_SEARCH_STATUS.md
– Implementation statusTESTING_INSTRUCTIONS.md
– How to test the systemModified files:
/web/api/models.py
– Added chunk_text, chunk_index, metadata fields/web/api/metadata_client.py
– Vision LLM integration, chunk extraction/web/api/views.py
– Search endpoints, metadata boosting/web/api/hybrid_search.py
– Boosting algorithm (new file)We successfully built a production-ready vision RAG system by:
Final accuracy: ~95% on our test set (up from ~85% with vanilla ColiVara)
Vision RAG is not a replacement for traditional RAG – it’s a specialized tool for visually-rich documents.
The truth:
Our recommendation:
ColiVara and ColPali represent exciting advances in vision-based retrieval. The technology works and has legitimate use cases.
But: The “state-of-the-art retrieval for any document” marketing is misleading. Vision RAG solves some problems better than traditional RAG, but creates new problems too.
For most text-heavy applications (documentation, knowledge bases, chat over reports), traditional RAG is still superior in accuracy, speed, and cost.
For visually-rich documents (presentations with infographics, financial reports with charts, forms with spatial layout), vision RAG is worth the investment.
Choose wisely based on your actual needs, not the hype.
Need Vision RAG Implementation?