Guides

Performance

Optimizing vibe-hnindex for maximum indexing speed and search responsiveness.

Indexing Performance

Parallel Workers

Since v0.8.0, index_codebase uses worker threads for parallel processing. The default INDEX_WORKERS=auto uses all CPU cores minus one.

Single-threaded

INDEX_WORKERS=1

Baseline. One file at a time. Good for low-resource machines.

Multi-threaded (auto)

INDEX_WORKERS=auto

~3-4× faster on multi-core machines. Default setting.

Batch Size

INDEX_PARALLEL_BATCH controls files per worker batch (default: 8). Higher values increase throughput but use more memory:

INDEX_PARALLEL_BATCH=16  # Faster, more memory
INDEX_PARALLEL_BATCH=4   # Slower, less memory

Search Performance

Streaming Search (v0.9.0+)

Streaming search runs keyword + semantic in parallel, providing ~1.5-2× speedup for hybrid mode:

SEARCH_STREAM_ENABLED=true

Streaming provides 4-phase progress notifications:

Parallel Search — keyword and semantic run simultaneously
RRF Fusion — combined scoring
Post-processing — deduplication and path quality
Results — final ranked output

Search Cache

Results are cached with LRU eviction (default 100 entries, 5 min TTL):

SEARCH_CACHE_SIZE=200        # More cache entries
SEARCH_CACHE_TTL_MS=600000   # Longer TTL (10 min)

Set SEARCH_CACHE_TTL_MS=0 to disable caching for benchmarking.

Mode Selection Strategy

Choose the right search mode for optimal performance:

Scenario	Best Mode	Why
Exact symbol/function name	keyword	Fastest — no embedding needed
Natural language question	semantic	Embedding overhead but better relevance
General search	hybrid	Best results; moderate overhead
Code patterns	regex	Bypasses FTS/embeddings entirely
Find definitions	symbol	Quick SQLite lookup

Timeout Tuning

Adjust timeouts for slow machines or remote services:

OLLAMA_TIMEOUT_MS=60000   # 60s for slow Ollama
QDRANT_TIMEOUT_MS=30000   # 30s for remote Qdrant
SEARCH_TIMEOUT_MS=120000  # 2min overall timeout

Hardware Recommendations

Component	Minimum	Recommended
CPU	2 cores	4+ cores (for parallel indexing)
RAM	4 GB	8+ GB (embedding models)
Storage	SSD with 2 GB free	NVMe with 10+ GB free

Benchmarking

Use the benchmark_search tool to measure performance:

benchmark_search(project_name: "my-app")

See Benchmark docs for interpreting results.

Single-Pass Indexing (v0.9.1+)

Since v0.9.1, indexing uses a single pass for chunking + embedding + dependency/symbol parsing instead of two passes. Combined with SHA-1 for change detection, this makes indexing ~30-40% faster on large codebases.

Embedding Model Performance

The embedding model significantly impacts indexing speed. Larger models produce better vectors but take longer per chunk:

Model	RAM	GPU?	Latency/Chunk	Index 1k Files*
`all-minilm`	~100 MB	No	~5 ms	~30 sec
`nomic-embed-text`	~400 MB	Optional	~15 ms	~1.5 min
`bge-m3:567m`	~1.5 GB	Recommended	~25 ms	~2.5 min
`qwen3-embedding:4b`	~3 GB (Q4)	Required	~40 ms	~4 min

* Estimated indexing time for 1,000 source files with 60-line chunks (single worker). Actuals vary with file size, hardware, and parallel workers.

VRAM Impact

Models that fit entirely in GPU VRAM generate embeddings 5-30× faster than models that spill to system RAM. Always check quantization level:

Quantization	Memory Reduction	Quality Impact
F16 (default)	Baseline	None
Q8_0	~50%	Negligible
Q5_K_M	~65%	Minimal
Q4_K_M	~75%	Moderate

# Pull a quantized model
ollama pull nomic-embed-text:Q8_0
ollama pull qwen3-embedding:4b-Q4_K_M

Matryoshka Dimension Reduction

nomic-embed-text and snowflake-arctic-embed2 support Matryoshka Representation Learning — you can reduce dimensions (e.g., 768 → 512) while keeping ~90% quality, saving memory and Qdrant storage:

EMBEDDING_DIMENSIONS=512   # Slower to compute but uses less Qdrant storage
EMBEDDING_DIMENSIONS=256   # Even smaller, still usable quality

PreviousSetup MCP

NextTroubleshooting