Performance
Optimizing vibe-hnindex for maximum indexing speed and search responsiveness.
Indexing Performance
Parallel Workers
Since v0.8.0, index_codebase uses worker threads for parallel processing. The default INDEX_WORKERS=auto uses all CPU cores minus one.
Batch Size
INDEX_PARALLEL_BATCH controls files per worker batch (default: 8). Higher values increase throughput but use more memory:
INDEX_PARALLEL_BATCH=16 # Faster, more memory
INDEX_PARALLEL_BATCH=4 # Slower, less memorySearch Performance
Streaming Search (v0.9.0+)
Streaming search runs keyword + semantic in parallel, providing ~1.5-2× speedup for hybrid mode:
SEARCH_STREAM_ENABLED=trueStreaming provides 4-phase progress notifications:
- Parallel Search — keyword and semantic run simultaneously
- RRF Fusion — combined scoring
- Post-processing — deduplication and path quality
- Results — final ranked output
Search Cache
Results are cached with LRU eviction (default 100 entries, 5 min TTL):
SEARCH_CACHE_SIZE=200 # More cache entries
SEARCH_CACHE_TTL_MS=600000 # Longer TTL (10 min)Set SEARCH_CACHE_TTL_MS=0 to disable caching for benchmarking.Mode Selection Strategy
Choose the right search mode for optimal performance:
| Scenario | Best Mode | Why |
|---|---|---|
| Exact symbol/function name | keyword | Fastest — no embedding needed |
| Natural language question | semantic | Embedding overhead but better relevance |
| General search | hybrid | Best results; moderate overhead |
| Code patterns | regex | Bypasses FTS/embeddings entirely |
| Find definitions | symbol | Quick SQLite lookup |
Timeout Tuning
Adjust timeouts for slow machines or remote services:
OLLAMA_TIMEOUT_MS=60000 # 60s for slow Ollama
QDRANT_TIMEOUT_MS=30000 # 30s for remote Qdrant
SEARCH_TIMEOUT_MS=120000 # 2min overall timeoutHardware Recommendations
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 2 cores | 4+ cores (for parallel indexing) |
| RAM | 4 GB | 8+ GB (embedding models) |
| Storage | SSD with 2 GB free | NVMe with 10+ GB free |
Benchmarking
Use the benchmark_search tool to measure performance:
benchmark_search(project_name: "my-app")See Benchmark docs for interpreting results.
Single-Pass Indexing (v0.9.1+)
Since v0.9.1, indexing uses a single pass for chunking + embedding + dependency/symbol parsing instead of two passes. Combined with SHA-1 for change detection, this makes indexing ~30-40% faster on large codebases.
Embedding Model Performance
The embedding model significantly impacts indexing speed. Larger models produce better vectors but take longer per chunk:
| Model | RAM | GPU? | Latency/Chunk | Index 1k Files* |
|---|---|---|---|---|
all-minilm | ~100 MB | No | ~5 ms | ~30 sec |
nomic-embed-text | ~400 MB | Optional | ~15 ms | ~1.5 min |
bge-m3:567m | ~1.5 GB | Recommended | ~25 ms | ~2.5 min |
qwen3-embedding:4b | ~3 GB (Q4) | Required | ~40 ms | ~4 min |
* Estimated indexing time for 1,000 source files with 60-line chunks (single worker). Actuals vary with file size, hardware, and parallel workers.
VRAM Impact
Models that fit entirely in GPU VRAM generate embeddings 5-30× faster than models that spill to system RAM. Always check quantization level:
| Quantization | Memory Reduction | Quality Impact |
|---|---|---|
| F16 (default) | Baseline | None |
| Q8_0 | ~50% | Negligible |
| Q5_K_M | ~65% | Minimal |
| Q4_K_M | ~75% | Moderate |
# Pull a quantized model
ollama pull nomic-embed-text:Q8_0
ollama pull qwen3-embedding:4b-Q4_K_MMatryoshka Dimension Reduction
nomic-embed-text and snowflake-arctic-embed2 support Matryoshka Representation Learning — you can reduce dimensions (e.g., 768 → 512) while keeping ~90% quality, saving memory and Qdrant storage:
EMBEDDING_DIMENSIONS=512 # Slower to compute but uses less Qdrant storage
EMBEDDING_DIMENSIONS=256 # Even smaller, still usable quality