GitHubnpm
Guides

Performance

Optimizing vibe-hnindex for maximum indexing speed and search responsiveness.

Indexing Performance

Parallel Workers

Since v0.8.0, index_codebase uses worker threads for parallel processing. The default INDEX_WORKERS=auto uses all CPU cores minus one.

Single-threaded

INDEX_WORKERS=1

Baseline. One file at a time. Good for low-resource machines.

Multi-threaded (auto)

INDEX_WORKERS=auto

~3-4× faster on multi-core machines. Default setting.

Batch Size

INDEX_PARALLEL_BATCH controls files per worker batch (default: 8). Higher values increase throughput but use more memory:

INDEX_PARALLEL_BATCH=16  # Faster, more memory
INDEX_PARALLEL_BATCH=4   # Slower, less memory

Search Performance

Streaming Search (v0.9.0+)

Streaming search runs keyword + semantic in parallel, providing ~1.5-2× speedup for hybrid mode:

SEARCH_STREAM_ENABLED=true

Streaming provides 4-phase progress notifications:

  1. Parallel Search — keyword and semantic run simultaneously
  2. RRF Fusion — combined scoring
  3. Post-processing — deduplication and path quality
  4. Results — final ranked output

Search Cache

Results are cached with LRU eviction (default 100 entries, 5 min TTL):

SEARCH_CACHE_SIZE=200        # More cache entries
SEARCH_CACHE_TTL_MS=600000   # Longer TTL (10 min)
Set SEARCH_CACHE_TTL_MS=0 to disable caching for benchmarking.

Mode Selection Strategy

Choose the right search mode for optimal performance:

ScenarioBest ModeWhy
Exact symbol/function namekeywordFastest — no embedding needed
Natural language questionsemanticEmbedding overhead but better relevance
General searchhybridBest results; moderate overhead
Code patternsregexBypasses FTS/embeddings entirely
Find definitionssymbolQuick SQLite lookup

Timeout Tuning

Adjust timeouts for slow machines or remote services:

OLLAMA_TIMEOUT_MS=60000   # 60s for slow Ollama
QDRANT_TIMEOUT_MS=30000   # 30s for remote Qdrant
SEARCH_TIMEOUT_MS=120000  # 2min overall timeout

Hardware Recommendations

ComponentMinimumRecommended
CPU2 cores4+ cores (for parallel indexing)
RAM4 GB8+ GB (embedding models)
StorageSSD with 2 GB freeNVMe with 10+ GB free

Benchmarking

Use the benchmark_search tool to measure performance:

benchmark_search(project_name: "my-app")

See Benchmark docs for interpreting results.

Single-Pass Indexing (v0.9.1+)

Since v0.9.1, indexing uses a single pass for chunking + embedding + dependency/symbol parsing instead of two passes. Combined with SHA-1 for change detection, this makes indexing ~30-40% faster on large codebases.

Embedding Model Performance

The embedding model significantly impacts indexing speed. Larger models produce better vectors but take longer per chunk:

ModelRAMGPU?Latency/ChunkIndex 1k Files*
all-minilm~100 MBNo~5 ms~30 sec
nomic-embed-text~400 MBOptional~15 ms~1.5 min
bge-m3:567m~1.5 GBRecommended~25 ms~2.5 min
qwen3-embedding:4b~3 GB (Q4)Required~40 ms~4 min

* Estimated indexing time for 1,000 source files with 60-line chunks (single worker). Actuals vary with file size, hardware, and parallel workers.

VRAM Impact

Models that fit entirely in GPU VRAM generate embeddings 5-30× faster than models that spill to system RAM. Always check quantization level:

QuantizationMemory ReductionQuality Impact
F16 (default)BaselineNone
Q8_0~50%Negligible
Q5_K_M~65%Minimal
Q4_K_M~75%Moderate
# Pull a quantized model
ollama pull nomic-embed-text:Q8_0
ollama pull qwen3-embedding:4b-Q4_K_M

Matryoshka Dimension Reduction

nomic-embed-text and snowflake-arctic-embed2 support Matryoshka Representation Learning — you can reduce dimensions (e.g., 768 → 512) while keeping ~90% quality, saving memory and Qdrant storage:

EMBEDDING_DIMENSIONS=512   # Slower to compute but uses less Qdrant storage
EMBEDDING_DIMENSIONS=256   # Even smaller, still usable quality