Header Banner Image
Your
Trusted
Get Fully AWS Funded
Cloud Migration
Single blog hero image

Scaling Vector Databases Without Burning Cash (and Your Weekend)

Introduction
A fast-growing AI sales tech startup recently secured funding. Their product — an AI-powered outbound calling platform for mortgage negotiations — was gaining serious traction with finance clients.

Your marketing squad just added five new languages and doubled page‑level embeddings from 10 million to 50 million- all before your first coffee. Latency SLOs? Unchanged. Budget? Of course not. If you size the cluster wrong today, you will be explaining the overage on every ops call for the next 12 months.

“Every vector you store is a tiny monthly subscription you’ve sold back to your cloud provider.”

This long‑form guide distills six months of Slack wars, real invoices, and post‑mortem tears into a practical playbook. We cover:

  • The 4‑step sizing formula (with code snippets).

  • Real €/M‑vector price points on Hetzner vs AWS.

  • Multi‑tenant design, read/write ratios, GPU off‑load, and quantisation.

  • Where vector DBs steal budget compared to SQL, Loki, OpenSearch, and ClickHouse.

By the end, you’ll know exactly why your data science team’s “just 384‑dim embeddings” translates to either three CCX53 nodes or two r7g.8xlarge—and how to debate that in the next exec meeting.

What Is Vector Database Scaling?

Vector database scaling is the art of expanding storage, RAM, and compute so approximate‑nearest‑neighbour (ANN) queries stay within p95 latency targets as vector count, dimension, or QPS climb. It usually means sharding indices, tiering storage (RAM→NVMe→S3), and balancing replicas for HA.

Why It Matters – 2025 Context for Fast‑Growing Tech Companies

Cloud Modernisation, Not Moon‑Shots

  • Investors love AI‑driven UX, but hate infra burn.

  • RAG, personalised search, and semantic analytics balloon vector count faster than user count.

  • Early cost slipups are now venture‑capital‑visible; billing lines expose overspend down to the hour.

“Vector search is the first infra line‑item the Board reads after ‘GPU spend’.”

Real Pain‑Points We Hear Weekly

Pain

Symptom

Hidden Cost

Feature team adds language or modality

Embedding volume × 3

1‑click deploy triggers €4 k/mo RAM growth

Multi‑tenant SaaS, one marquee customer doubles traffic

Spike in cache misses

Other tenants suffer recall drop ➜ churn risk

Pivot to serverless demo

Pay‑per‑request RU/WU spikes

CFO asks why POC costs more than prod 

Make no mistake: Vector DBs are moving out of “nice‑to‑have” into the critical path of real‑time UX. You’ll scale them whether prepared or not.

The 4‑Step Sizing Framework
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.

We extend the short formula you saw earlier into an end‑to‑end workflow with code snippets and sanity checkpoints.

Step 1 – Estimate Raw Footprint

Result: 50 M × 768‑d float32 ≈ 143 GiB.

Step 2 – Apply Engine Factor (F)

Each engine keeps ancillary graphs, caches, and metadata. We measured memory at 50 M vectors for default HNSW configs.

Engine

Overhead F

Memory @50 M

Why

Milvus (HNSW)

× 7‑8

1.0‑1.1 TB

Graph & neighbour lists in RAM

Weaviate

× 2

286 GB

Vector‑cache + inverted index

Qdrant

× 1.5

215 GB

Payload encoded leanly

Vespa

× 1.2‑3

170‑430 GB

Compression selectable (bf16, int8, PQ)

Rule‑of‑thumb: Don’t trust vendor docs—dump and load 1 million vectors first, then multiply.

Step 3 – Map to Node Shapes

Using Weaviate’s ×2 factor: 286 GB ÷ 128 GB ≈ 2.2 → 3× CCX53 on Hetzner gives 384 GB cluster with 30 % head‑room. Want AWS? Two r7g.8xlarge offer 512 GB, but at 3‑5× cost.

Step 4 – Add Replicas & Tiering

  • Reads dominate at scale → 2× replicas give HA and double QPS.

  • Hot shards stay in NVMe, cold shards flush to S3.

Developer Responsibilities That Don’t Disappear
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.
  1. Topology Declaration & Queue Types
    Choose classic mirrors or x-queue-type: quorum. Quorum queues use Raft, drop priorities, and behave differently with TTL.

  2. Reliable Publishing
    Enable publisher confirms (channel.confirmSelect()), else a broker fail-over can eat in-flight messages.

  3. Connection Resilience
    Use clients with automatic connection & channel recovery, then re-declare exchanges/queues after reconnect. Expect at least one reconnect per monthly AWS patch window.

  4. Idempotent Consumers
    Fail-over may redeliver. Make handlers safe for duplicates.

  5. Prefetch & Back-Pressure Tuning
    Large backlogs replicate across three AZs, killing latency. Keep queues short, prefetch modest (20-50), and monitor QueueDequeue CloudWatch metric.

  6. Sizing & Sharding
    Heavy streams? Split by key into multiple queues or brokers. Cluster won’t lift the single-queue ceiling.

  7. Alert Hygiene
    Three times the nodes means three times the metrics. De-noise your dashboards (e.g., ignore benign raft elections).

Quick Sanity Sheet
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.

Range

Milvus

Weaviate

Qdrant

Vespa

Dev <5 M

1 × 8 vCPU / 32 GB

1 × 8 vCPU / 32 GB

1 × 4 vCPU / 16 GB

1 content + 2 API (≈64 GB)

Small ≈50 M

3 × 32 vCPU / 128 GB

3 × 128 GB

3 × 64 GB

6 × 64 GB

Mid ≈0.5 B

25 × 64 vCPU / 256 GB

12 × 256 GB

12 × 128 GB

24 × 72 vCPU

The Economics – € per Million Vectors in the Real World
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.

Hetzner vs AWS Cost Table

Stay inside those lines and RabbitMQ is cost-effective and developer-friendly.

Provider

Nodes

€/month

€/M vector

Notes

Hetzner CCX53

3 × 128 GB

€675

€13.5

Flat‑rate, EU DC

AWS r7g.8xlarge

2 × 256 GB

€2 275

€45

Spot saves 70 % but risk

AWS r7a.8xlarge

2 × 256 GB

€3 900

€78

EU Central1 on‑demand

Compression & Quantisation

Switch to HNSW + PQ:

  • Memory shrink: 24× (float32→int8 sub‑vectors).

  • Recall impact: ≤1 % on MS MARCO 50 M.

  • Cost drop: Weaviate €/M vector ≈ €0.6.

“Quantise cold shards—turn an r7a budget into a t4g bill.”

How Vector Stores Differ from SQL, NoSQL, Loki, and ClickHouse
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.

Engine class

Writes (ingest)

Reads (query)

R:W split

Vector DB (ANN)

CPU‑heavy index build, 1‑2× RAM

Graph walks, RAM‑lat, GPU optional

1 : 3–5

SQL / NoSQL

Small random I/O

Short key lookups, cache

1 : 1

Loki / TSDB

Append to object store

Massive decompression

1 : 8

ClickHouse

Chunk merge CPU

Vectorised scans

1 : 4

“We budget Loki for traffic spikes; we budget vector search for new features—very different fiscal rhythms.”

Multi‑Tenancy – Keeping Noisy Neighbours in Check
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.

Engine

Isolation Primitive

Strength

Caveat

Milvus

Database→Collection

Strong RBAC

64 DB cap

Qdrant

is_tenant payload

Lightest RAM

Cluster global limits

Weaviate

Tenant shards

Data invisible cross‑tenant

Off by default

Vespa

Tenant→App→Instance

Billing & quota

Pin zones for hard isolation

Common Pitfalls & Anti‑Patterns
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.
  • Over‑sharding = lost recall. Keep ≤64 shards/query or adopt routing‑aware hashing.

  • Implicit index rebuilds. Distance metric switch doubles memory until swap completes.

  • Serverless shock. Pinecone RU/WU great for POCs; sustained 100 QPS can out‑price self‑host in weeks.

  • Ignoring write spikes. Online fine‑tuning models can add 30 % throughput at night—plan ingest.

Best Practices & Success Tips
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.
  • Think tiers, not instances. RAM for hot shards, NVMe for warm, S3 for cold.

  • Quantise early. PQ accuracy loss is negligible at billions scale.

  • Treat embeddings like logs. Retention policy + auto‑archive to cheap storage.

  • Automate with IaC. Use Terraform modules for shard counts so data scientists can request capacity without kubectl.

  • Observe recall, not just latency. Dropping from 99 → 94 % recall can slip past alerts yet ruin conversions.

Future‑Proofing – GPUs, SIMD, and Serverless Hybrids
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.
  • GPU nodes shine above 50 k QPS/shard; below that, AVX‑512 CPUs cheaper.

  • SIMD index builds in FAISS 1.8 cut ingest time by 40 %.

  • Serverless warm pools: Keep 10 % vectors in Pinecone for demos, bulk in Qdrant BYOC.

  • Regulatory headwinds: EU AI Act will require audit trails → pick engines with WAL + S3 snapshots.

Conclusion – Put It in the Budget Before Marketing Ships the Next Feature
The migration went live without a hitch. A few minor issues — like app warm-up and health check tuning — surfaced and were resolved quickly.

Vector search is moving from prototype to production faster than most infra. Armed with the 4‑step framework and real €/M vector costs, you can defend budgets, architect smart tiers, and sleep the night before launch day.

Frequently Asked Questions

Q1. How many documents are hidden behind 50 M vectors?

A1. 10‑17 M medium‑length docs when chunked at 512 tokens.

Q2. Cheapest path to billions of vectors?

A2. PQ or int8 compression + SSD tier; Weaviate with PQ drops to <€1/M.

Q3. Does GPU always pay off?

A3. Only when each shard sustains >50 k QPS; otherwise CPU SIMD wins.

Q4. How to avoid re‑index pain?

A4. Abstract metric choice in config; schedule double‑RAM windows at low‑traffic hours.

Q5. Is serverless ever cheaper?

A5. At ≤5 M vectors & bursty workloads—otherwise self‑host with quantisation.