Rinha de Backend 2026: Building a Fraud Detection API in 350MB

3 million vectors, 14 dimensions, 1 CPU, and a p99 that went from 13,200ms to 2.47ms.

What is Rinha de Backend?

Rinha de Backend is a Brazilian backend competition where participants submit an API that gets load-tested and scored. The 2026 edition was about fraud detection in financial transactions using exact k-NN vector search.

You ship a Docker Compose setup. The organizers run k6 against it. Your score is a combination of latency (p99) and detection quality (FP/FN). There are limits: 1 CPU unit total, 350MB of memory total, spread across as many containers as you want, with at least one load balancer and two API instances.

I submitted a Go API behind nginx, doing exact k-NN search over 3 million labeled reference vectors. This is the story of how it went from a p99 of 13.2 seconds to 2.47 milliseconds.

The Challenge

The organizers provide a file with 3 million labeled vectors (references.json.gz, ~284MB uncompressed). Each vector has 14 dimensions, representing features like transaction amount, installments, distance from home, MCC risk score, and more. Each vector is labeled "fraud" or "legit".

When a request comes in, you vectorize the payload into the same 14-dimensional space, find the 5 nearest neighbors using Euclidean distance (exact, no sqrt), and classify: if 3+ of the 5 are fraud, deny it.

The scoring formula is the sum of two independent components:

final_score = score_p99 + score_det

Latency score: 1000 × log₁₀(1000ms / p99). Ceiling at +3000 (p99 ≤ 1ms), floor at −3000 (p99 > 2000ms).
Detection score: Logarithmic function of weighted errors. FP weights 1, FN weights 3, HTTP errors weight 5. If failure rate > 15%, fixed at −3000.

Maximum possible score: +6000. Minimum: −6000.

Starting Point: Brute Force

My first implementation was naive brute force. Every request computed squared Euclidean distance against all 3 million vectors. On my dev machine:

Search benchmark: 15.8 ms/op (brute-force, 3M vectors)

Under load test with 100 VUs, p99 hit 13.2 seconds. The score was around −1300. Timeouts everywhere.

Running brute force on 3 million 14-dimensional vectors per request inside a 350MB budget was never going to work. The dataset says it plainly in the docs: "using brute force will likely have very poor performance (O(N × 14)) for this challenge."

I needed an index.

IVF Clustering: The First Leap

I implemented IVF (Inverted File Index) with k-means clustering. Instead of scanning all 3M vectors, pick the nearest cluster centroid and scan only its members.

With K=1024 clusters, average cluster size was ~2,930 vectors. Score: +3,919, p99 much improved.

Then I changed one number:

K := 1024  // → 4096

One line. Artifact rebuild took 32 minutes. K=4096 reduced average cluster size to ~732 vectors. Score jumped to +4,517 a +598 point gain from a single parameter change. Local search went from ~269μs/op to ~295μs/op (10% slower per query due to centroid overhead), but the net effect on p99 under load was strongly positive because smaller clusters meant fewer cache misses.

The Architecture Evolves

Over the next several iterations, I stacked optimizations:

SoA memory layout. Reorganized the 3M vectors from Array-of-Structs (14 floats per vector, scattered) to Structure-of-Arrays (14 contiguous arrays of int8). This turned random memory access into sequential scans. The artifact dropped from ~284MB JSON to ~45MB binary via int8 quantization.

fasthttp migration. Switched from net/http + chi to valyala/fasthttp. Score jumped to +5,183 (p99=3.84ms). Zero-allocation request handling mattered at 900+ req/s.

AVX2 dual-accumulator fused kernel. The hot path was ScanClusterSoA: for each of ~183 blocks (8 vectors each), it called a Go function per dimension, 14 assembly calls per block, with scalar accumulation in between. That was ~2,562 function calls per query.

I replaced it with one assembly function:

# Y0 = even dims accumulator (0, 2, 4, 6, 8, 10, 12)
# Y4 = odd dims accumulator  (1, 3, 5, 7, 9, 11, 13)
VFMADD231PS Y2, Y2, Y0      # fused multiply-add

Dual accumulator splits the dependency chain in half (7 hops × 5 cycles = 35 cycles per chain instead of 70 serial). A checkpoint at dim 8 allows early-exiting entire blocks when all partial sums already exceed the worst known distance. Hardware prefetch brings the next block into L1d while FMAs are in flight.

The critical detail: survivors from the float32 fast path get exact int64 re-ranking before TopK insertion. Float32 accumulation has ~0.5% relative error; the re-ranking preserves exactness (FP=1/FN=0).

Score: +5,509 (p99=2.52ms).

The Debugging Wall: One Line, 4 Wasted Submissions

This is the part that hurts to recount.

I made 5 simultaneous changes (single-variable isolation rule violation): HAProxy TCP mode, GC=200, GOAMD64=v3, PGO, warmup. Score dropped from +5,504 to +4,673. p99 went from 2.55ms to 17.25ms.

I reverted to nginx, but forgot to restore nginx CPU from 0.10 back to 0.20. The other 4 variables kept changing. Every submission blamed a different culprit:

| Tag | p99 | Hypothesis | |-----|:---:|------------| | v11 | 17.25ms | "HAProxy TCP mode is the culprit" | | v12 | 88.48ms | "GC=200 is the culprit" | | v13 | 88.23ms | "GOAMD64=v3 is the culprit" | | v14 | 90.79ms | "PGO cross-arch is the culprit" | | v15 | 91.23ms | All build hypotheses exhausted |

The real difference? Nginx CPU at 0.10 (half the original). 35x slower for lack of 0.10 CPU. The fix was one line in docker-compose.yml:

cpus: "0.10"  # → "0.20"

One line, 4 engine submissions wasted, 3 hours of chasing wrong hypotheses. The lesson: never change more than one variable between tests. Write down every change.

Reverse-Engineering the Number 1

At this point I had +5,183. The leader had +5,964 with p99=1.09ms and zero detection errors. I needed to understand what I was missing.

I analyzed the top 2 solutions:

| Competitor | Score | p99 | FP/FN | |------------|:-----:|:---:|:-----:| | muanlartins | +5,964 | 1.09ms | 0/0 | | Joyce | +5,853 | 1.40ms | 0/0 | | Me (ivf-v8) | +5,183 | 3.84ms | 2/1 |

The breakthrough was muanlartins' two-tier exact IVF:

Fast tier (95.84% of queries):
  RankCentroids(q, K=4096) → pick top-2 centroids
  ScanCluster × 2 (~1,464 vectors total)
  If result confident → DONE

Escalation (4.16% of queries):
  PickNext 32 remaining centroids
  ScanCluster × 32 with AABB-LB + triangle-inequality pruning
  → Exact KNN-5 result

This is exact KNN-5 at approximate-search speed. He pre-computed calibration thresholds offline against the test data. 95.84% of queries complete after scanning just 1,464 vectors. Only when the result is ambiguous does it escalate to scan more centroids.

Another crucial discovery: QuantScale=10000 (not 32767). The dataset has 4 decimal places; k/10000 maps exactly to int16(k) with zero quantization loss. This eliminated the last false negative I was carrying.

The mmap Discovery: Simpler is Better

My final optimization almost didn't happen. I tried switching the artifact from //go:embed to unix.Mmap to reduce memory pressure. All 4 test submissions exited immediately with OOM.

{
  "p99": "2.47ms",
  "scoring": {
    "breakdown": {"FP": 1, "FN": 0, "TP": 24037, "TN": 30020, "http_errors": 0},
    "failure_rate": "0%",
    "final_score": 5516.20
  }
}

But v22-v22d all failed with exit code 1. The root cause:

Binary data segment (//go:embed artifact.bin): 86.6 MB  ← charged to cgroup
Mmap (/artifact.bin, PROT_READ MAP_PRIVATE):  86.6 MB  ← file-backed
Total: ~174 MB
Cgroup memory limit: 130 MB
Result: OOM → exit 1

I had both //go:embed AND mmap, two copies of the same 86.6MB. The fix was to remove the embed:

After (mmap-only):
Binary size: 5.8 MB (87MB → 5.8MB, 15x smaller)
Total RSS: ~45 MB
Headroom: 85 MB ✓

The v22e submission scored +5,516 (p99=2.47ms). The GC improvement alone shaved 0.03ms off the tail because live heap dropped from ~96MB (embed) to ~10MB (mmap-only).

Final Results

| Metric | Value | |--------|:-----:| | Final score | +5,516 | | p99 | 2.47ms | | False positives | 1 (of 54,057 requests) | | False negatives | 0 | | HTTP errors | 0 | | Resources used | 276MB of 350MB, 1.00 CPU | | Overall rank | 46th of 230 |

After a post-submission review on May 23, the re-test showed +5,251.25 (p99=2.66ms, 0.02% failures), maintaining position 46/230.

What I Learned

Algorithm beats micro-optimization. IVF clustering gave +598 points from one line. The AVX2 kernel gave +3. The nginx fix gave +823 by restoring a single number.
Change one variable at a time. I wasted 4 submissions and 3 hours because I made 5 changes at once and didn't track the nginx CPU limit.
Reverse-engineer the best. Analyzing muanlartins' source code uncovered two-tier exact IVF, QuantScale=10000, and calibration thresholds, none of which I would have invented on my own.
Know your resource limits. The embed+mmap double-count was obvious in hindsight. I should have measured virtual memory usage in the container before shipping.
The best optimization is sometimes deleting code. Removing //go:embed artifact.bin (1 import, 1 var declaration) reduced the binary 15x and fixed the OOM. Zero new code.

A Note on the Tools

I used Opencode with DeepSeek V4 for 99% of the code. One commit was made with Cursor + GPT 5.5 (fixing a nested git repo). The V4 model handled everything from IVF clustering and AVX2 assembly to Docker tuning and mmap. Premium models are not always necessary, context and structure matter more than raw capability.

This was also my first Go project. Coming from Python and C (42 School), Go's single binary output, built-in concurrency, and rich standard library made it ideal for this problem. The hot path ended up with zero allocations per request, using unsafe.Pointer for C-level memory control where needed, while the GC handled everything else.

I stopped because the project is complete and I had other projects to work on, all 5 phases, all 44 requirements, rank 46/230. The marginal return on optimization was diminishing, and time for other work called. It was genuinely fun, and I'd recommend any backend engineer try a competition like this.