7 Things I Learned Building a Fraud Detection API for Rinha de Backend 2026

First time with Go, AI-assisted, and 22 iterations later.

This project was my first contact with Go, my first backend competition, and the most intense optimization cycle I've ever done. Here are the 7 things that mattered most.

1. The Competition Docs Are a Masterclass

The Rinha de Backend repository has some of the clearest competition docs I've seen. Every file has a purpose:

ARCHITECTURE.md explains the topology, resource limits, and containerization rules with diagrams and examples.
EVALUATION.md breaks down the scoring formula step by step with a worked table showing exactly how FP, FN, and latency translate to points. The "why these weights" section explains the reasoning behind each parameter, rare in competition docs.
SUBMISSION.md gives the exact branch structure, file manifest, and info.json format needed. No guesswork.
DATASET.md describes the 3 reference files, the -1 sentinel convention, and confirms that pre-processing is allowed.
DETECTION_RULES.md specifies the exact 14-dimensional vector formula with normalization constants.

This level of documentation meant I never had to reverse-engineer the evaluation. I could focus on engineering. If you're running a competition, this is the template to follow.

2. 99% of the Code Was Written by a Non-Premium Model

One commit f10c097, was made with Cursor using GPT 5.5. It fixed a nested git repository issue. The other 50+ commits across 22 iterations were executed by Opencode GO using DeepSeek V4 Pro and V4 Flash.

This matters because it challenges the assumption that you need premium models (GPT 5.5, Claude Opus 4.7) for every coding task. The V4 series handled:

IVF k-means clustering implementation
AVX2 assembly kernels (dual-accumulator, VFMADD231PS, PREFETCHT0)
Go unsafe pointer casting and memory layout optimization
Docker multi-stage builds and nginx tuning
mmap syscall integration and cgroup-aware memory management
k6 test execution and results analysis

The premium model was used once, for a git workflow problem. Everything else, including the most technically demanding parts, was executed by a model that costs a fraction per token.

Capacity is not the same as cost. For structured engineering projects with a clear context, mid-range models are often sufficient, though the specific context of the task must always be taken into account. Premium models justify their cost in situations involving ambiguity, unprecedented problems, and architectural discovery, but when it comes to execution, in my opinion, tier A models are capable of getting the job done.

3. What I Would Do Differently

Single-variable isolation from day one. The most expensive mistake of the project was changing 5 variables at once in the Phase 3 system optimizations (v11). I spent 4 engine submissions and 3 hours chasing wrong hypotheses because nginx CPU was cut from 0.20 to 0.10 and I never noticed. The fix was one line in docker-compose.yml. The rule is simple: one change per test. Every time I violated it, I paid.

Measure virtual memory in the container before shipping. The embed+mmap double-count caused 4 consecutive OOM exits. Having both //go:embed artifact.bin (86.6MB in the ELF data segment) and unix.Mmap (86.6MB file-backed) created ~183MB of virtual address space against a 130MB cgroup limit. A single docker stats or cat /sys/fs/cgroup/memory/memory.usage_in_bytes before the first engine submission would have revealed the problem immediately.

Profile before optimizing. I spent 3 sessions on AVX2 assembly, v6, v7b, v7c, v7d, before understanding that the real bottleneck was vector scan count, not arithmetic width. The two-tier exact IVF (v10) reduced vectors scanned from 5,856 to 1,464 per query and gained more points than all SIMD work combined. When muanlartins proved that float32 AVX2 on Haswell can hit p99=1.09ms, the bottleneck was never the instruction set, it was scanning too many vectors.

Check infrastructure first when debugging regressions. Before hypothesizing about compiler flags, GC behavior, or PGO inlining, run git diff <good-commit>..HEAD -- docker-compose.yml. Resource limits are invisible to go test, go vet, and local benchmarks. In my case, a 0.10 CPU difference caused a 35x p99 regression and generated 4 false root cause analyses across 600+ lines of session documentation.

Most used commands across the project:

Git:

git commit -m "feat/perf/fix: <description>"        # every session
git push origin master                                # sync main repo
git checkout --orphan source-clean                    # create clean source branch
git push submission-target source-clean:main --force  # push source
git push <url> HEAD:submission --force                # push submission (temp-dir method)
git diff <baseline>..HEAD -- docker-compose.yml       # find infra regressions
git checkout <commit> -- <file>                       # partial revert
git log --oneline -3                                  # quick context

Docker:

docker compose down -v && docker compose up -d                 # restart stack
docker build --no-cache -t <tag> .                             # build (no cache for artifact)
docker run --rm --network=host -v $(pwd)/k6:/k6 grafana/k6 ... # smoke test
docker push ghcr.io/matheus896/rinha-backend-2026:<tag>        # push to registry

Make:

make test    # go test ./...
make bench   # go test -bench=. -count=10 ./...
make k6-local  # k6 smoke test
make docker-build   # docker build

4. Why I Stopped

The project is complete. All 5 phases finished, all 44 requirements implemented. The final score was +5,516 (p99=2.47ms, FP=1/FN=0, 0 HTTP errors), with a post-review re-test of +5,251.25 (p99=2.66ms, 0.02% failures), ranking 46th of 230 on May 31, when I last checked, the final result once the match is over may be different

I stopped because I have a job and other projects in 42sp. Not because there was nothing left to do, there always is. The p99 ceiling of +3000 at 1ms was still out of reach. The #1 solution scored +5,964. But the marginal return on another optimization cycle was diminishing, and the time investment was real.

It was also genuinely fun. The combination of hard constraints (1 CPU, 350MB), a clear scoring function, and visible progress across 22 iterations made it addictive. I'd recommend any backend engineer try a competition like this.

5. Tools and Resources That Made It Possible

Skills were loaded in the sessions. The most used:

| Skill | Where It Applied | |-------|-----------------| | golang-benchmark | Before/after comparisons with benchstat in every iteration | | golang-performance | Haswell cache hierarchy, SIMD patterns, GC tuning methodology | | golang-safety | unsafe.Pointer wrappers, keeping mmap'd memory alive | | golang-testing | TDD table-driven tests, recall validation | | golang-concurrency | Early sync.Pool and goroutine patterns | | golang-code-style | Idiomatic Go conventions across the codebase | | golang-error-handling | Containment pattern for panic recovery | | golang-observability | pprof profiles, GC pause analysis | | nginx-configuration | keepalive pools, Unix socket upstream, buffer tuning | | docker-compose-orchestration | Resource limits, tmpfs volumes, health checks | | multi-stage-dockerfile | Initial Dockerfile optimization, .dockerignore | | k6 | Smoke, load, and official test execution | | find-docs | npx ctx7 for Go GC tuning, mmap, cgroup memory | | verification-before-completion | Anti-regression checklist before every submission |

Reusable prompt files were generated for each major iteration (.planning/prompts/next-llm-*.md). These captured the full project state current score, last changes, active rules, anti-regression constraints so each session started with complete context instead of rebuilding it. This pattern alone saved hours of repeated explanation.

Reference competitors' code was stored in docs/ for analysis. Reviewing how muanlartins, Joyce, and jairo solved the same problem was the highest-leverage research activity.

6. First Time With Go: Impressions vs Python and C

This was my first Go project. My background is Python (data, APIs) and C (42 School pointers, memory management, data structures).

What surprised me positively:

Single binary output. go build produces a static binary. No virtualenv, no node_modules, no JVM. For Docker deployment this is transformative, the runtime image is 15MB.
Concurrency without complexity. Goroutines and channels are simpler than Python's asyncio (callback chains) and C's pthreads (manual thread management). GOMAXPROCS=1 was actually optimal for this workload no concurrency overhead at all.
Rich standard library. net/http, encoding/json, testing, unsafe, debug/pprof everything needed for this project was built-in. Zero third-party dependencies except valyala/fasthttp (which replaced net/http later) and golang.org/x/sys (for mmap).
Cross-compilation is trivial. GOOS=linux GOARCH=amd64 GOAMD64=v3 go build, done. No build matrix, no platform-specific toolchains.
Zero-allocation patterns. Go's unsafe.Pointer gives C-level memory control when needed, but the GC handles the common case. The hot path had 0 allocs per request after optimization.

What I missed from C (42): SIMD intrinsics in the compiler (_mm256_fmadd_ps instead of writing assembly), and the absolute performance ceiling of manual memory management. But Go's GC is good enough, the mmap optimization moved the 86MB index off-heap, and GC pauses became negligible at p99=2.47ms.

Go splits the difference between Python's productivity and C's performance. I would use it again for any CPU-bound API service with tight resource constraints.

7. Inspirations From Other Competitors

muanlartins (#1 overall, +5,964). The biggest influence. His two-tier exact IVF was the structural breakthrough: fast path with NPROBE=2 covers 95.84% of queries at 1,464 vectors scanned, escalation to full exact search for only 4.16%. His QuantScale=10000 discovery eliminated the last quantization error. And his dual-accumulator AVX2 kernel proved that float32 SIMD on Haswell can hit p99=1.09ms. I borrowed the architecture, not the code, wrote my own assembly kernel with int64 re-ranking to handle the quantization difference.

Joyce (#3 overall, +5,853). Her fasthttp configuration and handler pattern showed what a zero-allocation HTTP layer looks like. The DisableHeaderNamesNormalizing, NoDefaultDate, NoDefaultServerHeader flags, understand the configuration pattern, implemented my own handler logic. Her scanBlocksAVX2 assembly with staged early exit at dims 4, 6, 8, and 14 influenced my dual-accumulator checkpoint design.

jairo (Rust, +5,932). Took a different approach: subset 100K vectors + brute-force + AVX2. Instead of indexing all 3M vectors like IVF, he picked 100K and ran exact search with AVX2 acceleration. This validated that aggressive subsetting plus SIMD is a viable alternative to IVF clustering. I didn't copy the approach, but it confirmed that the vector scan count (not the index structure) is the primary lever.

Study competitors to understand what is possible, then implement how in your own way.