Back to all articles
PerformanceArchitectureRust

The Physics of Latency: Why We Don't Use LLMs to Secure LLMs

Everyone wants AI security, but nobody wants to add 500ms to their request. Here is why we bet on classical ML and Rust for our detection engine.

The Physics of Latency: Why We Don't Use LLMs to Secure LLMs

The Physics of Latency: Why We Don't Use LLMs to Secure LLMs

There is a popular architecture for AI security that goes like this:

  1. User sends prompt.
  2. Middleware sends prompt to GPT-4 with "Is this safe?".
  3. GPT-4 says "Yes".
  4. Middleware sends prompt to your actual model.

This architecture is dead on arrival.

The Math Doesn't Work

  • Your Model: 500ms (Time to First Token).
  • Security Model: 500ms.
  • Total Latency: 1s+.

You just doubled your latency. For a voice agent or a real-time copilot, that is unacceptable.

The Hybrid Architecture

We set a budget: 10ms. To hit that, we had to get off the LLM train.

1. The "Dumb" Models (0.5ms)

We run specialized classifiers (XGBoost/Linear) on simple features:

  • Prompt length.
  • Character distribution.
  • Known malicious n-grams.

These catch the "script kiddie" attacks instantly.

2. The Transformers (8ms)

We fine-tuned DeBERTa-v3-small (a 40MB model) on our attack dataset. It runs on CPU. It fits in L3 cache. It understands semantics ("Ignore instructions" vs. "Translate instructions") but is 100x faster than GPT-4.

3. The Verifier (Async)

If a prompt is suspicious (confidence 0.6-0.8), we don't block it immediately if the risk is low. We let it through, but we asynchronously send it to a larger model for analysis. If it turns out to be an attack, we ban the user after the fact.

Why Rust?

We wrote the core proxy in Python first. We hit the GIL wall at 500 concurrent requests. We rewrote the hot path in Rust.

  • Memory Safety: Zero segfaults.
  • Concurrency: Tokio handles 10k connections/sec.
  • Python Interop: We bind the Rust core to Python via PyO3 so we can still use the ML ecosystem.

Conclusion

You don't fight fire with fire. You fight fire with water. You don't secure LLMs with more LLMs. You secure them with engineering.