
Regex is Not Enough: How We Built PII Detection That Doesn't Suck
If you search for "how to detect credit card numbers in Python," you will find this regex:
\b(?:\d[ -]*?){13,16}\b
If you deploy this regex to production, you will block:
- Git commit hashes.
- UUIDs.
- Product SKUs.
- Timestamps.
And you will annoy every single user who triggers it.
The "False Positive" Trap
PII (Personally Identifiable Information) detection is an exercise in probability.
- "123-45-678" looks like a Social Security Number.
- It is also a valid Part Number for a dishwasher.
If you block the dishwasher part number, your support bot becomes useless.
Our Hybrid Approach
We realized that relying on regex alone was impossible. We needed Context.
Layer 1: The Strict Regex (Fast)
We run highly specific regexes first.
Instead of \d{9}, we look for \d{3}-\d{2}-\d{4}.
We validate checksums (Luhn algorithm for credit cards). If the math doesn't check out, it's not a card.
Layer 2: Contextual Keywords (Fast)
A number is just a number. A number near a word is a signal.
- "SSN: 123-45-6789" -> High Confidence
- "Part: 123-45-6789" -> Ignore
We built a lightweight scanner that looks for "anchor words" within a 5-token window of the match.
Layer 3: Named Entity Recognition (Slow but Smart)
For the edge cases (names, addresses), regex fails completely. You can't write a regex for "John Smith."
We deployed a distilled BERT model fine-tuned for PII recognition. It runs only when Layer 1 and 2 are inconclusive. It understands that "I live at 123 Main St" is an address, but "I have 123 main issues" is not.
Redaction vs. Blocking
We also changed what we do when we find PII. Blocking is hostile. It tells the user "You did something wrong."
Redaction is helpful. We replace the PII with a token:
"My number is [PHONE_REDACTED]. Can you call me?"
The LLM receives the redacted prompt. It can still understand the intent ("User wants a call") without seeing the data.
Conclusion
Data protection isn't about finding patterns. It's about understanding meaning. If you aren't looking at context, you aren't detecting PII—you're just grepping for digits.
READ MORE

LangChain Is Unsafe by Default: How to Secure Your Chains
LangChain makes it easy to build agents. It also makes it easy to build remote code execution vulnerabilities. Here is the right way to secure your chains.

Why Support Bots Are Your Biggest Security Hole (And How We Fix It)
We've seen how easy it is to trick a helpful bot into leaking user data. Here is the architecture we recommend to prevent it without killing the user experience.

The Day an AI Agent Deleted Our Logs
We gave an AI agent permission to 'clean up'. It cleaned up everything. Here is the architecture we built to prevent it from happening again.