Back to all articles
PIIMLSecurity

Regex is Not Enough: How We Built PII Detection That Doesn't Suck

PII detection is easy if you don't care about false positives. If you do, it's a nightmare. Here is how we combined Regex, Context, and ML to catch sensitive data without blocking legitimate users.

Regex is Not Enough: How We Built PII Detection That Doesn't Suck

Regex is Not Enough: How We Built PII Detection That Doesn't Suck

If you search for "how to detect credit card numbers in Python," you will find this regex: \b(?:\d[ -]*?){13,16}\b

If you deploy this regex to production, you will block:

  1. Git commit hashes.
  2. UUIDs.
  3. Product SKUs.
  4. Timestamps.

And you will annoy every single user who triggers it.

The "False Positive" Trap

PII (Personally Identifiable Information) detection is an exercise in probability.

  • "123-45-678" looks like a Social Security Number.
  • It is also a valid Part Number for a dishwasher.

If you block the dishwasher part number, your support bot becomes useless.

Our Hybrid Approach

We realized that relying on regex alone was impossible. We needed Context.

Layer 1: The Strict Regex (Fast)

We run highly specific regexes first. Instead of \d{9}, we look for \d{3}-\d{2}-\d{4}. We validate checksums (Luhn algorithm for credit cards). If the math doesn't check out, it's not a card.

Layer 2: Contextual Keywords (Fast)

A number is just a number. A number near a word is a signal.

  • "SSN: 123-45-6789" -> High Confidence
  • "Part: 123-45-6789" -> Ignore

We built a lightweight scanner that looks for "anchor words" within a 5-token window of the match.

Layer 3: Named Entity Recognition (Slow but Smart)

For the edge cases (names, addresses), regex fails completely. You can't write a regex for "John Smith."

We deployed a distilled BERT model fine-tuned for PII recognition. It runs only when Layer 1 and 2 are inconclusive. It understands that "I live at 123 Main St" is an address, but "I have 123 main issues" is not.

Redaction vs. Blocking

We also changed what we do when we find PII. Blocking is hostile. It tells the user "You did something wrong."

Redaction is helpful. We replace the PII with a token:

"My number is [PHONE_REDACTED]. Can you call me?"

The LLM receives the redacted prompt. It can still understand the intent ("User wants a call") without seeing the data.

Conclusion

Data protection isn't about finding patterns. It's about understanding meaning. If you aren't looking at context, you aren't detecting PII—you're just grepping for digits.