Back to all articles
PIIPrivacySecurity

How We Detect 14 Types of PII Without Machine Learning

PII detection is easy if you don't care about false positives. If you do, it's a nightmare. Here's how we built a high-precision PII detector using layered regex, Luhn validation, preset-based sensitivity, and synthetic data replacement.

How We Detect 14 Types of PII Without Machine Learning

How We Detect 14 Types of PII Without Machine Learning

If you search for "how to detect credit card numbers in Python," you'll find this regex:

r'\b(?:\d[ -]*?){13,16}\b'

If you deploy this regex to production, you will block:

  1. Git commit hashes
  2. UUIDs
  3. Product SKUs
  4. Timestamps
  5. Phone numbers that happen to be 13+ digits
  6. Any sequence of numbers in a math question

You will annoy every user who triggers any of them. And they will complain, or worse, they'll just leave.

PII detection is a precision problem. It's easy to catch everything that might be PII. It's brutally hard to catch everything that is PII without also catching things that aren't.

Here's how we built a PII detector that handles 14 data types with precision high enough for production use—without any ML model.

Why We Chose Regex Over ML for PII

This might be surprising for a team that uses a 5-model ML ensemble for injection detection. Why not use ML for PII?

Three reasons:

1. PII has structure. A credit card number has a specific format (13-19 digits, Luhn-valid). A Social Security Number is XXX-XX-XXXX. An email has an @ and a domain. These are structural patterns that regex handles perfectly. ML is overkill for structure.

2. Regex is deterministic. When our regex catches an SSN, we can explain exactly why: "Matched pattern \d{3}-\d{2}-\d{4} at position 47." When ML flags something as PII, the explanation is "the model's internal weights produced a score of 0.73." For compliance-sensitive applications (HIPAA, PCI-DSS), the deterministic explanation is far more useful.

3. Regex is fast. Running 14 regex patterns against a prompt takes microseconds. Running an NER model takes 50-100ms. When PII detection is on every request (not just suspicious ones), microseconds matter.

The tradeoff: regex can't detect unstructured PII like names and addresses. We accept this tradeoff because the cost of false positives on names ("John" is both a common name and a common word) is higher than the benefit of catching them.

The 14 PII Types

Here's every type our detector handles, with the validation logic that keeps precision high.

Tier 1: Always Detected (All Presets)

These types are detected even on the most permissive preset because the risk of exposure is always high.

1. Social Security Number (SSN)

Pattern: r'\b\d{3}-\d{2}-\d{4}\b'
Label: [SSN_REDACTED]

The strict format (XXX-XX-XXXX) eliminates most false positives. We don't match bare 9-digit numbers because those are too ambiguous.

2. Credit Card Number

Pattern: r'\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))'
        r'[-\s]?\d{4,6}[-\s]?\d{4,5}[-\s]?\d{0,4}\b'
Label: [CARD_REDACTED]
Validation: Luhn algorithm

The regex matches Visa (4xxx), Mastercard (51-55xx), Amex (34/37xx), and Discover (6011/65xx) formats. But the real precision comes from the Luhn checksum—a mathematical validation that eliminates random number sequences. If the Luhn check fails, we don't flag it, even if the format matches.

We validate card lengths of 13, 14, 15, 16, and 19 digits to cover all major card networks.

3. Passport Number

Pattern: r'\b[A-Z]{1,2}\d{6,9}\b'
Label: [PASSPORT_REDACTED]

4. Driver's License

Pattern: r'\b[A-Z]\d{4,8}[-\s]?\d{0,5}\b'
Label: [DL_REDACTED]

5. IBAN (International Bank Account Number)

Pattern: r'\b[A-Z]{2}\d{2}\s?\d{4}\s?\d{4}\s?\d{4}(?:\s?\d{4}){0,4}\b'
Label: [IBAN_REDACTED]

Tier 2: Detected on Moderate and Strict Presets

These types have slightly higher false positive risk but are important for most applications.

6. Email Address

Pattern: r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
Label: [EMAIL_REDACTED]

7. US Phone Number

Pattern: r'\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'
Label: [PHONE_REDACTED]

8. International Phone Number

Pattern: r'\b\+\d{1,3}[-.\s]?\d{1,4}[-.\s]?\d{2,4}[-.\s]?\d{2,4}(?:[-.\s]?\d{2,4})?\b'
Label: [PHONE_REDACTED]

9. IPv4 Address

Pattern: r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
Label: [IP_REDACTED]

10. IPv6 Address

Pattern: r'\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b'
Label: [IP_REDACTED]

11. Date of Birth

Pattern: r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{1,2}[/-]\d{1,2})\b'
Label: [DOB_REDACTED]

12. Medicare ID

Pattern: r'\b\d{10,11}[A-Za-z]?\b'
Label: [MEDICARE_REDACTED]

13. NHS Number (UK)

Pattern: r'\b\d{3}\s?\d{3}\s?\d{4}\b'
Label: [NHS_REDACTED]

Tier 3: Only Detected on Strict Preset

14. US ZIP Code

Pattern: r'\b\d{5}(?:-\d{4})?\b'
Label: [ZIP_REDACTED]

ZIP codes have high false positive risk (any 5-digit number matches) so they're only detected on strict preset where maximum data protection is more important than user friction.

Preset-Based Sensitivity

Not every application needs the same PII protection level. A healthcare chatbot needs to catch everything. A creative writing tool probably doesn't need to redact ZIP codes.

We use three presets that control which PII types are active:

PresetTypes DetectedUse Case
StrictAll 14 typesHealthcare, finance, government
Moderate13 types (all except ZIP)General applications, support bots
Permissive5 types (SSN, credit card, passport, driver's license, IBAN)Creative writing, code assistants

The preset is configured per-project in the dashboard. You can also override individual types via the project's preset_overrides configuration.

Redaction vs. Blocking

Here's a design decision that makes a huge difference in user experience: we redact instead of blocking.

When we detect PII in a prompt, we don't reject the entire request. We replace the PII with descriptive tokens and forward the sanitized version to the LLM.

Original prompt:

My SSN is 123-45-6789 and my email is john@example.com.
Can you help me fill out my tax return?

Redacted prompt (sent to LLM):

My SSN is [SSN_REDACTED] and my email is [EMAIL_REDACTED].
Can you help me fill out my tax return?

The LLM receives the redacted version. It can still understand the intent ("user needs help with tax return") without seeing the sensitive data. The user gets their answer. The PII never reaches the LLM provider.

This is fundamentally different from blocking:

  • Blocking says: "You did something wrong. Try again." The user is frustrated and confused.
  • Redaction says: "We protected your data and processed your request." The user gets what they wanted.

Synthetic Data Replacement

For applications where the redaction tokens ([SSN_REDACTED]) would confuse the LLM, we offer synthetic data replacement.

Instead of replacing PII with tokens, we generate realistic-looking fake data that preserves the format:

PII TypeOriginalSynthetic Replacement
Emailjohn@example.comuser_847@placeholder.com
Phone(555) 123-4567(555) 000-0001
SSN123-45-6789000-00-0000
Credit Card4111 1111 1111 11114000 0000 0000 0002

The synthetic data preserves:

  • Format: The LLM sees correctly formatted data, so it can still reason about structure.
  • Consistency: The same PII replaced multiple times in the same prompt gets the same synthetic value, so references remain coherent.
  • Invalidity: Synthetic values are intentionally invalid (SSNs starting with 000, Luhn-invalid card numbers) so they can't be confused with real data.

This is particularly useful for RAG applications where the LLM needs to reference specific data points in its response. The response will contain the synthetic data, which the application can optionally reverse-map to the real values if needed.

API Key Detection: A Separate Detector

PII isn't the only sensitive data that leaks into prompts. We also detect API keys and credentials with a separate detector that handles 10 patterns:

PatternExample
OpenAI API keyssk-proj-..., sk-...
AWS Access KeysAKIA...
Google OAuth tokensya29....
Google API keysAIza...
GitHub PATsghp_...
GitHub OAuthgho_...
Generic API keysapi_key = "..."
Bearer tokensAuthorization: Bearer ...

All detected credentials are replaced with [API_KEY_REDACTED]. This detector runs on both inputs and outputs, catching credentials that users accidentally paste into prompts and credentials that the LLM might hallucinate from its training data.

Output Scanning: The Other Side

PII detection runs on both directions:

Input scanning catches PII that users are sending to the LLM. This prevents the data from reaching the LLM provider's servers.

Output scanning catches PII that the LLM generates in its response. This can happen when:

  • The LLM hallucinates realistic-looking PII
  • The LLM echoes back PII from its training data
  • A RAG system injects PII from retrieved documents into the response

For streaming responses, output scanning runs in real-time as chunks arrive. If PII is detected in a stream chunk, we can redact it inline without cutting the entire stream.

Conclusion

PII detection isn't about finding patterns. It's about finding the right patterns with enough precision that you don't destroy the user experience.

Our approach is deliberately simple: regex patterns, structural validation (Luhn), preset-based sensitivity, and redaction instead of blocking. No ML model, no NER, no embeddings. Just well-crafted patterns and careful engineering.

The simplicity is the point. Every detection rule is readable, auditable, and deterministic. When we redact a credit card number, you can trace the decision to a specific regex pattern and a passing Luhn check. Try getting that level of explainability from a neural network.

If you're not looking at context, you're not detecting PII—you're just grepping for digits. And if you're blocking instead of redacting, you're not protecting users—you're punishing them for having personal information.