Back to all articles
PIIPrivacyEngineering

Beyond Redaction: Why We Replace PII With Synthetic Data

Redacting PII with [SSN_REDACTED] breaks the LLM's ability to reason about data. Replacing it with realistic-looking fake data preserves the reasoning while eliminating the privacy risk. Here's how synthetic data replacement works and when to use it.

Beyond Redaction: Why We Replace PII With Synthetic Data

Beyond Redaction: Why We Replace PII With Synthetic Data

When PromptGuard detects a Social Security Number in a prompt, the default behavior is redaction:

Input:  "My SSN is 123-45-6789. Can you help with my tax return?"
Output: "My SSN is [SSN_REDACTED]. Can you help with my tax return?"

This works well for most use cases. The LLM sees the redacted prompt, understands the user's intent ("help with tax return"), and generates a useful response. The SSN never reaches the LLM provider.

But for some applications, redaction tokens break things.

When Redaction Fails

Problem 1: Format-Dependent Reasoning

Some applications need the LLM to reason about the format of data, not just its existence.

Input:  "Is this SSN valid: 123-45-6789?"
Redacted: "Is this SSN valid: [SSN_REDACTED]?"
LLM response: "I can see you've provided an SSN, but I cannot
               verify it as the number has been redacted."

The user wanted format validation. The redaction destroyed the information the LLM needed to answer. The user experience suffers.

Problem 2: Reference Consistency

In multi-turn conversations or document analysis, the same PII might appear multiple times:

Turn 1: "My email is john@example.com. Send the report there."
Turn 2: "Actually, send it to john@example.com instead of the
         other address."

Redacted Turn 1: "My email is [EMAIL_REDACTED]. Send the report there."
Redacted Turn 2: "Actually, send it to [EMAIL_REDACTED] instead of
                  the other address."

The LLM can't distinguish between "the first email" and "the second email" because both are [EMAIL_REDACTED]. If there were two different emails, the LLM loses the ability to track which is which.

Problem 3: Downstream Processing

If your pipeline processes the LLM's response programmatically—extracting structured data, populating forms, or triggering workflows—redaction tokens can break parsers:

# LLM response contains redaction tokens
response = "The customer's phone number is [PHONE_REDACTED]"

# Downstream parser expects a phone number format
phone = extract_phone(response)  # Returns None — can't parse token

The Synthetic Data Solution

Instead of replacing PII with tokens, we replace it with realistic-looking fake data that preserves the format but is guaranteed to be non-sensitive:

Input:  "My SSN is 123-45-6789 and email is john@example.com"
Synthetic: "My SSN is 000-00-0001 and email is user_847@placeholder.com"

The LLM sees properly formatted data. It can reason about the structure ("that looks like a valid SSN format"). It can track references across turns ("user_847@placeholder.com" is consistent within the session). Downstream parsers can extract the synthetic values without breaking.

But the data is fake. It can't be traced back to a real person. The privacy risk is zero.

How Synthetic Replacement Works

PromptGuard's SyntheticDataGenerator creates replacement values with three properties:

1. Format Preservation

Synthetic values match the format of the original PII type:

PII TypeOriginalSynthetic
Emailjohn.doe@company.comuser_847@placeholder.com
US Phone(555) 123-4567(555) 000-0001
SSN123-45-6789000-00-0001
Credit Card4111 1111 1111 11114000 0000 0000 0002
Date of Birth03/15/198501/01/2000
IPv4192.168.1.10010.0.0.1

The LLM sees data that looks like real data. Its reasoning works correctly because the format is valid.

2. Intentional Invalidity

Synthetic values are intentionally chosen to be obviously fake to any human or system that checks:

  • SSNs: Start with 000 (invalid per SSA rules — no real SSN starts with 000)
  • Credit cards: Fail the Luhn checksum (can't be charged)
  • Emails: Use @placeholder.com domain (clearly synthetic)
  • Phone numbers: Use 000-000X format (invalid per NANP)

This prevents accidental confusion between synthetic and real data. If a synthetic SSN somehow ends up in a database or report, it's immediately identifiable as fake.

3. Session Consistency

The same PII value replaced multiple times within the same request gets the same synthetic replacement:

Input:  "My email is john@example.com. Please confirm john@example.com is correct."
Synthetic: "My email is user_847@placeholder.com. Please confirm user_847@placeholder.com is correct."

Both instances of john@example.com map to the same user_847@placeholder.com. This preserves reference consistency—the LLM understands these are the same entity.

When to Use Synthetic vs. Redaction

ScenarioRecommended ApproachWhy
Support bot (general)RedactionLLM doesn't need to reason about PII format
Data validation ("is this SSN valid?")SyntheticLLM needs format information
Multi-turn conversations with PII referencesSyntheticReference consistency matters
RAG with document analysisSyntheticPreserves document structure
Healthcare (PHI de-identification)SyntheticClinical reasoning needs format-valid data
Compliance loggingRedactionSimpler, no risk of synthetic data being mistaken for real
High-throughput, low-latency applicationsRedactionSlightly less processing overhead

The Re-identification Question

A common concern: "If you replace PII with synthetic data, can the response be 're-identified' by mapping synthetic values back to real ones?"

The answer is: only if you choose to, and only on your infrastructure.

The mapping between real and synthetic values exists only in the request context—it's never logged, never stored (especially with zero retention mode), and never sent to the LLM provider. If your application needs to reverse-map synthetic values back to real ones in the response (e.g., to send an actual email), it can do so in your own backend after the LLM response comes back:

# 1. Scan and replace PII with synthetic data
scan_result = pg.security.scan(content=user_input, content_type="user_input")

# 2. Send synthetic version to LLM
response = llm.invoke(scan_result.synthetic or user_input)

# 3. Optionally reverse-map synthetic values in response
# (This happens in YOUR code, on YOUR infrastructure)
if needs_real_values:
    final_response = reverse_map(response, scan_result.mapping)
else:
    final_response = response  # Keep synthetic values

The LLM provider never sees the real PII. The reverse mapping is an application-level decision, not a security-layer decision.

Synthetic Data for RAG Applications

Synthetic replacement is particularly powerful for RAG applications where documents contain PII that needs to be processed but not exposed to the LLM.

Before synthetic replacement:

Retrieved document: "Patient John Smith (DOB: 03/15/1985, MRN: 12345)
was diagnosed with Type 2 diabetes on 04/20/2023."

LLM receives: Full document with real PII → compliance violation

With synthetic replacement:

Retrieved document: "Patient [NAME_1] (DOB: [DOB_REDACTED], MRN: [ID_REDACTED])
was diagnosed with Type 2 diabetes on 04/20/2023."

Synthetic version: "Patient Alex Johnson (DOB: 01/01/2000, MRN: 00001)
was diagnosed with Type 2 diabetes on 04/20/2023."

LLM receives: Synthetic version → clinically useful, privacy-safe

The LLM can reason about the clinical content ("Type 2 diabetes, diagnosed in 2023") with properly formatted patient references. The real identity is never exposed.

Combining Synthetic Data With Zero Retention

For maximum privacy:

  1. PII Detection: Catch all 39+ PII entity types in the input
  2. Synthetic Replacement: Replace with format-valid fake data
  3. LLM Processing: Send synthetic version to the LLM provider
  4. Output Scanning: Scan the response for any leaked PII
  5. Zero Retention: Don't store the prompt content in security event logs

With this configuration, PII is never sent to the LLM provider, never stored in PromptGuard's logs, and never persisted in any form outside the application's immediate request-response cycle.

Conclusion

Redaction is a blunt instrument. It works for most cases, but it breaks applications that need to reason about data format, track references across turns, or process structured outputs.

Synthetic data replacement is the scalpel. It removes the privacy risk while preserving the information structure that the LLM needs to function correctly.

The key insight: privacy protection and LLM functionality are not in tension. You don't have to choose between "expose real PII" and "break the user experience." Synthetic data gives you both—privacy protection that's invisible to the user and the model.

Real data for real people. Fake data for robots. That's the principle.