Why Piixie uses LLMs for PII detection

PII is more than obvious strings

Traditional PII detection works well when the sensitive value has a predictable shape. Email addresses contain an at sign. Phone numbers often follow a country-specific pattern. Credit card numbers can be checked with formatting and checksum rules. Those detectors are useful, but they only cover the easy cases.

Real documents contain private information in less mechanical forms: a patient's relationship to a doctor, a delivery address split across table columns, a dependent mentioned only by first name, a claim number described in a paragraph, or a signature block whose meaning depends on the surrounding agreement. The sensitive fact may not be a single token. It may be the relationship between several ordinary-looking tokens.

The hard cases are relational

Many leaks happen because the same entity appears in different ways across a document. A resume may mention "Alex" in a cover note, a full name in the header, a personal website in the footer, and a university project in a bullet. A rules engine sees strings. An LLM can reason that those strings are connected to one person and should be anonymized consistently.

This matters even more in tables and forms. A column named "spouse" or "emergency contact" may contain values that look like ordinary names. A row may include a date, a location, and an internal code that only become identifying when read together. A note saying "same address as previous tenant" can turn an otherwise harmless address reference into PII. These are difficult relations embedded in the data, and they are exactly the kind of relationships that only an LLM can reliably infer at document scale.

Rules miss language, layout, and intent

Regex and dictionaries do not understand why a value is present. They cannot easily distinguish a public company name from a private employer mentioned in a medical intake form, or a product code from a customer account number, without the surrounding context. They also struggle with partial mentions, aliases, titles, nicknames, pronouns, captions, footnotes, handwritten-style form labels, and values that wrap across lines.

An LLM can read the local neighborhood of a value and the broader document purpose. It can decide that "Dr. Ruiz" is a healthcare provider in one document, that "Ruiz family trust" is a private entity in another, and that "Ruiz Street" is an address component rather than a person. That contextual distinction is what makes PII detection safer.

Vision support removes the OCR dependency

OCR is not required as a separate preprocessing step in Piixie. The local model supports vision, so Piixie can analyze rendered document pages, scans, image-backed PDFs, screenshots, and visual regions directly when text extraction alone is not enough.

This avoids a common failure mode in document anonymization: OCR may misread characters, lose table structure, flatten multi-column layouts, or detach a value from its label. A vision-capable model can inspect the page as a document, not just as a stream of extracted words. That helps it keep labels, captions, proximity, and layout relationships in view while deciding what needs to be anonymized.

LLMs improve replacement quality too

Detection and replacement are connected. If Piixie knows that three mentions refer to the same person, it can use one replacement identity. If it knows that an email belongs to that person, synthetic mode can generate a coherent email local part. If it knows that multiple address lines are one address, it can keep the fake address internally consistent.

That is why Piixie combines LLM-based detection with constrained outputs. The model finds and classifies the sensitive entities, but Piixie controls the final operation: redaction, replacement tokens, or local Faker-backed synthetic data. The result is context-aware detection without handing raw documents to a cloud anonymization service.

Local-first is the practical tradeoff