<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Vault on ICE-ICE-BEAR-BLOG</title><link>https://ice-ice-bear.github.io/tags/vault/</link><description>Recent content in Vault on ICE-ICE-BEAR-BLOG</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Thu, 07 May 2026 00:00:00 +0900</lastBuildDate><atom:link href="https://ice-ice-bear.github.io/tags/vault/index.xml" rel="self" type="application/rss+xml"/><item><title>OPENAI Privacy Filter Reversible — Not Anonymization, Recoverable Pseudonymization</title><link>https://ice-ice-bear.github.io/posts/2026-05-07-openai-privacy-filter-reversible-tokenization/</link><pubDate>Thu, 07 May 2026 00:00:00 +0900</pubDate><guid>https://ice-ice-bear.github.io/posts/2026-05-07-openai-privacy-filter-reversible-tokenization/</guid><description>&lt;img src="https://ice-ice-bear.github.io/" alt="Featured image of post OPENAI Privacy Filter Reversible — Not Anonymization, Recoverable Pseudonymization" /&gt;&lt;h2 id="overview"&gt;Overview
&lt;/h2&gt;&lt;p&gt;OpenAI Privacy Filter (OPF) detects PII spans in text and replaces them with typed placeholders like &lt;code&gt;&amp;lt;PRIVATE_PERSON&amp;gt;&lt;/code&gt;. The default behavior is &lt;strong&gt;irreversible redaction&lt;/strong&gt; — if the same person appears five times, all five get collapsed into the same generic placeholder, and every relationship between mentions is destroyed.&lt;/p&gt;
&lt;p&gt;&lt;a class="link" href="https://github.com/deformatic/OPENAI-Privacy-Filter-Reversible-Tokenization" target="_blank" rel="noopener"
 &gt;deformatic/OPENAI-Privacy-Filter-Reversible-Tokenization&lt;/a&gt; bolts an opt-in &lt;strong&gt;reversible tokenization vault&lt;/strong&gt; layer on top. Masking is preserved, but the same entity gets the same indexed token (&lt;code&gt;&amp;lt;PRIVATE_PERSON_1&amp;gt;&lt;/code&gt;), and the original values are stored in a separate vault that only authorized callers can read. &lt;strong&gt;One day old at the time of sharing.&lt;/strong&gt; Apache 2.0, Python, 20 stars.&lt;/p&gt;
&lt;pre class="mermaid" style="visibility:hidden"&gt;flowchart TD
 Client["Client"] --&gt; API["PII Tokenization API"]
 API --&gt; Detector["OPF Detector &amp;lt;br/&amp;gt; (span + label + offset)"]
 Detector --&gt; Resolver["Token Resolver &amp;lt;br/&amp;gt; label + canonical_text"]
 Resolver --&gt; Writer["Vault Writer &amp;lt;br/&amp;gt; token to original (encrypted at rest)"]
 Writer --&gt; Token["tokenized_text"]
 Token --&gt; Down["downstream LLM / pipeline"]

 Auth["Authorized restore request"] --&gt; Restore["Restore API"]
 Restore --&gt; Reader["Vault Reader"]
 Reader --&gt; Out["restored_text"]&lt;/pre&gt;&lt;h2 id="default-opf-vs-reversible-layer"&gt;Default OPF vs Reversible Layer
&lt;/h2&gt;&lt;p&gt;Default OPF:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Alice emailed Bob.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt;PRIVATE_PERSON&amp;gt; emailed &amp;lt;PRIVATE_PERSON&amp;gt;.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;→ &amp;ldquo;Are these two people the same or different?&amp;rdquo; is no longer recoverable.&lt;/p&gt;
&lt;p&gt;Reversible layer:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;Alice emailed Bob. Alice&amp;#39;s phone is 555-1111.
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-&amp;gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&amp;lt;PRIVATE_PERSON_1&amp;gt; emailed &amp;lt;PRIVATE_PERSON_2&amp;gt;. &amp;lt;PRIVATE_PERSON_1&amp;gt;&amp;#39;s phone is &amp;lt;PRIVATE_PHONE_1&amp;gt;.
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Plus a separate vault:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-json" data-lang="json"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;schema_version&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;opf.reversible.v1&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;vault_id&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;7c1d...&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;entries&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;&amp;lt;PRIVATE_PERSON_1&amp;gt;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;label&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;private_person&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Alice&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;canonical_text&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;&amp;#34;Alice&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="nt"&gt;&amp;#34;index&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id="the-key-distinction"&gt;The Key Distinction
&lt;/h2&gt;
 &lt;blockquote&gt;
 &lt;p&gt;&lt;em&gt;&amp;ldquo;This is &lt;strong&gt;not anonymization&lt;/strong&gt;. It is &lt;strong&gt;recoverable pseudonymization&lt;/strong&gt;. The tokenized text is useful only if the vault is protected like source PII.&amp;rdquo;&lt;/em&gt; — README&lt;/p&gt;

 &lt;/blockquote&gt;
&lt;p&gt;&lt;a class="link" href="https://gdpr-info.eu/art-4-gdpr/" target="_blank" rel="noopener"
 &gt;Pseudonymization and anonymization are explicitly distinguished in the GDPR&lt;/a&gt;. Anonymized data is no longer personal data and falls outside GDPR; &lt;strong&gt;pseudonymized data is still personal data&lt;/strong&gt; (&lt;a class="link" href="https://gdpr-info.eu/recitals/no-26/" target="_blank" rel="noopener"
 &gt;GDPR Recital 26&lt;/a&gt;). So while keeping the vault separate gives compliance leverage at service boundaries, the vault itself must be protected at the same security tier as the source PII.&lt;/p&gt;
&lt;h2 id="the-problem-it-solves"&gt;The Problem It Solves
&lt;/h2&gt;&lt;p&gt;Plain redaction strips sensitive values but &lt;strong&gt;also destroys the relationships downstream still needs&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A reviewer needs to see that the same person appears multiple times.&lt;/li&gt;
&lt;li&gt;A downstream LLM task needs consistent placeholders for names, emails, phones, account numbers, secrets.&lt;/li&gt;
&lt;li&gt;A data pipeline needs to restore originals after enrichment, approval, or internal processing.&lt;/li&gt;
&lt;li&gt;A service boundary may allow tokenized text out of an enclave while requiring the vault to stay inside.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2 id="design-principles"&gt;Design Principles
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Backward compatible&lt;/strong&gt; — existing &lt;code&gt;redact()&lt;/code&gt; behavior unchanged&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Explicit opt-in&lt;/strong&gt; — reversible only via &lt;code&gt;OPF.tokenize()&lt;/code&gt; or &lt;code&gt;opf --recoverable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Model agnostic&lt;/strong&gt; — no changes to checkpoint, decoder, Viterbi, training, or eval paths&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stable per value&lt;/strong&gt; — same &lt;code&gt;label + canonical_text&lt;/code&gt; → same token within a vault&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Batch friendly&lt;/strong&gt; — one vault can be reused across many inputs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Auditable&lt;/strong&gt; — token mappings serialized in a clear schema (&lt;code&gt;opf.reversible.v1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Security aware&lt;/strong&gt; — README and schema both state plaintext vaults are development-grade only&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="token-assignment-rules"&gt;Token Assignment Rules
&lt;/h2&gt;&lt;p&gt;Within a single vault:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Same label + same canonical text → same token&lt;/li&gt;
&lt;li&gt;Same label + different canonical text → next index&lt;/li&gt;
&lt;li&gt;Different label + same text → different token family&lt;/li&gt;
&lt;li&gt;Source text collision → skip to next available index&lt;/li&gt;
&lt;li&gt;Overlapping spans → raise &lt;code&gt;ValueError&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="why-it-matters"&gt;Why It Matters
&lt;/h2&gt;&lt;ul&gt;
&lt;li&gt;Pseudonymization is the most practical of the &amp;ldquo;masking vs anonymization vs pseudonymization&amp;rdquo; trichotomy, but open-source implementations were essentially nonexistent. This is one answer.&lt;/li&gt;
&lt;li&gt;A separated vault enables a &lt;strong&gt;compliance argument&lt;/strong&gt;: &amp;ldquo;tokenized text sent to an LLM provider is not a PII transmission&amp;rdquo; — provided the vault is protected.&lt;/li&gt;
&lt;li&gt;It maps cleanly onto the patterns LLM pipelines are increasingly hitting as they enter enterprise.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="references"&gt;References
&lt;/h2&gt;&lt;h3 id="repo"&gt;Repo
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://github.com/deformatic/OPENAI-Privacy-Filter-Reversible-Tokenization" target="_blank" rel="noopener"
 &gt;deformatic/OPENAI-Privacy-Filter-Reversible-Tokenization&lt;/a&gt; — Apache 2.0, Python, 20 stars at time of writing&lt;/li&gt;
&lt;li&gt;Upstream &lt;a class="link" href="https://github.com/openai/openai-privacy-filter" target="_blank" rel="noopener"
 &gt;OpenAI Privacy Filter (OPF)&lt;/a&gt; — span detection + masking&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="privacy-concepts"&gt;Privacy concepts
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://gdpr-info.eu/art-4-gdpr/" target="_blank" rel="noopener"
 &gt;GDPR Article 4 — Definitions&lt;/a&gt; (pseudonymization / anonymization)&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://gdpr-info.eu/recitals/no-26/" target="_blank" rel="noopener"
 &gt;GDPR Recital 26 — Not applicable to anonymous data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://www.apache.org/licenses/LICENSE-2.0" target="_blank" rel="noopener"
 &gt;Apache License 2.0&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="related-infra"&gt;Related infra
&lt;/h3&gt;&lt;ul&gt;
&lt;li&gt;&lt;a class="link" href="https://platform.openai.com/docs/guides/your-data" target="_blank" rel="noopener"
 &gt;OpenAI Platform — Privacy &amp;amp; Data Use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a class="link" href="https://openai.github.io/openai-agents-js/guides/guardrails/" target="_blank" rel="noopener"
 &gt;OpenAI Agents SDK — Guardrails&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="insights"&gt;Insights
&lt;/h2&gt;&lt;p&gt;Pseudonymization is the spot where most LLM pipelines stall once they enter compliance territory. Full redaction kills downstream quality; sending raw text crosses the boundary. This layer aims directly at the gap: tokenized text can leave the enclave, the vault stays inside. The design itself is small and clean — model path untouched, opt-in only, existing &lt;code&gt;redact()&lt;/code&gt; preserved verbatim. But as the README hammers home, &lt;strong&gt;the vault must be protected at the same security tier as the source PII&lt;/strong&gt;, and a plaintext vault is development-grade only — never production. Twenty stars in a day suggests this pattern was already running as in-house tooling at multiple teams; what was missing was a public reference implementation. Because the upstream OPF model path is left untouched, this is a clean PR-able extension rather than a fork — there&amp;rsquo;s a real chance of upstream merge, which would be the right ending for a feature that should arguably ship inside OPF itself.&lt;/p&gt;</description></item></channel></rss>