Inspiration
One of our teammates (Chris) previously started an LLM-based mental health startup last year. They were selling to HR leaders across the globe! The leaders really liked the solution, from Fortune 500 companies to startups, and even the Navy sent their HR team to their doorstep. However, they couldn't reach enterprise-wide deployment despite contracts being signed to experiment with all these companies in a vacuum. This was because of the larger (and evolving) worry of data privacy - every single company worried that their data or the data of their clients would be potentially leaked, used, or abused by the LLMs that the startup was based on. And as we found out, they weren't alone...
The complexity of anonymizing data WHILE preserving context to be useful to LLMs became a huge issue (and idea!) in itself, and that's where this idea came from.
What It Does
PII Data Masking is a privacy-preserving solution that utilizes locally (cloud) hosted language models to identify and replace sensitive personal information within datasets. By assessing the context and importance of the redacted data, the system generates non-identifiable alternatives that maintain the semantic integrity and analytical value of the original text. This innovative approach ensures data privacy while preserving the usefulness of the dataset for analysis and insights.
How We Built It
The main goal of our project is to anonymize sensitive data in such a way that the redacted information still retains its usefulness for analysis, without compromising individual privacy.
1. Locally Hosted Language Models
- You plan to use language models hosted on a local server to scan and process datasets. This ensures enhanced security as the data does not need to be sent to external servers.
- The initial task of these models is to identify PII accurately within the dataset.
2. Context Preservation
- After identifying the PII, the system assesses the importance of the redacted information within its context. The language model determines why specific information is crucial and how it contributes to the overall meaning of the text.
- This step is critical to ensure that replacing the PII does not alter the fundamental insights or value derived from the data.
3. Synonym and Filler Data Generation
- Another language model, possibly like Gemini or another capable of generating synonymous expressions, is used to create non-identifiable alternatives that maintain the semantic integrity of the original text.
4. Data Integration
- The newly generated, anonymized terms or phrases replace the original PII in the dataset.
- This step requires careful integration to maintain the logical flow and readability of the data, ensuring that the dataset remains useful for analysis.
5. Analysis of Anonymized Data
- The final dataset, now devoid of PII but still contextually intact, can be safely analyzed using various tools without risking privacy breaches.
- This allows researchers and analysts to work with the data more freely, without concerns over privacy violations. ## In-Depth Contextualization Explanation I
Using Minstrel AI:
Contextual Analysis
- Initially, unredacted data is processed by Minstrel AI. This model's primary role is to analyze the text deeply to identify which parts of the data contain PII that also holds significant contextual or cultural importance. This includes recognizing elements of the data where the specific type of PII (like a name or place) is crucial to the narrative or information integrity.
Identification of Contextual Significance
- Minstrel AI determines how certain pieces of information, such as names or addresses, contribute to the overall context. For instance, if a name like "Jamal" is identified, Minstrel AI recognizes and flags the name not just as PII but as a key identifier potentially associated with African-American identity, which might be relevant in the data's cultural or social context.
Feedback for Replacement
- Minstrel AI then outputs recommendations or metadata about these elements, suggesting that while the name is PII and should be masked, any replacement should consider maintaining its cultural or contextual significance.
Using Gemini:
Synonym Generation
- Based on the contextual insights provided by Minstrel AI, Gemini then steps in to handle the actual synonym generation. This model is tasked with creating alternatives for the PII that preserve the identified cultural or contextual significance.
Contextually Appropriate Replacements
- Continuing with the example of the name "Jamal," Gemini would generate alternatives that also reflect African-American identity, such as "Tyrone," instead of choosing a culturally neutral or mismatched name like "Nathan." This ensures that the anonymized data remains culturally and contextually relevant and accurate.
Integration Workflow:
Data Input and Processing
- Data is input into Minstrel AI, which identifies and evaluates the contextual importance of PII within the dataset.
Contextual Insight Transfer
- Information about what needs to be preserved in the context during the anonymization process is transferred from Minstrel AI to Gemini.
Anonymization with Context Preservation
- Gemini uses this information to generate and substitute synonyms that maintain the original data's contextual integrity.
Reintegration and Review
- The new, anonymized data is reintegrated into the dataset, ensuring that it still carries the same narrative value and can be used meaningfully in analysis without risking privacy.
Challenges We Ran Into
- Determining who our target customer would be (B2B or B2C)
- Determining how to identify important contextual elements without having the private data touch Gemini
- Getting the LLM to return structured data from an NLP prompt
Accomplishments That We're Proud Of
We integrated a lot of new tools with each other in order to implement a rather complex model framework to solve a major issue that affects just about any company building on top of LLMs. We built a product that has real market value and provides a full application solution to a growing but unmet need.
What we learned
We learned how to integrate many of the new and developing AI tools together to make a production application.
What's next for DataMask
- Add customization features designed for different niches beyond therapy, such as law, medicine, and research.
- Deploy everything to the cloud so that we can bring it into production.
- Fully transition from B2C to B2B.
Log in or sign up for Devpost to join the conversation.