Best AI for Business: A Guide to Ensure Data Privacy

Chetan Kumar
Jan 16, 2024
4 min read

Updated: Jan 22, 2024

Best AI for Business. Consult with Core Maitri for your AI integration needs. — AIs can cause major data leaks if proper care is not taken.

The Need:

To use AI to analyze your sensitive business data

The Concern:

Making sure your data is not leaked to the public!

What is the risk?

Large Language Models, LLMs utilize tons of data for training. This data is then stored in the model as parameter-weights, which are basically numbers. The GPT-3 "model", for example, is a giant math equation with 70 billion numbers .

Once the data is trained, and the "numbers" are updated, there is no way to "remove" the data from the model.

For example, below is the evidence from the NYT legal battle against OpenAI.

NY Times battle with Open AI over plagiarism. — An exhibit from the Times complaint showing plagiarism from an OpenAI product.

The point to be noted, is that this text was not "stored" anywhere in Open-AI's GPT-model, but rather stored in it's parameters numeric values. The above text was "generated" by the LLM model when prompted in a certain way.

To remove this data from the model, means, retraining the whole model again, without the questionable data.

What's the Risk for you and your business?

Scenario 1

Well, imagine your developer copy-pastes a proprietary code from your software and asks Open-AI for some programming help. This code will now be a part of the training data that Open-AI periodically trains on.

Once the model has trained on your data, it is encoded and stored, as numbers. When prompted in a certain way, the model "could" regurgitate the entire source code verbatim, exposing your proprietary IP.

Scenario 2

A manager in your team is working on sales quotation, and uploads an document with your bidding prices for a valuable contract. to open-ai for summarization, or generating some document content.

Once the model is trained on the above content, it may become available to unwitting or malicious users with the prompt "Generate quotation for ABC co proposal for XYZ Project Bid", with ABC Co, being your company.

Mitigation Steps:

Do NOT use the free version of Chat-GPT, instead, use the enterprise or team paid version.

Remember, The free version is only free because Open-AI get user-data to train on.

Use the Open-AI API within your ERP or CRM, or create a custom app with AI integration

Open-AI also has an API. That means, you don't need to use Chat-GPT, which is just a chat-bot interface, but you can instead build your own copildot inside your ERP or CRM applications for internal use.

See the privacy policy for open ai below:

Talk to your dev team or software vendors to explore how integrate the Open-AI API into your applications for a more seamless and secure experience.

Consult with Core Maitri for AI Integration

Anonymize your data

Before using the data make sure to anonymize the data. That includes, redacting Personal identifying information like real names, emails, and SSNs, but also sensitive business data.

When constructing prompts for API or Chat-GPT consider manually or automatically (using software tools) removing sensitive information in a way that does not materially impact the result.

Here are a few techniques:

Data Masking:
1. Example: Replace actual names in customer service transcripts with generic placeholders like "Customer1," "Customer2," etc.
2. Application: Useful in scenarios where the identity of the individuals is not relevant to the interaction, like customer support or feedback analysis.
Pseudonymization:
1. Example: Substitute identifiable fields such as names or email addresses with unique identifiers or pseudonyms. For instance, "John Doe" becomes "JD12345".
2. Application: Effective in datasets where the data needs to be re-identified at a later stage, like longitudinal studies or customer behavior analysis.
Redaction:
1. Example: Remove or black out sensitive information in documents. For example, in legal contracts, sensitive clauses or client details can be redacted.
2. Application: Useful for legal, healthcare, or financial documents where only certain parts of the text are needed for analysis.
Differential Privacy:
1. Example: Add noise to data or use aggregated data instead of individual records. For instance, use average sales figures of a region instead of individual store sales.
2. Application: Ideal for statistical analysis where individual data points are less relevant than the overall trends.
Data Swapping:
1. Example: Swap values between records in a dataset. For instance, swap the locations or ages between different records in a customer database.
2. Application: Useful in maintaining the integrity of the dataset for analysis while ensuring individual data points cannot be traced back to an individual.
Generalization:
1. Example: Replace precise values with broader categories. For example, instead of exact ages, categorize customers into age groups like 20-30, 31-40, etc.
2. Application: Effective in marketing or demographic analysis where trends across broader groups are more relevant than individual data.
Tokenization:
1. Example: Replace sensitive data with non-sensitive equivalents, known as tokens. Credit card numbers can be replaced with unique token IDs.
2. Application: Commonly used in payment processing or financial transactions.
Encryption with Controlled Access:
1. Example: Encrypt data but provide decryption keys only to authorized systems, like ChatGPT's API, under specific conditions.
2. Application: Ensures data privacy during transmission and when at rest, useful in scenarios where data might need to be re-identified under controlled circumstances.

On-Premise and Open-Source LLM Solutions:

The final way is to just use your own LLM. There are a ton of open-source LLMs that are available for commercial use.

But this comes with list of pros and cons.

➕ Pros

Control Over Data and Privacy:
1. Advantage: Complete control over the data being processed, ensuring that sensitive information doesn't leave the organizational boundary.
2. Relevance: Especially important for businesses handling sensitive or confidential information.
Customization and Specialization:
1. Advantage: Ability to tailor the LLM to specific business needs and optimize it for particular types of tasks or industries.
2. Relevance: Beneficial for niche industries or specialized applications.
Improved Performance and Latency:
1. Advantage: Potentially faster response times as the model is hosted on local servers.
2. Relevance: Important for time-sensitive applications.
Compliance and Regulatory Control:
1. Advantage: Easier to comply with regional data protection laws and industry-specific regulations.
2. Relevance: Critical for businesses in highly regulated sectors like finance, healthcare, and legal.
Data Security:
1. Advantage: Enhanced security measures can be implemented as per organizational standards.
2. Relevance: Vital for maintaining data integrity and confidentiality.
No Dependency on External Providers:
1. Advantage: Reduced risk of service disruptions from external cloud service providers.
2. Relevance: Important for organizations requiring high availability and reliability.

➖ Cons

High Initial and Ongoing Costs:
1. Challenge: Significant investment in hardware, software, and maintenance.
2. Relevance: A major consideration for smaller organizations or those with limited IT budgets.
Technical Expertise Required:
1. Challenge: Need for skilled personnel to deploy, manage, and update the LLM.
2. Relevance: Can be a barrier for organizations without in-house AI expertise.
Scalability Issues:
1. Challenge: Scaling up requires additional hardware and can be more complex compared to cloud solutions.
2. Relevance: Important for businesses anticipating growth or fluctuating demands.
Resource Intensive:
1. Challenge: Requires significant computational resources, leading to high energy consumption.
2. Relevance: A consideration for organizations concerned about operational costs and environmental impact.
Maintenance and Upgrades:
1. Challenge: Regular maintenance and updates are needed to ensure performance and security.
2. Relevance: Requires ongoing commitment of resources.
Risk of Obsolescence:
1. Challenge: Rapid advancements in AI technology might render the in-house model outdated.
2. Relevance: A concern for organizations looking to maintain cutting-edge capabilities.

Consult with Core Maitri to host your own LLM