Is my data safe?

Pillar 2 left you with the most reliable way to hold hallucination down: ground the model in your own verified data. It’s good advice, and it quietly smuggles in a harder question. To ground a model in your data, you have to give the model your data — and the moment you do, a different risk opens up. Is that data safe to hand over in the first place?

When people ask whether a generative AI tool is “safe,” they usually mean the model — is it secure, will it behave. But the part of safety actually in your control is more concrete than the model’s inner life: where does your data go when you use the thing, who can see it on the way through, and is it allowed to be there at all? That is not a question about the model. It is a question about deployment — and it has one answer-shaped lever: cloud, local, or a deliberate mix of the two.

This is the Inputs bucket of the CIDA lens from Pillar 1, and it is the one pillar where the right architecture, chosen up front, closes the risk almost entirely — and the wrong one leaves it open no matter how careful everyone is afterward.

Drop that back onto the lens from Pillar 1, and this pillar lights up a single bucket — Inputs. What you feed the system, and where you let it travel, is the whole game here.

Context

Where and how the system is used — which sets your privacy and residency bar.

Inputs

What data goes in, and where it's allowed to go. Data safety is almost entirely this bucket.

This pillar lives here

Decision procedure

The model, prompt, and guardrails that produce the output.

Action

What happens to the output, and who checks it.

What you’re actually handing over

Send a prompt to a hosted model and your words leave your building. What happens to them next is the whole question. Four risks ride along.

It leaves your perimeter

Your text travels to someone else's servers to be processed. Whatever protections you had inside your walls stop at the edge of them.

It may be logged, retained, or trained on

On the consumer chat tiers, services log prompts, keep them for a window, and may use them to train the model. The enterprise and API tiers contractually don't — but only if that's the tier you're actually on.

Memorization & regurgitation

A model can memorize distinctive snippets from what it is shown and later surface them to someone else. Your secret becomes a pattern in a system you don't control.

Regulatory & jurisdictional exposure

If the data is personal or regulated — GDPR, HIPAA — sending it across a border or to a third party can be the violation itself, before anything ever leaks.

The deployment spectrum

The principle is simple. Every model runs in one of three places — the public cloud, inside your own four walls, or a deliberate split between them — and the only real decision is which data goes where. Get that mapping right up front and the safety question is effectively answered before a single prompt ever leaves the building.

Every way of running a model sits somewhere on a line from total convenience to total control. The trick is to pick your spot on purpose, per kind of data, rather than landing on “whatever was easiest to sign up for.”

More convenient ← → more control

Easiest

Public cloud API

Sign up, send a prompt. Cheapest to start, and your data lives by the provider's terms. Right for work that isn't sensitive.

Middle

Private / VPC

The model runs inside your own cloud account or a dedicated tenant. Data stays in your boundary; you set retention and access.

Most control

On-prem / local

Open-weight models on hardware you own (via something like vLLM). Nothing leaves the building — complete data sovereignty, at the cost of running the iron.

You don’t have to pick one

The choice is not binary, and treating it as binary is how teams either over-pay for sovereignty they don’t need or leak data they couldn’t afford to lose. The practical answer is segregation: classify your data by sensitivity, keep the crown jewels on local or private infrastructure, and let everything mundane use the cheap, easy cloud.

A patient record runs on your own hardware; a marketing brainstorm goes to the public API. Same company, two deployments, each matched to what that piece of data can actually tolerate. The art is in drawing the line in the right place — high enough that you’re protected, low enough that you’re not paying to lock up a press release.

It takes one well-meaning engineer to make all of this concrete.

In April 2023, Samsung engineers pasted proprietary semiconductor source code into ChatGPT to debug it, and fed internal meeting transcripts in for summaries. Three leaks in twenty days. The data now sat on external servers Samsung had no way to reach and delete. Within weeks the company banned generative-AI tools across the business and started building its own.

That episode — reported by TechCrunch in May 2023 — wasn’t a breach. No model misbehaved and nobody was hacked. Sensitive data simply walked out the door through a text box, because the deployment was “paste it into a website on the internet.” The fix was never a better prompt or a more careful employee. It was a different place for the data to go.

What the providers actually offer

Here is the part the Samsung story hides: a provider’s enterprise and API tiers are a different world from its free chat box. On the paid tiers, the major providers contractually don’t train on your data, offer zero-retention options, and will sign the agreements that regulated work requires. The protection is real — it just lives on the right account.

Provider	Trains on your data?	Default retention	HIPAA BAA	EU residency / GDPR
AWS Bedrock	No (inference); not shared with providers	Zero by default; ~30 days on some models for abuse checks	Eligible under the AWS BAA	DPA + SCCs; EU region (model-dependent)
Google Vertex AI	No on enterprise/API	Partly configurable; Search-grounding keeps 30 days	Yes, under Google’s BAA (GA models)	DPA + SCCs/DPF; EU via regional endpoint
OpenAI	Consumer: yes (opt-out) · API/Enterprise: no	API ~30 days; ZDR on approval	API (ZDR) + ChatGPT Enterprise	DPA + SCCs; EU for API + Enterprise
Anthropic	Consumer: opt-out · API/Enterprise: no	API ~30 days; ZDR on approval	API + Claude Enterprise	DPA + SCCs; EU only via Bedrock/Vertex/Foundry

Accurate as of June 11, 2026, and subject to change — confirm before relying on any figure. Notes: consumer tiers train by default unless you opt out (OpenAI) or decline (Anthropic); ZDR (zero data retention) and HIPAA BAAs are sales-negotiated, not self-serve; "EU residency" means storage and inference in an EU region, with exceptions — Google's Grounding-with-Search exits the EU, and Anthropic's first-party API is US/global only (EU residency comes via the cloud partners).

Two things that table won’t let you forget. First, almost every protection is a property of the API or enterprise account, not the consumer chat — the no-training guarantee, zero retention, the HIPAA agreement. The free web tier has none of them by default, and that is precisely the door Samsung walked through. Second, the API alone gets you the big one — no training on your data — but zero retention and a signed BAA are extra steps you actually take, not things you get for showing up.

A common mix-up — a BAA is not a GDPR step

Precision matters here — these two get conflated constantly. A BAA — the Business Associate Agreement in the HIPAA BAA column — is a HIPAA instrument. It governs US health data, and signing one does nothing for GDPR. If the data you're handing over is the personal data of someone in the EU, a BAA does not make you compliant. The instrument you need there is a DPA, the Data Processing Addendum, which every major provider now folds into its terms automatically. The moment that data crosses to a US-based model you also lean on the SCCs — the Standard Contractual Clauses the same DPA carries by default. The cleaner alternative is to never make the crossing at all: keep the processing inside an EU region, and the transfer question disappears.

These don't substitute for one another. A US clinic piloting an AI scribe signs a BAA; a European retailer signs a DPA. A company that is both a US healthcare operation and a handler of EU customer data signs both — they stack, one on top of the other, because they answer to two different laws.

And neither one is, on its own, "the safe move." A BAA and a DPA are paperwork laid over a decision you have already made — where the data is processed. That decision is still the load-bearing one, the whole argument of this piece. The agreements make a given deployment contractually safe; choosing EU-region or on-prem processing makes it physically safe. For the data you genuinely cannot afford to lose, you want both at once — the signed DPA, and data that never leaves the region to begin with.

The decision, made simple

You don’t need a committee for this. You need to sort your data, then match it to a place — and be able to prove where it went. The whole framework fits in three questions.

1 · Classify

How sensitive is this data — public, internal, confidential, or regulated? Most teams have never actually sorted it, which is why the question feels hard.

2 · Match

What is the lowest-control deployment this class can safely live in? Regulated and confidential data earns local or private; public data is fine in the cloud.

3 · Verify

Can you show where the data went and who could see it? An answer you can't audit isn't an answer — it's a hope.

This is architecture we design. Sorting a data estate by sensitivity, choosing where each class runs, standing up private or on-prem inference — open-weight models on your own hardware via tools like vLLM — and wiring the governance and audit trail that proves data stayed where it was meant to: that is the core of a TwiceData data-safety engagement. Done up front, it turns is my data safe? from a standing worry into a documented answer.

Questions a skeptic asks

The honest pushback, answered straight.

Isn't a big provider's security better than anything I could build?

Their security is excellent — at protecting their systems. That is a different question from whether your data should be on their systems at all. The strongest lock in the world doesn't help if the valuables are sitting in someone else's house, under their retention policy and their jurisdiction.

They say they don't train on my data — isn't that enough?

On the API and enterprise tiers it's real and largely solid: the major providers don't train on that data by default, and some keep it inside your own cloud tenant entirely. But "no training" covers one risk while leaving others in their hands — retention windows, a breach, a subpoena, a policy that changes next year — and on the consumer tier the default is the reverse. The protection is real; just make sure you're standing on the tier that has it.

If we just use the API, are we protected — or do we need the enterprise tier?

The API alone gets you the big one: the major providers don't train on API data by default. But zero retention and a HIPAA BAA are extra steps you take — request ZDR, execute the BAA — not defaults you get for showing up. The enterprise tier isn't required for that protection; it's what stops your people from pasting into the consumer chat in the first place, which is the failure that actually happens. The API only protects the data that goes through the API.

Isn't running it locally just expensive and slow?

It carries real upfront cost and operational weight — that's Pillar 5's subject. But open-weight models on modern hardware are far more capable than most people assume, and for the data that genuinely can't leave, "expensive" beats "breach." You also don't have to run everything locally. Segregate.

We're not in a regulated industry — does any of this apply?

Trade secrets, customer lists, unreleased plans, and source code aren't regulated, and they're still catastrophic to leak — ask Samsung. Regulation raises the stakes; it doesn't create them.

Can't I just anonymize the data before sending it?

Sometimes, and it's worth doing. But de-identification is harder than it looks: free text leaks identifiers you didn't plan for, and "anonymized" data is often re-identifiable when combined with other sources. Treat it as one layer, not the answer.

Isn't this an IT and security problem, not a generative AI one?

It's both, and generative AI changes its shape. A chat box is a frictionless export channel that lives in the browser, used by people who don't register pasting text as "sending data offsite." The old perimeter assumed data left through known doors. This one put a new door on every desk.

What we will not claim (anti-fabrication)

The big providers are too secure to leak my data.

Their security protects their infrastructure, not your decision to put data on it. The exposures that matter here usually aren't breaches — they're your own data, sent somewhere it shouldn't have gone, the system working exactly as designed.

If it's encrypted in transit, my data is safe.

Encryption in transit protects the data on the wire. It says nothing about what happens once it arrives — how it's logged, how long it's kept, who can query it, or which jurisdiction now holds it. Safe-in-transit is not safe-at-rest, and neither is safe-from-policy.

Where this goes next

Get the data architecture right and you’ve closed the risk you control. The next pillar is about the risk you invite: what happens when people stop checking the model’s work and simply do what it says. Part 4, Use AI as a tool, not a master, is about blind acceptance — the most expensive habit in applied AI.

Whether you already have data flowing to generative AI services and need to know what’s actually leaving, or you’re designing a system and want the data boundaries right from day one, classifying your data and matching it to the right deployment is concrete, answerable work — and a good place for a free 60-minute call to start. For how this fits the data and AI work we deliver, see our approach to AI engagements.

––