privacyaillmsecuritycompliance

Why the LLM never sees your row values

DuckViz uses AI to generate dashboards, reports, and decks — but the AI never touches your customers' data. Here's exactly what gets sent to OpenAI on every call, what doesn't, and why we drew the line where we did.

Vikas AwaghadeApril 28, 20267 min read

The first question I get from any security-aware engineer evaluating DuckViz is some version of: "You're using AI to build dashboards. So my customers' data is going to OpenAI?"

The answer is no, and the rest of this post is the receipt. I'll walk through every place the AI is involved, show you exactly what payload leaves the device, and acknowledge the two narrow surfaces where some data does get sampled — and how those compromises are bounded.

This isn't a marketing claim. It's the load-bearing wall the rest of the architecture rests on, and you can verify every word of it from your browser's network tab.

The contract — three categories

Across every AI call DuckViz makes, the LLM sees one of three categories of information, and never anything outside them:

Schema — column names and types. created_at: TIMESTAMP, customer_id: VARCHAR, amount: DECIMAL(10,2).
Pre-aggregated query results — the rows produced by the SQL the AI itself wrote, after they've been counted, grouped, summed, or topped. A chart titled "Revenue by month" sends 12 numbers, not 200,000 transactions.
Tightly bounded samples — only in two specific situations, both documented below, both surfacing in the UI before they happen.

What the LLM never sees: full rows, joined tables, your customer identifiers, your PII columns by value, or anything DuckDB returns from a query the user hasn't run.

What the AI sees, by surface

DuckViz has four AI-powered surfaces. Here's what each one actually sends.

Widget flow (dashboard generation)

When the user asks the AI to build a dashboard, the request body to /api/widget-flow/fast-recommend looks roughly like this:

POST /api/widget-flow/fast-recommend

{  "tableName": "t_orders",  "schema": [    { "name": "created_at", "type": "TIMESTAMP" },    { "name": "amount", "type": "DECIMAL(10,2)" },    { "name": "country", "type": "VARCHAR" },    { "name": "status", "type": "VARCHAR" }  ],  "rowCount": 247392,  "userPrompt": "Show me revenue trends and top countries"}

That's it. No rows. The AI looks at the schema, picks chart types from a fixed registry, writes SQL targeting t_orders, and streams the SQL back. The browser then runs that SQL against the local DuckDB-WASM instance. Results never leave the device.

If the AI's SQL fails (a column doesn't exist, a type mismatch), DuckViz auto-retries by sending the error message and the failed SQL to the LLM — never the rows DuckDB would have returned.

Report and deck generation

The report editor and deck presenter both call /api/generate-report-sections and /api/generate-deck-slides. Each request includes the dashboard's widgets and the chart-data results — which are themselves the output of aggregation queries.

A "Top 10 customers by revenue" widget sends ten rows: customer name and revenue total. That's a presentation-grade payload, equivalent to what the user would paste into a slide anyway. A "Total transactions" KPI sends one number.

What's not in those payloads: the underlying transactions, the joined customer table, the addresses, the email columns. The AI gets the chart, not the data behind the chart.

Date column samples — the first compromise

Here's the first place where actual values cross the wire, narrowly. When a query involves a DATE or TIMESTAMP column, the client samples five values per column and sends them in the request body. Without this, the LLM can't pick the right strptime specifier — %Y-%m-%d versus %d/%m/%Y versus a Unix epoch.

dateSamples in widget-flow request

{  "dateSamples": {    "created_at": ["2026-04-12 09:31:22", "2026-04-12 11:08:55", ...],    "shipped_at": ["2026-04-15", "2026-04-16", ...]  }}

Five timestamps per date column. That's the leak surface, and it's deliberately small. We chose it over the alternative — telling users to manually configure date format strings — because the friction would have killed the product.

If your date columns themselves contain sensitive values (rare; dates are usually just dates), this is the place to know about. The samples are visible in the network tab, the same as any other request body.

Log format detection — the second compromise

The other place values can cross the wire is when the user uploads a log file. But before the LLM is involved at all, DuckViz runs a local matcher against @duckviz/parser-catalog — a client-side registry of built-in formats (nginx, Apache, syslog, Sysmon XML, JSON Lines, AWS VPC flow logs, Cloudtrail, and more) plus any custom formats the user has defined in their settings.

log file → detectLocal({ userFormats })           // entirely in the browser
  ├── matches a built-in or custom format?  → use it. No network call. Done.
  └── no match?                              → POST /api/detect-log-format
                                                  with 5–10 sample lines

The local match runs regex patterns against the first few lines and the file name. If it hits, parsing starts immediately — no data leaves the device for known formats, which covers most of what people throw at it.

Only when nothing in the catalog matches does the LLM get called. The first 5-10 lines of the log get sent to /api/detect-log-format and the AI returns a parser config the WASM log parser then uses locally to chew through the rest. The unrecognized-format path is the only place raw, unaggregated values leave the device, and it's gated behind two thresholds: the user explicitly picks the file, and the local matcher has already declined.

If you find yourself hitting this path often, you can add the format to your account's custom-format list once and have every future upload of that shape detected locally going forward. Most users add one or two and stop seeing the network call.

If you're embedding DuckViz in a product where users upload logs, this is the integration surface to be loud about in your privacy policy — sized appropriately to the unknown-format fraction of uploads, not all uploads. For products that only use the datasets prop (passing pre-fetched API data to the components), this path never fires at all.

Why the line is there

Every category I draw — schema is fine, results-of-AI-written-SQL are fine, raw rows are not — comes from a simple test: would the product still work if we removed it?

Without schema, the AI can't write correct SQL. The product wouldn't work.
Without pre-aggregated results, the report generator can't write narratives that reference the actual chart numbers ("revenue grew 23% in Q3"). The product wouldn't work.
Without raw row values, the AI can still pick chart types, write SQL, generate report copy, and produce decks. The product works fine.

So we don't send raw rows.

This is the same logic that decides what gets sent to a remote engineer reviewing a customer's data problem. You'd send them the schema and the query plan, not a CSV dump.

What this buys you on the compliance side

The privacy posture isn't an aesthetic choice. It's load-bearing for what you can do with DuckViz inside a regulated product:

GDPR: Personal data lives in row values, not column names. A schema like email: VARCHAR isn't personal data; the email addresses in that column are. We don't send them.
SOC2 / data residency: The customer's row values never traverse a network boundary they didn't already authorize for the source data. If your APIs already comply, embedding DuckViz doesn't broaden the trust boundary.
HIPAA-style PHI: The same reasoning applies. PHI is in the row values. The AI sees the table shape, not the patient records.

I'm not a lawyer. None of this replaces your own compliance review. But it does mean the review you have to run is mostly the one you've already run for your own product, plus the two narrow surfaces above.

You can verify this yourself

Open the network tab while you use DuckViz. Filter for requests to /api/widget-flow/, /api/generate-report-sections/, and /api/generate-deck-slides/. Look at every request body. If you find an actual row value — outside the date sample columns and the log-detection path — that's a bug. Open an issue.

The contract is the contract because the network tab makes it auditable. Anything I write here that doesn't match what you see in there, you should call out.

Where this leaves you

If you're embedding DuckViz, the integration story shrinks: your security team only needs to review what's in your existing data pipeline. If you're using the hosted app, the same applies — the file you upload is parsed and queried in your browser, not on our server.

The wall is exactly as load-bearing as I claim, because the product was designed around it being load-bearing. Schema goes out, rows stay home. Everything else is a consequence of that.

— Vikas