duckdb-wasmwasmlogsperformancesql

Querying logs in the browser — Rust, WASM, and DuckDB

DuckViz parses and queries log files entirely in the browser — no backend, no upload. Here's the stack that makes it work, the SQL it lets you write, and the non-obvious bits that took the longest to get right.

Vikas AwaghadeApril 29, 20266 min read

It's 4 PM. A customer reports something looks wrong. You have a 200MB nginx access log on your laptop and forty-five minutes before standup. Five years ago the answer was: ship the file somewhere, build a pipeline, find a Kibana dashboard, give up, grep.

Today, the answer is: open a browser tab.

DuckViz runs the entire log-analysis stack inside the page — a Rust parser compiled to WebAssembly, DuckDB-WASM as the query engine, all communicating through a Web Worker so the UI stays responsive. The file you drop never touches a network. This post is the engineering tour of how the stack fits together and what it actually lets you do.

The pipeline

file (drag-and-drop)
   │
   ▼
detectLocal({ userFormats })                  ← @duckviz/parser-catalog
   │   regex + filename hints, runs in <50ms
   ▼
@duckviz/parser  (Rust → WASM, in a Web Worker)
   │   structured JSON rows
   ▼
DuckDB-WASM
   │   ingestJsonRows(records, tableName) via Arrow IPC
   ▼
SQL workspace (browser)

Three packages, one page, zero backends. Let me walk each layer.

1. Format detection — local first, LLM only as fallback

Before any parsing happens, the file goes through detectLocal({ userFormats }). This runs against @duckviz/parser-catalog — a client-side registry of built-in formats (nginx, Apache, syslog, Sysmon XML, JSON Lines, AWS VPC flow, CloudTrail, and a couple dozen others) plus any custom formats the user has saved.

The matcher uses two signals: filename hints (access.log, error.log, *.evtx) and regex matches against the first few lines. If any catalog entry hits, the corresponding parser config is selected and the rest of the pipeline starts immediately. The LLM is only called when no catalog entry matches, and only sees a handful of sample lines — covered in detail in the privacy post.

For most uploads, format detection is a few regex evaluations and never hits the network.

2. Parsing — Rust compiled to WebAssembly

The parser itself is a Rust crate in the monorepo, compiled to WebAssembly via wasm-pack. It runs inside a Web Worker so a 200MB file doesn't freeze the main thread.

What it actually does:

Streaming line reader with backpressure into the Worker.
Format-specific extractors: regex-based for delimited formats (CLF, syslog), XML attribute walker for Sysmon and Windows event logs, JSON parser for JSON Lines.
Preprocess steps: trimming PROCID brackets, splitting Apache combined-format quoted strings, normalizing newlines.
Field shaping: every record comes out as Record<string, unknown> with primitive types — strings, numbers, booleans, dates as ISO strings.

The Rust code is doing what a Logstash filter chain does, except it ships in a .wasm blob, runs at native-ish speed, and never sees a network.

Two pieces took disproportionately long to get right:

XML with single-quote attributes (Sysmon)

Most XML parsers assume double-quote attributes (<Event id="123">). Sysmon and a lot of Windows event logs use single-quote attributes (<Event id='123'>). The Rust extractor has its own xml_attr() walker that handles both, plus an EventData scanner for the deeply-nested <Data Name='X'> payloads that Sysmon produces. Falling back to a generic XML library would have meant dragging in two megabytes of dependencies.

Sub-second timestamps

Strptime is unforgiving about separators. A timestamp like 2026-04-29 14:33:21,127 (comma before milliseconds) won't match %Y-%m-%d %H:%M:%S.%f because of the comma. We normalize before parsing:

strptime normalization

-- regexp_replace turns ",127" or ":127" at the end into ".127"SELECT  strptime(    regexp_replace("ts", '[,:]([0-9]+)That regexp_replace runs inside DuckDB-WASM at query time, so it's effectively free — the rewrite is just SQL, not a parser change.
3. Ingestion — Arrow IPC into DuckDB-WASM
Once the parser emits structured rows, they get fed into DuckDB-WASM via the engine's Arrow IPC path. The engine is configured with three flags worth knowing about:
app/providers.tsx
<DuckvizDBProvider  persistence       // tables survive page reload via IndexedDB  arrowIngest       // zero-copy batch insert through Apache Arrow  batchSize={5000}  // chunk size — sweet spot for memory + throughput>  {children}</DuckvizDBProvider>
arrowIngest matters most. Without it, ingestion goes row-by-row through prepared statements; with it, batches go in as Arrow record batches and the engine bulk-loads. On the same hardware, the difference is roughly an order of magnitude on big files.
Each file gets its own table named t_<sanitized_filename> — app.log becomes t_app_log, auth-2026-04.log becomes t_auth_2026_04. Per-file tables matter for two reasons:

No naming collisions when multiple log files are open in the same session.
Targeted queries — you can write SELECT * FROM t_access_log JOIN t_error_log USING (request_id) and DuckDB just does the right thing.

Tables are exported as Parquet to IndexedDB on onAfterPersist, so a refresh doesn't re-ingest the same file. The browser becomes a real database that survives a page reload.
4. Querying — what you actually get
Once ingested, the file is a queryable table. Everything below ran on a 50MB nginx access log in a regular browser tab:
5xx errors by IP
SELECT remote_addr, COUNT(*) AS errorsFROM t_access_logWHERE status >= 500GROUP BY remote_addrORDER BY errors DESCLIMIT 20;
Traffic by 5-minute bucket
SELECT  date_trunc('minute', ts) - INTERVAL (EXTRACT(MINUTE FROM ts)::INT % 5) MINUTE AS bucket,  COUNT(*) AS requestsFROM t_access_logGROUP BY bucketORDER BY bucket;
Joining auth + access logs by request id
SELECT a.user_id, a.ip, x.path, x.statusFROM t_auth_log aJOIN t_access_log x USING (request_id)WHERE a.action = 'login_success'  AND x.status >= 400;
Window function — slow request percentiles per route
SELECT  request_uri,  COUNT(*) AS hits,  approx_quantile(request_time, 0.50) AS p50_ms,  approx_quantile(request_time, 0.95) AS p95_ms,  approx_quantile(request_time, 0.99) AS p99_msFROM t_access_logGROUP BY request_uriHAVING COUNT(*) > 100ORDER BY p99_ms DESCLIMIT 20;
These are queries you'd run in production against a real database. They run identically in the browser, against your file, with no server in the picture.
What this stack displaces
For the "I have a file and a question and forty-five minutes" workflow, the conventional stack is overkill:

Shipping: Logstash, Filebeat, Vector, Fluent Bit
Storage: Elasticsearch, Loki, ClickHouse, S3 + Athena
UI: Kibana, Grafana, Splunk

Each one of those is the right answer for a different problem — long retention, multi-tenant ingestion, alerting, dashboards across hundreds of services. But for the "answer one question now" question, you don't need any of them. You need a parser, a query engine, and a UI that can hold the file. DuckViz packages those three into a single page.
What it costs
The honest performance picture, on an M2 MacBook in Chrome:

50MB nginx log (~280k lines): parse + ingest ≈ 3 seconds. First query ≈ 80ms.
200MB log: parse + ingest ≈ 11 seconds. Queries depend on shape; aggregates over the full table run in 200–600ms.
500MB+: starts to push the browser's memory ceiling on 8GB machines. The memory monitor opens a non-closable modal at 100% of jsHeapSizeLimit and prompts to drop a table.

Tables persist across reloads, so the cost is paid once per file. Subsequent sessions reopen instantly.
Try it
Drop a log file at app.duckviz.com/upload. Open your browser's DevTools, network tab, and watch what happens — for any format the catalog recognizes, the only network traffic is a single LLM call to refine date format detection (covered in the privacy post). The file itself stays where it started.
The whole point of this stack is that you can verify it. Ship a complaint if anything I described doesn't match what you see in the network tab.
— Vikas
, '.\1'),    '%Y-%m-%d %H:%M:%S.%f'  ) AS tsFROM t_app_log;

That regexp_replace runs inside DuckDB-WASM at query time, so it's effectively free — the rewrite is just SQL, not a parser change.

3. Ingestion — Arrow IPC into DuckDB-WASM

Once the parser emits structured rows, they get fed into DuckDB-WASM via the engine's Arrow IPC path. The engine is configured with three flags worth knowing about:

app/providers.tsx

<DuckvizDBProvider  persistence       // tables survive page reload via IndexedDB  arrowIngest       // zero-copy batch insert through Apache Arrow  batchSize={5000}  // chunk size — sweet spot for memory + throughput>  {children}</DuckvizDBProvider>

arrowIngest matters most. Without it, ingestion goes row-by-row through prepared statements; with it, batches go in as Arrow record batches and the engine bulk-loads. On the same hardware, the difference is roughly an order of magnitude on big files.

Each file gets its own table named t_<sanitized_filename> — app.log becomes t_app_log, auth-2026-04.log becomes t_auth_2026_04. Per-file tables matter for two reasons:

No naming collisions when multiple log files are open in the same session.
Targeted queries — you can write SELECT * FROM t_access_log JOIN t_error_log USING (request_id) and DuckDB just does the right thing.

Tables are exported as Parquet to IndexedDB on onAfterPersist, so a refresh doesn't re-ingest the same file. The browser becomes a real database that survives a page reload.

4. Querying — what you actually get

Once ingested, the file is a queryable table. Everything below ran on a 50MB nginx access log in a regular browser tab:

5xx errors by IP

SELECT remote_addr, COUNT(*) AS errorsFROM t_access_logWHERE status >= 500GROUP BY remote_addrORDER BY errors DESCLIMIT 20;

Traffic by 5-minute bucket

SELECT  date_trunc('minute', ts) - INTERVAL (EXTRACT(MINUTE FROM ts)::INT % 5) MINUTE AS bucket,  COUNT(*) AS requestsFROM t_access_logGROUP BY bucketORDER BY bucket;

Joining auth + access logs by request id

SELECT a.user_id, a.ip, x.path, x.statusFROM t_auth_log aJOIN t_access_log x USING (request_id)WHERE a.action = 'login_success'  AND x.status >= 400;

Window function — slow request percentiles per route

SELECT  request_uri,  COUNT(*) AS hits,  approx_quantile(request_time, 0.50) AS p50_ms,  approx_quantile(request_time, 0.95) AS p95_ms,  approx_quantile(request_time, 0.99) AS p99_msFROM t_access_logGROUP BY request_uriHAVING COUNT(*) > 100ORDER BY p99_ms DESCLIMIT 20;

These are queries you'd run in production against a real database. They run identically in the browser, against your file, with no server in the picture.

What this stack displaces

For the "I have a file and a question and forty-five minutes" workflow, the conventional stack is overkill:

Shipping: Logstash, Filebeat, Vector, Fluent Bit
Storage: Elasticsearch, Loki, ClickHouse, S3 + Athena
UI: Kibana, Grafana, Splunk

Each one of those is the right answer for a different problem — long retention, multi-tenant ingestion, alerting, dashboards across hundreds of services. But for the "answer one question now" question, you don't need any of them. You need a parser, a query engine, and a UI that can hold the file. DuckViz packages those three into a single page.

What it costs

The honest performance picture, on an M2 MacBook in Chrome:

50MB nginx log (~280k lines): parse + ingest ≈ 3 seconds. First query ≈ 80ms.
200MB log: parse + ingest ≈ 11 seconds. Queries depend on shape; aggregates over the full table run in 200–600ms.
500MB+: starts to push the browser's memory ceiling on 8GB machines. The memory monitor opens a non-closable modal at 100% of jsHeapSizeLimit and prompts to drop a table.

Tables persist across reloads, so the cost is paid once per file. Subsequent sessions reopen instantly.

Try it

Drop a log file at app.duckviz.com/upload. Open your browser's DevTools, network tab, and watch what happens — for any format the catalog recognizes, the only network traffic is a single LLM call to refine date format detection (covered in the privacy post). The file itself stays where it started.

The whole point of this stack is that you can verify it. Ship a complaint if anything I described doesn't match what you see in the network tab.

— Vikas