Purpose of this document
This document explains how the Digital Fluency platform is built and why each architectural choice was made. The intended reader is a collaborator, contractor, or technical reviewer who wants to understand what we are committing to before writing code.
The architecture is constrained by three things, in this order:
- The pedagogy (
pedagogy.md) — the five design moves and the productive-struggle / metacognitive-overload findings dictate what the system must observe, what it must respond to, and how fast. - The current state of AI agent technology (mid-2026) — what frontier vision-based "computer-use" agents can and cannot do, and where they will plausibly be in 12 months.
- Realistic build cost — what's available open-source, what must be built, what should be deferred to v2.
This is a living document. Field research with digital-skills providers (fieldwork.md) will reshape several sections, particularly the AI co-pilot's intervention specifics and the curriculum-content assumptions.
1. Why simulated, not vision-agent overlay
The single biggest architectural decision is whether to build a simulated environment (we own the canvas, we own the state) or a vision-agent overlay (a Claude / Operator / Gemini computer-use agent watches the user navigate real websites and apps).
We commit to the simulated environment. The case rests on four findings.
1.1 Vision-based computer-use agents cannot meet the latency budget for real-time coaching
Current vision agents take 2–7 seconds per step (screenshot → API call → inference → action). For a real-time tutoring use case — "user makes a mistake, coach intervenes within 1–2 seconds" — that is 4–8x too slow. By the time the agent has analyzed the screen, the user has already moved on or compounded the error.
An instrumented sandbox responds in <10ms because we already know the state — every input field, every focus event, every keystroke pause is local. There is no inference step.
For the Bastani-style engagement-mediated learning that our pedagogy depends on, low-latency feedback is not a nice-to-have. The mechanism is "watch the user attempt, intervene when productive struggle becomes distress." A 5-second latency means the intervention lands after the moment is over.
1.2 The benchmark numbers do not transfer to production
The research synthesis (saved at research/summaries/instrumented-vs-vision-telemetry.md and the computer-use-reliability findings in chat) shows a sharp gap:
- OSWorld (full-OS benchmark): 79.6% top score (Apr 2026) — looks promising.
- Online-Mind2Web (real production websites): 30% for frontier models, 61% for OpenAI Operator.
- Production failure rates: 40% of multi-agent pilots fail within 6 months (Partnership on AI MAST taxonomy).
The benchmark-vs-production gap is not narrowing because production sites adversarially evolve against scrapers and agents (CAPTCHAs, bot detection, dynamic DOM, geo-fencing). Benchmarks don't have those defenses; the real web does.
A learning product whose coaching layer breaks on a meaningful fraction of real sites is not a learning product.
1.3 Vision agents miss the cognitive signals our pedagogy depends on
Even when vision agents correctly identify on-screen state, they cannot see:
- Keystroke timing — pauses between keys signal cognitive load.
- Dwell time on a button before clicking — signals uncertainty.
- Hover-without-click — signals "considered, rejected."
- Partial input — fields half-filled then abandoned.
- Undo / redo events — signal correction of mental model, not just mistake.
- Focus state — which window the user has selected, not just what's on the screen.
- Paste origin — did this text come from another tab, the user's brain, or an AI suggestion?
These are the signals Bastani's RL-driven tutor used to build a richer "knowledge state" than the binary correct/incorrect signals BKT can offer. They are invisible to vision. They are trivial to capture in an instrumented sandbox.
1.4 Build-on-top libraries close most of the gap on the simulated side
The library-landscape research (saved at research/summaries/browser-desktop-simulation-ecosystem.md) found that ~50–60% of the simulated-environment build can be assembled from MIT-licensed open-source projects:
- daedalOS (window manager / desktop shell)
- TipTap (rich-text editor with deep instrumentation hooks)
- ZenFS (formerly BrowserFS — virtual filesystem)
- React Hook Form (form library with full event surface)
The remaining 40–50% is custom — the email client mock, the AI tutor integration layer, and the pedagogical telemetry — but that work is unavoidable in either architecture.
Conclusion
Vision-agent overlay would lock us into a slow, unreliable, telemetry-poor coaching layer running against a defended, adversarial web. A simulated sandbox lets us own the latency budget, the signal richness, and the curriculum surface. The 12-month forecast for vision agents (per the agent-reliability research) is incremental improvement on benchmarks but flat on production-site reliability — the gap is structural.
This decision is revisable. If, by 2027, frontier computer-use agents reach <500ms latency on real sites with >90% reliability, we should re-evaluate. We do not believe that is plausible on this horizon.
2. Architecture overview
What lives in the browser
- The simulated desktop shell, modeled on daedalOS (MIT, actively maintained).
- The five core apps: a contained browser viewport (sandboxed iframe with our overlay), a mock email client (custom), a document editor (TipTap), a form interface (React Hook Form-based), and a file browser (over ZenFS).
- The telemetry layer — a thin event-capture system that wraps every interactive element and emits structured events.
- The AI co-pilot side panel — a thin client that streams events to the backend and renders responses.
What lives on the backend
- The task engine — a state machine per task, with far-transfer task selection and contrasting-case insertion logic.
- The tutor orchestrator — the layer between raw telemetry and LLM calls. Decides whether to intervene, what prompt to build, which model to call.
- The knowledge-state store — per-user pattern-mastery records, near/far transfer scores, engagement metrics.
- The LLM API client (Anthropic) with prompt caching.
What we deliberately don't build
- No browser extension. We are the canvas. The browser app inside our sim is sandboxed; we don't navigate the real web.
- No mobile app for v1. The pedagogy depends on multi-window, hierarchical, persistent state — the smartphone form factor actively undermines what we're trying to teach.
- No offline mode for v1. Simplifies the architecture. Library-distribution constraints (
fieldwork.md) may force this in v2.
3. Instrumentation / telemetry layer
The telemetry layer is the architectural choice that makes the rest of the system possible. It is also the largest piece of custom build effort.
What we capture
For every interactive element (input, button, link, draggable, focusable), we capture:
- Click events — target, modifier keys, dwell time before click (cursor parked over element before clicking).
- Keystroke events — key, modifier, inter-key timing, position in input.
- Focus events — which element gained/lost focus, sequence over time.
- Paste events — content, source application (we control all source apps), destination element.
- Undo / redo events — what was undone, time since the action being undone.
- Hover events — duration, target.
- Scroll events — delta, direction, dwell at scroll position.
- Window events — open, close, focus, resize, move.
- Time-on-task — total elapsed time, time-since-last-action, idle stretches.
- Attempt count — how many distinct attempts at the current task subgoal.
Why each signal matters
Each signal maps to a pedagogical observation:
| Signal | What it tells us |
|---|---|
| Inter-key pauses > 2s | Cognitive load — composing a thought, not typing it |
| Dwell-before-click > 3s | Uncertainty — considering an option, not committing |
| Hover-then-no-click | Considered and rejected; useful for contrasting-case design |
| Partial input + window switch | Hit a roadblock that needs information from elsewhere |
| Repeated undo on same edit | Mental model not yet stable; correct → revert → correct loop |
| Paste origin = AI panel | User is offloading to AI rather than thinking |
| Time-since-last-action > 30s | Stuck. Threshold for considering intervention. |
| Repeated wrong-target click | Possibly a UI confusion, possibly conceptual confusion — the orchestrator must distinguish |
These signals are the ones Bastani et al. used to build their RL knowledge-state estimator — and the ones invisible to vision agents.
Storage and privacy
Events stream to the backend in batches (~5s windows) and persist as a per-user event log. PII is minimal — no real names, no real email content, no real-world account information; the simulated environment never connects to the real web. Event logs are retained for the duration of an active learning relationship plus a research-purpose retention window to be defined in the privacy policy.
What we don't capture
- No microphone, no camera, no screen recording. The instrumentation is event-based, not perceptual. (One of the indirect benefits of the simulated-sandbox choice: privacy is much cleaner than vision-based alternatives would be.)
- No keystroke logging in password fields or any real-credential field. The simulated environment doesn't have those, but the principle applies if v2 ever introduces them.
4. AI co-pilot integration
The co-pilot is where the pedagogy meets the LLM API. Its design is constrained by three findings:
- Bastani 2026: Engagement-mediated gains came from a chatbot prompted to refuse direct answers until students demonstrated substantial effort. Our co-pilot must do the same.
- McCarthy 2018 "metacognitive overload": Reflective prompts that work for confident learners hurt low-confidence learners when mis-timed. Our population is low-confidence by definition.
- Schwartz & Bransford 1998: Telling without readiness produces memorization, not transfer. Intervention timing is everything.
Intervention rules (v1)
The orchestrator decides whether to intervene based on signals from the telemetry layer. The default is silence — the co-pilot does not speak unless one of these triggers fires:
- Stuck trigger: time-since-last-productive-action > 45s and no upward progress on subgoal.
- Repeated-error trigger: ≥ 3 attempts at the same wrong subgoal action.
- Frustration trigger (provisional, calibration TBD from field research): sustained rapid wrong-clicks, or window-switching pattern that suggests panic search.
- Task-complete trigger: user successfully completes a task subgoal — fires the pattern-naming response.
- Task-end trigger: user finishes the full task — fires the metacognitive debrief.
When a trigger fires, the orchestrator decides what to do — see prompt-construction below.
When the user asks for help (clicks the help button), the co-pilot acknowledges but does not immediately provide an answer. It offers a graduated escalation: re-state the goal → ask what the user has tried → ask what they think the next step might be → offer a hint that names a relevant pattern → demonstrate (rare, last resort).
Prompt construction
Every co-pilot response is built from three components, sent to the model with prompt caching enabled (the system prompt is the cacheable portion):
- System prompt (cached): the co-pilot's identity, voice, the refuse-until-effort rule, the pattern vocabulary, the list of forbidden moves (no "simply", no "just", no "obviously"), and the response-format schema.
- User context (mostly cached, refreshed per session): the user's known mastered patterns, current curriculum level, any relevant struggle history, and the current task description.
- Current event window (uncached): the last ~30 seconds of telemetry events plus the trigger that fired.
The model is asked to produce a response in a structured schema:
intervention_type: one of{hint, pattern_name, debrief_question, encourage, demonstrate, silent}text: the user-facing copypattern_referenced: which pattern from our taxonomy this invokes (for analytics)confidence: model self-rated confidence in the intervention's appropriateness
Model choice
- Claude Sonnet (Opus 4.7 generation, claude-sonnet-4-6) for the orchestrator's main calls. Quality matters most for pattern-naming and debrief construction — these are the high-leverage interventions. Cost is justified.
- Claude Haiku (claude-haiku-4-5-20251001) for high-frequency low-stakes calls — inline encouragement, simple acknowledgment, formatting-only operations. ~10x cheaper, fast enough for sub-second turnarounds.
- No model-switching at v1. Build with Sonnet only, then introduce Haiku for specific call types once we have telemetry on which intervention types are which volume.
Forbidden moves (the "what the co-pilot must never do" list)
Drawn from pedagogy.md and from what we expect the field research (fieldwork.md) to reinforce:
- Never give the answer when the trigger was "stuck" rather than "repeated-error" — that's telling without readiness.
- Never ask a metacognitive debrief question while the user is still mid-task.
- Never use condescending language ("simply", "just", "easy", "obviously").
- Never praise effort without specifying what was good.
- Never name a pattern the user hasn't actually used (no false-flattering pattern attribution).
- Never run more than one debrief question per task. McCarthy 2018's overload finding constrains frequency, not just timing.
5. State estimation for adaptive sequencing
Bastani et al.'s system uses a particle-filtering knowledge-state estimator with model-predictive control to select the next problem. That's a research-grade approach. For our v1, a simpler estimator is sufficient.
v1: LLM-as-judge over interaction history
- After each task, the orchestrator sends the task transcript and the user's pattern-naming attempt (from the debrief) to the model with a prompt asking: "Did the user demonstrate mastery of pattern X?" with a 0–1 confidence score.
- The judgment becomes a per-pattern, per-user mastery score that decays over time.
- Task selection: pick the next task whose required patterns include at least one the user has mastered (for confidence) and one not yet mastered (for productive struggle), in surface forms different from anything seen before (for cross-domain transfer).
- The contrasting-case insertion logic fires periodically: when two consecutive tasks looked similar but required different patterns, the orchestrator inserts a discrimination prompt.
This is a heuristic system, not an optimal one. Its main virtue is that it's debuggable — every selection decision can be traced to a specific judgment.
v2: Bayesian Knowledge Tracing or Deep Knowledge Tracing
Once we have telemetry on enough users (~500–1,000 active), we have enough data to fit a proper knowledge-tracing model. The classic options:
- BKT (Corbett & Anderson 1995): simple, well-understood, binary mastery state per pattern. Works with small data.
- DKT (Piech et al. 2015): RNN-based; better fit but requires more data and is harder to debug.
The choice between these is empirical and should be made when we have data — premature commitment is wasted effort.
Why not Bastani-style RL at v1
Particle-filtering + MPC is real engineering work and requires:
- A well-specified action space and reward function (we don't have these yet).
- Enough data to fit the state-transition model (we don't have this yet).
- Tolerance for opaque decision-making (which makes debugging the pedagogy harder).
We get there in v2 if and only if the heuristic system shows clear evidence of being a bottleneck. Most likely it will not be the bottleneck — content quality and intervention design will be.
6. Cost model
Honest answer: uncertain, but tractable. The dominant cost per active user is LLM inference; everything else (hosting, storage, telemetry) is small.
Token-cost estimate (v1, per active user-hour)
Assumptions:
- Average task: ~10 minutes of active engagement.
- Average ~3 co-pilot interventions per task (1 hint, 1 pattern-name, 1 debrief).
- Average intervention prompt: ~3,000 input tokens (with prompt caching, ~600 effective input tokens after the first call), ~300 output tokens.
- Per-call cost (Sonnet 4.6, with 90% cache hit rate): roughly $0.003 input + $0.005 output ≈ $0.008 per intervention.
- Per task: ~$0.024.
- Per user-hour (6 tasks): ~$0.15.
If we move high-frequency low-stakes calls to Haiku (10x cheaper for those), per-hour cost drops to **$0.05–0.10**.
For a 5-month learning relationship at 2 hours/week: 40 hours × $0.10 ≈ $4 per user. This is roughly an order of magnitude lower than human-tutored alternatives and a factor of 5–10 cheaper than the OpenAI Operator coaching scenario the agent-reliability research priced out ($1.25–10 per session).
Where uncertainty lives
- Cache hit rate — the system prompt is highly cacheable, but if we redesign frequently in early development, we burn cache. Stable post-v1.
- Intervention frequency calibration — if field research reveals our co-pilot needs to intervene more often than estimated, costs scale linearly.
- Pricing changes — Anthropic and others have steadily reduced prices over time. The 12-month forecast is for cost to drop, not rise.
Where caching matters most
The system prompt (co-pilot identity, voice, rules, pattern taxonomy) is the largest cacheable block. Stabilizing it is high-priority engineering work — every change to the system prompt invalidates the cache for active users.
7. Build phases
v0: proof of concept (4–6 weeks, single engineer)
Goal: validate that we can instrument a sandbox, route telemetry to an LLM-driven co-pilot, and produce a coherent pattern-naming intervention. Not a learning product.
- Single app: the document editor (TipTap) plus a form (React Hook Form).
- One curriculum task: "extract information from a source document, fill in a form, save the result."
- Telemetry layer for those two surfaces only.
- Co-pilot with the trigger system from §4 and Sonnet calls.
- No assessment, no curriculum sequencing.
Success criterion: a developer (not a target user) can run the task, get plausible co-pilot interventions at appropriate moments, and the telemetry log captures the events that justified those interventions.
v1: MVP (4–6 months, small team)
Goal: a deployable learning product for a constrained pilot cohort.
- daedalOS-derived shell with all five apps.
- Curriculum Levels 1–2 from the spec, ~10–15 tasks total.
- Full telemetry layer per §3.
- Co-pilot per §4 with Sonnet + Haiku model split.
- LLM-as-judge knowledge-state estimator per §5.
- Near-transfer + far-transfer assessment instruments.
- Per-user analytics dashboard for the team.
- 1k–10k user capacity (depending on intervention frequency).
Success criterion: defined by the partner-pilot deal terms (TBD per fieldwork.md Phase 3 outcomes).
v2: adaptive (deferred — 6+ months post-v1)
Goal: replace heuristic state estimation with a learned model; expand curriculum.
- BKT or DKT model fit on v1 data.
- Curriculum Levels 3–5 from the spec.
- Possible: cohort/peer-learning architecture if field research surfaces this as load-bearing.
- Possible: offline mode for library distribution.
- Possible: a vision-agent integration as a complement to the simulated environment, for the specific case of "transfer test on a real website" — not for primary learning.
8. Open technical questions
These are the questions we don't yet have a confident answer to. They are deferred, not ignored.
8.1 Sandbox-to-real transfer measurement
The pedagogy commits to far-transfer assessment in novel contexts. The question is: do those novel contexts have to include real applications (real Gmail, real Google Docs), or can they be sufficiently varied within the simulated environment to constitute genuine transfer?
The aviation/medical-simulator literature suggests functional fidelity > physical fidelity — a flight simulator that captures the right decision points produces better real-world transfer than a high-fidelity replica that doesn't. Plausibly the same holds for our domain. But it is empirical and we have not tested it. The v2 question is whether to add a real-app transfer instrument (perhaps via vision agent at that stage), or whether varied surface forms within the sim are sufficient.
8.2 Cold-start state estimation
A new user has no telemetry history. The LLM-as-judge estimator can only judge after at least one task. How do we pick the first task? Likely: a brief diagnostic micro-task (~3 minutes) with predetermined difficulty calibration that produces a coarse initial mastery vector. The diagnostic itself is a content design problem, not a technical one.
8.3 Multilingual scope
Our v1 target is English-language users. The Urban Institute brief flags non-English speakers as a population providers struggle to serve. Adding multilingual support to the simulated apps is moderate engineering effort; adding it to the co-pilot is essentially free at the LLM layer (Claude is strong multilingual). The hard part is the curriculum — task design that works across languages and reading levels. Defer to v2 unless field research surfaces it as a v1 distribution requirement.
8.4 The "third-level digital divide" structural problem
Per research/summaries/adult-ct-and-digital-skills-transfer.md: training transfer is diluted when learners can't practice at home (no internet, no device). Our product cannot solve this from inside the browser. The v2 question is whether to partner with hardware-and-internet distribution programs (KEYSPOT, hotspot loans through libraries), or to optimize for the case where the user does have access. v1 assumes access; v2 may need to address it.
8.5 Engagement architecture
Pure individual? Cohort-based? Hybrid? The Bastani RCT was effectively individual-with-instructor-context (the Python course had teachers). Whether sustained adult engagement requires a peer or instructor layer is the field-research question (fieldwork.md Q3). Architecture differs significantly between the options. Defer the choice until field research reports.
How this connects back
pedagogy.mddictates what the co-pilot must do, what signals must be visible, and what the assessment must distinguish. Every architectural commitment in this doc is downstream of a pedagogical commitment in that one.product-spec.mdspecifies the curriculum content and the user-facing surfaces. This doc specifies how we instrument those surfaces and how the AI layer reads them.fieldwork.mdis the source of truth for the open questions in §8 and for the calibration of the co-pilot intervention rules in §4.pitch-and-overview.mdsummarizes the headline claims that this doc justifies in detail (the simulated-vs-vision argument, the cost model, the build feasibility).
This doc will need substantial revision after Phase 2 of the field-research program. The intervention timing rules in §4, in particular, are educated guesses — they should be tightened against observed instructor behavior before any code commits to them.