Objective
Enable low-to-mid digital users to:
- Complete real-world digital tasks
- Operate across applications
- Build transferable schemas (not procedural memorizations) for digital problem-solving
- Use AI as a directable tool, not a passive answer source
Target: move users from PIAAC Level 1 (single-step tasks in a generic interface) to Level 2+ (multi-step problem-solving with inferential reasoning), and demonstrate far transfer: success on novel-context tasks that share structure but not surface form with training tasks.
The product surfaces, AI co-pilot behavior, and assessment metrics in this spec are direct consequences of the pedagogical commitments in pedagogy.md. The architectural choices are detailed in technical-approach.md.
Target user
Can:
- Use a smartphone confidently (browser, messaging, simple apps)
- Send a basic email
- Navigate a single web interface
Cannot reliably:
- Complete tasks that span multiple applications
- Move information between apps (e.g., copy from a web page into a form)
- Manage persistent state (files, downloads, draft documents)
- Recover from errors without abandoning the task
- Direct an AI to help, then evaluate whether its output is correct
Mental-model gaps (the deeper constraint)
Per the Urban Institute's 2019 finding1 and the conceptual-change literature2, the gating issue for adult digital fluency is not UI familiarity but mental models. Smartphone-fluent users lack:
- A model of the file system (files exist persistently in named locations, not encapsulated inside apps)
- A model of multi-window state (multiple things can be open and operated on in parallel)
- A model of hierarchical organization (folders within folders, threads within mailboxes)
- A model of versioning and history (current state was reached by a series of edits that can be undone, branched, or compared)
The curriculum must construct these. Procedural training over them produces sandbox-procedural users who fail in the wild. See pedagogy.md §3.
Core product concept
A task-based learning system inside an instrumented simulated desktop, with an embedded AI co-pilot that observes the user's work and intervenes pedagogically.
The simulated environment is a deliberate architectural choice driven by latency, telemetry richness, and reliability constraints — see technical-approach.md §1.
System overview
1. Simulated desktop (in-browser)
A simplified OS-like environment with:
- Browser (sandboxed iframe with overlay; we control all "websites" the user can navigate to)
- Email client (custom mock — no real send/receive; messages are scripted by the task engine)
- Document editor (TipTap-based)
- Form interface (React Hook Form-based)
- File browser (over a virtual filesystem, ZenFS)
Constraints:
- Minimal UI — controlled complexity, only the affordances the curriculum needs
- Realistic enough that schemas transfer (functional fidelity, per the simulator-fidelity literature)
- Fully instrumented at the event level (see
technical-approach.md§3)
2. Task engine
Each task:
- Multi-step
- Specifies a primary pattern and 0–2 secondary patterns being practiced
- Specifies an underlying schema being constructed (which mental model the task targets)
- Cross-application when appropriate to the level
- Drawn from real-world scenarios
Examples (these are placeholder content — final task content is shaped by fieldwork.md Phase 1):
- Apply for a service: download a benefit form, fill it out, attach proof, submit.
- Research + summarize + send: read three short articles, extract three claims, draft an email summarizing them.
- Verify + complete: receive an email asking for confirmation, find the original document referenced, verify a fact, reply.
3. AI co-pilot
Persistent side panel. Capabilities:
- Context-aware hints (graduated escalation, not direct answers)
- Pattern naming after task subgoals
- Metacognitive debrief at task end
- Demonstration on explicit user request, after attempt
- Error explanation when the user gets stuck after multiple attempts
Rules (see pedagogy.md §4 and technical-approach.md §4):
- Default behavior is silence. The co-pilot speaks only when a defined trigger fires.
- Attempt before assist (productive struggle).
- Pattern naming uses the curriculum's controlled vocabulary; the same pattern is named the same way every time it appears.
- Forbidden moves: condescending language ("simply", "just"), unspecified praise, false-flattering pattern attribution, more than one debrief question per task.
4. Assessment engine
Tracks two distinct categories of metric:
Near-transfer (procedural fluency):
- Completion rate on training-like tasks
- Time-to-completion improvement
- Error frequency
- AI usage patterns (productive vs. answer-seeking, judged by chat-quality LLM-as-judge)
Far-transfer (schema application): (headline metric)
- % of users who successfully complete a task they've never seen before, in a context they've never trained in, using a pattern from earlier in the curriculum
- Pattern-naming recall (do users name the patterns the co-pilot named for them, weeks later, without prompting?)
- Self-efficacy on novel tasks
Outputs:
- Per-pattern mastery score (decays over time)
- Per-mental-model mastery score
- Progression readiness signal (next-task selection driver)
The near/far distinction is non-negotiable. A system that only measures near-transfer will produce sandbox-procedural users — see pedagogy.md §6 (falsification).
Transfer Design Principles
The five evidence-supported moves from pedagogy.md, translated into operational product constraints:
TDP-1: Cross-domain task families
Each skill is practiced in at least three different surface forms before being marked "taught." Not three email tasks — one email task, one form task, one document task that share the same underlying pattern. The variation across surfaces produces low-road transfer; the consistency of pattern builds the schema.
Implementation: Curriculum-engine constraint. The task selection algorithm cannot mark a pattern "mastered" on a single surface form, regardless of completion success.
TDP-2: Explicit pattern naming
On task subgoal completion, the co-pilot names the pattern the user just used, in the curriculum's controlled vocabulary, with one example of where the pattern recurs.
Implementation: A defined trigger in the co-pilot orchestrator (task-complete trigger). The pattern vocabulary is a small, stable set (~12–20 patterns) defined as part of curriculum design, not invented per-response.
TDP-3: Metacognitive debrief
At task end, the co-pilot asks one question: "What pattern did you use? Where else might it apply?" The user articulates the schema in their own words, the co-pilot confirms or refines.
Implementation: A defined trigger in the co-pilot orchestrator (task-end trigger). Frequency limit: one debrief per task. Skipped if the user shows signs of metacognitive overload (per McCarthy 2018 — see pedagogy.md §4).
TDP-4: Contrasting cases
Periodically (~every 4–6 tasks), the system inserts a near-neighbor task that looks similar to the previous one but requires a different pattern. The co-pilot prompts: "Is this the same pattern as last time, or different? How can you tell?"
Implementation: Task-selection logic. The contrasting-case trigger fires when the system has just completed a task with primary pattern X, and a near-neighbor task with primary pattern Y exists.
TDP-5: Far-transfer assessment
Assessment cycles include tasks the user has never seen, in surface forms they've never trained in, that require previously-taught patterns.
Implementation: The Assessment Engine reserves a held-out pool of "novel surface form" tasks per pattern. These are never used as training tasks for any user.
Curriculum structure
Five levels. Each progresses on the kind of mental-model construction it requires, not procedural difficulty alone. Constraint per TDP-1: each skill must appear in ≥3 distinct surface forms before being considered taught.
| Level | Goal | Mission alignment |
|---|---|---|
| L1 — Operational basics | Construct the desktop mental model: file persistence, multi-window state, basic hierarchy. | Bridge level. Probably not where our distinctive contribution lives — public libraries already deliver this for free, with in-person support, at scale. v1 likely treats this as a fast-track diagnostic or partners with library Level-1 onboarding. |
| L2 — Transactional tasks | Sequencing and verification. Multi-step processes within one or two apps; recovery from errors. | Bridge level. Some library systems handle this; many don't. Our value-add starts here. |
| L3 — Workflow execution | Operate across applications. Compose Level 1–2 patterns into longer workflows that span apps and sustain state. | Where our core mission begins. The library system does not scaffold this depth. The Urban Institute (2019) names this as the gap providers can't fill. |
| L4 — AI-augmented workflows | Direct AI as a tool inside a workflow. Evaluate AI output. Refine. | Novel. No established curriculum exists at this level for adult digital-fluency populations. |
| L5 — Adaptation layer | Encounter an unfamiliar tool, recognize patterns, figure it out. The closest thing the platform certifies to "I am now digitally fluent." | Most aligned with the pitch's thesis. Tools change; AI agents proliferate; the only durable skill is the ability to meet new ones. |
Full curriculum content — task families, patterns, mental-model tracks, the controlled pattern vocabulary the AI co-pilot uses — lives in curriculum.md. That document also flags the strategic question of where v1 should target on this scale (see Implications below).
Cross-cutting: mental-model construction tracks
Independent of the procedural-skill progression, the curriculum tracks four mental-model constructions:
- File system (hierarchical, persistent, named locations) — primarily Level 1
- Multi-window state (parallel things, focus, switching) — Levels 1–3
- Versioning and history (undo, drafts, edits-over-time) — Levels 2–4
- AI as directable system (delegation, evaluation, refinement) — Levels 4–5
The Assessment Engine tracks each as a separate mastery dimension.
UX principles
- One task at a time. No competing surfaces, no notifications.
- Minimal interface. Only the affordances the current task requires are interactable.
- Immediate feedback. Telemetry-driven; the interface reacts within <100ms even if the AI co-pilot's response takes longer.
- AI always available, never intrusive. The co-pilot side panel is persistent but speaks only on defined triggers (per TDP).
- Errors are tasks, not failures. A wrong attempt becomes a contrasting-case opportunity, not a marked failure.
MVP scope
Open question: where does v1 target on the curriculum?
The previous draft of this spec scoped v1 to "Curriculum Levels 1–2 (~10–15 tasks total)." Drafting curriculum.md surfaced a problem with that framing: Level 1–2 work is largely already served by the public library system (free, in-person, with thirty years of operational experience). A v1 that targets the same competencies competes with established infrastructure and betrays the pitch's "transferable schemas in the age of AI" thesis, because the novel content lives at Levels 3+.
Two alternatives we are considering for v1 — to be decided jointly with Phase 2 of fieldwork.md:
- Levels 2–3, Level 1 as a fast-track diagnostic. Users with smartphone fluency (most of the target population) skip Level 1 in 10–20 minutes; users without it get routed to library/Northstar acquisition before re-entering. Targets the real gap libraries don't fill.
- Levels 3–4 only, library/ABE partnership for Levels 1–2. Sharper. Concedes the basics to established infrastructure and concentrates engineering on the novel-contribution levels. Distribution requires a partner —
fieldwork.mdis set up to identify one.
v1 features (regardless of curriculum scope):
- Simulated desktop with five core apps
- Full telemetry layer per
technical-approach.md§3 - AI co-pilot with the trigger system, model split (Sonnet + Haiku), and forbidden-moves list
- LLM-as-judge knowledge-state estimator
- Near-transfer + far-transfer assessment instruments
- Per-user analytics dashboard for the team
- ~12–20 tasks total (curriculum scope determines which levels they cover)
Not in v1:
- Whichever curriculum levels aren't selected as the v1 target
- Adaptive sequencing via BKT/DKT (heuristic in v1)
- Cohort or peer-learning architecture (deferred until field research informs)
- Offline mode
- Mobile
- Multilingual support
Detailed build phases in technical-approach.md §7.
Key metrics
Primary (headline):
- Far-transfer rate: % of users completing a novel-context task using a previously-taught pattern.
Secondary:
- Pattern-naming recall (delayed)
- Procedural completion rate
- Time-to-completion improvement
- Sustained engagement (per Bastani: time-on-task and persistence)
- Self-efficacy change
Diagnostic (for our team, not the user):
- Per-intervention type frequency (are we triggering pattern-naming and debriefs at the right rates?)
- Co-pilot chat-quality score (LLM-as-judge on whether interventions were productive)
- Mental-model mastery scores per dimension
Risks and mitigations
Pedagogical risks
Sandbox-procedural failure — users become competent in our environment but cannot transfer to real apps.
→ Mitigated by TDP-1 (≥3 surface forms per skill), TDP-5 (far-transfer assessment), and the falsification metrics in pedagogy.md §6.
AI dependency — users learn to ask the co-pilot rather than to think. → Mitigated by attempt-before-assist (productive struggle), graduated escalation in help, and tracking AI-usage patterns as a chat-quality metric, not just volume.
Metacognitive overload — debrief prompts demoralize low-confidence users.
→ Mitigated by McCarthy-2018-aware design (one debrief per task max, scaffolded modeling first, skip if affective signals indicate distress). See pedagogy.md §5.
Engagement risks
Novelty effect decay — Bastani's mediator wanes after the first sessions. → Mitigated by the engagement-architecture decision (deferred, see open questions). Field research probably surfaces a cohort/community layer requirement.
Distress vs. productive struggle confusion — the co-pilot intervenes too late or too early.
→ Mitigated by field research with experienced instructors (fieldwork.md Q2). The v1 trigger thresholds are educated guesses; calibration comes from observation.
Distribution and access risks
The PIAAC Level-1 population is the least likely to find a D2C learning product.
→ Distribution channel is unspecified. See pitch-and-overview.md "Distribution channel" TODO.
Third-level digital divide — users without home internet or a device cannot practice between sessions; OECD data shows skills decline without sustained practice. → Out of scope for v1 product; partially addressed by partner-distribution choice (libraries provide access).
Competitive risks
Khan / Google / Microsoft can ship a competitor in a quarter if this works. → Mitigated by (a) the population specificity (incumbents have not targeted PIAAC Level-1 adults), (b) the pedagogy specificity (transfer-targeted design is a non-trivial intellectual commitment), (c) the open-source platform layer strategy (long-term defensibility comes from being the standard, not from the moat).
Technical notes
Architecture, telemetry layer, AI integration, model choices, state estimation, cost model, and build phases: technical-approach.md.
Headline points:
- Browser-only (no app install). React-based.
- Backend handles task engine, tutor orchestration, knowledge state.
- Anthropic Claude (Sonnet for orchestration, Haiku for high-frequency).
- Per-active-user-hour cost: ~$0.05–0.15 with prompt caching.
- v1 buildable in 4–6 months with a small team; ~50–60% assembled from MIT-licensed open-source (daedalOS, TipTap, ZenFS, React Hook Form).
Distribution and pilot plan
TODO (Matt): Distribution channel commitment + first-cohort plan.
Field-research program (
fieldwork.md) is designed to produce 2–3 named candidate partner organizations as a deliverable. Defer commitment until Phase 2 of that program completes.Likely shape (subject to revision):
- Partner with one library system or ABE provider for the first 100–500-user pilot.
- Co-design content for a cohort with the partner organization (their staff knows their learners; co-design has compounding pedagogical value).
- Partner organization runs in-person component (per Urban Institute: "humanizing" matters); product runs the AI-coached interactive component.
- Pilot success criterion: defined jointly with partner against far-transfer rate and engagement-retention thresholds.
Long-term expansion
- Adaptive difficulty engine (BKT/DKT) replacing v1 heuristic
- Levels 3–5 of the curriculum
- Larger task library
- Benchmarking system (the dataset becomes the largest empirical record of adult digital-fluency transfer)
- Enterprise / employer deployment
- Certification layer (third-party-recognized credential for completion)
- Open-source platform release (daedalOS-derived shell + telemetry layer + co-pilot integration; other adult-education orgs adopt and extend)
End state
A system that:
- Teaches transferable digital competence (not procedural memorization)
- Measures it with both near and far transfer instruments
- Improves continuously as the dataset grows
- Becomes the standard layer between low-fluency adults and AI-augmented digital systems
- Generates the empirical evidence the field does not have
Footnotes
-
Hecker & Loprest, Foundational Digital Skills for Career Progress, Urban Institute 2019. Notes:
research/grey-literature/urban-institute-2019-foundational-digital-skills.md. ↩ -
Synthesis of conceptual-change literature (Chi 2008 framework, NN/G mental-model studies). See
research/summaries/adult-ct-and-digital-skills-transfer.mdandpedagogy.md§3. ↩