Designing for Transfer in Adult Digital Fluency

Purpose of this document

This document explains why the Digital Fluency platform is designed the way it is. The product surfaces, AI co-pilot behavior, curriculum sequencing, and assessment metrics described in the spec are not arbitrary choices — each is a response to a specific finding in the cognitive-science and intelligent-tutoring literature on what produces transferable skill, as opposed to procedural memorization.

Our central claim: arming adult users with computational thinking capacity is a problem of transfer. Decades of research show transfer rarely happens by accident, and that specific instructional moves materially raise the probability of it happening. This document names those moves, the evidence behind them, the gaps in that evidence, and the specific design constraints they impose on our product.

If a reviewer wants to know whether the pedagogy is serious, this is the document to read.

The transfer problem

Wing's 2006 framing of computational thinking (CT) as "a fundamental skill for everyone, not just for computer scientists" is the conceptual ambition behind our project.¹ But Wing's manifesto is silent on the question that has dominated the subsequent twenty years of CT-education research: does it transfer?

The historical record is sobering. Pea & Kurland's 1984 evaluation of LOGO — a programming language designed expressly to cultivate generalizable thinking — found no evidence that children who learned LOGO transferred those skills to planning tasks in other domains.² Forty years later, the field's own assessment is that "computational thinking transfer" remains a notorious problem. The K-12 meta-analyses (Ye 2022, n = 55 studies; CT-STEM 2024, n = 37 studies, n = 7,832 students) do find significant transfer effects on average — but these studies are overwhelmingly K-12, and the reviewers explicitly note that adults "received relatively little attention."³

The pattern across all "learn X to think better generally" claims, going back to Thorndike's 1901 demolition of the "mental discipline" theory of Latin instruction, is the same: skills do not spontaneously generalize from the contexts in which they were acquired. The Urban Institute's 2019 brief makes this concrete for our specific population: "fluid use of a smartphone does not always translate to broader digital skills. Some young people who were experts with smartphones were not able to easily transfer their knowledge into a work setting where they needed to use computers and office software and tasks."⁴

For our project, this means: a curriculum that teaches procedures inside a sandbox will produce sandbox-competent users. Procedural fluency in our environment is necessary but not sufficient. The design must explicitly target transfer, or the project's central claim collapses.

What transfers and what doesn't

Salomon & Perkins (1989) offer the conceptual map we use throughout the rest of this document.² They argue that transfer is not a unitary phenomenon — it occurs by two distinct mechanisms with different conditions:

Low-road transfer is automatic and stimulus-controlled. It happens when a well-practiced skill is triggered by a new context that resembles the contexts of practice. The conditions: extensive varied practice and automaticity. Their canonical example: driving a car prepares you to drive a truck without conscious effort.
High-road transfer is mindful and deliberate. It requires that the learner abstract the underlying principle from the original learning context, then apply that abstraction to a new situation. The conditions: mindful abstraction, typically supported by metacognitive guidance. Their example: childhood study habits ("set aside time for what matters") spontaneously surfacing in adult work scheduling.

These mechanisms call for different instructional moves. Low-road transfer comes from varied practice. High-road transfer comes from making abstractions explicit. A curriculum that does only one will get only one kind of transfer.

The cognitive-science literature converges on a sharper claim: what transfers is abstract schema — the structure of a problem and the type of reasoning it calls for — rather than production rules — the specific keystrokes and clicks. Schwartz & Bransford (1998) make the point with a simple example: a novice and an expert both "understand" the sentence "the dressmaker used the scissors to cut the cloth," but the expert's representation is differentiated (dressmaker's shears differ from barber's shears differ from nail scissors in specific structural ways tied to function). Direct instruction about scissors design lands differently on those two readers.⁵ For our population, the parallel is exact: a user who has built only an "icons-on-a-phone-screen" mental model is a novice for any task that requires understanding files, persistence, or hierarchy — even when the surface UI of those tasks looks familiar.

This is the deepest design constraint we face. Mental models, not UI patterns, are the transfer-gating variable. UI conventions are largely standardized (Jakob's Law), and surface-level click sequences transfer reasonably well. What does not transfer is the underlying conceptual structure. Smartphone-native users who have never built a mental model of a hierarchical file system cannot productively use one even when the icons are familiar. Our curriculum must explicitly construct those mental models — file as referenced object with persistent location, document as edited state with version history, application as long-running process with multi-window state — or our users will become sandbox-procedural and remain real-world-blocked.

The five design moves

Five instructional moves with strong empirical support, each tied to a specific product feature:

1. Cross-domain task families (low-road transfer mechanism)

Each skill must be practiced in at least three different surface forms before it is considered "taught." Three email tasks won't do; the variation must cross domains: one email task, one form task, one document task that share an underlying pattern (e.g., "extract structured information from an unstructured source"). The variation across surface forms is what produces low-road transfer; the consistency of underlying pattern is what builds the schema.

Empirical anchor: Salomon & Perkins describe varied practice as "the cognitive element ... adapts in subtle ways to each of these contexts, yielding an incrementally broadening ability that gradually becomes more and more detached from its original context."² HPL II's Ch 5 confirms variability of practice as one of five evidence-supported strategies for durable, flexible knowledge.⁶

Product implementation: Curriculum constraint at the spec level. The task engine cannot mark a skill "complete" on a single surface form.

2. Explicit pattern naming (high-road, forward-reaching)

When the user completes a task, the AI co-pilot names the underlying pattern in plain language: "You just used decomposition — you broke a big task into smaller steps. You'll see this same pattern in Level 3 when you research a topic and write a report."

The name is the cognitive handle. It is the abstraction the learner can later retrieve when facing a new context.

Empirical anchor: Salomon & Perkins on mindful abstraction: "the abstraction must be understood, and the understanding requires mindfulness; automatic processes just do not yield novel abstractions that are well understood."² The Gick & Holyoak (1983) finding they cite is striking: subjects who wrote their own summaries comparing two analogous stories transferred to a new problem at 91%; subjects who wrote poor summaries transferred at 30%. Producing the abstraction in your own words is the discriminating variable.

Product implementation: AI co-pilot post-task behavior. Instead of "well done," the co-pilot names what the user did using a vocabulary that recurs across the curriculum.

3. Metacognitive debrief (high-road, backward-reaching)

After each task, a 30-second debrief: "What pattern did you use? Where else might that same pattern apply?" The user articulates the schema in their own words.

This is the move that prepares users for the real test: encountering a novel digital task in the wild, mentally reaching back to a pattern from training, and applying it. Without rehearsing this retrieval, the schema stays passive.

Empirical anchor: VanLehn's 2011 meta-analysis of 50+ Intelligent Tutoring System studies: self-explanation scaffolding produced effect sizes of d ≈ 0.33–0.55 on transfer tasks. Aleven & Koedinger (2002) reported the same range in the Cognitive Tutor specifically.⁷ HPL II flags self-explanation and elaborative interrogation as evidence-supported, with the caveat that they depend on prior knowledge — low-knowledge learners may need scaffolded modeling first.⁶

Important caveat from the recent literature: Zengilowski et al. (L@S 2025) ran a preregistered RCT (n = 1,005) on metacognitive reflection prompts in math learning and found a null effect.⁷ McCarthy et al. (2018) found that metacognitive prompts in a reading tutor hurt performance during practice — "metacognitive overload."⁷ The mechanism is real, but timing, frequency, and design matter enormously. Our co-pilot must use these prompts sparingly, after success rather than during struggle, and scaffolded for our low-confidence population.

Product implementation: End-of-task debrief, modeled by the co-pilot for the first several occurrences, then prompted from the user.

4. Contrasting cases (schema differentiation)

Periodically, the user is presented with two tasks that look similar but require different patterns. The system asks: "Is this the same pattern as the last one, or different? How can you tell?"

This forces the discrimination that prevents schemas from staying surface-level. Without contrasting cases, learners build schemas keyed on superficial features (the task is "about email" or "about a form") rather than on structural features (the task is "extract → transform → submit").

Empirical anchor: Schwartz & Bransford 1998. In their three classroom studies, undergraduates who analyzed contrasting cases before hearing a lecture made significantly better predictions on a novel hypothetical experiment one week later than students who summarized text, read about the cases, or analyzed cases without hearing a lecture.⁵ The crossover within-subject design controls for individual differences. Critically, their true/false verification test did not discriminate between conditions — only the far-transfer prediction task did. Standard recognition assessments can pass without indicating real learning. This is a direct empirical anchor for our assessment design (see move #5).

Product implementation: Task engine logic that periodically inserts a near-neighbor task that requires a different pattern, paired with a discrimination prompt.

5. Far-transfer assessment (the only honest test)

The Assessment Engine measures success not on tasks that look like training tasks, but on tasks with novel surface forms that require previously-taught patterns. The headline metric is far-transfer rate: the percentage of users who complete a task they have never seen before, in a context they have never trained in, using a pattern from earlier in the curriculum.

Near-transfer metrics — completion rate on training-like tasks, time-to-completion, error rate — are tracked, but they are secondary. They diagnose procedural fluency, not transfer.

Empirical anchor: Schwartz & Bransford demonstrate that recognition-style assessments mask the differences that prediction-style assessments reveal.⁵ The CT-transfer meta-analyses (Ye 2022; CT-STEM 2024) report distinguishable near and far transfer effects.³ If we only measure near transfer, we will accept a system that produces sandbox-procedural users.

Product implementation: Assessment Engine surfaces a near/far transfer distinction. The product cannot ship without an instrument for assessing far transfer; the spec must specify what novel-context tasks look like.

Productive struggle and engagement

Two findings from the recent literature shape the AI co-pilot's intervention behavior beyond the five design moves above.

Productive struggle. Schwartz & Bransford's central thesis is that direct instruction ("telling") is most effective when it follows a period of preparation that surfaces what the learner does not yet know. "When telling occurs without readiness, the primary recourse for students is to treat the new information as ends to be memorized rather than as tools to help them perceive and think."⁵ The implication is that the co-pilot should not be a help-text dispenser. It should require attempt before assist — not as a punitive measure, but because telling without readiness produces memorization, not transfer.

This is the empirical foundation for the Bastani et al. (2026) finding in their 770-student RCT: their tutoring chatbot was prompted to refuse direct answers until students demonstrated substantial effort, and engagement (not problem volume) mediated the entire 0.15 SD learning gain.⁸

Engagement as the mediator, not the goal. Bastani's mediation analysis showed gains were driven by sustained time-on-task and persistence, not by completing more or harder problems. Our co-pilot's intervention rules must therefore optimize for engagement quality (the chat-quality score Bastani used as an LLM-judged metric), not for raw task volume or speed.

For our adult-novice population the calibration of struggle is delicate. McCarthy 2018's "metacognitive overload" finding cautions that good intentions can backfire: prompts that work for confident learners can demoralize low-confidence ones.⁷ The product must include affective signals (typing pauses, repeated wrong attempts, dwell time on help-text) and back off when struggle is becoming distress rather than productive.

Falsification: how we'd know it isn't working

Three concrete failure signals. If any of them shows up in deployment data, the pedagogy described above is not working in our population, and we should redesign rather than scale.

Far-transfer rate plateaus near completion-rate. If users who complete training tasks at 80% also complete novel-context tasks at 75–80%, we are accidentally training on the test — our "novel" tasks are not novel enough. If far-transfer rate is much lower than completion rate (e.g., 80% completion but 25% far transfer), we are training procedural fluency, not transfer. Both failure modes are diagnosable from the metric distinction baked into the Assessment Engine.
Pattern-naming recall fails. If, weeks after a curriculum module, users cannot name the patterns the co-pilot named for them — not even with a "what did you do in Level 2?" prompt — then the explicit-pattern-naming design move did not stick, and either the prompts are too generic or the calibration is wrong.
Smartphone-mental-model carryover persists. If users who reach the document editor module still treat documents as ephemeral (no expectation of versioning, no understanding of where the file is stored), then the explicit mental-model instruction did not produce the categorical shift Chi (2008) describes.³ This is the Urban-Institute failure mode reproduced inside our product, and it would mean the curriculum needs to make mental models more central earlier.

A pedagogy doc without a falsification section is just a list of citations. The credibility move is to commit, in advance, to what would count as evidence the design is wrong.

Open questions

The honest answer about the evidence base for our specific population: it is thin.

No RCTs exist on adult computational-thinking transfer. The K-12 meta-analyses cannot be directly applied to our population, and the reviewers themselves note that adults "received relatively little attention."³
No empirical studies exist on adult mental-model construction for digital systems. Chi's conceptual-change framework is highly relevant but has not been empirically tested with adult digital learners.³
The LLM-tutor metacognitive-prompting literature is sparse and partly null. A preregistered 2025 RCT found no effect; a 2025 Google/DeepMind RCT showed +5.5pp on transfer but used Socratic dialogue rather than explicit pattern naming.⁷
The smartphone-to-desktop transfer-failure mechanism has not been diagnosed. The Urban Institute reports the outcome; no controlled study has identified whether the failure is mental-model mismatch, lack of practice, or interface dissimilarity — or tested an intervention.³

This is not a weakness of the proposal. It is the contribution opportunity. A platform that deploys a structured intervention to thousands of low-fluency adults, instruments far-transfer measurement at scale, and reports honestly on what works will produce evidence the field currently does not have. The pedagogy described above is grounded in the strongest available adjacent evidence — Schwartz & Bransford, Salomon & Perkins, the ITS meta-analyses, Bastani et al. — but the specific question of whether these moves work for adults learning digital fluency is empirical and we are positioned to answer it.

How this document feeds the rest

pitch-and-overview.md references this doc as the long-form pedagogical justification. The pitch's "why this teaches transfer, not memorization" claim links here.
product-spec.md depends on the five design moves: each becomes a concrete product feature (curriculum constraint, co-pilot post-task behavior, end-of-task debrief, periodic contrasting-case insertion, Assessment Engine near/far split).
technical-approach.md depends on the productive-struggle and metacognitive-overload findings: the co-pilot must instrument affective signals, not just performance signals, to time interventions correctly.

Wing, J. M. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35. PDF · our notes. ↩
Salomon, G., & Perkins, D. N. (1989). Rocky roads to transfer: Rethinking mechanisms of a neglected phenomenon. Educational Psychologist, 24(2), 113–142. PDF · our notes. Pea & Kurland's 1984 LOGO finding cited at p. 114. ↩ ↩² ↩³ ↩⁴
Synthesis: adult CT and digital-skills transfer. Includes Ye 2022 meta-analysis, CT-STEM 2024 meta-analysis, RAND 2024 pilot, OECD PIAAC, Chi 2008 conceptual change, NN/G mental-models work. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Hecker, I., & Loprest, P. (2019). Foundational Digital Skills for Career Progress. Urban Institute. PDF · our notes. Smartphone-transfer-failure quote on p. 12. ↩
Schwartz, D. L., & Bransford, J. D. (1998). A time for telling. Cognition and Instruction, 16(4), 475–522. PDF · our notes. Telling-without-readiness quote on p. 477. ↩ ↩² ↩³ ↩⁴
National Academies of Sciences, Engineering, and Medicine. (2018). How People Learn II: Learners, Contexts, and Cultures. Chapter 5: Knowledge and Reasoning. our notes. ↩ ↩²
Synthesis: LLM-tutor metacognitive-prompting literature. Includes Aleven & Koedinger 2002, VanLehn 2011 meta-analysis, McCarthy et al. 2018 (metacognitive overload), Zengilowski et al. L@S 2025 (preregistered null), LearnLM UK RCT 2025, Kestin et al. Harvard physics RCT 2025. ↩ ↩² ↩³ ↩⁴ ↩⁵
Chung, A. T-H., Zhang, B., Kung, L-C., Bastani, H., & Bastani, O. (2026). Effective Personalized AI Tutors via LLM-Guided Reinforcement Learning. SSRN 6423358. PDF · original source. ↩