A real-time, voice-driven training suite for people who stutter — built to give the clinician measurement, not just engagement. Every game listens to the client's voice and tells you what their speech-motor system is actually doing in the moment.
This walkthrough shows each game in the suite with a live, interactive demonstration. The same demos run inside the actual product. Drag sliders, click phonemes, watch animations — every visualization is the real mechanic the game uses with a client.
Use the arrow keys, swipe left or right, or use the prev/next buttons to move between frames. Scroll inside each frame to read it fully.
Every syllable in FluentPlay is scored along three dimensions: Safety (how protected this moment is), Pressure (how much load it carries), and Difficulty (how hard the execution actually was). Each dimension is a weighted average of factors. Below is a simplified set — Speech Console runs the full 21-factor version.
Every game shares the same browser-side audio engine. Your microphone goes through two FFT analysers — one for spectral features, one for waveform peak detection — and into a frame-level state machine called the Disfluency Feature Stream (DFS).
For every audio frame (about 60 per second), the DFS classifies the sound as silent, building, or voiced. From that stream it derives:
Each feature feeds the PAD scorer. No audio is recorded. Nothing leaves the browser except the Azure Speech call (used by Speech Console and Articulation Trainer for phoneme-level accuracy).
Top row: live waveform with peak markers. Middle row: DFS state classification per frame (silent / building / voiced). Bottom row: cumulative feature counts as the syllable plays.
Speech Console runs the full PAD engine on a phrase, syllable by syllable. The speaker enters a phrase, picks a preset (Balanced, High Sensitivity, or Timing focus), and speaks. Each syllable scores in real time and produces a colored chip — green for safe-and-low-pressure, amber for moderate, red for high difficulty.
For every syllable Speech Console renders two rows of phoneme chips: anticipated (what the engine expects from the dictionary) and detected (what Azure Speech Services actually heard, scored 0–100). Below the chips are the live P, A, D bars for the current syllable, then the rolling Ground meta-parameter — it drops fast on stumbles, recovers slowly on success.
When the playhead hits "bil", the detected row shows two /b/ chips in red — that's a block: the engine caught the speaker re-attempting onset before the vowel landed. The PAD bars spike on Pressure and Difficulty at the same moment, and Ground drops. Speech Console makes that visible without the SLP having to flag it manually.
The Articulation Trainer is a real-time anatomical view of speech production. Pick any sound — the diagram shows exactly which articulators are involved, where the air flows, and how the lips and tongue have to position themselves to produce it. Same engine that runs inside Speech Console for live phoneme feedback.
Most people who stutter can produce any phoneme correctly under high effort. The clinical goal is to do it with less. The trainer shows the mechanics — where lips press, where tongue contacts, where air flows — so the speaker has a model of "easy" to aim for, not just a target word.
Sound Bridge gives you two sounds and asks you to connect them without stopping your voice. You hold the first sound, slide smoothly into the second, then hold that one too — building a bridge of continuous voice between them.
If your voice drops below the voicing threshold during the transition, the bridge breaks and you start over. The score combines hold steadiness with transition continuity.
The demo on the right auto-loops through clean → shaky → broken. Click any below to jump to that scenario.
Coarticulation — the smooth chaining of phonemes — is exactly where many people who stutter block. Sound Bridge isolates that transition as the training target. The smoother your slide, the more your speech-motor system is relying on a single integrated motor plan rather than two sequential ones.
The DIVA model of feedforward speech control predicts that fluent transitions depend on the second sound's motor plan being prepared while the first is still being produced. Sound Bridge's continuity meter is a real-time proxy for that feedforward signal.
Rainbow Syllables v12 is the SLP-facing trial build of the FluentPlay engine. Same scoring pipeline as Speech Console — but the v12 UI exposes every intermediate signal so a clinician can see exactly what the system is measuring at each moment.
The mockup on the right is the actual v12 game screen, animated with sample data. Top to bottom:
The anticipated row is the target — what the speaker is supposed to produce. The detected row is the reality — what Azure SR pulled from the audio. The gap between them is the actionable signal: where did the production drift from the target, and on which phoneme. SLPs can review this row-by-row after a session to identify exactly which phonemes need work.
Rainbow Syllables v12 and Speech Console run the same scoring pipeline. Speech Console is the clean production framing — a single PAD breakdown with the technical details abstracted away. Rainbow Syllables v12 is the SLP trial build — every signal exposed for diagnostic review. The 20-session trial limit and the diagnostic monitor are what make it the SLP version.
Summit takes one word and asks you to repeat it many times. Each successful production climbs a stylized mountain trail. Each stumble drops the climber back toward basecamp. Reach the summit to win.
This is voluntary stuttering in the Van Riper tradition, gamified. Repeating a word — especially a feared word — is the canonical exposure exercise for stuttering. The mountain frame turns that exposure into a goal-directed task: every repetition is altitude gained.
The descent mechanic explicitly punishes giving up. There's no skip button. There's no easy mode. The only way to the summit is to keep speaking through the moments you would normally avoid.
Summit doesn't reward smooth speech. It rewards sustained engagement with a hard word. Habituation reduces the amygdala threat response, which in turn stabilizes downstream motor gating — the actual mechanism by which exposure therapy works.
Cadence is rhythm training for speech-motor timing. A horizontal lane runs across the screen with a fixed hit line on the left side. Bars spawn on the right and travel left at the chosen tempo. As each bar reaches the hit line, the speaker says the target phoneme. Voice onset detection at the hit line scores the timing as PERFECT, GOOD, or a miss.
Eight beats per measure. Bars spawn at intervals of 60/BPM seconds. The hit window is asymmetric — more forgiving on the early side (~180px before the line) than the late side (~100px after). A PERFECT hit is within roughly the first third of the early window; everything else inside the window scores GOOD. A miss means the bar passed the late edge with no voice onset, OR the speaker fired voice with no bar in window (a "too early" error). Combo accumulates with consecutive hits and resets on any miss. Hit 75% or more of the eight beats and the BPM auto-advances by 10.
Speaking in time with an external rhythm has a robust short-term fluency-enhancing effect — the metronomic fluency effect. Cadence gamifies the metronome. The deeper target is cerebellar timing coordination: cerebellar circuits govern speech rhythm, and the cerebellum-stutter literature implicates them in disfluency. Cadence is direct rhythm training for those circuits.
Tempo-based repetition spaces productions evenly across time, which is a stronger motor learning condition than block practice (the "say it 10 times in a row" pattern). Cadence delivers distributed practice without making the speaker think about it.
Rhythm Pad is the diadochokinetic drill from the legacy suite. Five segment slots in a row. The speaker says the target phoneme into each slot in sequence — voice in the correct volume zone fills the slot, missing the window grays it out.
Each slot has a fixed time window (about 1.2 seconds at the medium speed preset). A ghost fill sweeps across the slot showing time elapsed. When the speaker's voice lands inside the volume zone, the slot snaps to its target color and the ghost fill jumps to full — that's a hit. Then a release gate requires the voice to drop out of the zone before the next slot can be hit, which prevents one continuous tone from filling everything. If the window expires with no hit, the slot grays out and the run advances anyway. Five hits in a row clears the measure.
Three consecutive hits trigger the streak reward. The streak indicator at the bottom shows how many in a row the speaker has banked toward the goal. Any miss resets the streak counter to zero.
Diadochokinetic drilling is the canonical exercise for measuring and training speech-motor coordination. The clinical version uses /pa-ta-ka/ at increasing rates to assess motor sequencing. Rhythm Pad takes that exercise, gamifies it, and produces per-segment scoring data the SLP can review.
Without the release gate, a speaker could just hold one continuous tone and the system would register every slot as a hit. The release gate forces discrete productions, which is the actual motor pattern being trained. It's a small mechanic with a large effect on what the game is measuring.
Each bubble carries something to say — a sound, a syllable, or a word. The speaker plugs in whatever they want to practice, and the bubbles drift across the screen carrying that target. As each bubble reaches the zone, the speaker says it. Speech recognition (or volume, depending on mode) decides whether the bubble pops.
The bubble content is the whole point. A child practicing /b/ words sees bubbles labeled ball, baby, bus, big. An adult drilling consonant clusters sees bubbles labeled str, spl, scr. A speaker working on feared words sees bubbles labeled with the exact words they avoid. The game adapts to the practice, not the other way around.
Bubble size is a secondary feature. A small bubble wants a soft voice, a large bubble wants a loud voice. This trains intensity calibration — letting a speaker discover they can produce the same target at multiple effort levels. But the size is the side-channel; the label inside the bubble is the target.
Like every game in the suite, Bubble Hunt can run on volume detection alone or on full speech recognition. With SR enabled, the bubble pops only when the speaker actually produces the target sound — not just any sound at the right volume.
The intensity-modulation training in Bubble Hunt evolved into the Volume Target selector in Rainbow Syllables, and from there into the RMS-based Vocal Force factor in Speech Console's Pressure dimension. The "user supplies the target" model carried forward into Speech Console's challenge-sound flagging.
The games are the visible product. The workflow on the right is the actual product. FluentPlay is built around a six-stage loop you run with a client over weeks or months — the games slot into the loop, the analytics close it.
Every stage produces data the next stage consumes. Assessment surfaces challenge sounds. Challenge sounds drive targeted articulation work. Articulation work feeds drill speed and intensity. Drill output gets reviewed against the baseline. Review drives the configuration of the next session. The loop runs until the client's PAD profile on a given phrase is stable in the green band — that's the operational definition of fluency for that phrase.
FluentPlay isn't a game suite that happens to log data. It's a clinical instrument with a game-shaped interaction layer. Your day-to-day workflow is what FluentPlay is engineered around — the games are the means by which that workflow becomes tractable for both children and adults who would otherwise refuse traditional drill therapy. Pediatric clients show up because it looks like a game. Adult clients stay because the data is real.
Per-syllable PAD scores. Per-phoneme accuracy from Azure SR. Challenge-sound flag history. Session arc plots. Ground trajectory. RMS, voiced duration, and onset traces. Everything is available as structured data you can export for progress notes, insurance documentation, parent reports, or your own longitudinal tracking across the client's history.
Three current games on one shared engine. Six legacy games at fluentplay.itch.io. One PAD framework underneath all of them. Built for clinicians who need real measurement, not just engagement.
FluentPlay is currently in clinical pilot with a small group of practices. If you'd like to join the pilot or run a demo session with one of your clients, send me an email and I'll set you up directly. No app store, no install — you'll get a link, your client speaks into a browser, and the data flows back to you.
Reach out: [email protected] | fluentplaytech.com
Will Carbone | FluentPlay Technologies LLC | Somerville, MA | April 2026 | Provisional Patent Pending