What is FluentPlay?

A real-time, voice-driven training suite for people who stutter — built to give the clinician measurement, not just engagement. Every game listens to the client's voice and tells you what their speech-motor system is actually doing in the moment.

This walkthrough shows each game in the suite with a live, interactive demonstration. The same demos run inside the actual product. Drag sliders, click phonemes, watch animations — every visualization is the real mechanic the game uses with a client.

Use the arrow keys, swipe left or right, or use the prev/next buttons to move between frames. Scroll inside each frame to read it fully.

Will Carbone  |  FluentPlay Technologies LLC  |  April 2026  |  Provisional Patent Pending
👉Drag any slider — the syllable on the right rescores in real time.

The PAD framework

Every syllable in FluentPlay is scored along three dimensions: Safety (how protected this moment is), Pressure (how much load it carries), and Difficulty (how hard the execution actually was). Each dimension is a weighted average of factors. Below is a simplified set — Speech Console runs the full 21-factor version.

Safety factors
Stability.70
Stable RMS = confident initiation. Wobble = anticipatory tension.
Sustained voice.65
How long the voice holds without breaking.
Late position.40
First syllable is the hardest. Later syllables ride momentum.
Pressure factors
Vocal force.55
Higher RMS = more effort. Pushing hard costs.
Pace demand.50
Faster speed narrows the execution window.
Difficulty factors
Onset attempts.20
Multiple voice-onset attempts = false starts, restarts.
Block / prolongation.15
Effortful silence or stretched sound.
Challenge phoneme.30
Speaker-flagged hard sound — adds motor planning load.
Try a preset
Sample syllable "hel" live

The audio pipeline

Every game shares the same browser-side audio engine. Your microphone goes through two FFT analysers — one for spectral features, one for waveform peak detection — and into a frame-level state machine called the Disfluency Feature Stream (DFS).

What the DFS tracks

For every audio frame (about 60 per second), the DFS classifies the sound as silent, building, or voiced. From that stream it derives:

RMS
intensity
Onsets
count
Voiced
duration
Blocks
flag

Each feature feeds the PAD scorer. No audio is recorded. Nothing leaves the browser except the Azure Speech call (used by Speech Console and Articulation Trainer for phoneme-level accuracy).

What the panel shows

Top row: live waveform with peak markers. Middle row: DFS state classification per frame (silent / building / voiced). Bottom row: cumulative feature counts as the syllable plays.

t 0.000

Speech Console — the flagship

Speech Console runs the full PAD engine on a phrase, syllable by syllable. The speaker enters a phrase, picks a preset (Balanced, High Sensitivity, or Timing focus), and speaks. Each syllable scores in real time and produces a colored chip — green for safe-and-low-pressure, amber for moderate, red for high difficulty.

The phrase

hel lo my name is bil ly

What the right side shows

For every syllable Speech Console renders two rows of phoneme chips: anticipated (what the engine expects from the dictionary) and detected (what Azure Speech Services actually heard, scored 0–100). Below the chips are the live P, A, D bars for the current syllable, then the rolling Ground meta-parameter — it drops fast on stumbles, recovers slowly on success.

Watch for

When the playhead hits "bil", the detected row shows two /b/ chips in red — that's a block: the engine caught the speaker re-attempting onset before the vowel landed. The PAD bars spike on Pressure and Difficulty at the same moment, and Ground drops. Speech Console makes that visible without the SLP having to flag it manually.

t 0.000

Articulation Trainer

The Articulation Trainer is a real-time anatomical view of speech production. Pick any sound — the diagram shows exactly which articulators are involved, where the air flows, and how the lips and tongue have to position themselves to produce it. Same engine that runs inside Speech Console for live phoneme feedback.

lips
tongue
airflow

Pick a sound

Or animate a word
Why this matters

Most people who stutter can produce any phoneme correctly under high effort. The clinical goal is to do it with less. The trainer shows the mechanics — where lips press, where tongue contacts, where air flows — so the speaker has a model of "easy" to aim for, not just a target word.

Working diagram — your selection
how the mouth produces the sound you picked
Anatomy reference — tap to learn
click any colored part to see what it does for speech
Tap a body part
Each colored region above plays a specific role in speech production. Tap any of them to see how they shape sound.

Sound Bridge

Sound Bridge gives you two sounds and asks you to connect them without stopping your voice. You hold the first sound, slide smoothly into the second, then hold that one too — building a bridge of continuous voice between them.

The three phases

Phase 1
HOLD
Phase 2
SLIDE
Phase 3
HOLD

If your voice drops below the voicing threshold during the transition, the bridge breaks and you start over. The score combines hold steadiness with transition continuity.

Watch all three scenarios

The demo on the right auto-loops through clean → shaky → broken. Click any below to jump to that scenario.

Why this trains fluency

Coarticulation — the smooth chaining of phonemes — is exactly where many people who stutter block. Sound Bridge isolates that transition as the training target. The smoother your slide, the more your speech-motor system is relying on a single integrated motor plan rather than two sequential ones.

DIVA model connection

The DIVA model of feedforward speech control predicts that fluent transitions depend on the second sound's motor plan being prepared while the first is still being produced. Sound Bridge's continuity meter is a real-time proxy for that feedforward signal.

Bridge /a/ → /i/ quality 0%

Rainbow Syllables — v12 SLP trial

Rainbow Syllables v12 is the SLP-facing trial build of the FluentPlay engine. Same scoring pipeline as Speech Console — but the v12 UI exposes every intermediate signal so a clinician can see exactly what the system is measuring at each moment.

What you see in the live game

The mockup on the right is the actual v12 game screen, animated with sample data. Top to bottom:

Score / Streak / Unit — session-wide stats and which syllable you're on
Phrase pills — every syllable in the phrase, color-coded by status
Anatomy panel — mini mouth diagram for the active syllable's leading phoneme
Vocalization blob — fills as the speaker holds voice
Anticipated phonemes — what the system expects to hear, populated the moment the syllable becomes active
Detected phonemes — what Azure SR actually heard, populated when the syllable completes — colored by match accuracy
PAD readout — Safety / Pressure / Difficulty / Ground for the just-completed syllable
Monitor footer — live RMS, voiced duration, onset count, fill %, DFS state — the raw signal feeding everything above

Why both phoneme rows

The anticipated row is the target — what the speaker is supposed to produce. The detected row is the reality — what Azure SR pulled from the audio. The gap between them is the actionable signal: where did the production drift from the target, and on which phoneme. SLPs can review this row-by-row after a session to identify exactly which phonemes need work.

Same engine, different framing

Rainbow Syllables v12 and Speech Console run the same scoring pipeline. Speech Console is the clean production framing — a single PAD breakdown with the technical details abstracted away. Rainbow Syllables v12 is the SLP trial build — every signal exposed for diagnostic review. The 20-session trial limit and the diagnostic monitor are what make it the SLP version.

SCORE0
STREAK0
UNIT1/4
END
starting…
anticipated
detected
Safety
Pressure
Difficulty
Ground
RMS0.00 Voiced0ms Onsets0 Fill0% SILENT

Summit — legacy

Summit takes one word and asks you to repeat it many times. Each successful production climbs a stylized mountain trail. Each stumble drops the climber back toward basecamp. Reach the summit to win.

Settings

Reps
2 / 4 / 6 / 8
Descent
Gentle → Freefall

The scientific basis

This is voluntary stuttering in the Van Riper tradition, gamified. Repeating a word — especially a feared word — is the canonical exposure exercise for stuttering. The mountain frame turns that exposure into a goal-directed task: every repetition is altitude gained.

The descent mechanic explicitly punishes giving up. There's no skip button. There's no easy mode. The only way to the summit is to keep speaking through the moments you would normally avoid.

Habituation, not perfection

Summit doesn't reward smooth speech. It rewards sustained engagement with a hard word. Habituation reduces the amygdala threat response, which in turn stabilizes downstream motor gating — the actual mechanism by which exposure therapy works.

Watch a different climb
Climber basecamp reps 0/0

Cadence — legacy

Cadence is rhythm training for speech-motor timing. A horizontal lane runs across the screen with a fixed hit line on the left side. Bars spawn on the right and travel left at the chosen tempo. As each bar reaches the hit line, the speaker says the target phoneme. Voice onset detection at the hit line scores the timing as PERFECT, GOOD, or a miss.

How a measure runs

Eight beats per measure. Bars spawn at intervals of 60/BPM seconds. The hit window is asymmetric — more forgiving on the early side (~180px before the line) than the late side (~100px after). A PERFECT hit is within roughly the first third of the early window; everything else inside the window scores GOOD. A miss means the bar passed the late edge with no voice onset, OR the speaker fired voice with no bar in window (a "too early" error). Combo accumulates with consecutive hits and resets on any miss. Hit 75% or more of the eight beats and the BPM auto-advances by 10.

perfect / good
missed / too early
pending

Tempo

BPM80
Slow tempos teach control. Faster tempos train automaticity. Hit ≥75% of eight beats and the game auto-advances BPM by 10.

The scientific basis

Speaking in time with an external rhythm has a robust short-term fluency-enhancing effect — the metronomic fluency effect. Cadence gamifies the metronome. The deeper target is cerebellar timing coordination: cerebellar circuits govern speech rhythm, and the cerebellum-stutter literature implicates them in disfluency. Cadence is direct rhythm training for those circuits.

Distributed practice

Tempo-based repetition spaces productions evenly across time, which is a stronger motor learning condition than block practice (the "say it 10 times in a row" pattern). Cadence delivers distributed practice without making the speaker think about it.

Hits 0 / 0 accuracy —

Rhythm Pad — legacy

Rhythm Pad is the diadochokinetic drill from the legacy suite. Five segment slots in a row. The speaker says the target phoneme into each slot in sequence — voice in the correct volume zone fills the slot, missing the window grays it out.

How a measure runs

Each slot has a fixed time window (about 1.2 seconds at the medium speed preset). A ghost fill sweeps across the slot showing time elapsed. When the speaker's voice lands inside the volume zone, the slot snaps to its target color and the ghost fill jumps to full — that's a hit. Then a release gate requires the voice to drop out of the zone before the next slot can be hit, which prevents one continuous tone from filling everything. If the window expires with no hit, the slot grays out and the run advances anyway. Five hits in a row clears the measure.

Streak

Three consecutive hits trigger the streak reward. The streak indicator at the bottom shows how many in a row the speaker has banked toward the goal. Any miss resets the streak counter to zero.

hit (voice in zone)
pending
missed (window expired)

The scientific basis

Diadochokinetic drilling is the canonical exercise for measuring and training speech-motor coordination. The clinical version uses /pa-ta-ka/ at increasing rates to assess motor sequencing. Rhythm Pad takes that exercise, gamifies it, and produces per-segment scoring data the SLP can review.

Why the release gate matters

Without the release gate, a speaker could just hold one continuous tone and the system would register every slot as a hit. The release gate forces discrete productions, which is the actual motor pattern being trained. It's a small mechanic with a large effect on what the game is measuring.

Hits 0 / 5 streak —

Bubble Hunt — legacy

Each bubble carries something to say — a sound, a syllable, or a word. The speaker plugs in whatever they want to practice, and the bubbles drift across the screen carrying that target. As each bubble reaches the zone, the speaker says it. Speech recognition (or volume, depending on mode) decides whether the bubble pops.

What the speaker plugs in

The bubble content is the whole point. A child practicing /b/ words sees bubbles labeled ball, baby, bus, big. An adult drilling consonant clusters sees bubbles labeled str, spl, scr. A speaker working on feared words sees bubbles labeled with the exact words they avoid. The game adapts to the practice, not the other way around.

Where volume comes in

Bubble size is a secondary feature. A small bubble wants a soft voice, a large bubble wants a loud voice. This trains intensity calibration — letting a speaker discover they can produce the same target at multiple effort levels. But the size is the side-channel; the label inside the bubble is the target.

SR-ready

Like every game in the suite, Bubble Hunt can run on volume detection alone or on full speech recognition. With SR enabled, the bubble pops only when the speaker actually produces the target sound — not just any sound at the right volume.

Lineage to the current suite

The intensity-modulation training in Bubble Hunt evolved into the Volume Target selector in Rainbow Syllables, and from there into the RMS-based Vocal Force factor in Speech Console's Pressure dimension. The "user supplies the target" model carried forward into Speech Console's challenge-sound flagging.

Pops 0 misses 0

Clinical workflow — how it fits into your sessions

The games are the visible product. The workflow on the right is the actual product. FluentPlay is built around a six-stage loop you run with a client over weeks or months — the games slot into the loop, the analytics close it.

The loop, end to end

Every stage produces data the next stage consumes. Assessment surfaces challenge sounds. Challenge sounds drive targeted articulation work. Articulation work feeds drill speed and intensity. Drill output gets reviewed against the baseline. Review drives the configuration of the next session. The loop runs until the client's PAD profile on a given phrase is stable in the green band — that's the operational definition of fluency for that phrase.

Why this matters for your practice

FluentPlay isn't a game suite that happens to log data. It's a clinical instrument with a game-shaped interaction layer. Your day-to-day workflow is what FluentPlay is engineered around — the games are the means by which that workflow becomes tractable for both children and adults who would otherwise refuse traditional drill therapy. Pediatric clients show up because it looks like a game. Adult clients stay because the data is real.

What you can pull from a session

Per-syllable PAD scores. Per-phoneme accuracy from Azure SR. Challenge-sound flag history. Session arc plots. Ground trajectory. RMS, voiced duration, and onset traces. Everything is available as structured data you can export for progress notes, insurance documentation, parent reports, or your own longitudinal tracking across the client's history.

1
ASSESS
2
FLAG
3
TRAIN
4
DRILL
5
REVIEW
6
ITERATE
STAGE 1
Assess
Establish a baseline PAD profile across a target phrase
WHAT THE SLP DOES
Picks a phrase the client struggles with and runs Speech Console while the client speaks it. No coaching, no intervention — just baseline capture.
WHAT FLUENTPLAY PRODUCES
Per-syllable Safety, Pressure, Difficulty scores across the phrase. The Ground trajectory. A first read on which syllables sit in the red band.
→ Speech Console

Bringing FluentPlay into your practice

Three current games on one shared engine. Six legacy games at fluentplay.itch.io. One PAD framework underneath all of them. Built for clinicians who need real measurement, not just engagement.

Current product — runs in any browser
SPEECH CONSOLE
Flagship PAD scoring engine. Per-syllable Safety / Pressure / Difficulty breakdown across 21 weighted factors. Three presets, custom mode, challenge-sound flagging, real-time scoring during phrase production.
SOUND BRIDGE
Coarticulation training. Hold first sound, slide to second, hold again — without breaking voice. Continuous-phonation drill grounded in the DIVA model of feedforward speech control.
ARTICULATION TRAINER
Per-phoneme effort visualization. Ten articulator diagrams light up green / amber / red based on the actual mechanical effort each sound cost. Trains low-effort production directly.
Legacy suite (Unity WebGL, fluentplay.itch.io)
RAINBOW SYLLABLES
Syllable-blob fill game. Conceptual ancestor of Speech Console.
SUMMIT
Voluntary stuttering as a mountain climb. Repetition-based exposure.
CADENCE
Guitar-Hero-style rhythm training. Metronomic fluency effect.
BUBBLE HUNT
Intensity calibration. Match voice volume to bubble size.
RHYTHM PAD
Diadochokinetic phoneme repetition drill.
BRIDGE
Unity ancestor of the current Sound Bridge.
Getting started with a client

FluentPlay is currently in clinical pilot with a small group of practices. If you'd like to join the pilot or run a demo session with one of your clients, send me an email and I'll set you up directly. No app store, no install — you'll get a link, your client speaks into a browser, and the data flows back to you.

Reach out: [email protected]  |  fluentplaytech.com

Will Carbone  |  FluentPlay Technologies LLC  |  Somerville, MA  |  April 2026  |  Provisional Patent Pending