Author: Min Htet Myet · Last revised: 2026-05-19 Status: Internal research note. Scope is restricted to techniques actually implemented in this repository; speculative extensions are explicitly labelled.
Simulated Self is a single-page web application that ingests a single RGB webcam stream plus a microphone and produces a co-present, real-time digital twin rendered as a 3D humanoid. The system is built around two research themes:
- Human Modeling and Augmentation — recovering body/hand pose and facial-expression state from a 2D image stream, retargeting that state to a skeletal humanoid, and exposing controls that let a user explore a mirrored representation of themselves.
- Extension of Human Sensory/Motor Functions — pairing the visual pipeline with a streaming speech ↔ language-model ↔ speech loop so the avatar becomes a conversational partner whose presence is contingent on the user's body, voice, and affect.
The system is intentionally browser-native and zero-backend (apart from the LLM endpoint), trading model capacity for latency, privacy, and deployability. This document describes the methodology, the engineering trade-offs, the limitations we currently inherit, and a comparison of the techniques that we evaluated and shipped.
| Modality | Sensor | Sampling |
|---|---|---|
| RGB video | getUserMedia({ video }) |
Camera-default, typically 640×480 @ 30 fps |
| Microphone audio | getUserMedia({ audio }) |
48 kHz, mono, with browser AEC/NS/AGC |
| Spoken prompts | webkitSpeechRecognition |
Continuous, interim results |
| Channel | Implementation |
|---|---|
| 3D humanoid pose | THREE.Group driven by MediaPipe landmarks (SkeletonRenderer, HumanoidRobot) |
| Facial affect mirror | FaceManager blendshape-style updates from useImprovedEmotionDetection |
| Streamed text response | Groq llama-3.1-8b-instant, chunked tokens to a glass HUD |
| Spoken response | SpeechSynthesisUtterance queued once the stream terminates |
| UI telemetry | VoiceChatPanel (state), MicLevelMeter (RMS), status pills (camera/pose/hands) |
┌──── MediaPipe Pose (33 landmarks) ────┐
RGB video ──┤ ├──► SkeletonRenderer ──► HumanoidRobot (Three.js)
├──── MediaPipe Hands (21 × 2) │
└──── face-api.js (7-class expression) ──┘
Mic audio ─► Web Speech API ─► transcript ─► Groq stream ─► tokens ─► HUD
└─► SpeechSynthesis ─► speaker
Mic audio ─► AnalyserNode (RMS) ─► MicLevelMeter (UI)
Given a monocular RGB stream, produce a 3D representation of the user that (a) is temporally stable enough to drive a humanoid avatar at interactive frame rates, (b) covers body, hands, and face simultaneously, and (c) runs fully inside a commodity browser without backend offload.
We use MediaPipe Pose (33 landmarks) and MediaPipe Hands (21 per
hand) loaded lazily from cdn.jsdelivr.net at runtime
(src/hooks/useMediaPipePoseDetection.ts). Each landmark is {x, y, z, visibility?} in normalised image space; z is a relative depth estimate
from MediaPipe's regression head and is not metric.
Connection topology is encoded explicitly in
src/components/SkeletonRenderer.tsx (POSE_CONNECTIONS,
HAND_CONNECTIONS) so we never depend on MediaPipe's drawing utility for
the 3D scene.
SkeletonRenderer and HumanoidRobot share a THREE.Group. The humanoid
exposes seven sub-managers, each responsible for a slice of state:
| Sub-manager | Responsibility |
|---|---|
ModelLoader |
Loads the GLB shell |
LimbManager |
Arm/leg bone rotations from pose landmarks |
ShoulderManager |
Clavicle/shoulder corrections to stabilise upper-body chain |
FingerManager |
Per-finger rotations from hand landmarks |
FaceManager |
Expression-driven facial updates |
MaterialManager |
Material/shader state |
VisibilityManager |
Toggles between humanoid mode and raw landmark mode |
This decomposition keeps each transformation small and locally testable; it also lets us swap individual managers (e.g. a different finger retargeting algorithm) without touching the rest of the pipeline.
useImprovedEmotionDetection uses @vladmandic/face-api's
tinyFaceDetector + faceExpressionNet. We deliberately chose this
combination because:
- TinyFaceDetector at
inputSize=416, scoreThreshold=0.5runs comfortably on the CPU/Wasm backend used in the browser. - The expression network returns the canonical seven-class probability
vector (
happy, sad, angry, surprised, neutral, fearful, disgusted).
Predictions are sampled on a 400 ms interval (not per-frame) and pushed
through a temporal smoother that keeps the last 6 predictions or
predictions newer than 2.5 s — whichever is smaller. The dominant emotion
is the one maximising count × mean_confidence, and the state only
publishes when the dominant class changes or confidence exceeds 0.6. This
is a deliberate trade-off: we accept higher latency on rare expressions
(e.g. fearful) in exchange for stability against single-frame noise that
would otherwise cause the avatar's face to flicker.
| Decision | What we shipped | What we explicitly rejected | Reason |
|---|---|---|---|
| Pose detector | MediaPipe Pose + Hands | MediaPipe Holistic single graph; BlazePose GHUM full mesh; OpenPose server | Holistic raises latency budget per frame; OpenPose needs a server; GHUM not exposed in JS build |
| Face affect | @vladmandic/face-api TinyFaceDetector + FaceExpressionNet |
Larger SSD-MobileNet face detector; AU regression (e.g. OpenFace) | SSD adds ~3× latency in-browser; OpenFace has no first-class web build |
| Emotion temporal model | RMS-style "count × mean confidence" window (last 6 / 2.5 s) | Exponential moving average; HMM/Viterbi over class trajectories | EMA over-weights stale frames at low sampling rate; HMM was out of scope |
| Avatar driving | Direct landmark → bone rotation in LimbManager / FingerManager |
Full inverse kinematics solver (CCD/FABRIK) | The MediaPipe skeleton is already pose-complete; IK was unnecessary and adds drift |
| Hand handedness | User instruction overlay: "flip both palms forward at start" | Automatic re-labelling per frame | MediaPipe's left/right flip under occlusion is unsolved here; the instruction is cheaper |
Note that we also keep two parallel "advanced" / "improved" emotion hooks
(useAdvancedEmotionDetection, useImprovedEmotionDetection) so we can
A/B switch detector configurations without losing the older path. Only
Improved drives the avatar's face in the current UI.
- Monocular depth is approximate. MediaPipe's
zis a relative regression; the humanoid is consequently driven primarily by 2D projection and inferred limb orientation, not metric reconstruction. - No multi-person support. All three pipelines assume a single subject.
- Occlusion of either hand below the desk or behind the torso causes finger sub-manager updates to freeze on the last seen pose.
- Affect ≠ emotion.
face-api.jspredicts visual expression categories, not internal affective state; "neutral" dominates in normal conversation by construction. - Lighting sensitivity. Low-light or backlit conditions degrade TinyFaceDetector and the pose graph simultaneously; we surface this only indirectly through landmark visibility scores.
- CDN dependency. MediaPipe scripts are loaded at runtime from jsdelivr; an offline user cannot use the pose pipeline today.
Augment the user's voice with a low-latency, interruptible conversational partner that also reflects their physical and affective state, while running inside the browser-security envelope and respecting a fixed token budget.
The microphone is captured twice, with explicit separation of concerns:
| Consumer | Hook / module | Purpose |
|---|---|---|
| Speech recognition | useVoiceInteraction → Web Speech API |
Continuous transcript with interim results |
| Loudness telemetry | useMicLevel → AnalyserNode (RMS) |
Drives the new MicLevelMeter and silence detection |
This separation lets us show "I can't hear you" feedback even when the SpeechRecognition engine has not yet emitted a result.
useMicLevel opens its own getUserMedia stream with
echoCancellation/noiseSuppression/autoGainControl enabled, builds an
AnalyserNode with fftSize = 1024 and
smoothingTimeConstant = 0.6, and computes RMS of the time-domain
buffer per requestAnimationFrame. We chose RMS over peak amplitude
because RMS correlates better with perceived loudness and is more
forgiving of transient clicks.
A sliding peak (peakRef) decays at ~8 % per 60 ms so the UI shows a
hairline marker without flickering. The hook publishes:
level ∈ [0, 1]— gamma-corrected RMS for UI rendering,peak ∈ [0, 1]— decaying max,isSilent— true after 1.5 s belowSILENCE_THRESHOLD = 0.015,errorandpermissionState— surfaced as actionable copy in the meter.
src/services/geminiService.ts (named for legacy reasons) is a Groq SDK
wrapper exposing an AsyncGenerator<string> over Groq's
chat.completions.create({ stream: true }). The contract is intentionally
narrow:
- Model.
llama-3.1-8b-instant— chosen for first-token latency, not ceiling capability. - System prompt. Bounded to "≤50 words, plain conversational, TTS-safe" so TTS does not stall on bullet lists or code.
- Budget.
max_completion_tokens = 256to bound TTS time and cost. - Hard timeout. 15 s via an internal
AbortController. - Retry. Up to 2 retries with exponential backoff (
300 × 2ⁿ, capped at 2 s) for429and5xxonly; never on user cancel. - Errors. Mapped to a
GroqServiceErrortaxonomy (NOT_CONFIGURED | TIMEOUT | NETWORK | UPSTREAM | UNKNOWN) so the UI can decide whether to expose a "Try again" button.
useVoiceInteraction is the single source of truth for the voice loop:
idle ──startListening──► listening ──final transcript──► processing
│
▼
error ◄──── streaming ◄────── (first token)
│
▼
speaking ──onend──► idle
cancel() aborts both the in-flight stream and speechSynthesis.
retry() re-runs the last prompt. Streaming tokens are accumulated in
partialResponse for the HUD and only handed to SpeechSynthesisUtterance
after the stream terminates — speaking incremental chunks produces a
choppy prosody profile and was rejected.
| Decision | What we shipped | What we explicitly rejected | Reason |
|---|---|---|---|
| LLM provider | Groq llama-3.1-8b-instant (streaming) |
Google Gemini (the original integration; quota-bound for our key) | Groq's hosted Llama gives sub-second first-token in practice and is provider-neutral |
| Streaming UX | Accumulate tokens in HUD, speak on completion | Speak each chunk as it arrives | Per-chunk TTS produces choppy prosody and double-trips on speechSynthesis.cancel() |
| Speech in | Web Speech API (webkitSpeechRecognition) |
Whisper.cpp WASM in-browser | Whisper WASM models are 100–300 MB and would dominate the per-session download budget |
| Speech out | SpeechSynthesisUtterance |
Cloud TTS (ElevenLabs, Polly) | Adds a paid dependency and a second round-trip; not required for the prototype |
| Mic level | AnalyserNode RMS + sliding peak |
Peak-only meter; FFT-bar visualiser | RMS tracks perceived loudness; bar visualiser implied louder mic state than RMS did |
| Error handling | Typed GroqServiceError + targeted retry policy |
Generic try/catch with toast |
We need to distinguish retryable (429/5xx/timeout) from terminal (4xx/cancel) |
| Mic permission UX | Inline tips in the meter referencing the address-bar lock | Generic "permission denied" alert | Users on Chromium need site-settings instructions, not a modal |
- Browser support skew. Web Speech API quality (and existence) varies significantly across browsers. Chromium delegates to a Google service and requires network connectivity; Safari's implementation is on-device but less accurate; Firefox has no production support.
- API key exposure model.
VITE_GROQ_API_KEYis inlined into the client bundle by Vite at build time. This is acceptable for a personal demo but unsafe for public deployment; the documented production path is a Lovable Cloud edge function (seedocs/SECURITY.md). - No barge-in. While the avatar is speaking, the recogniser is paused to avoid feedback. A true barge-in would require AEC tuned for the TTS output and an explicit cross-channel canceller, which is out of scope.
- TTS prosody is constrained.
SpeechSynthesisUtteranceexposes only rate/pitch/volume; SSML support across browsers is partial. - One-shot context. The Groq call sends only
{system, user}; we do not yet maintain conversation history. This is deliberate for the demo — long histories raise both latency and token cost.
Each pipeline (pose, hands, face affect, voice in, voice out, LLM) owns its own React hook, its own loading state, and its own error state. A failure in any one does not block the others — e.g. the AI key being missing leaves the pose pipeline intact and only disables the voice HUD's "Generate" path.
| Pipeline | Budget per cycle | Mechanism |
|---|---|---|
| Three.js render | 16.6 ms (60 fps) | Mutates THREE.Group directly; React not involved |
| Pose / hands | 33 ms (30 fps) | MediaPipe onResults callback |
| Facial expression | 400 ms | setInterval-driven, not per-frame |
| AI first token | ≤ 600 ms target | Groq streaming + no client buffering |
| Mic meter | 16.6 ms (rAF) | RMS over a 1024-sample window |
All vision and audio-level analysis runs on-device in the browser; no frames or audio buffers leave the machine. The only network egress is:
- MediaPipe + face-api model files (CDN).
- The textual prompt + system message sent to Groq.
- The textual completion returned from Groq.
This is documented and reinforced in docs/SECURITY.md.
| Artefact | Pinned by |
|---|---|
| Library versions | package.json + bun.lockb |
| Model versions | MediaPipe & face-api CDN URLs (no SHA pin — known risk) |
| LLM behaviour | Model id llama-3.1-8b-instant, temperature=0.8 |
| Sampling cadence | Hard-coded intervals (400 ms emotion, rAF render/meter) |
- Single-subject, single-camera assumption. None of the perception pipelines is multi-user.
- No metric calibration. Without a stereo rig or IMU, depth and absolute joint angles are approximate.
- Affect detection is visual only. No prosody-based affect from the microphone is fused in; the LLM is text-only.
- The LLM is non-deterministic and non-grounded. With
temperature=0.8and no retrieval layer, identical prompts produce different replies; the system makes no factual guarantees. - Browser drift. Web Speech API, AudioWorklet, and
permissions.query ({ name: 'microphone' })semantics change between browser releases; regressions show up as silently degraded UX rather than hard failures. - CDN trust. MediaPipe scripts and face-api weights are loaded from third-party CDNs; in principle a compromised CDN could ship hostile code into the page.
Listed here for completeness; none of these items are present in the repository and they must not be cited as current capabilities.
- LLM call routed through a Lovable Cloud edge function so the API key never reaches the browser bundle.
- Multi-turn conversation memory with per-session truncation.
- Audio-prosody affect fusion (e.g. pitch / energy → arousal).
- Barge-in (talk-over) with full-duplex AEC.
- Whisper-WASM transcription as a Firefox fallback.
- IK-corrected limb retargeting for tighter joint constraints.
| Concept | File |
|---|---|
| Pose / hand landmark capture | src/hooks/useMediaPipePoseDetection.ts |
| Skeleton + humanoid retargeting | src/components/SkeletonRenderer.tsx, src/components/HumanoidRobot.tsx, src/components/robot/* |
| Facial expression | src/hooks/useImprovedEmotionDetection.ts |
| Voice state machine | src/hooks/useVoiceInteraction.ts |
| LLM client | src/services/geminiService.ts |
| Voice HUD | src/components/VoiceChatPanel.tsx |
| Mic level meter | src/hooks/useMicLevel.ts, src/components/MicLevelMeter.tsx |
| Permission onboarding | src/components/PermissionHandler.tsx |