Research Report — Simulated Self

Author: Min Htet Myet · Last revised: 2026-05-19 Status: Internal research note. Scope is restricted to techniques actually implemented in this repository; speculative extensions are explicitly labelled.

0. Abstract

Simulated Self is a single-page web application that ingests a single RGB webcam stream plus a microphone and produces a co-present, real-time digital twin rendered as a 3D humanoid. The system is built around two research themes:

Human Modeling and Augmentation — recovering body/hand pose and facial-expression state from a 2D image stream, retargeting that state to a skeletal humanoid, and exposing controls that let a user explore a mirrored representation of themselves.
Extension of Human Sensory/Motor Functions — pairing the visual pipeline with a streaming speech ↔ language-model ↔ speech loop so the avatar becomes a conversational partner whose presence is contingent on the user's body, voice, and affect.

The system is intentionally browser-native and zero-backend (apart from the LLM endpoint), trading model capacity for latency, privacy, and deployability. This document describes the methodology, the engineering trade-offs, the limitations we currently inherit, and a comparison of the techniques that we evaluated and shipped.

1. System under study

1.1 Inputs

Modality	Sensor	Sampling
RGB video	`getUserMedia({ video })`	Camera-default, typically 640×480 @ 30 fps
Microphone audio	`getUserMedia({ audio })`	48 kHz, mono, with browser AEC/NS/AGC
Spoken prompts	`webkitSpeechRecognition`	Continuous, interim results

1.2 Outputs

Channel	Implementation
3D humanoid pose	`THREE.Group` driven by MediaPipe landmarks (`SkeletonRenderer`, `HumanoidRobot`)
Facial affect mirror	`FaceManager` blendshape-style updates from `useImprovedEmotionDetection`
Streamed text response	Groq `llama-3.1-8b-instant`, chunked tokens to a glass HUD
Spoken response	`SpeechSynthesisUtterance` queued once the stream terminates
UI telemetry	`VoiceChatPanel` (state), `MicLevelMeter` (RMS), status pills (camera/pose/hands)

1.3 Pipeline at a glance

            ┌──── MediaPipe Pose  (33 landmarks) ────┐
RGB video ──┤                                       ├──► SkeletonRenderer ──► HumanoidRobot (Three.js)
            ├──── MediaPipe Hands (21 × 2)           │
            └──── face-api.js (7-class expression) ──┘

Mic audio ─► Web Speech API ─► transcript ─► Groq stream ─► tokens ─► HUD
                                                              └─► SpeechSynthesis ─► speaker

Mic audio ─► AnalyserNode (RMS) ─► MicLevelMeter (UI)

2. Theme 1 — Human Modeling and Augmentation

2.1 Problem statement

Given a monocular RGB stream, produce a 3D representation of the user that (a) is temporally stable enough to drive a humanoid avatar at interactive frame rates, (b) covers body, hands, and face simultaneously, and (c) runs fully inside a commodity browser without backend offload.

2.2 Methodology

2.2.1 Body and hand pose

We use MediaPipe Pose (33 landmarks) and MediaPipe Hands (21 per hand) loaded lazily from cdn.jsdelivr.net at runtime (src/hooks/useMediaPipePoseDetection.ts). Each landmark is {x, y, z, visibility?} in normalised image space; z is a relative depth estimate from MediaPipe's regression head and is not metric.

Connection topology is encoded explicitly in src/components/SkeletonRenderer.tsx (POSE_CONNECTIONS, HAND_CONNECTIONS) so we never depend on MediaPipe's drawing utility for the 3D scene.

2.2.2 Retargeting to a humanoid

SkeletonRenderer and HumanoidRobot share a THREE.Group. The humanoid exposes seven sub-managers, each responsible for a slice of state:

Sub-manager	Responsibility
`ModelLoader`	Loads the GLB shell
`LimbManager`	Arm/leg bone rotations from pose landmarks
`ShoulderManager`	Clavicle/shoulder corrections to stabilise upper-body chain
`FingerManager`	Per-finger rotations from hand landmarks
`FaceManager`	Expression-driven facial updates
`MaterialManager`	Material/shader state
`VisibilityManager`	Toggles between humanoid mode and raw landmark mode

This decomposition keeps each transformation small and locally testable; it also lets us swap individual managers (e.g. a different finger retargeting algorithm) without touching the rest of the pipeline.

2.2.3 Facial affect

useImprovedEmotionDetection uses @vladmandic/face-api's tinyFaceDetector + faceExpressionNet. We deliberately chose this combination because:

TinyFaceDetector at inputSize=416, scoreThreshold=0.5 runs comfortably on the CPU/Wasm backend used in the browser.
The expression network returns the canonical seven-class probability vector (happy, sad, angry, surprised, neutral, fearful, disgusted).

Predictions are sampled on a 400 ms interval (not per-frame) and pushed through a temporal smoother that keeps the last 6 predictions or predictions newer than 2.5 s — whichever is smaller. The dominant emotion is the one maximising count × mean_confidence, and the state only publishes when the dominant class changes or confidence exceeds 0.6. This is a deliberate trade-off: we accept higher latency on rare expressions (e.g. fearful) in exchange for stability against single-frame noise that would otherwise cause the avatar's face to flicker.

2.3 Comparison of techniques considered

Decision	What we shipped	What we explicitly rejected	Reason
Pose detector	MediaPipe Pose + Hands	MediaPipe Holistic single graph; BlazePose GHUM full mesh; OpenPose server	Holistic raises latency budget per frame; OpenPose needs a server; GHUM not exposed in JS build
Face affect	`@vladmandic/face-api` TinyFaceDetector + FaceExpressionNet	Larger SSD-MobileNet face detector; AU regression (e.g. OpenFace)	SSD adds ~3× latency in-browser; OpenFace has no first-class web build
Emotion temporal model	RMS-style "count × mean confidence" window (last 6 / 2.5 s)	Exponential moving average; HMM/Viterbi over class trajectories	EMA over-weights stale frames at low sampling rate; HMM was out of scope
Avatar driving	Direct landmark → bone rotation in `LimbManager` / `FingerManager`	Full inverse kinematics solver (CCD/FABRIK)	The MediaPipe skeleton is already pose-complete; IK was unnecessary and adds drift
Hand handedness	User instruction overlay: "flip both palms forward at start"	Automatic re-labelling per frame	MediaPipe's left/right flip under occlusion is unsolved here; the instruction is cheaper

Note that we also keep two parallel "advanced" / "improved" emotion hooks (useAdvancedEmotionDetection, useImprovedEmotionDetection) so we can A/B switch detector configurations without losing the older path. Only Improved drives the avatar's face in the current UI.

2.4 Limitations (Theme 1)

Monocular depth is approximate. MediaPipe's z is a relative regression; the humanoid is consequently driven primarily by 2D projection and inferred limb orientation, not metric reconstruction.
No multi-person support. All three pipelines assume a single subject.
Occlusion of either hand below the desk or behind the torso causes finger sub-manager updates to freeze on the last seen pose.
Affect ≠ emotion. face-api.js predicts visual expression categories, not internal affective state; "neutral" dominates in normal conversation by construction.
Lighting sensitivity. Low-light or backlit conditions degrade TinyFaceDetector and the pose graph simultaneously; we surface this only indirectly through landmark visibility scores.
CDN dependency. MediaPipe scripts are loaded at runtime from jsdelivr; an offline user cannot use the pose pipeline today.

3. Theme 2 — Extension of Human Sensory/Motor Functions

3.1 Problem statement

Augment the user's voice with a low-latency, interruptible conversational partner that also reflects their physical and affective state, while running inside the browser-security envelope and respecting a fixed token budget.

3.2 Methodology

3.2.1 Voice capture

The microphone is captured twice, with explicit separation of concerns:

Consumer	Hook / module	Purpose
Speech recognition	`useVoiceInteraction` → Web Speech API	Continuous transcript with interim results
Loudness telemetry	`useMicLevel` → `AnalyserNode` (RMS)	Drives the new `MicLevelMeter` and silence detection

This separation lets us show "I can't hear you" feedback even when the SpeechRecognition engine has not yet emitted a result.

3.2.2 Real-time mic level meter (new)

useMicLevel opens its own getUserMedia stream with echoCancellation/noiseSuppression/autoGainControl enabled, builds an AnalyserNode with fftSize = 1024 and smoothingTimeConstant = 0.6, and computes RMS of the time-domain buffer per requestAnimationFrame. We chose RMS over peak amplitude because RMS correlates better with perceived loudness and is more forgiving of transient clicks.

A sliding peak (peakRef) decays at ~8 % per 60 ms so the UI shows a hairline marker without flickering. The hook publishes:

level ∈ [0, 1] — gamma-corrected RMS for UI rendering,
peak ∈ [0, 1] — decaying max,
isSilent — true after 1.5 s below SILENCE_THRESHOLD = 0.015,
error and permissionState — surfaced as actionable copy in the meter.

3.2.3 Conversational core

src/services/geminiService.ts (named for legacy reasons) is a Groq SDK wrapper exposing an AsyncGenerator<string> over Groq's chat.completions.create({ stream: true }). The contract is intentionally narrow:

Model. llama-3.1-8b-instant — chosen for first-token latency, not ceiling capability.
System prompt. Bounded to "≤50 words, plain conversational, TTS-safe" so TTS does not stall on bullet lists or code.
Budget. max_completion_tokens = 256 to bound TTS time and cost.
Hard timeout. 15 s via an internal AbortController.
Retry. Up to 2 retries with exponential backoff (300 × 2ⁿ, capped at 2 s) for 429 and 5xx only; never on user cancel.
Errors. Mapped to a GroqServiceError taxonomy (NOT_CONFIGURED | TIMEOUT | NETWORK | UPSTREAM | UNKNOWN) so the UI can decide whether to expose a "Try again" button.

3.2.4 State machine

useVoiceInteraction is the single source of truth for the voice loop:

idle ──startListening──► listening ──final transcript──► processing
                                                            │
                                                            ▼
                            error ◄──── streaming ◄────── (first token)
                                            │
                                            ▼
                                         speaking ──onend──► idle

cancel() aborts both the in-flight stream and speechSynthesis. retry() re-runs the last prompt. Streaming tokens are accumulated in partialResponse for the HUD and only handed to SpeechSynthesisUtterance after the stream terminates — speaking incremental chunks produces a choppy prosody profile and was rejected.

3.3 Comparison of techniques considered

Decision	What we shipped	What we explicitly rejected	Reason
LLM provider	Groq `llama-3.1-8b-instant` (streaming)	Google Gemini (the original integration; quota-bound for our key)	Groq's hosted Llama gives sub-second first-token in practice and is provider-neutral
Streaming UX	Accumulate tokens in HUD, speak on completion	Speak each chunk as it arrives	Per-chunk TTS produces choppy prosody and double-trips on `speechSynthesis.cancel()`
Speech in	Web Speech API (`webkitSpeechRecognition`)	Whisper.cpp WASM in-browser	Whisper WASM models are 100–300 MB and would dominate the per-session download budget
Speech out	`SpeechSynthesisUtterance`	Cloud TTS (ElevenLabs, Polly)	Adds a paid dependency and a second round-trip; not required for the prototype
Mic level	`AnalyserNode` RMS + sliding peak	Peak-only meter; FFT-bar visualiser	RMS tracks perceived loudness; bar visualiser implied louder mic state than RMS did
Error handling	Typed `GroqServiceError` + targeted retry policy	Generic `try/catch` with toast	We need to distinguish retryable (`429/5xx/timeout`) from terminal (`4xx`/cancel)
Mic permission UX	Inline tips in the meter referencing the address-bar lock	Generic "permission denied" alert	Users on Chromium need site-settings instructions, not a modal

3.4 Limitations (Theme 2)

Browser support skew. Web Speech API quality (and existence) varies significantly across browsers. Chromium delegates to a Google service and requires network connectivity; Safari's implementation is on-device but less accurate; Firefox has no production support.
API key exposure model. VITE_GROQ_API_KEY is inlined into the client bundle by Vite at build time. This is acceptable for a personal demo but unsafe for public deployment; the documented production path is a Lovable Cloud edge function (see docs/SECURITY.md).
No barge-in. While the avatar is speaking, the recogniser is paused to avoid feedback. A true barge-in would require AEC tuned for the TTS output and an explicit cross-channel canceller, which is out of scope.
TTS prosody is constrained. SpeechSynthesisUtterance exposes only rate/pitch/volume; SSML support across browsers is partial.
One-shot context. The Groq call sends only {system, user}; we do not yet maintain conversation history. This is deliberate for the demo — long histories raise both latency and token cost.

4. Cross-cutting methodology

4.1 Failure isolation

Each pipeline (pose, hands, face affect, voice in, voice out, LLM) owns its own React hook, its own loading state, and its own error state. A failure in any one does not block the others — e.g. the AI key being missing leaves the pose pipeline intact and only disables the voice HUD's "Generate" path.

4.2 Performance budgets

Pipeline	Budget per cycle	Mechanism
Three.js render	16.6 ms (60 fps)	Mutates `THREE.Group` directly; React not involved
Pose / hands	33 ms (30 fps)	MediaPipe `onResults` callback
Facial expression	400 ms	`setInterval`-driven, not per-frame
AI first token	≤ 600 ms target	Groq streaming + no client buffering
Mic meter	16.6 ms (rAF)	RMS over a 1024-sample window

4.3 Privacy posture

All vision and audio-level analysis runs on-device in the browser; no frames or audio buffers leave the machine. The only network egress is:

MediaPipe + face-api model files (CDN).
The textual prompt + system message sent to Groq.
The textual completion returned from Groq.

This is documented and reinforced in docs/SECURITY.md.

4.4 Reproducibility

Artefact	Pinned by
Library versions	`package.json` + `bun.lockb`
Model versions	MediaPipe & face-api CDN URLs (no SHA pin — known risk)
LLM behaviour	Model id `llama-3.1-8b-instant`, `temperature=0.8`
Sampling cadence	Hard-coded intervals (400 ms emotion, rAF render/meter)

5. Overall limitations & threats to validity

Single-subject, single-camera assumption. None of the perception pipelines is multi-user.
No metric calibration. Without a stereo rig or IMU, depth and absolute joint angles are approximate.
Affect detection is visual only. No prosody-based affect from the microphone is fused in; the LLM is text-only.
The LLM is non-deterministic and non-grounded. With temperature=0.8 and no retrieval layer, identical prompts produce different replies; the system makes no factual guarantees.
Browser drift. Web Speech API, AudioWorklet, and permissions.query ({ name: 'microphone' }) semantics change between browser releases; regressions show up as silently degraded UX rather than hard failures.
CDN trust. MediaPipe scripts and face-api weights are loaded from third-party CDNs; in principle a compromised CDN could ship hostile code into the page.

6. Future work (explicitly not implemented today)

Listed here for completeness; none of these items are present in the repository and they must not be cited as current capabilities.

LLM call routed through a Lovable Cloud edge function so the API key never reaches the browser bundle.
Multi-turn conversation memory with per-session truncation.
Audio-prosody affect fusion (e.g. pitch / energy → arousal).
Barge-in (talk-over) with full-duplex AEC.
Whisper-WASM transcription as a Firefox fallback.
IK-corrected limb retargeting for tighter joint constraints.

7. References to source

Concept	File
Pose / hand landmark capture	`src/hooks/useMediaPipePoseDetection.ts`
Skeleton + humanoid retargeting	`src/components/SkeletonRenderer.tsx`, `src/components/HumanoidRobot.tsx`, `src/components/robot/*`
Facial expression	`src/hooks/useImprovedEmotionDetection.ts`
Voice state machine	`src/hooks/useVoiceInteraction.ts`
LLM client	`src/services/geminiService.ts`
Voice HUD	`src/components/VoiceChatPanel.tsx`
Mic level meter	`src/hooks/useMicLevel.ts`, `src/components/MicLevelMeter.tsx`
Permission onboarding	`src/components/PermissionHandler.tsx`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research Report — Simulated Self

0. Abstract

1. System under study

1.1 Inputs

1.2 Outputs

1.3 Pipeline at a glance

2. Theme 1 — Human Modeling and Augmentation

2.1 Problem statement

2.2 Methodology

2.2.1 Body and hand pose

2.2.2 Retargeting to a humanoid

2.2.3 Facial affect

2.3 Comparison of techniques considered

2.4 Limitations (Theme 1)

3. Theme 2 — Extension of Human Sensory/Motor Functions

3.1 Problem statement

3.2 Methodology

3.2.1 Voice capture

3.2.2 Real-time mic level meter (new)

3.2.3 Conversational core

3.2.4 State machine

3.3 Comparison of techniques considered

3.4 Limitations (Theme 2)

4. Cross-cutting methodology

4.1 Failure isolation

4.2 Performance budgets

4.3 Privacy posture

4.4 Reproducibility

5. Overall limitations & threats to validity

6. Future work (explicitly not implemented today)

7. References to source

FilesExpand file tree

RESEARCH.md

Latest commit

History

RESEARCH.md

File metadata and controls

Research Report — Simulated Self

0. Abstract

1. System under study

1.1 Inputs

1.2 Outputs

1.3 Pipeline at a glance

2. Theme 1 — Human Modeling and Augmentation

2.1 Problem statement

2.2 Methodology

2.2.1 Body and hand pose

2.2.2 Retargeting to a humanoid

2.2.3 Facial affect

2.3 Comparison of techniques considered

2.4 Limitations (Theme 1)

3. Theme 2 — Extension of Human Sensory/Motor Functions

3.1 Problem statement

3.2 Methodology

3.2.1 Voice capture

3.2.2 Real-time mic level meter (new)

3.2.3 Conversational core

3.2.4 State machine

3.3 Comparison of techniques considered

3.4 Limitations (Theme 2)

4. Cross-cutting methodology

4.1 Failure isolation

4.2 Performance budgets

4.3 Privacy posture

4.4 Reproducibility

5. Overall limitations & threats to validity

6. Future work (explicitly not implemented today)

7. References to source