Comprehensive AI-powered voice analysis system for pediatric speech and developmental screening
HappyVoiceLearn is an advanced voice analysis system designed to assist in early screening for autism spectrum disorder (ASD), ADHD, and speech disorders in children. The system combines state-of-the-art speaker diarization, acoustic feature extraction, and prosody analysis to provide comprehensive voice assessments.
- π€ Speaker Diarization: Automatically separates human child voice from AI agent in multi-speaker audio
- π 88 Acoustic Features: OpenSMILE eGeMAPSv02 feature extraction (voice quality, pitch, formants, MFCCs)
- π΅ Prosody Analysis: Comprehensive analysis of pitch contours, intonation, rhythm, stress, and phrasing (50+ metrics)
- π©Ί Clinical Screening: Evidence-based screening for ASD, ADHD, and speech disorders
- βοΈ Cloud-Ready: Optimized for Google Cloud Run with auto-scaling
- π Production API: RESTful API with JSON responses for easy integration
- Harmonics-to-Noise Ratio (HNR): Voice breathiness and clarity
- Jitter & Shimmer: Voice stability and consistency
- Pitch (F0): Mean, range, and variability
- Formants: F1, F2, F3 frequencies and bandwidths
- MFCCs: Mel-frequency cepstral coefficients for voice timbre
- Speech Rate: Voiced segments per second
- Loudness: Mean, variability, and dynamic range
- Pitch Contour: F0 dynamics, excursions, velocity, acceleration
- Intonation Patterns: Rising, falling, rise-fall, flat classifications
- Rhythm: nPVI, PVI, syllable timing, tempo estimation
- Stress Patterns: Stress rate, intervals, strength, regularity
- Phrasing: Phrase lengths, pause patterns, pause-to-speech ratio
- ASD Screening (8-point scale): Flat intonation, narrow pitch range, atypical prosody
- ADHD Screening (5-point scale): Speech rate variability, irregular rhythm, loudness inconsistency
- Speech Disorder Screening (6-point scale): Voice quality issues, articulation concerns
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Audio Input (WAV/MP3) β
β (May contain multiple speakers) β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Speaker Diarization β
β (pyannote.audio 3.1 pipeline) β
β β
β β’ Identifies speakers β
β β’ Separates human child from AI agent β
β β’ Extracts human-only audio segments β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Feature Extraction (Parallel) β
β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
β β OpenSMILE β β Prosody Analysis β β
β β eGeMAPSv02 β β (Parselmouth) β β
β β β β β β
β β β’ 88 acoustic β β β’ Pitch contours β β
β β features β β β’ Intonation β β
β β β’ Voice quality β β β’ Rhythm (nPVI) β β
β β β’ Spectral analysis β β β’ Stress patterns β β
β β β’ MFCCs β β β’ Phrasing β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Clinical Screening β
β β
β β’ ASD indicators (flat intonation, pitch range, etc.) β
β β’ ADHD indicators (variability, rhythm, etc.) β
β β’ Speech disorder indicators (voice quality, etc.) β
β β’ Risk level calculation (low/medium/high) β
β β’ Follow-up recommendations β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β JSON Response β
β β
β β’ All features and metrics β
β β’ Clinical interpretations β
β β’ Risk assessments β
β β’ Recommendations β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.9 or higher
- HuggingFace account and token (for pyannote.audio)
- Google Cloud account (for deployment)
-
Clone the repository:
git clone https://github.com/yourusername/happyvoicelearn.git cd happyvoicelearn -
Set up Python environment:
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
cd happyvoice-gcp pip install -r requirements.txt -
Set HuggingFace token:
export HF_TOKEN=your_huggingface_token_here -
Run the API locally:
cd src python main.py -
Test the API:
# Health check curl http://localhost:8080/health # Analyze audio curl -X POST http://localhost:8080/analyze \ -F "audio=@path/to/your/audio.wav" \ -H "Content-Type: multipart/form-data"
# Set variables
export PROJECT_ID=your-gcp-project-id
export REGION=us-central1
export HF_TOKEN=your_huggingface_token
# Authenticate
gcloud auth login
gcloud config set project $PROJECT_ID
# Enable required APIs
gcloud services enable run.googleapis.com
gcloud services enable cloudbuild.googleapis.com
gcloud services enable artifactregistry.googleapis.com
# Deploy to Cloud Run
cd happyvoice-gcp
gcloud run deploy happyvoicelearn \
--source . \
--region $REGION \
--platform managed \
--allow-unauthenticated \
--memory 4Gi \
--cpu 2 \
--timeout 600 \
--set-env-vars "HF_TOKEN=$HF_TOKEN"
# Get service URL
gcloud run services describe happyvoicelearn \
--region $REGION \
--format 'value(status.url)'- Cold Start: 30-60 seconds (first request after idle)
- Processing Time: 60-80 seconds per audio file
- Resources: 2 vCPU, 4GB RAM recommended
- Estimated Cost: ~$0.10 per 1000 requests
- Auto-scaling: 0 to 10+ instances based on demand
Health check endpoint.
Response:
{
"status": "healthy",
"service": "happyvoicelearn-complete",
"version": "3.0.0",
"components": {
"speaker_diarization": "pyannote.audio 3.1",
"opensmile": "eGeMAPSv02 (88 features)",
"prosody": "Full prosody analysis"
},
"capabilities": [
"speaker_separation",
"human_vs_ai_identification",
"voice_quality_analysis",
"prosody_analysis",
"clinical_screening (ASD/ADHD/Speech)"
]
}Complete voice analysis pipeline.
Request (Multipart Form):
curl -X POST https://your-service-url/analyze \
-F "audio=@child_voice.wav"Request (JSON with Base64):
curl -X POST https://your-service-url/analyze \
-H "Content-Type: application/json" \
-d '{
"audio_base64": "data:audio/wav;base64,UklGRi...",
"child_age": 8,
"gender": "male",
"skip_diarization": false
}'Response:
{
"success": true,
"timestamp": "2025-01-06T00:00:00.000000",
"processing_info": {
"speaker_separation_attempted": true,
"speaker_separation_success": true
},
"opensmile": {
"feature_count": 88,
"features": {
"F0semitoneFrom27.5Hz_sma3nz_amean": 30.67,
"HNRdBACF_sma3nz_amean": 5.84,
"jitterLocal_sma3nz_amean": 0.0226,
"shimmerLocaldB_sma3nz_amean": 1.237,
// ... 84 more features
},
"interpretation": {
"pitch_variability": {
"value": 0.204,
"status": "normal",
"concern": false
},
"hnr": {
"value": 5.84,
"status": "low",
"concern": true,
"interpretation": "Breathy/noisy voice quality"
}
// ... more interpretations
}
},
"prosody": {
"pitch_contour": {
"mean_f0_hz": 157.51,
"f0_range_hz": 215.8,
"f0_range_semitones": 22.66,
"pitch_changes_per_second": 17.25
// ... more metrics
},
"intonation": {
"pattern_counts": {
"rise_fall": 11,
"flat": 9,
"fall_rise": 5
},
"pattern_proportions": {
"rise_fall": 0.393,
"flat": 0.321,
"fall_rise": 0.179
},
"intonation_variety": 1.94
},
"rhythm": {
"syllable_count": 76,
"syllables_per_second": 3.93,
"npvi": 54.42,
"pvi": 43.28,
"tempo_bpm": 125.0
}
// ... stress and phrasing metrics
},
"screening": {
"disclaimer": "SCREENING ONLY - Not diagnostic. Always consult qualified professionals.",
"asd": {
"score": 2,
"max_score": 8,
"risk_level": "medium",
"indicators": [
"Elevated shimmer: 1.237 dB",
"Elevated flat intonation: 32.1%"
],
"recommendation": "Continue monitoring"
},
"adhd": {
"score": 1,
"max_score": 5,
"risk_level": "low",
"indicators": [
"Elevated speech rate variability: CV=0.26"
],
"recommendation": "Continue monitoring"
},
"speech_disorder": {
"score": 4,
"max_score": 6,
"risk_level": "high",
"indicators": [
"Elevated jitter: 0.0226",
"Low HNR: 5.8 dB",
"High shimmer: 1.237 dB"
],
"recommendation": "Consider speech-language evaluation"
},
"summary": {
"requires_followup": true,
"primary_concern": "speech_disorder",
"confidence": "medium"
}
}
}Speaker separation only (no analysis).
Request:
curl -X POST https://your-service-url/separate-speakers \
-F "audio=@multi_speaker_audio.wav"Response:
{
"success": true,
"audio_base64": "UklGRi4uLg==",
"format": "wav"
}- DEPLOYMENT_SUMMARY.md - Complete GCP deployment guide and service information
- COMBINED_ANALYSIS_TEST_RESULTS.md - Detailed test results with clinical interpretations
- PROJECT_COMPLETE.md - Full project overview and capabilities
- HUMAN_VS_AI_IDENTIFICATION_GUIDE.md - Speaker identification methods
- happyvoice-gcp/DEPLOYMENT.md - Detailed deployment instructions
GCP_HappyVoice/
βββ README.md # This file
βββ .gitignore # Git exclusions
β
βββ happyvoice-gcp/ # Main deployment package
β βββ src/
β β βββ main.py # Flask API (all endpoints)
β β βββ prosody_analyzer.py # Prosody analysis module
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Container configuration
β βββ .dockerignore # Build optimization
β βββ DEPLOYMENT.md # Detailed deployment guide
β βββ README.md # Quick start guide
β
βββ OpenSmile_Prosody_Custom/ # Development/testing version
β βββ main_combined.py # Combined analysis (local)
β βββ prosody_analyzer.py # Prosody module
β βββ requirements_combined.txt # Dependencies
β βββ FEATURE_COMPARISON.md # Feature documentation
β
βββ Documentation/
β βββ DEPLOYMENT_SUMMARY.md # GCP deployment summary
β βββ COMBINED_ANALYSIS_TEST_RESULTS.md # Test results & analysis
β βββ PROJECT_COMPLETE.md # Complete project overview
β βββ HUMAN_VS_AI_IDENTIFICATION_GUIDE.md # Speaker ID methods
β βββ SPEAKER_SEPARATION_OPTIONS.md # Diarization options
β βββ OPENSMILE_COMPLETE_GUIDE.md # OpenSMILE documentation
β
βββ Scripts/
β βββ run_speaker_diarization.py # Speaker diarization script
β βββ identify_human_vs_ai.py # Human/AI identification
β βββ test_combined_analysis.py # Full analysis test
β βββ test_local_opensmile.py # OpenSMILE test
β
βββ opensmile-gcp/ # Legacy OpenSMILE-only version
βββ README.md
python test_combined_analysis.pyThis will:
- Load test audio
- Perform speaker diarization (if multi-speaker)
- Extract OpenSMILE features
- Analyze prosody
- Generate clinical screening scores
- Save results to
combined_analysis_results.json
python run_speaker_diarization.pySeparates multi-speaker audio and saves human-only segments.
THIS SYSTEM IS FOR SCREENING PURPOSES ONLY - NOT DIAGNOSTIC
- This tool provides preliminary screening indicators only
- Results should NEVER replace professional clinical assessment
- Always consult qualified healthcare professionals for diagnosis
- False positives and false negatives are possible
- Voice quality can be affected by many factors (illness, fatigue, environment, recording quality)
- Do NOT upload identifiable patient information without proper consent
- Ensure HIPAA/GDPR compliance if processing protected health information
- Audio files may contain sensitive information - handle appropriately
- Use authentication and encryption for production deployments
- This system is designed for research and development purposes
- Validate results with clinical professionals before any clinical application
- Screening criteria are based on research literature but not clinically validated
- Python 3.9+
- Flask 2.0+
- pyannote.audio 3.1
- opensmile 2.4+
- parselmouth 0.4+
- librosa 0.10+
- numpy, pandas, scipy
See happyvoice-gcp/requirements.txt for complete list.
- For Local Testing: 8GB+ RAM recommended
- For GCP Deployment: 4GB RAM, 2 vCPU minimum
- Audio Format: WAV preferred (MP3 also supported)
- Sample Rate: 16kHz+ recommended
- Mono/Stereo: Both supported (mono preferred)
This is a research project. Contributions, suggestions, and feedback are welcome!
- Additional clinical screening criteria
- More speaker identification methods
- Support for other languages
- Real-time analysis capabilities
- Mobile/edge deployment optimization
This project is for research and educational purposes.
Please cite if you use this work in your research.
- pyannote.audio - Speaker diarization
- OpenSMILE - Acoustic feature extraction
- Parselmouth - Prosody analysis (Praat wrapper)
- librosa - Audio processing
- Google Cloud Run - Serverless deployment platform
- Clinical screening criteria based on published literature
- See individual documentation files for detailed references
For issues, questions, or suggestions:
- Open an issue on GitHub
- Review documentation in the
/Documentationfolder - Check test results in
COMBINED_ANALYSIS_TEST_RESULTS.md
- Multi-language support
- Real-time analysis API
- Longitudinal tracking capabilities
- Integration with EHR systems
- Mobile SDK
- Edge deployment (TensorFlow Lite)
- Automated report generation
HappyVoiceLearn - Advancing early detection through AI-powered voice analysis
Last Updated: January 2025