HappyVoiceLearn

Comprehensive AI-powered voice analysis system for pediatric speech and developmental screening

HappyVoiceLearn is an advanced voice analysis system designed to assist in early screening for autism spectrum disorder (ASD), ADHD, and speech disorders in children. The system combines state-of-the-art speaker diarization, acoustic feature extraction, and prosody analysis to provide comprehensive voice assessments.

🎯 Key Features

🎤 Speaker Diarization: Automatically separates human child voice from AI agent in multi-speaker audio
🔊 88 Acoustic Features: OpenSMILE eGeMAPSv02 feature extraction (voice quality, pitch, formants, MFCCs)
🎵 Prosody Analysis: Comprehensive analysis of pitch contours, intonation, rhythm, stress, and phrasing (50+ metrics)
🩺 Clinical Screening: Evidence-based screening for ASD, ADHD, and speech disorders
☁️ Cloud-Ready: Optimized for Google Cloud Run with auto-scaling
🚀 Production API: RESTful API with JSON responses for easy integration

📊 What It Analyzes

Voice Quality (OpenSMILE)

Harmonics-to-Noise Ratio (HNR): Voice breathiness and clarity
Jitter & Shimmer: Voice stability and consistency
Pitch (F0): Mean, range, and variability
Formants: F1, F2, F3 frequencies and bandwidths
MFCCs: Mel-frequency cepstral coefficients for voice timbre
Speech Rate: Voiced segments per second
Loudness: Mean, variability, and dynamic range

Prosody Features (Parselmouth + librosa)

Pitch Contour: F0 dynamics, excursions, velocity, acceleration
Intonation Patterns: Rising, falling, rise-fall, flat classifications
Rhythm: nPVI, PVI, syllable timing, tempo estimation
Stress Patterns: Stress rate, intervals, strength, regularity
Phrasing: Phrase lengths, pause patterns, pause-to-speech ratio

Clinical Screening

ASD Screening (8-point scale): Flat intonation, narrow pitch range, atypical prosody
ADHD Screening (5-point scale): Speech rate variability, irregular rhythm, loudness inconsistency
Speech Disorder Screening (6-point scale): Voice quality issues, articulation concerns

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      Audio Input (WAV/MP3)                      │
│                  (May contain multiple speakers)                │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Speaker Diarization                           │
│              (pyannote.audio 3.1 pipeline)                      │
│                                                                 │
│  • Identifies speakers                                          │
│  • Separates human child from AI agent                          │
│  • Extracts human-only audio segments                           │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                  Feature Extraction (Parallel)                  │
│                                                                 │
│  ┌──────────────────────┐         ┌──────────────────────┐     │
│  │   OpenSMILE          │         │  Prosody Analysis    │     │
│  │   eGeMAPSv02         │         │  (Parselmouth)       │     │
│  │                      │         │                      │     │
│  │  • 88 acoustic       │         │  • Pitch contours    │     │
│  │    features          │         │  • Intonation        │     │
│  │  • Voice quality     │         │  • Rhythm (nPVI)     │     │
│  │  • Spectral analysis │         │  • Stress patterns   │     │
│  │  • MFCCs             │         │  • Phrasing          │     │
│  └──────────────────────┘         └──────────────────────┘     │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Clinical Screening                           │
│                                                                 │
│  • ASD indicators (flat intonation, pitch range, etc.)         │
│  • ADHD indicators (variability, rhythm, etc.)                 │
│  • Speech disorder indicators (voice quality, etc.)            │
│  • Risk level calculation (low/medium/high)                    │
│  • Follow-up recommendations                                   │
└─────────────────────────────┬───────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                        JSON Response                            │
│                                                                 │
│  • All features and metrics                                    │
│  • Clinical interpretations                                    │
│  • Risk assessments                                            │
│  • Recommendations                                             │
└─────────────────────────────────────────────────────────────────┘

🚀 Quick Start

Prerequisites

Python 3.9 or higher
HuggingFace account and token (for pyannote.audio)
Google Cloud account (for deployment)

Local Testing

Clone the repository:

git clone https://github.com/yourusername/happyvoicelearn.git
cd happyvoicelearn

Set up Python environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

cd happyvoice-gcp
pip install -r requirements.txt

Set HuggingFace token:

export HF_TOKEN=your_huggingface_token_here

Run the API locally:
```
cd src
python main.py
```

Test the API:

# Health check
curl http://localhost:8080/health

# Analyze audio
curl -X POST http://localhost:8080/analyze \
  -F "audio=@path/to/your/audio.wav" \
  -H "Content-Type: multipart/form-data"

☁️ Google Cloud Deployment

Deploy to Cloud Run

# Set variables
export PROJECT_ID=your-gcp-project-id
export REGION=us-central1
export HF_TOKEN=your_huggingface_token

# Authenticate
gcloud auth login
gcloud config set project $PROJECT_ID

# Enable required APIs
gcloud services enable run.googleapis.com
gcloud services enable cloudbuild.googleapis.com
gcloud services enable artifactregistry.googleapis.com

# Deploy to Cloud Run
cd happyvoice-gcp
gcloud run deploy happyvoicelearn \
  --source . \
  --region $REGION \
  --platform managed \
  --allow-unauthenticated \
  --memory 4Gi \
  --cpu 2 \
  --timeout 600 \
  --set-env-vars "HF_TOKEN=$HF_TOKEN"

# Get service URL
gcloud run services describe happyvoicelearn \
  --region $REGION \
  --format 'value(status.url)'

Performance & Cost

Cold Start: 30-60 seconds (first request after idle)
Processing Time: 60-80 seconds per audio file
Resources: 2 vCPU, 4GB RAM recommended
Estimated Cost: ~$0.10 per 1000 requests
Auto-scaling: 0 to 10+ instances based on demand

📡 API Reference

Endpoints

`GET /health`

Health check endpoint.

Response:

{
  "status": "healthy",
  "service": "happyvoicelearn-complete",
  "version": "3.0.0",
  "components": {
    "speaker_diarization": "pyannote.audio 3.1",
    "opensmile": "eGeMAPSv02 (88 features)",
    "prosody": "Full prosody analysis"
  },
  "capabilities": [
    "speaker_separation",
    "human_vs_ai_identification",
    "voice_quality_analysis",
    "prosody_analysis",
    "clinical_screening (ASD/ADHD/Speech)"
  ]
}

`POST /analyze`

Complete voice analysis pipeline.

Request (Multipart Form):

curl -X POST https://your-service-url/analyze \
  -F "audio=@child_voice.wav"

Request (JSON with Base64):

curl -X POST https://your-service-url/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "audio_base64": "data:audio/wav;base64,UklGRi...",
    "child_age": 8,
    "gender": "male",
    "skip_diarization": false
  }'

Response:

{
  "success": true,
  "timestamp": "2025-01-06T00:00:00.000000",
  "processing_info": {
    "speaker_separation_attempted": true,
    "speaker_separation_success": true
  },
  "opensmile": {
    "feature_count": 88,
    "features": {
      "F0semitoneFrom27.5Hz_sma3nz_amean": 30.67,
      "HNRdBACF_sma3nz_amean": 5.84,
      "jitterLocal_sma3nz_amean": 0.0226,
      "shimmerLocaldB_sma3nz_amean": 1.237,
      // ... 84 more features
    },
    "interpretation": {
      "pitch_variability": {
        "value": 0.204,
        "status": "normal",
        "concern": false
      },
      "hnr": {
        "value": 5.84,
        "status": "low",
        "concern": true,
        "interpretation": "Breathy/noisy voice quality"
      }
      // ... more interpretations
    }
  },
  "prosody": {
    "pitch_contour": {
      "mean_f0_hz": 157.51,
      "f0_range_hz": 215.8,
      "f0_range_semitones": 22.66,
      "pitch_changes_per_second": 17.25
      // ... more metrics
    },
    "intonation": {
      "pattern_counts": {
        "rise_fall": 11,
        "flat": 9,
        "fall_rise": 5
      },
      "pattern_proportions": {
        "rise_fall": 0.393,
        "flat": 0.321,
        "fall_rise": 0.179
      },
      "intonation_variety": 1.94
    },
    "rhythm": {
      "syllable_count": 76,
      "syllables_per_second": 3.93,
      "npvi": 54.42,
      "pvi": 43.28,
      "tempo_bpm": 125.0
    }
    // ... stress and phrasing metrics
  },
  "screening": {
    "disclaimer": "SCREENING ONLY - Not diagnostic. Always consult qualified professionals.",
    "asd": {
      "score": 2,
      "max_score": 8,
      "risk_level": "medium",
      "indicators": [
        "Elevated shimmer: 1.237 dB",
        "Elevated flat intonation: 32.1%"
      ],
      "recommendation": "Continue monitoring"
    },
    "adhd": {
      "score": 1,
      "max_score": 5,
      "risk_level": "low",
      "indicators": [
        "Elevated speech rate variability: CV=0.26"
      ],
      "recommendation": "Continue monitoring"
    },
    "speech_disorder": {
      "score": 4,
      "max_score": 6,
      "risk_level": "high",
      "indicators": [
        "Elevated jitter: 0.0226",
        "Low HNR: 5.8 dB",
        "High shimmer: 1.237 dB"
      ],
      "recommendation": "Consider speech-language evaluation"
    },
    "summary": {
      "requires_followup": true,
      "primary_concern": "speech_disorder",
      "confidence": "medium"
    }
  }
}

`POST /separate-speakers`

Speaker separation only (no analysis).

Request:

curl -X POST https://your-service-url/separate-speakers \
  -F "audio=@multi_speaker_audio.wav"

Response:

{
  "success": true,
  "audio_base64": "UklGRi4uLg==",
  "format": "wav"
}

📚 Documentation

DEPLOYMENT_SUMMARY.md - Complete GCP deployment guide and service information
COMBINED_ANALYSIS_TEST_RESULTS.md - Detailed test results with clinical interpretations
PROJECT_COMPLETE.md - Full project overview and capabilities
HUMAN_VS_AI_IDENTIFICATION_GUIDE.md - Speaker identification methods
happyvoice-gcp/DEPLOYMENT.md - Detailed deployment instructions

📁 Project Structure

GCP_HappyVoice/
├── README.md                                    # This file
├── .gitignore                                   # Git exclusions
│
├── happyvoice-gcp/                              # Main deployment package
│   ├── src/
│   │   ├── main.py                              # Flask API (all endpoints)
│   │   └── prosody_analyzer.py                  # Prosody analysis module
│   ├── requirements.txt                         # Python dependencies
│   ├── Dockerfile                               # Container configuration
│   ├── .dockerignore                            # Build optimization
│   ├── DEPLOYMENT.md                            # Detailed deployment guide
│   └── README.md                                # Quick start guide
│
├── OpenSmile_Prosody_Custom/                    # Development/testing version
│   ├── main_combined.py                         # Combined analysis (local)
│   ├── prosody_analyzer.py                      # Prosody module
│   ├── requirements_combined.txt                # Dependencies
│   └── FEATURE_COMPARISON.md                    # Feature documentation
│
├── Documentation/
│   ├── DEPLOYMENT_SUMMARY.md                    # GCP deployment summary
│   ├── COMBINED_ANALYSIS_TEST_RESULTS.md        # Test results & analysis
│   ├── PROJECT_COMPLETE.md                      # Complete project overview
│   ├── HUMAN_VS_AI_IDENTIFICATION_GUIDE.md      # Speaker ID methods
│   ├── SPEAKER_SEPARATION_OPTIONS.md            # Diarization options
│   └── OPENSMILE_COMPLETE_GUIDE.md              # OpenSMILE documentation
│
├── Scripts/
│   ├── run_speaker_diarization.py               # Speaker diarization script
│   ├── identify_human_vs_ai.py                  # Human/AI identification
│   ├── test_combined_analysis.py                # Full analysis test
│   └── test_local_opensmile.py                  # OpenSMILE test
│
└── opensmile-gcp/                               # Legacy OpenSMILE-only version
    └── README.md

🧪 Testing

Run Combined Analysis Test

python test_combined_analysis.py

This will:

Load test audio
Perform speaker diarization (if multi-speaker)
Extract OpenSMILE features
Analyze prosody
Generate clinical screening scores
Save results to combined_analysis_results.json

Speaker Diarization Test

python run_speaker_diarization.py

Separates multi-speaker audio and saves human-only segments.

⚠️ Important Disclaimers

Clinical Use

THIS SYSTEM IS FOR SCREENING PURPOSES ONLY - NOT DIAGNOSTIC

This tool provides preliminary screening indicators only
Results should NEVER replace professional clinical assessment
Always consult qualified healthcare professionals for diagnosis
False positives and false negatives are possible
Voice quality can be affected by many factors (illness, fatigue, environment, recording quality)

Privacy & Security

Do NOT upload identifiable patient information without proper consent
Ensure HIPAA/GDPR compliance if processing protected health information
Audio files may contain sensitive information - handle appropriately
Use authentication and encryption for production deployments

Research Use

This system is designed for research and development purposes
Validate results with clinical professionals before any clinical application
Screening criteria are based on research literature but not clinically validated

🔧 Technical Requirements

Python Dependencies

Python 3.9+
Flask 2.0+
pyannote.audio 3.1
opensmile 2.4+
parselmouth 0.4+
librosa 0.10+
numpy, pandas, scipy

See happyvoice-gcp/requirements.txt for complete list.

System Requirements

For Local Testing: 8GB+ RAM recommended
For GCP Deployment: 4GB RAM, 2 vCPU minimum
Audio Format: WAV preferred (MP3 also supported)
Sample Rate: 16kHz+ recommended
Mono/Stereo: Both supported (mono preferred)

🤝 Contributing

This is a research project. Contributions, suggestions, and feedback are welcome!

Areas for Improvement

Additional clinical screening criteria
More speaker identification methods
Support for other languages
Real-time analysis capabilities
Mobile/edge deployment optimization

📄 License

This project is for research and educational purposes.

Please cite if you use this work in your research.

🙏 Acknowledgments

Technologies Used

pyannote.audio - Speaker diarization
OpenSMILE - Acoustic feature extraction
Parselmouth - Prosody analysis (Praat wrapper)
librosa - Audio processing
Google Cloud Run - Serverless deployment platform

Research References

Clinical screening criteria based on published literature
See individual documentation files for detailed references

📞 Support

For issues, questions, or suggestions:

Open an issue on GitHub
Review documentation in the /Documentation folder
Check test results in COMBINED_ANALYSIS_TEST_RESULTS.md

🗺️ Roadmap

HappyVoiceLearn - Advancing early detection through AI-powered voice analysis

Last Updated: January 2025

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
OpenSmile_Prosody_Custom		OpenSmile_Prosody_Custom
happyvoice-gcp		happyvoice-gcp
opensmile-gcp		opensmile-gcp
.gitignore		.gitignore
COMBINED_ANALYSIS_TEST_RESULTS.md		COMBINED_ANALYSIS_TEST_RESULTS.md
DEPLOYMENT_SUMMARY.md		DEPLOYMENT_SUMMARY.md
GITHUB_SETUP.md		GITHUB_SETUP.md
HUMAN_VS_AI_IDENTIFICATION_GUIDE.md		HUMAN_VS_AI_IDENTIFICATION_GUIDE.md
OPENSMILE_COMPLETE_GUIDE.md		OPENSMILE_COMPLETE_GUIDE.md
PROJECT_COMPLETE.md		PROJECT_COMPLETE.md
README.md		README.md
README_OPENSMILE.md		README_OPENSMILE.md
SPEAKER_SEPARATION_OPTIONS.md		SPEAKER_SEPARATION_OPTIONS.md
identify_human_vs_ai.py		identify_human_vs_ai.py
run_speaker_diarization.py		run_speaker_diarization.py
speaker_separation_guide.py		speaker_separation_guide.py
test_combined_analysis.py		test_combined_analysis.py
test_local_opensmile.py		test_local_opensmile.py

Folders and files

Latest commit

History

Repository files navigation

HappyVoiceLearn

🎯 Key Features

📊 What It Analyzes

Voice Quality (OpenSMILE)

Prosody Features (Parselmouth + librosa)

Clinical Screening

🏗️ Architecture

🚀 Quick Start

Prerequisites

Local Testing

☁️ Google Cloud Deployment

Deploy to Cloud Run

Performance & Cost

📡 API Reference

Endpoints

GET /health

POST /analyze

POST /separate-speakers

📚 Documentation

📁 Project Structure

🧪 Testing

Run Combined Analysis Test

Speaker Diarization Test

⚠️ Important Disclaimers

Clinical Use

Privacy & Security

Research Use

🔧 Technical Requirements

Python Dependencies

System Requirements

🤝 Contributing

Areas for Improvement

📄 License

🙏 Acknowledgments

Technologies Used

Research References

📞 Support

🗺️ Roadmap

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`POST /analyze`

`POST /separate-speakers`

Packages