Skip to content

phildani7/HappyVoiceLearn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HappyVoiceLearn

Comprehensive AI-powered voice analysis system for pediatric speech and developmental screening

Cloud Run Python OpenSMILE License

HappyVoiceLearn is an advanced voice analysis system designed to assist in early screening for autism spectrum disorder (ASD), ADHD, and speech disorders in children. The system combines state-of-the-art speaker diarization, acoustic feature extraction, and prosody analysis to provide comprehensive voice assessments.


🎯 Key Features

  • 🎀 Speaker Diarization: Automatically separates human child voice from AI agent in multi-speaker audio
  • πŸ”Š 88 Acoustic Features: OpenSMILE eGeMAPSv02 feature extraction (voice quality, pitch, formants, MFCCs)
  • 🎡 Prosody Analysis: Comprehensive analysis of pitch contours, intonation, rhythm, stress, and phrasing (50+ metrics)
  • 🩺 Clinical Screening: Evidence-based screening for ASD, ADHD, and speech disorders
  • ☁️ Cloud-Ready: Optimized for Google Cloud Run with auto-scaling
  • πŸš€ Production API: RESTful API with JSON responses for easy integration

πŸ“Š What It Analyzes

Voice Quality (OpenSMILE)

  • Harmonics-to-Noise Ratio (HNR): Voice breathiness and clarity
  • Jitter & Shimmer: Voice stability and consistency
  • Pitch (F0): Mean, range, and variability
  • Formants: F1, F2, F3 frequencies and bandwidths
  • MFCCs: Mel-frequency cepstral coefficients for voice timbre
  • Speech Rate: Voiced segments per second
  • Loudness: Mean, variability, and dynamic range

Prosody Features (Parselmouth + librosa)

  • Pitch Contour: F0 dynamics, excursions, velocity, acceleration
  • Intonation Patterns: Rising, falling, rise-fall, flat classifications
  • Rhythm: nPVI, PVI, syllable timing, tempo estimation
  • Stress Patterns: Stress rate, intervals, strength, regularity
  • Phrasing: Phrase lengths, pause patterns, pause-to-speech ratio

Clinical Screening

  • ASD Screening (8-point scale): Flat intonation, narrow pitch range, atypical prosody
  • ADHD Screening (5-point scale): Speech rate variability, irregular rhythm, loudness inconsistency
  • Speech Disorder Screening (6-point scale): Voice quality issues, articulation concerns

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Audio Input (WAV/MP3)                      β”‚
β”‚                  (May contain multiple speakers)                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Speaker Diarization                           β”‚
β”‚              (pyannote.audio 3.1 pipeline)                      β”‚
β”‚                                                                 β”‚
β”‚  β€’ Identifies speakers                                          β”‚
β”‚  β€’ Separates human child from AI agent                          β”‚
β”‚  β€’ Extracts human-only audio segments                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Feature Extraction (Parallel)                  β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   OpenSMILE          β”‚         β”‚  Prosody Analysis    β”‚     β”‚
β”‚  β”‚   eGeMAPSv02         β”‚         β”‚  (Parselmouth)       β”‚     β”‚
β”‚  β”‚                      β”‚         β”‚                      β”‚     β”‚
β”‚  β”‚  β€’ 88 acoustic       β”‚         β”‚  β€’ Pitch contours    β”‚     β”‚
β”‚  β”‚    features          β”‚         β”‚  β€’ Intonation        β”‚     β”‚
β”‚  β”‚  β€’ Voice quality     β”‚         β”‚  β€’ Rhythm (nPVI)     β”‚     β”‚
β”‚  β”‚  β€’ Spectral analysis β”‚         β”‚  β€’ Stress patterns   β”‚     β”‚
β”‚  β”‚  β€’ MFCCs             β”‚         β”‚  β€’ Phrasing          β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Clinical Screening                           β”‚
β”‚                                                                 β”‚
β”‚  β€’ ASD indicators (flat intonation, pitch range, etc.)         β”‚
β”‚  β€’ ADHD indicators (variability, rhythm, etc.)                 β”‚
β”‚  β€’ Speech disorder indicators (voice quality, etc.)            β”‚
β”‚  β€’ Risk level calculation (low/medium/high)                    β”‚
β”‚  β€’ Follow-up recommendations                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        JSON Response                            β”‚
β”‚                                                                 β”‚
β”‚  β€’ All features and metrics                                    β”‚
β”‚  β€’ Clinical interpretations                                    β”‚
β”‚  β€’ Risk assessments                                            β”‚
β”‚  β€’ Recommendations                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Python 3.9 or higher
  • HuggingFace account and token (for pyannote.audio)
  • Google Cloud account (for deployment)

Local Testing

  1. Clone the repository:

    git clone https://github.com/yourusername/happyvoicelearn.git
    cd happyvoicelearn
  2. Set up Python environment:

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    cd happyvoice-gcp
    pip install -r requirements.txt
  4. Set HuggingFace token:

    export HF_TOKEN=your_huggingface_token_here
  5. Run the API locally:

    cd src
    python main.py
  6. Test the API:

    # Health check
    curl http://localhost:8080/health
    
    # Analyze audio
    curl -X POST http://localhost:8080/analyze \
      -F "audio=@path/to/your/audio.wav" \
      -H "Content-Type: multipart/form-data"

☁️ Google Cloud Deployment

Deploy to Cloud Run

# Set variables
export PROJECT_ID=your-gcp-project-id
export REGION=us-central1
export HF_TOKEN=your_huggingface_token

# Authenticate
gcloud auth login
gcloud config set project $PROJECT_ID

# Enable required APIs
gcloud services enable run.googleapis.com
gcloud services enable cloudbuild.googleapis.com
gcloud services enable artifactregistry.googleapis.com

# Deploy to Cloud Run
cd happyvoice-gcp
gcloud run deploy happyvoicelearn \
  --source . \
  --region $REGION \
  --platform managed \
  --allow-unauthenticated \
  --memory 4Gi \
  --cpu 2 \
  --timeout 600 \
  --set-env-vars "HF_TOKEN=$HF_TOKEN"

# Get service URL
gcloud run services describe happyvoicelearn \
  --region $REGION \
  --format 'value(status.url)'

Performance & Cost

  • Cold Start: 30-60 seconds (first request after idle)
  • Processing Time: 60-80 seconds per audio file
  • Resources: 2 vCPU, 4GB RAM recommended
  • Estimated Cost: ~$0.10 per 1000 requests
  • Auto-scaling: 0 to 10+ instances based on demand

πŸ“‘ API Reference

Endpoints

GET /health

Health check endpoint.

Response:

{
  "status": "healthy",
  "service": "happyvoicelearn-complete",
  "version": "3.0.0",
  "components": {
    "speaker_diarization": "pyannote.audio 3.1",
    "opensmile": "eGeMAPSv02 (88 features)",
    "prosody": "Full prosody analysis"
  },
  "capabilities": [
    "speaker_separation",
    "human_vs_ai_identification",
    "voice_quality_analysis",
    "prosody_analysis",
    "clinical_screening (ASD/ADHD/Speech)"
  ]
}

POST /analyze

Complete voice analysis pipeline.

Request (Multipart Form):

curl -X POST https://your-service-url/analyze \
  -F "audio=@child_voice.wav"

Request (JSON with Base64):

curl -X POST https://your-service-url/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "audio_base64": "data:audio/wav;base64,UklGRi...",
    "child_age": 8,
    "gender": "male",
    "skip_diarization": false
  }'

Response:

{
  "success": true,
  "timestamp": "2025-01-06T00:00:00.000000",
  "processing_info": {
    "speaker_separation_attempted": true,
    "speaker_separation_success": true
  },
  "opensmile": {
    "feature_count": 88,
    "features": {
      "F0semitoneFrom27.5Hz_sma3nz_amean": 30.67,
      "HNRdBACF_sma3nz_amean": 5.84,
      "jitterLocal_sma3nz_amean": 0.0226,
      "shimmerLocaldB_sma3nz_amean": 1.237,
      // ... 84 more features
    },
    "interpretation": {
      "pitch_variability": {
        "value": 0.204,
        "status": "normal",
        "concern": false
      },
      "hnr": {
        "value": 5.84,
        "status": "low",
        "concern": true,
        "interpretation": "Breathy/noisy voice quality"
      }
      // ... more interpretations
    }
  },
  "prosody": {
    "pitch_contour": {
      "mean_f0_hz": 157.51,
      "f0_range_hz": 215.8,
      "f0_range_semitones": 22.66,
      "pitch_changes_per_second": 17.25
      // ... more metrics
    },
    "intonation": {
      "pattern_counts": {
        "rise_fall": 11,
        "flat": 9,
        "fall_rise": 5
      },
      "pattern_proportions": {
        "rise_fall": 0.393,
        "flat": 0.321,
        "fall_rise": 0.179
      },
      "intonation_variety": 1.94
    },
    "rhythm": {
      "syllable_count": 76,
      "syllables_per_second": 3.93,
      "npvi": 54.42,
      "pvi": 43.28,
      "tempo_bpm": 125.0
    }
    // ... stress and phrasing metrics
  },
  "screening": {
    "disclaimer": "SCREENING ONLY - Not diagnostic. Always consult qualified professionals.",
    "asd": {
      "score": 2,
      "max_score": 8,
      "risk_level": "medium",
      "indicators": [
        "Elevated shimmer: 1.237 dB",
        "Elevated flat intonation: 32.1%"
      ],
      "recommendation": "Continue monitoring"
    },
    "adhd": {
      "score": 1,
      "max_score": 5,
      "risk_level": "low",
      "indicators": [
        "Elevated speech rate variability: CV=0.26"
      ],
      "recommendation": "Continue monitoring"
    },
    "speech_disorder": {
      "score": 4,
      "max_score": 6,
      "risk_level": "high",
      "indicators": [
        "Elevated jitter: 0.0226",
        "Low HNR: 5.8 dB",
        "High shimmer: 1.237 dB"
      ],
      "recommendation": "Consider speech-language evaluation"
    },
    "summary": {
      "requires_followup": true,
      "primary_concern": "speech_disorder",
      "confidence": "medium"
    }
  }
}

POST /separate-speakers

Speaker separation only (no analysis).

Request:

curl -X POST https://your-service-url/separate-speakers \
  -F "audio=@multi_speaker_audio.wav"

Response:

{
  "success": true,
  "audio_base64": "UklGRi4uLg==",
  "format": "wav"
}

πŸ“š Documentation


πŸ“ Project Structure

GCP_HappyVoice/
β”œβ”€β”€ README.md                                    # This file
β”œβ”€β”€ .gitignore                                   # Git exclusions
β”‚
β”œβ”€β”€ happyvoice-gcp/                              # Main deployment package
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ main.py                              # Flask API (all endpoints)
β”‚   β”‚   └── prosody_analyzer.py                  # Prosody analysis module
β”‚   β”œβ”€β”€ requirements.txt                         # Python dependencies
β”‚   β”œβ”€β”€ Dockerfile                               # Container configuration
β”‚   β”œβ”€β”€ .dockerignore                            # Build optimization
β”‚   β”œβ”€β”€ DEPLOYMENT.md                            # Detailed deployment guide
β”‚   └── README.md                                # Quick start guide
β”‚
β”œβ”€β”€ OpenSmile_Prosody_Custom/                    # Development/testing version
β”‚   β”œβ”€β”€ main_combined.py                         # Combined analysis (local)
β”‚   β”œβ”€β”€ prosody_analyzer.py                      # Prosody module
β”‚   β”œβ”€β”€ requirements_combined.txt                # Dependencies
β”‚   └── FEATURE_COMPARISON.md                    # Feature documentation
β”‚
β”œβ”€β”€ Documentation/
β”‚   β”œβ”€β”€ DEPLOYMENT_SUMMARY.md                    # GCP deployment summary
β”‚   β”œβ”€β”€ COMBINED_ANALYSIS_TEST_RESULTS.md        # Test results & analysis
β”‚   β”œβ”€β”€ PROJECT_COMPLETE.md                      # Complete project overview
β”‚   β”œβ”€β”€ HUMAN_VS_AI_IDENTIFICATION_GUIDE.md      # Speaker ID methods
β”‚   β”œβ”€β”€ SPEAKER_SEPARATION_OPTIONS.md            # Diarization options
β”‚   └── OPENSMILE_COMPLETE_GUIDE.md              # OpenSMILE documentation
β”‚
β”œβ”€β”€ Scripts/
β”‚   β”œβ”€β”€ run_speaker_diarization.py               # Speaker diarization script
β”‚   β”œβ”€β”€ identify_human_vs_ai.py                  # Human/AI identification
β”‚   β”œβ”€β”€ test_combined_analysis.py                # Full analysis test
β”‚   └── test_local_opensmile.py                  # OpenSMILE test
β”‚
└── opensmile-gcp/                               # Legacy OpenSMILE-only version
    └── README.md

πŸ§ͺ Testing

Run Combined Analysis Test

python test_combined_analysis.py

This will:

  1. Load test audio
  2. Perform speaker diarization (if multi-speaker)
  3. Extract OpenSMILE features
  4. Analyze prosody
  5. Generate clinical screening scores
  6. Save results to combined_analysis_results.json

Speaker Diarization Test

python run_speaker_diarization.py

Separates multi-speaker audio and saves human-only segments.


⚠️ Important Disclaimers

Clinical Use

THIS SYSTEM IS FOR SCREENING PURPOSES ONLY - NOT DIAGNOSTIC

  • This tool provides preliminary screening indicators only
  • Results should NEVER replace professional clinical assessment
  • Always consult qualified healthcare professionals for diagnosis
  • False positives and false negatives are possible
  • Voice quality can be affected by many factors (illness, fatigue, environment, recording quality)

Privacy & Security

  • Do NOT upload identifiable patient information without proper consent
  • Ensure HIPAA/GDPR compliance if processing protected health information
  • Audio files may contain sensitive information - handle appropriately
  • Use authentication and encryption for production deployments

Research Use

  • This system is designed for research and development purposes
  • Validate results with clinical professionals before any clinical application
  • Screening criteria are based on research literature but not clinically validated

πŸ”§ Technical Requirements

Python Dependencies

  • Python 3.9+
  • Flask 2.0+
  • pyannote.audio 3.1
  • opensmile 2.4+
  • parselmouth 0.4+
  • librosa 0.10+
  • numpy, pandas, scipy

See happyvoice-gcp/requirements.txt for complete list.

System Requirements

  • For Local Testing: 8GB+ RAM recommended
  • For GCP Deployment: 4GB RAM, 2 vCPU minimum
  • Audio Format: WAV preferred (MP3 also supported)
  • Sample Rate: 16kHz+ recommended
  • Mono/Stereo: Both supported (mono preferred)

🀝 Contributing

This is a research project. Contributions, suggestions, and feedback are welcome!

Areas for Improvement

  • Additional clinical screening criteria
  • More speaker identification methods
  • Support for other languages
  • Real-time analysis capabilities
  • Mobile/edge deployment optimization

πŸ“„ License

This project is for research and educational purposes.

Please cite if you use this work in your research.


πŸ™ Acknowledgments

Technologies Used

Research References

  • Clinical screening criteria based on published literature
  • See individual documentation files for detailed references

πŸ“ž Support

For issues, questions, or suggestions:

  • Open an issue on GitHub
  • Review documentation in the /Documentation folder
  • Check test results in COMBINED_ANALYSIS_TEST_RESULTS.md

πŸ—ΊοΈ Roadmap

  • Multi-language support
  • Real-time analysis API
  • Longitudinal tracking capabilities
  • Integration with EHR systems
  • Mobile SDK
  • Edge deployment (TensorFlow Lite)
  • Automated report generation

HappyVoiceLearn - Advancing early detection through AI-powered voice analysis

Last Updated: January 2025

About

AI-powered voice analysis system for pediatric speech and developmental screening using OpenSMILE, prosody analysis, and speaker diarization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors