Skip to content

[Feature] Implement Intelligent Diagnosis Tools for Dubbo Admin AI Agent #1341

@stringl1l1l1l

Description

@stringl1l1l1l

This issue proposes the implementation of a comprehensive set of intelligent diagnosis tools for the Dubbo Admin AI Agent. The goal is to enhance the AI agent's ability to diagnose and resolve issues in Dubbo microservices across three deployment modes (Universal, Half, K8s) by leveraging multi-dimensional observability data (metrics, logs, traces).

Current State

The current Dubbo Admin AI Agent in the ai/ directory only uses mock tools. While Dubbo Admin has rich observability capabilities and service management APIs, these are not exposed to the AI agent for intelligent diagnosis.

Existing Capabilities Identified

Observability Infrastructure:

  • Prometheus integration with metric collection from Dubbo instances
  • Grafana dashboard integration for visualization
  • Distributed tracing support with dashboard links
  • Comprehensive logging infrastructure based on Zap
  • Real-time metrics and monitoring capabilities

Service Management APIs:

  • Complete CRUD operations for services, instances, and applications
  • Traffic rule management (condition routes, tag routes)
  • Configuration management (timeout, retry, load balancing)
  • Multi-deployment mode support (Universal, Half, K8s)
  • K8s resource models and management capabilities

Missing Integration

The gap is between these existing capabilities and the AI agent tools. The agent needs structured APIs to:

  • Query and analyze observability data
  • Perform intelligent diagnosis using LLM reasoning
  • Execute safe recovery operations
  • Provide comprehensive root cause analysis

Proposed Solution

Phase 1: Foundation Tools (High Priority)

1. Metrics Query Tools

  • query_service_metrics - Query basic metrics (QPS, RT, success rate) with filtering
  • get_application_overview - Get application health status and summary
  • analyze_metrics_anomaly - Detect metric anomalies based on historical data
  • compare_instance_performance - Compare performance across multiple instances

2. Log Analysis Tools

  • search_service_logs - Search logs by service, instance, keywords, time range
  • analyze_error_logs - Analyze error patterns and frequency
  • correlate_logs_with_metrics - Correlate logs with metric anomalies
  • trace_error_propagation - Track error propagation in call chains

3. Basic Service Management Tools

  • list_applications - List applications with filtering and pagination
  • get_service_details - Get comprehensive service information
  • list_service_instances - List service instances with health status
  • get_instance_status - Get detailed instance status and health checks

Phase 2: Advanced Analysis Tools (Medium Priority)

4. Distributed Tracing Tools

  • query_service_traces - Query distributed traces with filtering
  • analyze_trace_performance - Analyze trace performance and bottlenecks
  • detect_trace_anomalies - Detect trace anomalies (failures, timeouts)
  • map_service_dependencies - Build service dependency topology

5. Traffic Management Tools

  • list_traffic_rules - List all traffic control rules
  • get_traffic_rule_details - Get detailed rule configuration and impact
  • analyze_traffic_distribution - Analyze traffic patterns and anomalies
  • simulate_traffic_impact - Simulate traffic rule changes impact

6. Configuration Management Tools

  • get_service_config - Get service configuration details
  • list_config_changes - List configuration change history
  • analyze_config_consistency - Analyze configuration consistency
  • validate_config_changes - Validate configuration changes safety

Phase 3: Intelligent Diagnosis Tools (Low Priority)

7. Cross-Mode Management Tools

  • get_deployment_mode - Get deployment mode information
  • list_k8s_resources - List K8s resources in K8s mode
  • analyze_cross_mode_consistency - Analyze cross-mode consistency
  • migrate_service_mode - Assist with deployment mode migration

8. Intelligent Diagnosis Tools

  • diagnose_service_issues - Comprehensive issue diagnosis using multi-dim data
  • predict_service_anomalies - Predict potential anomalies based on history
  • generate_recovery_plan - Generate automated recovery plans
  • execute_safe_recovery - Execute safe recovery operations

Required API Enhancements

New API Endpoints to Implement

Metrics APIs:

POST /api/v1/metrics/batch-query        # Batch metric queries
POST /api/v1/metrics/anomaly-detection  # Anomaly detection
POST /api/v1/metrics/comparison         # Metric comparison analysis

Log APIs:

POST /api/v1/logs/search                # Log search
POST /api/v1/logs/error-analysis        # Error log analysis
POST /api/v1/logs/correlation          # Log-metric correlation

Trace APIs:

POST /api/v1/traces/query              # Trace query
POST /api/v1/traces/performance-analysis # Performance analysis
POST /api/v1/traces/dependency-map     # Dependency map generation

Diagnosis APIs:

POST /api/v1/diagnosis/comprehensive    # Comprehensive diagnosis
POST /api/v1/prediction/anomalies       # Anomaly prediction
POST /api/v1/recovery/plan-generation  # Recovery plan generation

Existing API Enhancements Required

  1. Enhance /api/v1/application/detail - Add health status assessment
  2. Enhance /api/v1/service/detail - Add more related information
  3. Enhance /api/v1/instance/detail - Add health checks and resource usage
  4. Enhance /api/v1/promQL/query - Support batch queries and time ranges
  5. Enhance traffic rule APIs - Add impact analysis and statistics

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions