Skip to content

ARCHITECTURE_FULL

Nick edited this page Mar 10, 2026 · 1 revision

PATAS Architecture - Complete Documentation

Version: 2.0.0
Based on: Real PATAS Core v2 code


Overview

PATAS Core is a generic engine for:

  • Analyzing large message corpora (spam / not_spam)
  • Automatically discovering spam patterns
  • Generating machine-readable blocking rules
  • Evaluating rules on real traffic
  • Promoting good rules and deactivating bad ones

Key Principle: PATAS Core is generic and reusable. Works with abstract domain models and can be wrapped by different profiles (Telegram on-prem, public API, etc.) without modification.


Data Models

Message

Normalized message storage from external logs or CSV imports.

Поля:

  • id - Internal ID (Integer, primary key)
  • external_id - External message ID (String, unique, for idempotency)
  • timestamp - Message timestamp (DateTime, timezone-aware, indexed)
  • text - Message text content (Text, required)
  • meta - JSON metadata (JSON, optional): channel, language, country, sender, source, etc.
  • is_spam - Optional spam label (Boolean, optional, indexed)
  • tas_action - TAS action (String, optional, indexed): 'blocked' / 'allowed'
  • user_complaint - User-reported spam (Boolean, default: False, indexed)
  • unbanned - Whether message/user was unbanned (Boolean, default: False)
  • created_at - Creation timestamp (DateTime, auto)

Индексы:

  • ix_messages_timestamp_spam (timestamp, is_spam)
  • ix_messages_tas_action (tas_action)

Pattern

Discovered spam patterns.

Поля:

  • id - Pattern ID (Integer, primary key)
  • type - Pattern type (PatternType enum, required, indexed):
    • URL - URL patterns
    • PHONE - Phone number patterns
    • TEXT - Text patterns
    • META - Metadata patterns
    • SIGNATURE - Message signature patterns
    • KEYWORD - Keyword patterns
  • description - Human-readable description (Text, required)
  • examples - Representative message texts (JSON array, optional)
  • created_at - Creation timestamp (DateTime, auto)
  • updated_at - Update timestamp (DateTime, auto)

Relations:

  • rules - One-to-many with Rule

Rule

SQL blocking rules with lifecycle management.

Поля:

  • id - Rule ID (Integer, primary key)
  • pattern_id - Associated pattern (Integer, FK, optional, indexed)
  • sql_expression - Safe SELECT query (Text, required)
  • status - Lifecycle state (RuleStatus enum, required, indexed):
    • CANDIDATE - New rule created by pattern mining
    • SHADOW - Rule in shadow evaluation
    • ACTIVE - Active rule, ready for export
    • DEPRECATED - Deprecated rule
  • origin - Origin (String, required, default: 'llm'): 'llm', 'pattern_mining', 'manual'
  • created_at - Creation timestamp (DateTime, auto)
  • updated_at - Update timestamp (DateTime, auto)

Индексы:

  • ix_rules_status_updated (status, updated_at)

Relations:

  • pattern - Many-to-one with Pattern
  • evaluations - One-to-many with RuleEvaluation

RuleEvaluation

Rule evaluation metrics.

Поля:

  • id - Evaluation ID (Integer, primary key)
  • rule_id - Associated rule (Integer, FK, required, indexed)
  • time_period_start - Evaluation window start (DateTime, required)
  • time_period_end - Evaluation window end (DateTime, required)
  • hits_total - Total messages matched (Integer, required)
  • spam_hits - Spam messages matched (Integer, required)
  • ham_hits - Non-spam messages matched (Integer, required)
  • precision - spam_hits / hits_total (Float, optional)
  • recall - (requires total spam count) (Float, optional)
  • coverage - hits_total / total_messages (Float, optional)
  • created_at - Creation timestamp (DateTime, auto)

Индексы:

  • ix_rule_evaluations_rule_created (rule_id, created_at)

Relations:

  • rule - Many-to-one with Rule

Core Services

1. TASLogIngester (app/v2_ingestion.py)

Ingest external logs into normalized Message storage.

Методы:

  • ingest_from_tas_api() - Pull from external API (HTTP client with retry)
  • ingest_from_tas_storage() - Read from files/DB (JSON/CSV)
  • ingest_batch() - Idempotent batch ingestion
  • ingest_from_csv() - CSV import

Особенности:

  • Idempotency via external_id
  • Support for multiple sources (API, storage, CSV)
  • Large file processing via streaming
  • Retry logic for HTTP requests
  • Error Handling: httpx.*, IOError, OSError

2. PatternMiningPipeline (app/v2_pattern_mining.py)

Mining patterns from Message batches.

Методы:

  • mine_patterns() - Main entry point
    • Параметры: days, min_spam_count, use_llm, use_semantic, enable_llm_validation
    • Возвращает: {patterns_created, rules_created, messages_processed, spam_count, ham_count}
  • _extract_and_aggregate() - Feature extraction and aggregation
  • _generate_patterns_and_rules() - Creating Pattern and Rule objects
  • _llm_pattern_discovery() - LLM for semantic patterns
  • _process_llm_rule() - Processing LLM-suggested rules

Особенности:

  • Chunked processing for large datasets (chunk_size)
  • Aggregates signals before LLM calls
  • Minimizes LLM usage through compact summaries
  • Creates Pattern records and candidate Rule objects
  • Support for semantic mining via embedding engine
  • LLM validation via v2_sql_llm_validator

Pattern Types:

  • URL patterns
  • Keyword patterns
  • Signature patterns
  • Semantic clusters (if enabled)

3. RuleLifecycleService (app/v2_rule_lifecycle.py)

Managing lifecycle state machine for rules.

States: candidateshadowactivedeprecated

Методы:

  • create_candidate_rule() - Creating new rule in candidate status
  • move_to_shadow() - Transition candidate → shadow (for evaluation)
  • promote_to_active() - Promotion shadow → active (after successful evaluation)
  • deprecate_rule() - Deprecation of rule (from any status)

4. ShadowEvaluationService (app/v2_shadow_evaluation.py)

Evaluating rules in shadow mode on real data.

Методы:

  • evaluate_rule(rule_id, days) - Evaluating a single rule
  • evaluate_all_shadow_rules(days) - Evaluating all shadow rules

Процесс:

  1. Executes SQL from sql_expression on real messages
  2. Calculates metrics: hits_total, spam_hits, ham_hits
  3. Computes: precision, coverage
  4. Creates RuleEvaluation records

Особенности:

  • SQL safety validation via v2_sql_safety
  • Error Handling: SQLSafetyError, SQLAlchemyError
  • Minimum sample size for evaluation

5. PromotionService (app/v2_promotion.py)

Automatic promotion and rollback of rules.

Методы:

  • promote_shadow_rules() - Promotion shadow → active
  • monitor_active_rules() - Monitoring and deprecation of degrading rules
  • export_active_rules(backend_type) - Export to SQL/ROL formats

AggressivenessProfile:

  • conservative() - min_precision=0.95, max_coverage=0.05, max_ham_hits=5
  • balanced() - min_precision=0.90, max_coverage=0.10, max_ham_hits=10
  • aggressive() - min_precision=0.85, max_coverage=0.20, max_ham_hits=20

Процесс продвижения:

  1. Gets shadow rules
  2. Checks metrics from RuleEvaluation
  3. Compares with AggressivenessProfile thresholds
  4. Promotes if metrics match

6. RuleBackend (app/v2_rule_backend.py)

Export rules to various formats.

Интерфейс:

  • RuleBackend - Abstract interface
  • SqlRuleBackend - SQL export
  • RolRuleBackend - ROL format export
  • create_rule_backend(backend_type) - Factory function

7. LLM Engine (app/v2_llm_engine.py)

LLM integration for pattern discovery.

Интерфейс:

  • PatternMiningEngine - Abstract interface
  • OpenAIPatternMiningEngine - OpenAI implementation
  • create_mining_engine() - Factory function

Использование:

  • Only for offline pattern discovery
  • Not used for real-time classification
  • Optional (can be disabled)

8. Embedding Engine (app/v2_embedding_engine.py)

Embedding engine for semantic mining.

Интерфейс:

  • EmbeddingEngine - Abstract interface
  • OpenAIEmbeddingEngine - OpenAI implementation
  • create_embedding_engine() - Factory function

Использование:

  • Semantic clustering of similar messages
  • Finds semantically similar patterns
  • Используется в PatternMiningPipeline если use_semantic=true

9. SQL Safety (app/v2_sql_safety.py)

Validation and sanitization of SQL rules.

Методы:

  • validate_sql_rule() - Validation of SQL rules
  • sanitize_sql_for_evaluation() - Sanitization for execution

Особенности:

  • Whitelist of tables/columns
  • Protection against SQL injection
  • Check for "match everything" rules
  • Only SELECT queries allowed

10. SQL LLM Validator (app/v2_sql_llm_validator.py)

LLM validation of SQL rules.

Методы:

  • validate_rule_with_llm() - LLM validation of SQL rules

Особенности:

  • Check for false positives
  • Risk assessment
  • Optional (if LLM is available)

Data Flow

Typical Workflow

1. Ingestion:
   TASLogIngester.ingest_batch() 
   → MessageRepository.create()
   → Messages в БД

2. Pattern Mining:
   PatternMiningPipeline.mine_patterns()
   → Pattern extraction (URL, keyword, signature, semantic)
   → LLM pattern discovery (optional)
   → PatternRepository.create()
   → RuleRepository.create() (status=CANDIDATE)

3. Shadow Evaluation:
   ShadowEvaluationService.evaluate_rule()
   → SQL execution on real messages
   → Metric calculation
   → RuleEvaluationRepository.create()

4. Promotion:
   PromotionService.promote_shadow_rules()
   → Checking metrics against AggressivenessProfile
   → RuleLifecycleService.transition() (SHADOW → ACTIVE)
   → Export via RuleBackend

Extension Points

1. Rule Backend

Implement RuleBackend interface for custom export formats:

from app.v2_rule_backend import RuleBackend

class CustomRuleBackend(RuleBackend):
    async def export_rules(self, rules: List[Rule]) -> str:
        # Your implementation
        pass

2. LLM Engine

Implement PatternMiningEngine for custom LLM providers:

from app.v2_llm_engine import PatternMiningEngine

class CustomLLMEngine(PatternMiningEngine):
    async def discover_patterns(self, signals: Dict, examples: List[str]) -> Dict:
        # Your implementation
        pass

3. Embedding Engine

Implement EmbeddingEngine for custom embedding providers:

from app.v2_embedding_engine import EmbeddingEngine

class CustomEmbeddingEngine(EmbeddingEngine):
    async def embed_texts(self, texts: List[str]) -> List[List[float]]:
        # Your implementation
        pass

Security

SQL Safety

  • Whitelist of tables/columns
  • Только SELECT queries
  • Protection against SQL injection
  • Validation before execution

Privacy

  • Privacy modes: STANDARD / STRICT
  • PII redaction
  • Logs do not store full texts in STRICT mode

Data Access

  • Idempotency via external_id
  • No hardcoded secrets
  • All settings via environment variables

Дополнительные ресурсы

Clone this wiki locally