Skip to content

Implement AI-based automatic identification of sensitive columns in databases#4495

Merged
fenyf merged 10 commits into
oceanbase:feat/summer-osppfrom
fenyf:ai-sensitive
Sep 29, 2025
Merged

Implement AI-based automatic identification of sensitive columns in databases#4495
fenyf merged 10 commits into
oceanbase:feat/summer-osppfrom
fenyf:ai-sensitive

Conversation

@fenyf

@fenyf fenyf commented Sep 17, 2025

Copy link
Copy Markdown
Collaborator

What type of PR is this?

type-feature
module-datasecurity

What this PR does / why we need it:

This PR introduces an intelligent, AI-based feature for automatically identifying sensitive columns in databases, addressing the limitations of the current manual, rule-based system. It aims to significantly improve both the accuracy and efficiency of data sensitivity discovery.

Implementation Overview

  • Scanner Refactoring: The core sensitive column scanner has been refactored using the Strategy Pattern to cleanly support multiple scanning modes.
  • Multiple Scan Modes:
    • Rule-Only: The classic behavior, using only pre-defined regex/keyword rules.
    • AI-Enhanced: A hybrid mode that combines rule-based matching with AI inference for superior accuracy.
    • Passive Scanning: The system now automatically triggers scans in the background when a user views a table's structure or executes a SELECT query, providing proactive data security insights.
  • AI Model Integration: Integrates a pre-trained language model to analyze column semantics and identify 13 default sensitive data types (e.g., Name, Phone, ID Card, etc.).
  • Configuration Management: A new AI configuration module has been added. The feature can be enabled and configured via environment variables (ODC_APP_EXTRA_ARGS), including the API key, endpoint, and model name. The UI will reflect the service's availability based on these settings.
  • Multi-Level Caching: A client-side caching mechanism has been implemented to enhance performance and reduce redundant API calls. It includes a 5-minute in-memory cache and a 24-hour localStorage cache.

Which issue(s) this PR fixes:

Fixes #4489

Special notes for your reviewer:

Testing Suggestions

Please pay special attention to the following key areas as detailed in the test plan:

  • 1. AI Configuration & Availability:

    • Verify that the "AI-Enhanced" mode in the UI is enabled only when the correct AI service parameters are provided in ODC_APP_EXTRA_ARGS.
    • Confirm it is disabled/grayed out with incorrect or missing parameters.
  • 2. Passive Scanning Triggers:

    • On View Table: When viewing a table's structure, confirm the UI shows a "Scanning..." indicator, which is then replaced by the scan result ("X sensitive columns found" or "No sensitive columns found").
    • On SQL Execute: When running a SELECT * FROM ... query, verify the same "Scanning..." indicator appears above the results, followed by the outcome.
  • 3. Caching Behavior:

    • Use browser developer tools to monitor network activity.
    • First Visit (Cache Miss): Confirm an API call to the AI scan service is made.
    • Within 5 Mins (In-Memory Cache): Navigate away and back to the table view. The result should load instantly with no new API call.
    • After 5 Mins / Page Refresh (LocalStorage Cache): Refresh the page or close and reopen the tab. The result should still load instantly with no new API call.
  • 4. Scan Mode Logic:

    • Test the "Rule-Only" mode to ensure it identifies data based on existing rules and ignores columns without rule-based features.
    • Test the "AI-Enhanced" mode to confirm it correctly identifies a mix of sensitive types (e.g., Name, Phone, Email) that may not be covered by simple rules.

feat(ai_recognition): Implement rule priority configuration using the strategy pattern

feat(ai_recognition): Implement the process of sending sensitive columns in batches to the AI

feat(ai_recognition): Implement the process of sending sensitive columns in batches to the AI

feat(ai_recognition): Implement the process of sending sensitive columns in batches to the AI(completely restructured version)
fix(ai_recognition): 解决查询接口缺少AI相关字段

fix(ai_recognition): 去除AI识别器冗余逻辑

feature(ai_recognition): 为扫描任务增加多线程和异步

feature(ai_recognition): 去除置信度阈值功能

feature(ai_recognition): 完善为AI识别结果指定默认算法功能

feature(ai_recognition): 中断扫描功能

feature(ai_recognition): 优化提示词拼接逻辑

feature(ai_recognition): 优化AI调用代码

feature(ai_recognition): 删除冗余识别模式

feature(ai_recognition): 优化提示词
…or passive scanning and refactor the code in the sole AI mode
feature(ai_recognition): 集成测试
fix(ai_recognition):AI状态查询bug修复
@CLAassistant

CLAassistant commented Sep 17, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

fix(ai_recognition):代码格式化和集成测试
@fenyf fenyf changed the base branch from main to feat/summer-ospp September 29, 2025 02:47
@fenyf fenyf merged commit 09d12f3 into oceanbase:feat/summer-ospp Sep 29, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Implement AI-based automatic identification of sensitive columns in databases

2 participants