A comprehensive active learning benchmark framework specifically designed for materials science regression tasks. This project implements 20+ active learning strategies across multiple materials science datasets, providing a standardized evaluation platform for active learning research in materials discovery.
- RandomSearch: Random sampling baseline
- GaussianProcessBased: GP-based uncertainty sampling
- GSBAG: Gaussian Process BAG method
- TreeBasedRegressor_Diversity: Tree-based diversity sampling
- TreeBasedRegressor_Representativity: Tree-based representativeness sampling
- TreeBasedRegressor_*_self: Adaptive tree methods
- QueryByCommittee: Committee disagreement query
- RD_QBC_ALR: Dimensionality reduction committee query
- LearningLoss: Learning loss prediction
- EGAL: Expected Gradient Length
- mcdropout: Monte Carlo Dropout
- BMDAL: Bayesian Batch Mode Active Learning
- GSi/GSx/GSy: Geometric sampling variants
- QDD: Query Density Diversity
- Basic_RD_ALR: Basic dimensionality reduction active learning
- RD_GS_ALR: Dimensionality reduction geometric sampling
- RD_EMCM_ALR: Dimensionality reduction expected model change maximization
Currently, the project includes the uci-concrete dataset:
| Dataset | Description | Target Property |
|---|---|---|
| Concrete Materials | ||
| uci-concrete | UCI concrete strength data | Compressive strength |
# Clone the repository
git clone https://github.com/yourusername/Benchmark-AL-Mat.git
cd Benchmark-AL-Mat
# Install dependencies
pip install -r requirements.txt# Navigate to source code directory
cd src
# Run a single experiment (using random search strategy and uci-concrete dataset)
python main.py --random-state 42 --strategy RandomSearch --dataset uci-concrete
# Use different initialization method
python main.py --random-state 42 --strategy GSBAG --dataset uci-concrete --initial-method kmeans
# Custom query parameters
python main.py --random-state 42 --strategy RD_EMCM_ALR --dataset uci-concrete --n-pro-query 20- Create configuration file
config.json:
{
"random_state": 42,
"initial_method": "random",
"strategy": "RD_EMCM_ALR",
"dataset": "uci-concrete",
"n_pro_query": 15,
"threshold": 0.9
}- Run experiment:
python main.py --config-file config.json| Parameter | Type | Description | Default |
|---|---|---|---|
--random-state |
int | Random seed (required) | - |
--strategy |
str | Strategy name | - |
--dataset |
str | Dataset name | - |
--initial-method |
str | Initial sampling method | 'random' |
--n-pro-query |
int | Number of samples per query | 10 |
--threshold |
float | Early stopping threshold | 0.85 |
--config-file |
str | JSON configuration file path | None |
- random: Random initialization
- greedy_search: Greedy search initialization
- kmeans: K-means clustering initialization
- ncc: Nearest centroid classifier initialization
Benchmark-AL-Mat/
βββ src/ # Source code
β βββ main.py # Main experiment script
β βββ strategies/ # Active learning strategies
β β βββ __init__.py
β β βββ randomsearch.py # Random search baseline
β β βββ gaussianprocess.py # Gaussian process methods
β β βββ qbc.py # Query by committee
β β βββ LL4AL.py # Learning loss
β β βββ egal.py # Expected gradient length
β β βββ bmdal.py # Bayesian methods
β β βββ ... # Other strategies
β βββ utils/ # Utility modules
β β βββ active_learner.py # Active learning framework
β β βββ dataset_process.py # Data processing
β β βββ initialize.py # Initialization methods
β β βββ utils_initialize/ # Initialization utilities
β βββ bmdal_reg/ # BMDAL implementation
βββ dataset/ # Datasets
β βββ meta.csv # Dataset metadata
β βββ uci-concrete/ # UCI concrete data
βββ requirements.txt # Dependencies
βββ config_example.json # Configuration example
βββ README.md # This file
Experiment results are saved in the following structure:
result/
βββ {n_pro_query}/ # Grouped by query count
β βββ {random_state}/ # Grouped by random seed
β β βββ {initial_method}/ # Grouped by initialization method
β β βββ {strategy}_{dataset}_{timestamp}.json
β β βββ time_record/
β β βββ {strategy}_{dataset}_{timestamp}.json
- Main results: RΒ² score sequences, query indices, model performance changes
- Time records: Per-step query timing, overall runtime analysis
The meta.csv file defines configuration information for datasets and should contain the following columns:
dataname: Dataset name to reference in command line or config filepath: Relative path to the dataset CSV filetarget_columns: Target columns to be excluded from features (semicolon-separated if multiple)target_to_fit: Target column(s) to be predicted (semicolon-separated if multiple)
Example (for uci-concrete dataset):
dataname,path,target_columns,target_to_fit
uci-concrete,../dataset/uci-concrete/concrete_data.csv,concrete_compressive_strength,concrete_compressive_strengthTo add a new dataset, add a new row to meta.csv and ensure the corresponding data files are placed in the dataset directory.
- Create new strategy file in
src/strategies/ - Implement strategy class with
query()method - Import in
src/strategies/__init__.py - Add to strategy registry in
main.py
Example strategy template:
class NewStrategy:
def __init__(self, random_state=None):
self.random_state = random_state
def fit(self, X, y):
# Train model
pass
def query(self, X_unlabeled, n_act=1):
# Select most informative samples
return selected_indices- Place data files in appropriate
dataset/directory - Add dataset information to
dataset/meta.csv - Ensure data format compatibility with existing datasets
- Baseline comparison: First run RandomSearch as baseline
- Strategy evaluation: Systematically test different strategy performances
- Parameter tuning: Adjust n_pro_query and threshold
- Multiple runs: Use different random seeds to ensure result reliability
- Small datasets: n_pro_query=5-10
- Large datasets: n_pro_query=20-50
- GPU acceleration: Deep learning strategies automatically use GPU (if available)
- numpy>=1.21.0: Numerical computation
- pandas>=1.3.0: Data processing
- scikit-learn>=1.0.0: Machine learning algorithms
- torch>=1.9.0: Deep learning framework
- scipy>=1.7.0: Gaussian process kernels
- matplotlib>=3.4.0: Result visualization
- seaborn>=0.11.0: Statistical plots
- xgboost: XGBoost regressor support
If you use this framework in your research, please cite:
@misc{benchmark-al-mat-2024,
title={A Comprehensive Benchmark of Active Learning Strategies with AutoML for Small-Sample Regression in Materials Science},
author={Jinghou Bi},
email={jinghou.bi@tu-dresden.de},
year={2025},
url={https://github.com/bjhtud/Benchmark-AL-Mat}
}This project is licensed under the MIT License - see the LICENSE file for details.
Please report issues on the GitHub Issues page.
- Author: Jinghou Bi
- Email: jinghou.bi@tu-dresden.de
- Institution: TU Dresden
Thanks to the materials science community for providing datasets, and to open source libraries like scikit-learn and PyTorch for their support.