_______ _______ _____ ______ _ _ _____ __ __ _______ __ _ _______ _______
| | |_____| | |_____/ \ / | | \_/ |_____| | \ | | |______
|_____ |_____ | | __|__ | \_ \/ |_____| | | | | \_| |_____ |______
Revision of the Clairvoyance AutoML method from Espinoza & Dupont et al. 2021. The updated version includes regression support, support for all linear/tree-based models, feature selection through modified Feature-Engine classes, and bayesian optimization using Optuna. Clairvoyance has built-in (optional) functionality to natively address compositionality of data such as next-generation sequencing counts tables from genomics/transcriptomics.
Clairvoyance is currently under active development and API is subject to change.
import clairvoyance as cy
# Stable:
# via PyPI
pip install clairvoyance_feature_selection
# Developmental:
pip install git+https://github.com/jolespin/clairvoyance
Espinoza JL, Dupont CL, OβRourke A, Beyhan S, Morales P, Spoering A, et al. (2021) Predicting antimicrobial mechanism-of-action from transcriptomes: A generalizable explainable artificial intelligence approach. PLoS Comput Biol 17(3): e1008857. https://doi.org/10.1371/journal.pcbi.1008857
Clairvoyance is currently under active development and undergoing a complete reimplementation from the ground up from the original publication. The following includes a list of new features:
- Bayesian optimization using
Optuna - Supports any linear or tree-based
Scikit-Learncompatible estimator - Supports any
Scikit-Learncompatible performance metric - Supports regression (in addition to classification as in original implementation)
- Properly implements transformations for compositional data (e.g., CLR and closure) based on the query features for each iteration
- Option to remove zero weighted features during model refitting
- [Pending] Visualizations for AutoML
Here's a simple usage case for the iris dataset with 996 noise features (total = 1000 features)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from clairvoyance.bayesian import BayesianClairvoyanceClassification
# Load iris dataset
X, y = load_iris(return_X_y=True, as_frame=True)
X.columns = X.columns.map(lambda j: j.split(" (cm")[0].replace(" ","_"))
# Relabel targets
target_names = load_iris().target_names
y = y.map(lambda i: target_names[i])
# Add 996 noise features (total = 1000 features) in the same range of values as the original features
number_of_noise_features = 996
vmin = X.values.ravel().min()
vmax = X.values.ravel().max()
X_noise = pd.DataFrame(
data=np.random.RandomState(0).randint(low=int(vmin*10), high=int(vmax*10), size=(150, number_of_noise_features))/10,
columns=map(lambda j:"noise_{}".format(j+1), range(number_of_noise_features)),
)
X_iris_with_noise = pd.concat([X, X_noise], axis=1)
X_training, X_testing, y_training, y_testing = train_test_split(X_iris_with_noise, y, stratify=y, random_state=0, test_size=0.3)
# Specify model algorithm and parameter grid
estimator=LogisticRegression(max_iter=1000, solver="liblinear")
param_space={
"C":["float", 0.0, 1.0],
"penalty":["categorical", ["l1", "l2"]],
}
# Fit the AutoML model
model = BayesianClairvoyanceClassification(estimator, param_space, n_iter=4, n_trials=50, feature_selection_method="addition", n_jobs=-1, verbose=0, feature_selection_performance_threshold=0.025)
df_results = model.fit_transform(X_training, y_training, cv=3, optimize_with_training_and_testing=True, X_testing=X_testing, y_testing=y_testing)
[I 2024-07-05 12:14:33,611] A new study created in memory with name: n_iter=1
[I 2024-07-05 12:14:33,680] Trial 0 finished with values: [0.7238095238095238, 0.7333333333333333] and parameters: {'C': 0.417022004702574, 'penalty': 'l1'}.
[I 2024-07-05 12:14:33,866] Trial 1 finished with values: [0.7238095238095239, 0.7333333333333333] and parameters: {'C': 0.30233257263183977, 'penalty': 'l1'}.
[I 2024-07-05 12:14:34,060] Trial 2 finished with values: [0.39999999999999997, 0
...
Recursive feature addition: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 5/5 [00:00<00:00, 170.02it/s]
Synopsis[n_iter=2] Input Features: 6, Selected Features: 1
Initial Training Score: 0.9047619047619048, Feature Selected Training Score: 0.8761904761904762
Initial Testing Score: 0.7777777777777778, Feature Selected Testing Score: 0.9333333333333333We were able to filter out all the noise features and get just the most informative features but linear models might not be the best for this classification task.
| study_name | best_hyperparameters | best_estimator | best_trial | number_of_initial_features | initial_training_score | initial_testing_score | number_of_selected_features | feature_selected_training_score | feature_selected_testing_score | selected_features |
|---|---|---|---|---|---|---|---|---|---|---|
| n_iter=1 | {'C': 0.0745664572902166, 'penalty': 'l1'} | LogisticRegression(C=0.0745664572902166, max_iter=1000, penalty='l1', | FrozenTrial(number=28, state=TrialState.COMPLETE, values=[0.7904761904761904, 0.7333333333333333], datetime_start=datetime.datetime(2024, 7, 6, 15, 53, 9, 422777), datetime_complete=datetime.datetime(2024, 7, 6, 15, 53, 9, 491422), params={'C': 0.0745664572902166, 'penalty': 'l1'}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'C': FloatDistribution(high=1.0, log=False, low=0.0, step=None), 'penalty': CategoricalDistribution(choices=('l1', 'l2'))}, trial_id=28, value=None) | 1000 | 0.790476 | 0.733333 | 6 | 0.904762 | 0.733333 | ['petal_length', 'noise_25', 'noise_833', 'noise_48', 'noise_653', 'noise_793'] |
| n_iter=2 | {'C': 0.9875411040455084, 'penalty': 'l1'} | LogisticRegression(C=0.9875411040455084, max_iter=1000, penalty='l1', | FrozenTrial(number=11, state=TrialState.COMPLETE, values=[0.9047619047619048, 0.7777777777777778], datetime_start=datetime.datetime(2024, 7, 6, 15, 53, 33, 987822), datetime_complete=datetime.datetime(2024, 7, 6, 15, 53, 34, 12108), params={'C': 0.9875411040455084, 'penalty': 'l1'}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'C': FloatDistribution(high=1.0, log=False, low=0.0, step=None), 'penalty': CategoricalDistribution(choices=('l1', 'l2'))}, trial_id=11, value=None) | 6 | 0.904762 | 0.777778 | 1 | 0.87619 | 0.933333 | ['petal_length'] |
# Specify DecisionTree model algorithm and parameter grid
from sklearn.tree import DecisionTreeClassifier
estimator=DecisionTreeClassifier(random_state=0)
param_space = {
"min_samples_leaf":["int", 1, 50],
"min_samples_split": ["float", 0.0, 0.5],
"max_features":["categorical", ["sqrt", "log2", None]],
}
model = BayesianClairvoyanceClassification(estimator, param_space, n_iter=4, n_trials=10, feature_selection_method="addition", n_jobs=-1, verbose=0, feature_selection_performance_threshold=0.0)
df_results = model.fit_transform(X_training, y_training, cv=3, optimize_with_training_and_testing=True, X_testing=X_testing, y_testing=y_testing)
df_results
[I 2024-07-06 15:48:59,235] A new study created in memory with name: n_iter=1
[I 2024-07-06 15:48:59,313] Trial 0 finished with values: [0.3523809523809524, 0.37777777777777777] and parameters: {'min_samples_leaf': 21, 'min_samples_split': 0.36016224672107905, 'max_features': 'log2'}.
[I 2024-07-06 15:49:00,204] Trial 1 finished with values: [0.9142857142857143, 0.9555555555555556] and parameters: {'min_samples_leaf': 5, 'min_samples_split': 0.09313010568883545, 'max_features': None}.
[I 2024-07-06 15:49:00,774] Trial 2 finished with values: [0.3523809523809524, 0.37777777777777777] and parameters: {'min_samples_leaf': 21, 'min_samples_split': 0.34260975019837975, 'max_features': 'log2'}.
...
/Users/jolespin/miniconda3/envs/soothsayer_env/lib/python3.9/site-packages/clairvoyance/feature_selection.py:632: UserWarning: remove_zero_weighted_features=True and removed 995/1000 features
warnings.warn("remove_zero_weighted_features=True and removed {}/{} features".format((n_features_initial - n_features_after_zero_removal), n_features_initial))
Recursive feature addition: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:00<00:00, 164.94it/s]
Synopsis[n_iter=1] Input Features: 1000, Selected Features: 1
Initial Training Score: 0.9142857142857143, Feature Selected Training Score: 0.9619047619047619
Initial Testing Score: 0.9555555555555556, Feature Selected Testing Score: 0.9555555555555556
/Users/jolespin/miniconda3/envs/soothsayer_env/lib/python3.9/site-packages/clairvoyance/bayesian.py:594: UserWarning: Stopping because < 2 features remain ['petal_width']
warnings.warn(f"Stopping because < 2 features remain {query_features}")We were able to get much higher perfomance on both the training and testing sets while identifying the most informative feature(s).
| study_name | best_hyperparameters | best_estimator | best_trial | number_of_initial_features | initial_training_score | initial_testing_score | number_of_selected_features | feature_selected_training_score | feature_selected_testing_score | selected_features |
|---|---|---|---|---|---|---|---|---|---|---|
| n_iter=1 | {'min_samples_leaf': 5, 'min_samples_split': 0.09313010568883545, 'max_features': None} | DecisionTreeClassifier(min_samples_leaf=5, | FrozenTrial(number=1, state=TrialState.COMPLETE, values=[0.9142857142857143, 0.9555555555555556], datetime_start=datetime.datetime(2024, 7, 6, 15, 49, 0, 127973), datetime_complete=datetime.datetime(2024, 7, 6, 15, 49, 0, 204635), params={'min_samples_leaf': 5, 'min_samples_split': 0.09313010568883545, 'max_features': None}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'min_samples_leaf': IntDistribution(high=50, log=False, low=1, step=1), 'min_samples_split': FloatDistribution(high=0.5, log=False, low=0.0, step=None), 'max_features': CategoricalDistribution(choices=('sqrt', 'log2', None))}, trial_id=1, value=None) | 1000 | 0.914286 | 0.955556 | 1 | 0.961905 | 0.955556 | ['petal_width'] |
Alright, let's switch it up and model a regression task instead. We are going to do the controversial boston housing dataset just because it's easy. We are going to use the RMSE scorer from Scikit-Learn and increase the number of iterations for the bayesian hyperparamter optimzation.
# Load modules
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from clairvoyance.bayesian import BayesianClairvoyanceRegression
from sklearn.metrics import make_scorer
# Load Boston data
# from sklearn.datasets import load_boston; boston = load_boston() # Deprecated
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
X = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'])
y = pd.Series(target)
# Add some noise features to total 1000 features
number_of_noise_features = 1000 - X.shape[1]
X_noise = pd.DataFrame(np.random.RandomState(0).normal(size=(X.shape[0], number_of_noise_features)), columns=map(lambda j: f"noise_{j}", range(number_of_noise_features)))
X_boston_with_noise = pd.concat([X, X_noise], axis=1)
X_normalized = X_boston_with_noise - X_boston_with_noise.mean(axis=0).values
X_normalized = X_normalized/X_normalized.std(axis=0).values
# Let's fit the model but leave a held out testing set
X_training, X_testing, y_training, y_testing = train_test_split(X_normalized, y, random_state=0, test_size=0.1)
# Define the parameter space
estimator = DecisionTreeRegressor(random_state=0)
param_space = {
"min_samples_leaf":["int", 1, 50],
"min_samples_split": ["float", 0.0, 0.5],
"max_features":["categorical", ["sqrt", "log2", None]],
}
scorer = make_scorer(mean_squared_error, greater_is_better=False) # Not necessary to add
# Fit the AutoML model
model = BayesianClairvoyanceRegression(estimator, param_space, scorer=scorer, n_iter=4, n_trials=10, feature_selection_method="addition", n_jobs=-1, verbose=1, feature_selection_performance_threshold=0.0)
df_results = model.fit_transform(X_training, y_training, cv=5, optimize_with_training_and_testing="auto", X_testing=X_testing, y_testing=y_testing)
I 2024-07-06 01:30:03,567] A new study created in memory with name: n_iter=1
[I 2024-07-06 01:30:03,781] Trial 0 finished with values: [-8.199129905056083, -10.15240690512492] and parameters: {'min_samples_leaf': 21, 'min_samples_split': 0.36016224672107905, 'max_features': 'log2'}.
[I 2024-07-06 01:30:04,653] Trial 1 finished with values: [-4.971853722495094, -6.666700255530846] and parameters: {'min_samples_leaf': 5, 'min_samples_split': 0.09313010568883545, 'max_features': None}.
[I 2024-07-06 01:30:05,188] Trial 2 finished with values: [-8.230463026740736, -10.167328393077224] and parameters: {'min_samples_leaf': 21, 'min_samples_split': 0.34260975019837975, 'max_features': 'log2'}.
...
Recursive feature addition: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:00<00:00, 116.99it/s]
Synopsis[n_iter=4] Input Features: 3, Selected Features: 3
Initial Training Score: -4.972940969198907, Feature Selected Training Score: -4.972940969198907
Initial Testing Score: -6.313587662660524, Feature Selected Testing Score: -6.313587662660524We successfully removed all the noise features and determined that RM, LSTAT, CRIM are the most important features. It's a controversial interpretation so I'm not going there but these results agree with what other researchers have determined as well.
| study_name | best_hyperparameters | best_estimator | best_trial | number_of_initial_features | initial_training_score | initial_testing_score | number_of_selected_features | feature_selected_training_score | feature_selected_testing_score | selected_features |
|---|---|---|---|---|---|---|---|---|---|---|
| n_iter=1 | {'min_samples_leaf': 5, 'min_samples_split': 0.09313010568883545, 'max_features': None} | DecisionTreeRegressor(min_samples_leaf=5, min_samples_split=0.09313010568883545, random_state=0) | FrozenTrial(number=1, state=TrialState.COMPLETE, values=[-4.971853722495094, -6.666700255530846], datetime_start=datetime.datetime(2024, 7, 6, 1, 30, 4, 256210), datetime_complete=datetime.datetime(2024, 7, 6, 1, 30, 4, 653385), params={'min_samples_leaf': 5, 'min_samples_split': 0.09313010568883545, 'max_features': None}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'min_samples_leaf': IntDistribution(high=50, log=False, low=1, step=1), 'min_samples_split': FloatDistribution(high=0.5, log=False, low=0.0, step=None), 'max_features': CategoricalDistribution(choices=('sqrt', 'log2', None))}, trial_id=1, value=None) | 1000 | -4.971853722495094 | -6.666700255530846 | 12 | -4.167626439610535 | -6.497959383451274 | ['RM', 'LSTAT', 'CRIM', 'DIS', 'TAX', 'noise_657', 'noise_965', 'noise_711', 'noise_213', 'noise_930', 'noise_253', 'noise_484'] |
| n_iter=2 | {'min_samples_leaf': 30, 'min_samples_split': 0.11300600030211794, 'max_features': None} | DecisionTreeRegressor(min_samples_leaf=30, min_samples_split=0.11300600030211794, random_state=0) | FrozenTrial(number=5, state=TrialState.COMPLETE, values=[-4.971072001107094, -6.2892657979392474], datetime_start=datetime.datetime(2024, 7, 6, 1, 30, 12, 603770), datetime_complete=datetime.datetime(2024, 7, 6, 1, 30, 12, 619502), params={'min_samples_leaf': 30, 'min_samples_split': 0.11300600030211794, 'max_features': None}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'min_samples_leaf': IntDistribution(high=50, log=False, low=1, step=1), 'min_samples_split': FloatDistribution(high=0.5, log=False, low=0.0, step=None), 'max_features': CategoricalDistribution(choices=('sqrt', 'log2', None))}, trial_id=5, value=None) | 12 | -4.971072001107094 | -6.2892657979392474 | 4 | -4.944562598653571 | -6.3774459339786524 | ['RM', 'LSTAT', 'CRIM', 'noise_213'] |
| n_iter=3 | {'min_samples_leaf': 45, 'min_samples_split': 0.06279265523191813, 'max_features': None} | DecisionTreeRegressor(min_samples_leaf=45, min_samples_split=0.06279265523191813, random_state=0) | FrozenTrial(number=1, state=TrialState.COMPLETE, values=[-5.236077512452411, -6.670753984555223], datetime_start=datetime.datetime(2024, 7, 6, 1, 30, 14, 831786), datetime_complete=datetime.datetime(2024, 7, 6, 1, 30, 14, 848240), params={'min_samples_leaf': 45, 'min_samples_split': 0.06279265523191813, 'max_features': None}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'min_samples_leaf': IntDistribution(high=50, log=False, low=1, step=1), 'min_samples_split': FloatDistribution(high=0.5, log=False, low=0.0, step=None), 'max_features': CategoricalDistribution(choices=('sqrt', 'log2', None))}, trial_id=1, value=None) | 4 | -5.236077512452411 | -6.670753984555223 | 3 | -5.236077512452413 | -6.670753984555223 | ['RM', 'LSTAT', 'CRIM'] |
| n_iter=4 | {'min_samples_leaf': 30, 'min_samples_split': 0.004493048833777491, 'max_features': None} | DecisionTreeRegressor(min_samples_leaf=30, min_samples_split=0.004493048833777491, random_state=0) | FrozenTrial(number=3, state=TrialState.COMPLETE, values=[-4.972940969198907, -6.313587662660524], datetime_start=datetime.datetime(2024, 7, 6, 1, 30, 19, 160978), datetime_complete=datetime.datetime(2024, 7, 6, 1, 30, 19, 177029), params={'min_samples_leaf': 30, 'min_samples_split': 0.004493048833777491, 'max_features': None}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'min_samples_leaf': IntDistribution(high=50, log=False, low=1, step=1), 'min_samples_split': FloatDistribution(high=0.5, log=False, low=0.0, step=None), 'max_features': CategoricalDistribution(choices=('sqrt', 'log2', None))}, trial_id=3, value=None) | 3 | -4.972940969198907 | -6.313587662660524 | 3 | -4.972940969198907 | -6.313587662660524 | ['RM', 'LSTAT', 'CRIM'] |