Official implementation of LILO: Bayesian Optimization with Natural Language Feedback (ICML 2026).
LILO performs Bayesian optimization guided by free-form natural-language feedback from a (possibly LLM-simulated) decision maker. The NL feedback is converted into a optimizable signale via LLM-automated pairwise comparisons followed by GP modelling, which feeds a standard BO acquisition loop.
[openreview]
[arXiv]
[Colab notebook]
The main entry point:
bo_loop.py— main BO loop on synthetic environments (DTLZ2, Vehicle Safety, CarCab, Thermo, NAS-Bench-201). Hydra-configured.
Supporting modules:
environments.py— synthetic environments and their utility functions.gp_models.py—SimpleGPProxyModel,CategoricalGPProxyModel.utility_approximator.py— LLM-driven scalar / pairwise utility approximation.human_feedback_simulator.py— simulated NL human responses via an LLM.prompts.py— prompt templates used by the LLM-side machinery.llm_utils.py— abstractLLMClientinterface that you must implement (see below).utils.py— small helpers (sigmoid, JSON extraction).config/— Hydra configs (one per method).
The entry point and supporting module for the summarization task:
summary_bo_loop.py— BO over LLM-summarization hyperparameters on CNN/DailyMail.summay_utils.py—ArticleSummarizer,SummaryFeedbackGenerator,LLMPairwiseJudgefor the summarization task.
Below we describe the steps to setup the environment.
First, clone the repository and enter the directory:
git clone https://github.com/facebookresearch/lilo
cd lilo
Then, set up a conda environment as follows:
conda create --name lilo python=3.12
conda activate lilo
Finally, install the required depndencies:
pip install -r requirements.txt
# or, for an editable install:
pip install -e .The optimizer relies on an LLM to simulate human feedback, label pairwise preferences, generate prior knowledge, etc. The repo ships with lilo.llm_utils.PlaceholderLLMClient, which raises NotImplementedError on use. Subclass LLMClient against your provider and wire it into bo_loop._make_llm_client (and summary_bo_loop._make_llm_client if you run the summarization experiments).
The only method you must implement is:
async def get_batch_llm_responses(
self, prompts, num_responses=1, kwargs=None,
max_retries=8, timeout_per_call=None,
) -> list[list[str]]:
...For each input prompt, return a list of num_responses string completions.
import asyncio
from openai import AsyncOpenAI
from lilo.llm_utils import LLMClient
def __init__(self, model="gpt-4o-mini", **kw):
super().__init__(model=model)
self.client = AsyncOpenAI()
async def get_batch_llm_responses(self, prompts, num_responses=1, kwargs=None, **_):
async def one(p):
r = await self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": p}],
n=num_responses,
**(kwargs or {}),
)
return [c.message.content for c in r.choices]
return await asyncio.gather(*(one(p) for p in prompts))Then in bo_loop.py:
def _make_llm_client(model: str) -> LLMClient:
return OpenAIClient(model=model)llm_utils.py contains additional skeletons for Anthropic and local vLLM/HuggingFace.
Place the input datasets under data/ (or pass alternative paths via Hydra). See data/README.md for the exact format and download instructions:
- NAS-Bench-201:
data/nasbench201_dataset_<id>.jsonl, derived from https://github.com/D-X-Y/NAS-Bench-201. - CNN/DailyMail (summarization task only):
data/cnn_daily_mail_{train,test}.parquet, from HuggingFacecnn_dailymailv3.0.0.
The other synthetic environments (dtlz2, vehicle_safety, carcab, thermo) need no external data — they use closed-form objectives or pymoo / botorch test functions.
The BO loop is configured via Hydra. The top-level entry is config/bo_loop.yaml; methods are selected via the Hydra method group.
method= |
Description |
|---|---|
lilo |
Pairwise NL utility approximation (the paper's main method) |
lilo_scalar |
Scalar NL utility approximation |
true_utility_bo |
Oracle baseline using the true utility |
preferential_bo |
Preferential BO baseline (PairwiseGP + qEUBO) |
llm_direct |
LLM directly proposes the next candidate |
llm_2step |
LLM-based 2-step acquisition |
environment.name |
Utility functions used in the paper |
|---|---|
dtlz2 |
piecewise_linear, l1, beta_products |
vehicle_safety |
piecewise_linear, beta_products, vehicle_safety_llm |
carcab |
piecewise_linear, carcab_llm |
thermo |
thermo_A, thermo_B |
nas201 |
nas201_research, nas201_edge |
Other useful overrides:
seed=<int>— random seed (default0).N_iter=<int>— number of BO trials (default8).bs_feedback=<int>— number of feedback datapoints (i.e., number of question-answer pairs in LILO, direct utility evaluations in the true utility baseline, pairwise comparisons in the preference BO baseliens) (default2).bs_exp=<int>— batch size of new candidates per trial (defaultnull= problem dimension).acquisition_method=log_nei|log_ei|thompson|ucb_<beta>|llm_2step|llm_direct.pair_labeling_acquisition_method=sequential_q_eubo|random— for the LILO pairwise feedback acquisition ablation.use_prior_knowledge=True/False,prior_knowledge_type=point|area|domain— for the prior-knowledge ablation. The prior is consumed at initialization (prior_knowledge_inc_method=llm_init, the only supported value).save_outputs=True save_dir=runs/<name>— save per-trial results.
After implementing your LLMClient and dropping nasbench201_dataset_0.jsonl into data/:
python -m lilo.bo_loop \
method=lilo seed=0 \
environment.name=dtlz2 environment.utility_func=piecewise_linear \
N_iter=8 \
save_outputs=True save_dir=runs/main_benchmarkThis writes runs/main_benchmark/run_<timestamp>/{config.json, results.jsonl}.
Each paper figure corresponds to a sweep over (method, environment, utility_func, seed, ...). Use Hydra's multirun mode (-m) to launch the sweep; aggregate the resulting results.jsonl files yourself (the paper plots are means ± confidence intervals across seeds — no aggregation script ships with the repo).
python -m lilo.bo_loop -m \
method=lilo,true_utility_bo,preferential_bo,llm_direct,llm_2step \
environment.name=dtlz2 environment.utility_func=piecewise_linear \
seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
save_outputs=True save_dir=runs/main_benchmarkRepeat with environment.name/utility_func for each row of the env table above.
For synthetic environments, the available prior knowledge type is point and area:
python -m lilo.bo_loop -m \
method=lilo \
use_prior_knowledge=True \
prior_knowledge_type=point,area \
environment.name=dtlz2 environment.utility_func=piecewise_linear \
seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
save_outputs=True save_dir=runs/prior_knowledgeFor real-world environemnts, use domain knowledge type:
python -m lilo.bo_loop -m \
method=lilo \
use_prior_knowledge=True \
prior_knowledge_type=domain \
environment.name=thermo environment.utility_func=thermo_A,thermo_B \
seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
save_outputs=True save_dir=runs/prior_knowledgepython -m lilo.bo_loop -m \
method=lilo_scalar \
environment.name=dtlz2 environment.utility_func=piecewise_linear \
seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
save_outputs=True save_dir=runs/pairwise_vs_scalarpython -m lilo.bo_loop -m \
method=lilo \
pair_labeling_acquisition_method=sequential_q_eubo,random \
environment.name=dtlz2 environment.utility_func=piecewise_linear \
seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
save_outputs=True save_dir=runs/bpf_ablationpython -m lilo.bo_loop -m \
method=lilo \
acquisition_method=log_nei,ucb_0.5,thompson \
environment.name=dtlz2 environment.utility_func=piecewise_linear \
seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
save_outputs=True save_dir=runs/acqf_ablationsummary_bo_loop.py is not Hydra-wrapped; configure it via the OmegaConf.create(...) block at the bottom of the file.
Key fields:
optimizer:"fixed"or"sampled".seed,N_trials,N_configs,N_articles,bs_feedback,acqf,persona,optimize_preamble,local_dir(where the parquet files live),llm_model.
Sweep seed and optimizer to reproduce the two summarization plots.
Each run writes:
<save_dir>/run_<timestamp>/
├── config.json # the resolved Hydra config
└── results.jsonl # one JSON object per BO trial
Per-trial fields include:
trial_indexbest_value_predicted— best predicted utility seen so farbest_value_true- best ground truth utility seen so farvalue_of_best_predicted- ground truth utility value of the best predicted so farpairwise_accuracy— LLM pairwise-judge accuracy on the trial's labeled poollabel_exp_df,context_df,prior_df— serialized pandas DataFrames
To compare methods across seeds, group by (method, environment.name, environment.utility_func, seed), take best_value_true per trial_index, and compute means / CIs.
The code is licensed under an MIT license.
Katarzyna Kobalczyk: knk25@cam.ac.uk
Jerry Lin: zylin@meta.com
If you find this repository useful, please consider giving a star ⭐ and please cite as:
@inproceedings{
kobalczyk2026lilo,
title={{LILO}: Bayesian Optimization with Natural Language Feedback},
author={Katarzyna Kobalczyk and Zhiyuan Jerry Lin and Benjamin Letham and Zhuokai Zhao and Maximilian Balandat and Eytan Bakshy},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=zVrbE9ZtEU}
}
