LILO: Bayesian Optimization with Natural Language Feedback

Official implementation of LILO: Bayesian Optimization with Natural Language Feedback (ICML 2026).

LILO performs Bayesian optimization guided by free-form natural-language feedback from a (possibly LLM-simulated) decision maker. The NL feedback is converted into a optimizable signale via LLM-automated pairwise comparisons followed by GP modelling, which feeds a standard BO acquisition loop.

[openreview] [arXiv] [Colab notebook]

What this repo contains

The main entry point:

bo_loop.py — main BO loop on synthetic environments (DTLZ2, Vehicle Safety, CarCab, Thermo, NAS-Bench-201). Hydra-configured.

Supporting modules:

environments.py — synthetic environments and their utility functions.
gp_models.py — SimpleGPProxyModel, CategoricalGPProxyModel.
utility_approximator.py — LLM-driven scalar / pairwise utility approximation.
human_feedback_simulator.py — simulated NL human responses via an LLM.
prompts.py — prompt templates used by the LLM-side machinery.
llm_utils.py — abstract LLMClient interface that you must implement (see below).
utils.py — small helpers (sigmoid, JSON extraction).
config/ — Hydra configs (one per method).

The entry point and supporting module for the summarization task:

summary_bo_loop.py — BO over LLM-summarization hyperparameters on CNN/DailyMail.
summay_utils.py — ArticleSummarizer, SummaryFeedbackGenerator, LLMPairwiseJudge for the summarization task.

💿 Installation

Below we describe the steps to setup the environment.

1️⃣ Environment

First, clone the repository and enter the directory:

git clone https://github.com/facebookresearch/lilo
cd lilo

Then, set up a conda environment as follows:

conda create --name lilo python=3.12
conda activate lilo

Finally, install the required depndencies:

pip install -r requirements.txt
# or, for an editable install:
pip install -e .

2️⃣ LLM client setup (required)

The optimizer relies on an LLM to simulate human feedback, label pairwise preferences, generate prior knowledge, etc. The repo ships with lilo.llm_utils.PlaceholderLLMClient, which raises NotImplementedError on use. Subclass LLMClient against your provider and wire it into bo_loop._make_llm_client (and summary_bo_loop._make_llm_client if you run the summarization experiments).

The only method you must implement is:

async def get_batch_llm_responses(
    self, prompts, num_responses=1, kwargs=None,
    max_retries=8, timeout_per_call=None,
) -> list[list[str]]:
    ...

For each input prompt, return a list of num_responses string completions.

Example: OpenAI

import asyncio
from openai import AsyncOpenAI
from lilo.llm_utils import LLMClient

    def __init__(self, model="gpt-4o-mini", **kw):
        super().__init__(model=model)
        self.client = AsyncOpenAI()

    async def get_batch_llm_responses(self, prompts, num_responses=1, kwargs=None, **_):
        async def one(p):
            r = await self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": p}],
                n=num_responses,
                **(kwargs or {}),
            )
            return [c.message.content for c in r.choices]
        return await asyncio.gather(*(one(p) for p in prompts))

Then in bo_loop.py:

def _make_llm_client(model: str) -> LLMClient:
    return OpenAIClient(model=model)

llm_utils.py contains additional skeletons for Anthropic and local vLLM/HuggingFace.

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

Place the input datasets under data/ (or pass alternative paths via Hydra). See data/README.md for the exact format and download instructions:

NAS-Bench-201: data/nasbench201_dataset_<id>.jsonl, derived from https://github.com/D-X-Y/NAS-Bench-201.
CNN/DailyMail (summarization task only): data/cnn_daily_mail_{train,test}.parquet, from HuggingFace cnn_dailymail v3.0.0.

The other synthetic environments (dtlz2, vehicle_safety, carcab, thermo) need no external data — they use closed-form objectives or pymoo / botorch test functions.

⚙️ Configuration

The BO loop is configured via Hydra. The top-level entry is config/bo_loop.yaml; methods are selected via the Hydra method group.

`method=`	Description
`lilo`	Pairwise NL utility approximation (the paper's main method)
`lilo_scalar`	Scalar NL utility approximation
`true_utility_bo`	Oracle baseline using the true utility
`preferential_bo`	Preferential BO baseline (`PairwiseGP` + `qEUBO`)
`llm_direct`	LLM directly proposes the next candidate
`llm_2step`	LLM-based 2-step acquisition

`environment.name`	Utility functions used in the paper
`dtlz2`	`piecewise_linear`, `l1`, `beta_products`
`vehicle_safety`	`piecewise_linear`, `beta_products`, `vehicle_safety_llm`
`carcab`	`piecewise_linear`, `carcab_llm`
`thermo`	`thermo_A`, `thermo_B`
`nas201`	`nas201_research`, `nas201_edge`

Other useful overrides:

seed=<int> — random seed (default 0).
N_iter=<int> — number of BO trials (default 8).
bs_feedback=<int> — number of feedback datapoints (i.e., number of question-answer pairs in LILO, direct utility evaluations in the true utility baseline, pairwise comparisons in the preference BO baseliens) (default 2).
bs_exp=<int> — batch size of new candidates per trial (default null = problem dimension).
acquisition_method=log_nei|log_ei|thompson|ucb_<beta>|llm_2step|llm_direct.
pair_labeling_acquisition_method=sequential_q_eubo|random — for the LILO pairwise feedback acquisition ablation.
use_prior_knowledge=True/False, prior_knowledge_type=point|area|domain — for the prior-knowledge ablation. The prior is consumed at initialization (prior_knowledge_inc_method=llm_init, the only supported value).
save_outputs=True save_dir=runs/<name> — save per-trial results.

🎮 Single-run example

After implementing your LLMClient and dropping nasbench201_dataset_0.jsonl into data/:

python -m lilo.bo_loop \
  method=lilo seed=0 \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  N_iter=8 \
  save_outputs=True save_dir=runs/main_benchmark

This writes runs/main_benchmark/run_<timestamp>/{config.json, results.jsonl}.

📄 Reproducing the paper experiments

Each paper figure corresponds to a sweep over (method, environment, utility_func, seed, ...). Use Hydra's multirun mode (-m) to launch the sweep; aggregate the resulting results.jsonl files yourself (the paper plots are means ± confidence intervals across seeds — no aggregation script ships with the repo).

Main benchmark

python -m lilo.bo_loop -m \
  method=lilo,true_utility_bo,preferential_bo,llm_direct,llm_2step \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/main_benchmark

Repeat with environment.name/utility_func for each row of the env table above.

LILO with prior knowledge

For synthetic environments, the available prior knowledge type is point and area:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=point,area \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

For real-world environemnts, use domain knowledge type:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=domain \
  environment.name=thermo environment.utility_func=thermo_A,thermo_B \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

Pairwise vs scalar

python -m lilo.bo_loop -m \
  method=lilo_scalar \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/pairwise_vs_scalar

Pair acqusition for LLM labeling via qEUBO vs. random

python -m lilo.bo_loop -m \
  method=lilo \
  pair_labeling_acquisition_method=sequential_q_eubo,random \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/bpf_ablation

Acquisition-method ablation

python -m lilo.bo_loop -m \
  method=lilo \
  acquisition_method=log_nei,ucb_0.5,thompson \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/acqf_ablation

Summarization

summary_bo_loop.py is not Hydra-wrapped; configure it via the OmegaConf.create(...) block at the bottom of the file.

Key fields:

optimizer: "fixed" or "sampled".
seed, N_trials, N_configs, N_articles, bs_feedback, acqf, persona, optimize_preamble, local_dir (where the parquet files live), llm_model.

Sweep seed and optimizer to reproduce the two summarization plots.

💾 Output format

Each run writes:

<save_dir>/run_<timestamp>/
├── config.json      # the resolved Hydra config
└── results.jsonl    # one JSON object per BO trial

Per-trial fields include:

trial_index
best_value_predicted — best predicted utility seen so far
best_value_true - best ground truth utility seen so far
value_of_best_predicted - ground truth utility value of the best predicted so far
pairwise_accuracy — LLM pairwise-judge accuracy on the trial's labeled pool
label_exp_df, context_df, prior_df — serialized pandas DataFrames

To compare methods across seeds, group by (method, environment.name, environment.utility_func, seed), take best_value_true per trial_index, and compute means / CIs.

⚖️ License

The code is licensed under an MIT license.

📧 Contact

Katarzyna Kobalczyk: knk25@cam.ac.uk

Jerry Lin: zylin@meta.com

✍️ Citation

If you find this repository useful, please consider giving a star ⭐ and please cite as:

@inproceedings{
kobalczyk2026lilo,
title={{LILO}: Bayesian Optimization with Natural Language Feedback},
author={Katarzyna Kobalczyk and Zhiyuan Jerry Lin and Benjamin Letham and Zhuokai Zhao and Maximilian Balandat and Eytan Bakshy},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=zVrbE9ZtEU}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
assets		assets
data		data
lilo		lilo
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LILO: Bayesian Optimization with Natural Language Feedback

What this repo contains

💿 Installation

1️⃣ Environment

2️⃣ LLM client setup (required)

Example: OpenAI

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

⚙️ Configuration

🎮 Single-run example

📄 Reproducing the paper experiments

Main benchmark

LILO with prior knowledge

Pairwise vs scalar

Pair acqusition for LLM labeling via qEUBO vs. random

Acquisition-method ablation

Summarization

💾 Output format

⚖️ License

📧 Contact

✍️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LILO: Bayesian Optimization with Natural Language Feedback

What this repo contains

💿 Installation

1️⃣ Environment

2️⃣ LLM client setup (required)

Example: OpenAI

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

⚙️ Configuration

🎮 Single-run example

📄 Reproducing the paper experiments

Main benchmark

LILO with prior knowledge

Pairwise vs scalar

Pair acqusition for LLM labeling via qEUBO vs. random

Acquisition-method ablation

Summarization

💾 Output format

⚖️ License

📧 Contact

✍️ Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages