Skip to content

facebookresearch/lilo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LILO: Bayesian Optimization with Natural Language Feedback

Official implementation of LILO: Bayesian Optimization with Natural Language Feedback (ICML 2026).

Watermarking Demo

LILO performs Bayesian optimization guided by free-form natural-language feedback from a (possibly LLM-simulated) decision maker. The NL feedback is converted into a optimizable signale via LLM-automated pairwise comparisons followed by GP modelling, which feeds a standard BO acquisition loop.

[openreview] [arXiv] [Colab notebook]

What this repo contains

The main entry point:

  • bo_loop.py — main BO loop on synthetic environments (DTLZ2, Vehicle Safety, CarCab, Thermo, NAS-Bench-201). Hydra-configured.

Supporting modules:

  • environments.py — synthetic environments and their utility functions.
  • gp_models.pySimpleGPProxyModel, CategoricalGPProxyModel.
  • utility_approximator.py — LLM-driven scalar / pairwise utility approximation.
  • human_feedback_simulator.py — simulated NL human responses via an LLM.
  • prompts.py — prompt templates used by the LLM-side machinery.
  • llm_utils.py — abstract LLMClient interface that you must implement (see below).
  • utils.py — small helpers (sigmoid, JSON extraction).
  • config/ — Hydra configs (one per method).

The entry point and supporting module for the summarization task:

  • summary_bo_loop.py — BO over LLM-summarization hyperparameters on CNN/DailyMail.
  • summay_utils.pyArticleSummarizer, SummaryFeedbackGenerator, LLMPairwiseJudge for the summarization task.

💿 Installation

Below we describe the steps to setup the environment.

1️⃣ Environment

First, clone the repository and enter the directory:

git clone https://github.com/facebookresearch/lilo
cd lilo

Then, set up a conda environment as follows:

conda create --name lilo python=3.12
conda activate lilo

Finally, install the required depndencies:

pip install -r requirements.txt
# or, for an editable install:
pip install -e .

2️⃣ LLM client setup (required)

The optimizer relies on an LLM to simulate human feedback, label pairwise preferences, generate prior knowledge, etc. The repo ships with lilo.llm_utils.PlaceholderLLMClient, which raises NotImplementedError on use. Subclass LLMClient against your provider and wire it into bo_loop._make_llm_client (and summary_bo_loop._make_llm_client if you run the summarization experiments).

The only method you must implement is:

async def get_batch_llm_responses(
    self, prompts, num_responses=1, kwargs=None,
    max_retries=8, timeout_per_call=None,
) -> list[list[str]]:
    ...

For each input prompt, return a list of num_responses string completions.

Example: OpenAI

import asyncio
from openai import AsyncOpenAI
from lilo.llm_utils import LLMClient

    def __init__(self, model="gpt-4o-mini", **kw):
        super().__init__(model=model)
        self.client = AsyncOpenAI()

    async def get_batch_llm_responses(self, prompts, num_responses=1, kwargs=None, **_):
        async def one(p):
            r = await self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": p}],
                n=num_responses,
                **(kwargs or {}),
            )
            return [c.message.content for c in r.choices]
        return await asyncio.gather(*(one(p) for p in prompts))

Then in bo_loop.py:

def _make_llm_client(model: str) -> LLMClient:
    return OpenAIClient(model=model)

llm_utils.py contains additional skeletons for Anthropic and local vLLM/HuggingFace.

3️⃣ Data setup (only required for NAS-Bench-201 and the summarization task)

Place the input datasets under data/ (or pass alternative paths via Hydra). See data/README.md for the exact format and download instructions:

  • NAS-Bench-201: data/nasbench201_dataset_<id>.jsonl, derived from https://github.com/D-X-Y/NAS-Bench-201.
  • CNN/DailyMail (summarization task only): data/cnn_daily_mail_{train,test}.parquet, from HuggingFace cnn_dailymail v3.0.0.

The other synthetic environments (dtlz2, vehicle_safety, carcab, thermo) need no external data — they use closed-form objectives or pymoo / botorch test functions.

⚙️ Configuration

The BO loop is configured via Hydra. The top-level entry is config/bo_loop.yaml; methods are selected via the Hydra method group.

method= Description
lilo Pairwise NL utility approximation (the paper's main method)
lilo_scalar Scalar NL utility approximation
true_utility_bo Oracle baseline using the true utility
preferential_bo Preferential BO baseline (PairwiseGP + qEUBO)
llm_direct LLM directly proposes the next candidate
llm_2step LLM-based 2-step acquisition
environment.name Utility functions used in the paper
dtlz2 piecewise_linear, l1, beta_products
vehicle_safety piecewise_linear, beta_products, vehicle_safety_llm
carcab piecewise_linear, carcab_llm
thermo thermo_A, thermo_B
nas201 nas201_research, nas201_edge

Other useful overrides:

  • seed=<int> — random seed (default 0).
  • N_iter=<int> — number of BO trials (default 8).
  • bs_feedback=<int> — number of feedback datapoints (i.e., number of question-answer pairs in LILO, direct utility evaluations in the true utility baseline, pairwise comparisons in the preference BO baseliens) (default 2).
  • bs_exp=<int> — batch size of new candidates per trial (default null = problem dimension).
  • acquisition_method=log_nei|log_ei|thompson|ucb_<beta>|llm_2step|llm_direct.
  • pair_labeling_acquisition_method=sequential_q_eubo|random — for the LILO pairwise feedback acquisition ablation.
  • use_prior_knowledge=True/False, prior_knowledge_type=point|area|domain — for the prior-knowledge ablation. The prior is consumed at initialization (prior_knowledge_inc_method=llm_init, the only supported value).
  • save_outputs=True save_dir=runs/<name> — save per-trial results.

🎮 Single-run example

After implementing your LLMClient and dropping nasbench201_dataset_0.jsonl into data/:

python -m lilo.bo_loop \
  method=lilo seed=0 \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  N_iter=8 \
  save_outputs=True save_dir=runs/main_benchmark

This writes runs/main_benchmark/run_<timestamp>/{config.json, results.jsonl}.

📄 Reproducing the paper experiments

Each paper figure corresponds to a sweep over (method, environment, utility_func, seed, ...). Use Hydra's multirun mode (-m) to launch the sweep; aggregate the resulting results.jsonl files yourself (the paper plots are means ± confidence intervals across seeds — no aggregation script ships with the repo).

Main benchmark

python -m lilo.bo_loop -m \
  method=lilo,true_utility_bo,preferential_bo,llm_direct,llm_2step \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/main_benchmark

Repeat with environment.name/utility_func for each row of the env table above.

LILO with prior knowledge

For synthetic environments, the available prior knowledge type is point and area:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=point,area \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

For real-world environemnts, use domain knowledge type:

python -m lilo.bo_loop -m \
  method=lilo \
  use_prior_knowledge=True \
  prior_knowledge_type=domain \
  environment.name=thermo environment.utility_func=thermo_A,thermo_B \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/prior_knowledge

Pairwise vs scalar

python -m lilo.bo_loop -m \
  method=lilo_scalar \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/pairwise_vs_scalar

Pair acqusition for LLM labeling via qEUBO vs. random

python -m lilo.bo_loop -m \
  method=lilo \
  pair_labeling_acquisition_method=sequential_q_eubo,random \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/bpf_ablation

Acquisition-method ablation

python -m lilo.bo_loop -m \
  method=lilo \
  acquisition_method=log_nei,ucb_0.5,thompson \
  environment.name=dtlz2 environment.utility_func=piecewise_linear \
  seed=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 \
  save_outputs=True save_dir=runs/acqf_ablation

Summarization

summary_bo_loop.py is not Hydra-wrapped; configure it via the OmegaConf.create(...) block at the bottom of the file.

Key fields:

  • optimizer: "fixed" or "sampled".
  • seed, N_trials, N_configs, N_articles, bs_feedback, acqf, persona, optimize_preamble, local_dir (where the parquet files live), llm_model.

Sweep seed and optimizer to reproduce the two summarization plots.

💾 Output format

Each run writes:

<save_dir>/run_<timestamp>/
├── config.json      # the resolved Hydra config
└── results.jsonl    # one JSON object per BO trial

Per-trial fields include:

  • trial_index
  • best_value_predicted — best predicted utility seen so far
  • best_value_true - best ground truth utility seen so far
  • value_of_best_predicted - ground truth utility value of the best predicted so far
  • pairwise_accuracy — LLM pairwise-judge accuracy on the trial's labeled pool
  • label_exp_df, context_df, prior_df — serialized pandas DataFrames

To compare methods across seeds, group by (method, environment.name, environment.utility_func, seed), take best_value_true per trial_index, and compute means / CIs.

⚖️ License

The code is licensed under an MIT license.

📧 Contact

Katarzyna Kobalczyk: knk25@cam.ac.uk

Jerry Lin: zylin@meta.com

✍️ Citation

If you find this repository useful, please consider giving a star ⭐ and please cite as:

@inproceedings{
kobalczyk2026lilo,
title={{LILO}: Bayesian Optimization with Natural Language Feedback},
author={Katarzyna Kobalczyk and Zhiyuan Jerry Lin and Benjamin Letham and Zhuokai Zhao and Maximilian Balandat and Eytan Bakshy},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=zVrbE9ZtEU}
}

About

Official Code Repository for the paper "LILO Bayesian Optimization with Natural Language Feedback".

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors