This repo contains the code and data associated with an early pilot experiment on chatbot personalization from the project Generative Social Choice (paper, general audience report). This experiment was conducted in November 2023 as part of OpenAI's Democratic Inputs to AI program.
We've since conducted follow-up experiments which improve on this implementation. For the code and data associated with the latest draft of Generative Social Choice, click here. For the code and data associated with a follow-up paper Generative Social Choice: The Next Generation by Niclas Böhmer, Sara Fish, and Ariel Procaccia, click here.
Authors of Generative Social Choice: Sara Fish, Paul Gölz, David Parkes, Ariel Procaccia, Gili Rusak, Itai Shapira, and Manuel Wüthrich.
- In the folder where this
README.mdfile is located, callpip install -e . - Install dependencies:
pipenv install - Create a file
OPENAI_API_KEYinutils/, and write in it your (personal) API key.
data/has all cleaned and anonymized data associated with the experiments in the paper:chatbot_personalization_data.csv: our cleaned and anonymized survey data, collected on Prolific. Also available at the dedicated repovalidate_disc_query_logs.csv: logs from our discriminative query validation experiment (Figure 1, replicate withpaper_replication/validate_discriminative_query.py)gen_query_eval/: logs from our generative query evaluation experiment (Figure 2, replicate withpaper_replication/gen_query_eval.py)user_summaries_generation.csvanduser_summaries_generation_raw_output.csv: the user summaries (and logs) used in our slate generation (replicate withpaper_replication/generate_summaries.py)ratings_and_matching.csv: assignments of validation users to statements (Figures 4-5, replicate withpaper_replication/compute_matching.py)
paper_replication/has scripts for replicating the experiments in the paper:validate_discriminative_query.py: validating discriminative queries (Figure 1)gen_query_eval.py: evaluating generative queries (Figure 2)generate_summaries.py: generating user summariesgenerate_slate.py: generating slatecompute_matching.py: compute assignment of users to statements in slate (Figures 4-5)
plots/has code for generating each of the plots in the paper, and the plots themselvesqueries/has implementation of the queries:query_chatbot_personalization.pycontains all of the chatbot personalization specific implementationquery_interface.pydescribes the interface for agents and generators. Anything that implements this interface should automatically work with our slate generation code.
slates/has our implementation of the slate generation algorithm ingenerate_slate_ensemble_greedy.pytest/has unit testsutils/has miscellaneous toolsgpt_wrapper.pycontains code for making LLM callshelper_functions.pyhasget_base_dir_path()andget_time_string()dataframe_completion.pycontains code for df-completion style LLM calls, used for our summary generation and generative query.
Each figure in the paper can be generated using a dedicated notebook:
- Figure 1:
plots/fig1_disc_query_eval.ipynb - Figure 2:
plots/fig2_slate_composition.ipynb - Figure 3: N/A
- Figure 4:
plots/fig4_assigned_utilities_pie_chart.ipynb - Figure 5:
plots/fig5_assigned_utilities_histogram.ipynb
To run unit tests with gpt-4o-mini, run the following command.
python -m unittest -k fast -v
To run unit tests using the exact LLMs used in the paper (for replication purposes), run the following command. This requires access to gpt-4-base and gpt-4-32k-0613.
python -m unittest -k replication -v
To run all unit tests, run the following command. This requires access to gpt-4-base and gpt-4-32k-0613.
python -m unittest -v
The quickest and cheapest way to rerun our experiments is to use a more modern LLM such as gpt-4o. The below commands run the exact experiments from our paper, except gpt-4o is used in place of gpt-4-base and gpt-4-32k-0613.
Generate summaries of all users:
python paper_replication/generate_summaries.py --model gpt-4o
Generate summary for a single user (for testing):
python paper_replication/generate_summaries.py --model gpt-4o --num_agents 1
To run the full experiment empirically validating the discriminative query (600 LLM calls):
python paper_replication/validate_discriminative_query.py --model gpt-4o
To empirically validate a single discriminative query (for testing):
python paper_replication/validate_discriminative_query.py --model gpt-4o --num_samples 1
To run the full experiment empirically evaluating the generative query:
python paper_replication/gen_query_eval.py --model gpt-4o
To evaluate a single ensemble round (for testing):
python paper_replication/gen_query_eval.py --model gpt-4o --num_rounds 1
To generate a slate for all users:
python paper_replication/generate_slate.py --model gpt-4o
To generate a slate for only 10 users (for testing):
python paper_replication/generate_slate.py --model gpt-4o --num_agents 10
To "exactly" (subject to inherent LLM stochasticity) reproduce our experiments, run the below commands. These require access to gpt-4-base and gpt-4-32k-0613. These will write logs to data/chatbot_personalization/demo_data/. To test on smaller sample sizes, use the --num_agents and --num_samples arguments (usage demonstrated above).
python paper_replication/generate_summaries.py --model default
python paper_replication/validate_discriminative_query.py --model default
python paper_replication/gen_query_eval.py --model default
python paper_replication/generate_slate.py --model default
This step uses Gurobi, but no LLM calls.
python paper_replication/compute_matching.py