This script is designed to help select a diverse group of individuals from a larger dataset, ensuring a balanced representation across various dimensions such as gender, age group, education, residence, disability, and interest. It includes features to bias the selection process to either over-represent or under-represent certain groups based on predefined criteria, enhancing the flexibility of the selection process to suit specific needs.
The primary aim of this script is to facilitate the creation of a diversified group from a dataset, ensuring that the final selection mirrors a balanced and diverse representation. It's particularly useful in scenarios where equitable representation is critical, such as in surveys, research studies, or team formations.
To run this script, you will need Python installed on your machine, along with the following libraries:
- Pandas
- NumPy
You can install these libraries using pip:
pip install pandas numpy
$ python generate_csv.py
To execute the script, follow these steps:
- Prepare your dataset according to the specified data structure and save it as a CSV file.
- Run the script using a Python interpreter. The first parameter is the CSV file path, the second is the target group size.
$ python main.py example_people_data.csv 60
The script supports biasing options to either over-represent or under-represent specific groups within the dataset. This feature allows for more control over the diversity of the selected group, making it possible to adjust the selection process based on specific needs or goals.
Biasing is applied through predefined criteria within the script. These criteria can be adjusted by modifying the bias_weights calculation, which assigns different weights to individuals based on attributes such as 'Disability', 'Age Group', or any other column in the dataset.
For example, to over-represent individuals with disabilities, a higher weight is assigned to records where Disability == 'Yes'. Conversely, to under-represent middle-aged males, a lower weight can be assigned to records matching this criterion.
You can customize the bias criteria by editing the select_diverse_group_with_bias in main.py
Your dataset should be a CSV file with the following columns:
ID: An incremental integer identifying each record.Age Group: Categorized age groups, e.g., '18-29', '30-39', etc.Education: Level of education, e.g., 'Elementary', 'Secondary', 'Higher'.Gender: Gender identification, e.g., 'Male', 'Female'.Residence: Type of residence, e.g., 'Capital', 'Non-Capital'.Disability: Disability status, e.g., 'Yes', 'No'.Interest: Level of interest, e.g., 'No', 'Some', 'Yes'.
Example CSV dataset structure:
ID,Age Group,Education,Gender,Residence,Disability,Interest
1,18-29,Higher,Female,Capital,No,Yes
2,30-39,Secondary,Male,Non-Capital,Yes,Some
...
This script has been made with the help of ChatGPT v4. Please, find the related conversations below:
Simple diverse selection:
https://chat.openai.com/share/1f1c89a2-59ae-44a6-979c-ce720b279229,
Extended dimensions and bias:
https://chat.openai.com/share/437498d5-7535-4788-87ca-c0dc1e35a2c2
Contributions are welcome! If you have suggestions or enhancements, please open an issue or submit a pull request.
MIT License - Feel free to use, modify, and distribute this script as you see fit.