Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .github/scripts/clean_kernelspecs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
import nbformat
import glob

for nb_path in glob.glob("**/*.ipynb", recursive=True):
with open(nb_path) as f:
nb = nbformat.read(f, as_version=4)
nb['metadata']['kernelspec'] = {
"name": "python3",
"display_name": "Python 3",
"language": "python"
}
with open(nb_path, 'w') as f:
nbformat.write(nb, f)

3 changes: 3 additions & 0 deletions .github/workflows/build_jb.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,8 @@ jobs:
run: |
pip install -r practicals_jn_book/requirements.txt

- name: Clean notebook kernelspecs
run: python .github/scripts/clean_kernelspecs.py

- name: Build documentation (only on macos-latest)
run: jupyter-book build practicals_jn_book --all -W
2 changes: 1 addition & 1 deletion big_data_environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ channels:
- conda-forge
dependencies:
- python>=3.9
- pandas>=2.2
- pandas>=3.0.1
- numpy>=2.2
- openpyxl>=3.1
- pyarrow>=19.0
Expand Down
3 changes: 2 additions & 1 deletion practicals_jn_book/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
pandas>=2.2
pandas>=3.0.1
numpy>=2.2
scikit-learn==1.6.1
seaborn==0.13.2
scipy>=1.15
matplotlib==3.10.0
jupyter-book==1.0
pyarrow>=19.0
nbformat
124 changes: 20 additions & 104 deletions practicals_jn_book/week_1/finalbook_part1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 11,
"metadata": {
"tags": [
"hide-input"
Expand All @@ -13,9 +13,9 @@
"name": "stdout",
"output_type": "stream",
"text": [
"My Python version is: 3.11.1\n",
"My Numpy version is: 1.26.4\n",
"My Pandas version is: 2.2.2\n"
"My Python version is: 3.13.12\n",
"My Numpy version is: 2.4.2\n",
"My Pandas version is: 3.0.1\n"
]
}
],
Expand Down Expand Up @@ -86,8 +86,9 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 12,
"metadata": {
"scrolled": true,
"tags": [
"hide-input"
]
Expand Down Expand Up @@ -171,7 +172,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 13,
"metadata": {
"scrolled": true,
"tags": [
Expand All @@ -198,7 +199,7 @@
"4 Dennis Cornelius\n",
"5 Brett Gibbs\n",
"6 John Haack\n",
"Name: Name, dtype: object \n",
"Name: Name, dtype: str \n",
"\n",
" Name\n",
"0 Andrzej Stanaszek\n",
Expand Down Expand Up @@ -248,7 +249,7 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 14,
"metadata": {
"scrolled": true,
"tags": [
Expand Down Expand Up @@ -294,91 +295,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Copy\n",
"\n",
"We already briefly mentioned that slicing in Pandas (and most other Python objects) returns a view not a copy. This might be a little counter intuitive if you're not familiar with general purpose programming languages (MATLAB is not). We do not want to mess with our lifter_df, so we will make a new dataframe for this assignment with three columns, ranging from 1-10:\n",
"```python\n",
"df1 = pd.DataFrame({\"X\": list(range(10)), \"Y\": list(range(10)), \"Z\": list(range(10))})\n",
"```\n",
"\n",
"### Assignment 3\n",
"\n",
"- **Make a slice of the first five rows using .iloc or .loc and assign it to a new variable.**\n",
"\n",
"- **Select all samples with .iloc or .loc and set all samples in the new variable to 0 and print the DataFrame.**\n",
"\n",
"- **Now print the original DataFrame. What do you notice?**\n",
"\n",
"You should get something like this:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"scrolled": true,
"tags": [
"hide-input"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sliced df:\n",
" X Y Z\n",
"0 0 0 0\n",
"1 0 0 0\n",
"2 0 0 0\n",
"3 0 0 0\n",
"4 0 0 0 \n",
"\n",
"Original df:\n",
" X Y Z\n",
"0 0 0 0\n",
"1 0 0 0\n",
"2 0 0 0\n",
"3 0 0 0\n",
"4 0 0 0\n",
"5 5 5 5\n",
"6 6 6 6\n",
"7 7 7 7\n",
"8 8 8 8\n",
"9 9 9 9\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/d6/sgv22vx10fb8mj7yrljzpkch0000gn/T/ipykernel_9819/483482325.py:3: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" df2.loc[:] = 0\n"
]
}
],
"source": [
"df1 = pd.DataFrame({\"X\": list(range(10)), \"Y\": list(range(10)), \"Z\": list(range(10))})\n",
"df2 = df1.iloc[:5, :]\n",
"df2.loc[:] = 0\n",
"print(\"Sliced df:\\n\", df2, \"\\n\") # \\n gives you an empty line after your print statement for readability\n",
"print(\"Original df:\\n\", df1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{warning}\n",
"Oh no, you've not only altered df1, but also df2. This is because the slicing operation gave you a view into the DataFrame, but not a copy of the data. Again, this saves a lot of memory, but it can mess up your data! Luckily Pandas gives us a warning when we try to do this!\n",
"```\n",
"\n",
"To prevent this problem you can use the ``.copy()`` method which returns you a copy and not a view.\n",
"\n",
"Note: whether Pandas returns a copy or a view is actually a pretty delicate topic, but just assume you get a view and use ``.copy()`` when you plan on changing the contents of the DataFrame.\n",
"\n",
"## Accessors\n",
"\n",
Expand All @@ -398,7 +314,7 @@
"\n",
"Cleaning up strings is a common operation in data science. Always check (your column names) for unwanted whitespace!\n",
"\n",
"### Assignment 4\n",
"### Assignment 3\n",
"\n",
"Consider an entry like this: {\"Name\": \"ALEXEY Kuzmin\", \"Age\": 34, \"Totalkg\": 527.25}. We can add it to the dataframe using the [concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) method: \n",
"\n",
Expand All @@ -423,7 +339,7 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 15,
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -521,7 +437,7 @@
"7 Alexey Kuzmin 34.0 527.25"
]
},
"execution_count": 26,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -538,7 +454,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Assignment 5\n",
"### Assignment 4\n",
"\n",
"````{margin}\n",
"```{admonition} Tip\n",
Expand All @@ -556,7 +472,7 @@
},
{
"cell_type": "code",
"execution_count": 27,
"execution_count": 16,
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -672,7 +588,7 @@
"7 Alexey Kuzmin Alexey Kuzmin 34.0 527.25"
]
},
"execution_count": 27,
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -713,7 +629,7 @@
"df_lifters = df_lifters.dropna().sort_values(by=\"Totalkg\", ascending=False)\n",
"```\n",
"\n",
"### Assignment 6\n",
"### Assignment 5\n",
"\n",
"- **First sort all the data by Totalkg score, make sure the Totalkg is on top of your DataFrame. Print out the DataFrame. What do you notice?**\n",
"\n",
Expand All @@ -728,7 +644,7 @@
},
{
"cell_type": "code",
"execution_count": 28,
"execution_count": 17,
"metadata": {
"tags": [
"hide-input"
Expand Down Expand Up @@ -798,9 +714,9 @@
"celltoolbar": "Tags",
"hide_input": false,
"kernelspec": {
"display_name": "big_data_environment",
"display_name": "Python [conda env:big_data_environment]",
"language": "python",
"name": "python3"
"name": "conda-env-big_data_environment-py"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -812,7 +728,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.2"
"version": "3.13.12"
},
"toc": {
"base_numbering": 1,
Expand Down
Loading
Loading