Skip to content

Feat: parallelize#3

Open
Murray2015 wants to merge 3 commits into
mainfrom
feat-parallelize
Open

Feat: parallelize#3
Murray2015 wants to merge 3 commits into
mainfrom
feat-parallelize

Conversation

@Murray2015

Copy link
Copy Markdown
Collaborator

Hey Evangelos. I've had a play with your code and got parallelization working. This PR does the following:

  1. Adds a requirements.txt so the repo can be installed more easily
  2. Removes all hard-coded paths from the codebase and moves them up several levels to main (because otherwise, only your laptop can run this code).
  3. Adds a command line interface in main, so that main can be called by other users. This is optional and defaults to all your hardcoded paths, which now live in main. However, if you want to call it via the command line, it can be called like so: python main.py --geojson_dir geojson_files --log_dir logs --output_path output/. This is different to the makefile approach, which doesn't play nicely with the command line for passing file paths.
  4. Parallelizes the code. This is done at the download and processing level rather than at the country level. i.e., it will sequentially run each country, but all the downloaders for that country will then run in parallel. This will make it faster to run for all countries, but it will also make it faster to run if just a single country is being run.
  5. Note that I haven't tested this extensively. I spot lots of warning and error messages in the output. Lots were pre-existing before I touched the code, but there may be bugs I've introduced hidden in there now. You'll probably have more context than I do to be able to debug them if so. For context, removing the paths, and refactoring to only read the geojson once were the key steps to getting parallelization working - the first errors I got were around resource contention as multiple downloaders tried to read the geojson file at the same time. However, this might be a source of bugs if any of the downloaders mutate the data frame (I don't think any do, but I haven't read every line of code in the codebase)
  6. I'm now off on holiday for a week, so will pick up any replies to this when I'm back!
  7. p.s. I didn't actually ask if it was ok to start messing around in your code, which is very bad etiquette - hopefully it's ok, and don't feel any pressure to merge the code in if you hate my approach!

Murray Hoggett added 3 commits May 3, 2024 20:21
…eers can work on the codebase. Kept defaults in main so it should keep working for Evangelos locally without changes
…ore I added this code. But I spot some complaining about mutating data in a dataframe copy, so possible one of the downloaders is mutating the geodataframe. Might be fine or might not be, needs investigation.
@Murray2015 Murray2015 requested a review from ediakatos May 4, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant