diff --git a/README.md b/README.md index 3e7c134..f733691 100644 --- a/README.md +++ b/README.md @@ -83,10 +83,10 @@ You will first need to create a Cornell lab account to [ask for access to and do - Run `sbatch filter2.awk raw_ebd_data_file.txt `script to group observations by locality. The script on the repository was the one used to create the USA dataset, where we kept checklists of June, July, December and January to prepare the USA-summer and USA-winter dataset. For Kenya no month filter was applied. This will create one file per hotspots, the files will be organized based on the hotspot name in `split-XXX/` folders where each folder should only contain .csv files for hotspots `L*XXX`. - Run `data_processing/ebird/find_checklists.py` to filter the checklists you want to keep based on the ebd sampling event data. Typically this will only keep complete checklists and will create (for the USA dataset) 2 .csv files with the hotspots to keep corresponding to observations made in the summer and winter. -- to get the targets, run `data_processing/ebird/get_winter_targets.py` . Change the paths to the one where you created the `split-XXX/` folders. +- to get the targets, run `data_processing/ebird/get_targets.py` . Change the paths to the one where you created the `split-XXX/` folders. -- Download the **Sentinel-2 data** using `data_processing/satellite/download_rasters_from_planetary_computer.py`. To reproduce all experiments, you will have to run it twice, specifying the BANDS as ["B02", "B03", "B04", "B08"] for the BGRNIR reflectance data, and "visual" for the RGB visual component. A useful function is `process_row` which will extract the least cloudy image (with your specified bands) over the specified period of time with a maximum of 10\% cloud cover. For some hotspots, it is possible that you will be able to extract the visual component but incomplte items of no item will be found for the R,B,G,NIR reflectance data with the cloud cover criterion of 10\%. For those hotspots, you can replace `process_row` with `process_row_mosaic`function to allow the extracted image to be a mosaic of the images found over the specified period of time. +- Download the **Sentinel-2 data** using `data_processing/satellite/download_rasters_from_planetary_computer.py`. To reproduce all experiments, you will have to run it twice, specifying the BANDS as ["B02", "B03", "B04", "B08"] for the BGRNIR reflectance data, and "visual" for the RGB visual component. A useful function is `process_row` which will extract the least cloudy image (with your specified bands) over the specified period of time with a maximum of 10\% cloud cover. For some hotspots, it is possible that you will be able to extract the visual component but incomplte items of no item will be found for the R,B,G,NIR reflectance data with the cloud cover criterion of 10\%. For those hotspots, you can replace `process_row` with `process_row_mosaic`function to allow the extracted image to be a mosaic of the images found over the specified period of time. We set the cloud cover criterion to 20\% in this second phase. - You can further clean the dataset using the functions in `data_processing/ebird/clean_hotspots.py` and filter out: - hotspots that are not withing the bounds of a given shapefile geometry. This was used for the USA dataset to filter out hotspots that are in the middle of the ocean and not in the contiguous USA @@ -98,6 +98,7 @@ This will create one file per hotspots, the files will be organized based on the - Get **range maps** using `data_processing/ebird/get_range_maps.py`. This will call for shapefiles that you can obtain through [ebirdst](https://ebird.github.io/ebirdst/). You can then save a csv of all combined range maps using `/satbird/data_processing/utils/save_range_maps_csv_v2.py`. - - For the environmental data variables, download the rasters of the country of insterest from [WorldClim](https://www.worldclim.org/) (and [SoilGrids](https://soilgrids.org/) for the USA dataset). +For the USA dataset, we used the USA rasters that were available from the GeoLifeCLEF 2020 dataset, and for Kenya, we used the procedure described in data_processing/environmental/extract_bioclimatic_variables_kenya.ipynb - Use `data_processing/environmental/get_csv_env.py` to get point data for the environmental variables (rasters of size 1 centered on the hotspots). Note that you should not fill nan values until you have done the train-validation-test split so you can fill the values with the means on the training set data. These point data variables are used for the mean encounter rates, environmental and MOSAIKS baselines diff --git a/data_processing/environmental/bound_data.py b/data_processing/environmental/bound_data.py index 5bffea7..3a3beda 100644 --- a/data_processing/environmental/bound_data.py +++ b/data_processing/environmental/bound_data.py @@ -1,5 +1,11 @@ """ post-processing to environmental data after extraction +This assumes the rasters have been originally extracted to a folder named "environmental" +and filled with interpolation using fill_env_nans.py, and saved to a folder named "environmental_filled". +The remaining NaN values in the rasters are filled with the mean of each environmental variable over the training set and the filled rasters are saved to "environmental_temp" +The min and max ranges of the original (not filled)rasters are computed and used to bound the data, and saved in a folder names "environmental_temp_2" + +We subsequently renamed "environmental_temp_2" to "environmental" and this is what is in the released dataset """ import argparse import functools @@ -27,7 +33,7 @@ def bound_env_data(root_dir, mini, maxi): bound env data after the interpolation """ - rasters = glob.glob(root_dir + "/environmental/*.npy") # '/network/projects/_groups/ecosystem- + rasters = glob.glob(root_dir + "/environmental_temp/*.npy") # '/network/projects/_groups/ecosystem- for raster_file in tqdm(rasters): file_name = os.path.basename(raster_file) @@ -47,7 +53,7 @@ def fill_nan_values(root_dir, dataframe_name="all_summer_hotspots_withnan.csv"): """ fill values that still have nans after interpolation with mean point values """ - rasters = glob.glob(os.path.join(root_dir, "environmental_bounded_2", "*.npy")) + rasters = glob.glob(os.path.join(root_dir, "environmental_filled", "*.npy")) dst = os.path.join(root_dir, "environmental_temp") train_df = pd.read_csv(os.path.join(root_dir, dataframe_name)) @@ -83,7 +89,7 @@ def compute_min_max_ranges(root_dir): """ computes minimum and maximum of env data """ - rasters = glob.glob(os.path.join(root_dir, "environmental_bounded_2", "*.npy")) + rasters = glob.glob(os.path.join(root_dir, "environmental", "*.npy")) nan_count = 0 @@ -133,24 +139,12 @@ def remove_files(root_dir): if __name__ == '__main__': + root_dir = "/network/projects/ecosystem-embeddings/SatBird_data_v2/USA_summer" fill_nan_values(root_dir=root_dir) # move_missing_file(root_dir=root_dir) # remove_files(root_dir=root_dir) mini, maxi = compute_min_max_ranges(root_dir=root_dir) - - # mini = [-7.0958333, 1., 21.74928093, 0., 0.5, - # -24.20000076, 1., -11.58333302, -15.30000019, -1.58333325, - # -15.41666698, 54., 9., 0., 5.00542307, - # 24., 1., 2., 13., 0., - # 221., 2., 0., 0., 34., - # 0., 0.] - # maxi = [2.56291656e+01, 2.22333336e+01, 1.00000000e+02, 1.36807947e+03, - # 4.62000008e+01, 1.87999992e+01, 5.16999969e+01, 3.36666679e+01, - # 3.36666679e+01, 3.63833351e+01, 2.18833332e+01, 3.40200000e+03, - # 5.59000000e+02, 1.75000000e+02, 1.10832680e+02, 1.55500000e+03, - # 5.50000000e+02, 6.52000000e+02, 1.46100000e+03, 1.12467000e+05, - # 1.81500000e+03, 2.50000000e+02, 8.10000000e+01, 5.24000000e+02, - # 9.80000000e+01, 8.30000000e+01, 9.90000000e+01] - bound_env_data(root_dir=root_dir, mini=mini, maxi=maxi) + + diff --git a/data_processing/environmental/extract_bioclimatic_variables_kenya.ipynb b/data_processing/environmental/extract_bioclimatic_variables_kenya.ipynb new file mode 100644 index 0000000..b6aae39 --- /dev/null +++ b/data_processing/environmental/extract_bioclimatic_variables_kenya.ipynb @@ -0,0 +1,684 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6AjAYCylQ6i8", + "outputId": "1b64665d-12a3-410a-aeef-85b045be18d8" + }, + "source": [ + "This notebook shows how to get the WorldClim rasters for Kenya. This was a google colab, so adapt package installation accordingly to your machine. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nXeZCt6-SJm8" + }, + "outputs": [], + "source": [ + "!pip install rasterio >> /dev/null\n", + "!pip install geopandas >> /dev/null\n", + "!pip install soilgrids >> /dev/null\n", + "!pip install geopy >> /dev/null" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qfwPbWGNSIb2" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import geopandas as gpd\n", + "from rasterio.mask import mask\n", + "import matplotlib.pyplot as plt\n", + "import geopandas as gpd\n", + "import rasterio\n", + "from rasterio.mask import mask\n", + "import fiona\n", + "import os\n", + "import geopy\n", + "import numpy as np\n", + "\n", + "import matplotlib.pyplot as plt\n", + "from soilgrids import SoilGrids\n", + "\n", + "import warnings\n", + "from pathlib import Path\n", + "\n", + "from rasterio.warp import calculate_default_transform, reproject, Resampling\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Download Bioclim vars from worldclim\n", + "!wget https://biogeo.ucdavis.edu/data/worldclim/v2.1/base/wc2.1_10m_bio.zip\n", + "!unzip /content/wc2.1_10m_bio.zip -d worldclim >> /dev/null" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Q44x-Z_tsesO", + "outputId": "d1c6ee44-84c7-4577-cd94-72b70eca31b6" + }, + "outputs": [], + "source": [ + "target_crs = 'EPSG:4326'\n", + "\n", + "# Define the URL to the Natural Earth shapefile for Kenya\n", + "url = 'https://naciscdn.org/naturalearth/10m/cultural/ne_10m_admin_0_countries.zip'\n", + "\n", + "# Read the shapefile using geopandas\n", + "gdf = gpd.read_file(url)\n", + "\n", + "# Filter the data to extract the boundary for Kenya\n", + "kenya_boundary = gdf[gdf['ADMIN'] == 'Kenya']\n", + "\n", + "# Save the Kenya boundary shapefile to a file\n", + "kenya_boundary.to_file('kenya_boundary.shp')\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iSDrkGPYUrvq" + }, + "outputs": [], + "source": [ + "\n", + "bioclimatic_raster_names = [\n", + " \"bio_1\", \"bio_2\", \"bio_3\", \"bio_4\", \"bio_5\", \"bio_6\", \"bio_7\", \"bio_8\", \"bio_9\",\n", + " \"bio_10\", \"bio_11\", \"bio_12\", \"bio_13\", \"bio_14\", \"bio_15\", \"bio_16\", \"bio_17\",\n", + " \"bio_18\", \"bio_19\"\n", + "]\n", + "\n", + "\n", + "raster_names = bioclimatic_raster_names" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5uKp0dKV5A2A" + }, + "outputs": [], + "source": [ + "def get_all_soilgrids_kenya():\n", + " # List of soil variable IDs\n", + " soil_variables = [\"bdod\", \"cec\", \"cfvo\", \"clay\", \"nitrogen\", \"phh2o\", \"sand\", \"silt\", \"soc\", \"ocd\", \"wrb\"]\n", + "\n", + " # get data from SoilGrids for each soil variable\n", + " for variable in soil_variables:\n", + " # Instantiate SoilGrids object\n", + " soil_grids = SoilGrids()\n", + " data = soil_grids.get_coverage_data(service_id=variable, coverage_id=f\"{variable}_0-5cm_mean\",\n", + " west=3752140, south=-537527, east=4696291, north=685077,\n", + " crs='urn:ogc:def:crs:EPSG::152160', output=f\"soil/{variable}.tif\")\n", + "\n", + " # show metadata\n", + " for key, value in soil_grids.metadata.items():\n", + " print('{}: {}'.format(key, value))\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "er01O3RFtbs5" + }, + "outputs": [], + "source": [ + "# Download worldclim Rasters for Kenya\n", + "\n", + "# Specify the path to the downloaded worldwide raster files\n", + "worldclim_folder = \"/content/worldclim\"\n", + "clipped_folder = \"/content/clips\"\n", + "\n", + "# Specify the path to the shapefile containing the polygons\n", + "shapefile = \"/content/kenya_boundary.shp\"\n", + "\n", + "# Read the polygons from the shapefile using GeoPandas\n", + "gdf = gpd.read_file(shapefile)\n", + "\n", + "# Iterate over the bioclim rasters\n", + "for filename in os.listdir(worldclim_folder):\n", + " if filename.endswith(\".tif\"):\n", + " # Extract the bioclim number from the filename\n", + " bioclim_number = filename.split(\"_\")[-1].split(\".\")[0]\n", + "\n", + " # Read the worldwide raster file using Rasterio\n", + " dataset = rasterio.open(os.path.join(worldclim_folder, filename))\n", + "\n", + " # Perform the clipping operation for each polygon\n", + " for index, polygon in gdf.geometry.items():\n", + " # Perform the clipping operation\n", + " clipped_data, clipped_transform = mask(dataset, [polygon], crop=True)\n", + "\n", + " # Access the first dimension of the clipped data\n", + " clipped_data = clipped_data[0]\n", + "\n", + " # Update the metadata of the clipped data\n", + " clipped_meta = dataset.meta.copy()\n", + " clipped_meta.update({\n", + " \"count\": 1, # Update the count to 1 since we have a single band\n", + " \"height\": clipped_data.shape[0],\n", + " \"width\": clipped_data.shape[1],\n", + " \"transform\": clipped_transform\n", + " })\n", + "\n", + " # Create the parent directories for the output file if they don't exist\n", + " output_folder = os.path.join(clipped_folder, f\"bio_{bioclim_number}\")\n", + " os.makedirs(output_folder, exist_ok=True)\n", + "\n", + " # Save the clipped data as a new raster file\n", + " clipped_filename = f\"bio_{bioclim_number}_Kenya.tif\"\n", + " clipped_filepath = os.path.join(output_folder, clipped_filename)\n", + " with rasterio.open(clipped_filepath, \"w\", **clipped_meta) as dest:\n", + " dest.write(clipped_data, indexes=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2sORahoSa8YO" + }, + "outputs": [], + "source": [ + "class Raster(object):\n", + " \"\"\"\n", + " Handles the loading and the patch extraction for a single raster\n", + " \"\"\"\n", + "\n", + " def __init__(self, path, country, size=256, nan=np.nan, out_of_bounds=\"error\"):\n", + " \"\"\"Loads a GeoTIFF file containing an environmental raster\n", + "\n", + " Parameters\n", + " ----------\n", + " path : string / pathlib.Path\n", + " Path to the folder containing all the rasters.\n", + " country : string, either \"FR\" or \"USA\"\n", + " Which country to load raster from.\n", + " size : integer\n", + " Size in pixels (size x size) of the patch to extract around each location.\n", + " nan : float or None\n", + " Value to use to replace missing data in original rasters, if None, leaves default values.\n", + " out_of_bounds : string, either \"error\", \"warn\" or \"ignore\"\n", + " If \"error\", raises an exception if the location requested is out of bounds of the rasters. Set to \"warn\" to only produces a warning and to \"ignore\" to silently ignore it and return a patch filled with missing data.\n", + " \"\"\"\n", + " path = Path(path)\n", + " if not path.exists():\n", + " raise ValueError(\n", + " \"path should be the path to a raster, given non-existant path: {}\".format(\n", + " path\n", + " )\n", + " )\n", + "\n", + " self.path = path\n", + " self.name = path.name\n", + " self.size = size\n", + " self.out_of_bounds = out_of_bounds\n", + " self.nan = nan\n", + "\n", + " # Loading the raster\n", + " filename = path / \"{}_{}.tif\".format(self.name, country)\n", + " print(filename)\n", + " with rasterio.open(filename) as dataset:\n", + " self.dataset = dataset\n", + " self.raster = dataset.read(1).astype(np.float32)\n", + " mask = self.dataset.read_masks(1).astype(bool)\n", + "\n", + " # Changes nodata to user specified value\n", + " if nan:\n", + " self.raster[np.isnan(self.raster)] = nan\n", + " self.raster[~mask] = nan\n", + "\n", + " # setting the shape of the raster\n", + " self.shape = self.raster.shape\n", + "\n", + "\n", + " def _extract_patch(self, coordinates):\n", + " \"\"\"Extracts the patch around the given GPS coordinates.\n", + " Avoid using this method directly.\n", + "\n", + " Parameter\n", + " ----------\n", + " coordinates : tuple containing two floats\n", + " GPS coordinates (latitude, longitude)\n", + "\n", + " Returns\n", + " -------\n", + " patch : 2d array of floats, [size, size], or single float if size == 1\n", + " Extracted patch around the given coordinates.\n", + " \"\"\"\n", + " row, col = self.dataset.index(coordinates[1], coordinates[0])\n", + "\n", + " if self.size == 1:\n", + " # Environmental vector\n", + " patch = self.raster[row, col]\n", + " else:\n", + " # FIXME: can it happen that part of the patch is outside the raster? (and how about the mask of the dataset?)\n", + " half_size = int(self.size / 2)\n", + " # FIXME: only way to trigger an exception? (slices don't)\n", + " self.raster[row, col]\n", + " patch = self.raster[\n", + " row - half_size:row + half_size,\n", + " col - half_size:col + half_size\n", + " ]\n", + "\n", + " patch = patch[np.newaxis]\n", + " return patch\n", + "\n", + " def __len__(self):\n", + " \"\"\"Number of bands in the raster (should always be equal to 1).\n", + "\n", + " Returns\n", + " -------\n", + " n_bands : integer\n", + " Number of bands in the raster\n", + " \"\"\"\n", + " return self.dataset.count\n", + "\n", + " def __getitem__(self, coordinates):\n", + " \"\"\"Extracts the patch around the given GPS coordinates.\n", + "\n", + " Parameters\n", + " ----------\n", + " coordinates : tuple containing two floats\n", + " GPS coordinates (latitude, longitude)\n", + "\n", + " Returns\n", + " -------\n", + " patch : 2d array of floats, [size, size], or single float if size == 1\n", + " Extracted patch around the given coordinates.\n", + " \"\"\"\n", + " try:\n", + " return self._extract_patch(coordinates)\n", + " except IndexError as e:\n", + " if self.out_of_bounds == \"error\":\n", + " raise e\n", + " else:\n", + " if self.out_of_bounds == \"warn\":\n", + " warnings.warn(\"GPS coordinates ({}, {}) out of bounds\".format(*coordinates))\n", + "\n", + " if self.size == 1:\n", + " return np.array([self.nan], dtype=np.float32)\n", + " else:\n", + " patch = np.empty((1, self.size, self.size), dtype=np.float32)\n", + " patch.fill(self.nan)\n", + " return patch\n", + "\n", + " def __repr__(self):\n", + " return str(self)\n", + "\n", + " def __str__(self):\n", + " return \"name: \" + self.name + \"\\n\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "22JRpmU-bML6" + }, + "outputs": [], + "source": [ + "class PatchExtractor(object):\n", + " \"\"\"\n", + " Handles the loading and extraction of an environmental tensor from multiple rasters given GPS coordinates.\n", + " \"\"\"\n", + "\n", + " def __init__(self, root_path, country, size=256):\n", + " \"\"\"Constructor\n", + "\n", + " Parameters\n", + " ----------\n", + " root_path : string or pathlib.Path\n", + " Path to the folder containing all the rasters.\n", + " size : integer\n", + " Size in pixels (size x size) of the patches to extract around each location.\n", + " \"\"\"\n", + " self.root_path = Path(root_path)\n", + " print(\"root_path:\",self.root_path)\n", + " if not self.root_path.exists():\n", + " raise ValueError(\n", + " \"root_path should be the directory containing the rasters, given a non-existant path: {}\".format(\n", + " root_path\n", + " )\n", + " )\n", + "\n", + " self.size = size\n", + "\n", + " #self.rasters_fr = []\n", + " self.rasters_us = []\n", + " self.country = \"Kenya\"\n", + "\n", + " def add_all_rasters(self, **kwargs):\n", + " \"\"\"Add all variables (rasters) available\n", + "\n", + " Parameters\n", + " ----------\n", + " kwargs : dict\n", + " Updates the default arguments passed to Raster (nan, out_of_bounds, etc.)\n", + " \"\"\"\n", + " for raster_name in raster_names:\n", + " self.append(raster_name, **kwargs)\n", + "\n", + " def add_all_bioclimatic_rasters(self, **kwargs):\n", + " \"\"\"Add all bioclimatic variables (rasters) available\n", + "\n", + " Parameters\n", + " ----------\n", + " kwargs : dict\n", + " Updates the default arguments passed to Raster (nan, out_of_bounds, etc.)\n", + " \"\"\"\n", + " for raster_name in bioclimatic_raster_names:\n", + " self.append(raster_name, **kwargs)\n", + "\n", + " def add_all_pedologic_rasters(self, **kwargs):\n", + " \"\"\"Add all pedologic variables (rasters) available\n", + "\n", + " Parameters\n", + " ----------\n", + " kwargs : dict\n", + " Updates the default arguments passed to Raster (nan, out_of_bounds, etc.)\n", + " \"\"\"\n", + " for raster_name in pedologic_raster_names:\n", + " self.append(raster_name, **kwargs)\n", + "\n", + " def append(self, raster_name, **kwargs):\n", + " \"\"\"Loads and appends a single raster to the rasters already loaded.\n", + "\n", + " Can be useful to load only a subset of rasters or to pass configurations specific to each raster.\n", + "\n", + " Parameters\n", + " ----------\n", + " raster_name : string\n", + " Name of the raster to load, should be a subfolder of root_path.\n", + " kwargs : dict\n", + " Updates the default arguments passed to Raster (nan, out_of_bounds, etc.)\n", + " \"\"\"\n", + "\n", + " r_us = Raster(self.root_path / raster_name, self.country, size=self.size, **kwargs)\n", + " #r_fr = Raster(self.root_path / raster_name, \"FR\", size=self.size, **kwargs)\n", + "\n", + " self.rasters_us.append(r_us)\n", + " #self.rasters_fr.append(r_fr)\n", + "\n", + " def clean(self):\n", + " \"\"\"Remove all rasters from the extractor.\n", + " \"\"\"\n", + " #self.rasters_fr = []\n", + " self.rasters_us = []\n", + "\n", + " def _get_rasters_list(self, coordinates):\n", + " \"\"\"Returns the list of rasters from the appropriate country\n", + "\n", + " Parameters\n", + " ----------\n", + " coordinates : tuple containing two floats\n", + " GPS coordinates (latitude, longitude)\n", + "\n", + " Returns\n", + " -------\n", + " rasters : list of Raster objects\n", + " All previously loaded rasters.\n", + " \"\"\"\n", + " #if coordinates[1] > -10.0:\n", + " # return self.rasters_fr\n", + " return(self.rasters_us)\n", + "\n", + " def __repr__(self):\n", + " return str(self)\n", + "\n", + " def __str__(self):\n", + " result = \"\"\n", + "\n", + " for rasters in [self.rasters_us]:\n", + " for raster in rasters:\n", + " result += \"-\" * 50 + \"\\n\"\n", + " result += str(raster)\n", + "\n", + " return result\n", + "\n", + " def __getitem__(self, coordinates):\n", + " \"\"\"Extracts the patches around the given GPS coordinates for all the previously loaded rasters.\n", + "\n", + " Parameters\n", + " ----------\n", + " coordinates : tuple containing two floats\n", + " GPS coordinates (latitude, longitude)\n", + "\n", + " Returns\n", + " -------\n", + " patch : 3d array of floats, [n_rasters, size, size], or 1d array of floats, [n_rasters,], if size == 1\n", + " Extracted patches around the given coordinates.\n", + " \"\"\"\n", + " rasters = self._get_rasters_list(coordinates)\n", + " return np.concatenate([r[coordinates] for r in rasters])\n", + "\n", + " def __len__(self):\n", + " \"\"\"Number of variables/rasters loaded.\n", + "\n", + " Returns\n", + " -------\n", + " n_rasters : integer\n", + " Number of loaded rasters\n", + " \"\"\"\n", + " return len(self.rasters_us)\n", + "\n", + " def plot(self, coordinates, return_fig=False, n_cols=5, fig=None, resolution=1.0):\n", + " \"\"\"Plot an environmental tensor (only works if size > 1)\n", + "\n", + " Parameters\n", + " ----------\n", + " coordinates : tuple containing two floats\n", + " GPS coordinates (latitude, longitude)\n", + " return_fig : boolean\n", + " If True, returns the created plt.Figure object\n", + " n_cols : integer\n", + " Number of columns to use\n", + " fig : plt.Figure or None\n", + " If not None, use the given plt.Figure object instead of creating a new one\n", + " resolution : float\n", + " Resolution of the created figure\n", + "\n", + " Returns\n", + " -------\n", + " fig : plt.Figure\n", + " If return_fig is True, the used plt.Figure object\n", + " \"\"\"\n", + " if self.size <= 1:\n", + " raise ValueError(\"Plot works only for tensors: size must be > 1\")\n", + "\n", + " rasters = self._get_rasters_list(coordinates)\n", + "\n", + " # Metadata are the name of the variables and the bounding boxes in latitude-longitude coordinates\n", + " metadata = [\n", + " (\n", + " raster.name,\n", + " [\n", + " coordinates[1] - (self.size // 2) * raster.dataset.res[0],\n", + " coordinates[1] + (self.size // 2) * raster.dataset.res[0],\n", + " coordinates[0] - (self.size // 2) * raster.dataset.res[1],\n", + " coordinates[0] + (self.size // 2) * raster.dataset.res[1],\n", + " ],\n", + " )\n", + " for raster in rasters\n", + " ]\n", + "\n", + " # Extracts the patch\n", + " patch = self[coordinates]\n", + "\n", + " # Computing number of rows and columns\n", + " n_rows = (patch.shape[0] + (n_cols - 1)) // n_cols\n", + "\n", + " if fig is None:\n", + " fig = plt.figure(figsize=(n_cols * 6.4 * resolution, n_rows * 4.8 * resolution))\n", + "\n", + " axes = fig.subplots(n_rows, n_cols)\n", + " axes = axes.ravel()\n", + "\n", + " for i, (ax, k) in enumerate(zip(axes, metadata)):\n", + " p = np.squeeze(patch[i])\n", + " im = ax.imshow(p, extent=k[1], aspect=\"equal\", interpolation=\"none\")\n", + "\n", + " ax.set_title(k[0], fontsize=20)\n", + " fig.colorbar(im, ax=ax)\n", + "\n", + " for ax in axes[len(metadata):]:\n", + " ax.axis(\"off\")\n", + "\n", + " fig.tight_layout()\n", + "\n", + " if return_fig:\n", + " return fig" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "HFmbWxdTbW3H", + "outputId": "dab62b43-4666-4cca-e97a-2e3af3f04577" + }, + "outputs": [], + "source": [ + "DATA_PATH = '/content/clips'\n", + "extractor = PatchExtractor(DATA_PATH, country=\"Kenya\", size = 200)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MMSwaWHvbtwK", + "outputId": "dcec5f7f-804d-49af-837f-d5e0c027d9d5" + }, + "outputs": [], + "source": [ + "extractor.add_all_rasters()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 432 + }, + "id": "GHFoEqjqIZDT", + "outputId": "7e805ada-de7b-483a-e185-11258a21dd95" + }, + "outputs": [], + "source": [ + "with rasterio.open(\"/content/clips/bio_16/bio_16_Kenya.tif\") as src:\n", + " plt.imshow(src.read(1))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "AFLYqNYVIjzT", + "outputId": "80834ac8-60f3-45da-e2c1-8a85caacc920" + }, + "outputs": [], + "source": [ + "rasterio.open(\"/content/clips/bio_19/bio_19_Kenya.tif\").shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Klmm7qcVbwPM", + "outputId": "491ff29a-77bb-4628-d36d-28ae45601be9" + }, + "outputs": [], + "source": [ + "#visualize a patch\n", + "\n", + "lat, long = -1.3920, 37.9633\n", + "\n", + "patch = extractor[lat, long]\n", + "\n", + "print(\"Patch shape: {}\".format(patch.shape))\n", + "print(\"Data type: {}\".format(patch.dtype))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "gxo-Tz5MnERN", + "outputId": "5e8389c4-ebb6-45b7-b3ea-34cecfd39d85" + }, + "outputs": [], + "source": [ + "fig = plt.figure(figsize=(14, 16))\n", + "extractor.plot((lat, long), fig=fig)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/data_processing/satellite/download_rasters_from_planetary_computer.py b/data_processing/satellite/download_rasters_from_planetary_computer.py index 6fe016c..8011f92 100644 --- a/data_processing/satellite/download_rasters_from_planetary_computer.py +++ b/data_processing/satellite/download_rasters_from_planetary_computer.py @@ -12,8 +12,8 @@ import planetary_computer import argparse +#Get your API key from Planetary computer, and replace "api-key" accordingly planetary_computer.settings.set_subscription_key("api-key") -# This should be a secret!! ask me for mine # Incase Planetary computer sleeps off, # Define the number of retries and the wait interval between retries @@ -27,10 +27,11 @@ # modifier=planetary_computer.sign_inplace, --> this is depricated ?? ) -# Define the bands we are interested in --> r,g,b,nir and true color image or "visual" +# Define the bands we are interested in --> ["B02", "B03", "B04", "B08"] for r,g,b,nir or "visual" for true color image BANDS = ["B02", "B03", "B04", "B08"] -time_of_interest = "2022-06-01/2022-07-31" #this is for summer, if winter use "2022-12-01/2023-01-31" +# "2022-06-01/2022-07-31" for SatBird-USA-summer, "2022-12-01/2023-01-31" for SatBird-USA-winter, "2022-01-01/2023-01-01" for SatBird-Kenya +time_of_interest = "2022-06-01/2022-07-31" def convert_polygon(polygon_str): @@ -179,11 +180,13 @@ def process_row_mosaic(row, save_dir, cc=20, no_item_found_txt="no_item_found.t def main(): + #The following configuration was used for SatBird-USA, change paths accordingly. For Kenya there are no seasons considered. # Specify the directory to save the rasters WINTER_SAVE_DIRECTORY = "/network/projects/ecosystem-embeddings/ebird_new/rasters_new/winter_rasters/" SUMMER_SAVE_DIRECTORY = "/network/projects/_groups/ecosystem-embeddings/ebird_new/rasters_new/summer_rasters" + #polygons are obtained with the bounding_box_from_center function in data_processing/satellite/utils.py winter_polygons = "/network/projects/ecosystem-embeddings/ebird_new/polygons_winter.csv" summer_polygons = "/network/projects/ecosystem-embeddings/ebird_new/polygons_summer.csv"