-
Notifications
You must be signed in to change notification settings - Fork 1
Ingester
The Ingester program starts with the module main_ingester.py. The main program is invoked with a call the Ingester::run(). This routine loops over a range of years to download and import files. The Ingester will use the configuration parameter start year as the beginning year in the range. The end of the range is always the current year, taken from the system clock.
To prevent downloading data files for years that have already been downloaded the Ingester stores the downloaded zip files which are named with the year. Using the previously downloaded files the store\zip folder is searched. If a file does not already exists for say the year 2012 then it is downloaded. If however a zip file for 2012 does exist then the Ingester will check the file modification timestamp to see if it was downloaded after 2012. If it was downloaded after 2012 then we know the file contained a complete years worth of data. If the system clocks year is still 2012 then we will continue to download 2012.
Files are downloaded and unzipped used the class Ingester.Crawler. The routine Crawler.download_COT_file() is called from the main routine. This in turn makes calls to Crawler.extract_zip() which in turn calls Crawler.import_file().
Crawler.extract_zip() extracts the contents to a temporary folder store\zip\tmp
Crawler.import_file() reads the CSV file and imports into a local array of rows that are dictionaries. We use dictionaries for the rows because the column header is used to name each value. The array of data is returned back to the main Ingester.run()
Next the data is imported into an SQLite database. For this we use the class Ingester.Importer and call Importer.import_data(). The Importer will create a new database (*.db) file in the store folder if it does not already exist. When databases are first created the tables are also created.
Two tables are used for storing the data. Besides the main future data table, a second table market_names is used solely for storing the market names. The market names are referenced from the future table in a relational way using id numbers. This saves disk space as a number is smaller than long text values that would need to be repeated for every row.
Importer.filter_names() is used to create a map the market names to their row id in the market_names table. Here the id is added to each data row dictionary. At the same time when each row is processed we write back the date as a python date object from the original text representation.
Importer.update_market_names() is used to update the market_names table's display name field in case in newer file years the market name has changed slightly.
Finally Importer.import_data() writes the data to the database. The data is appended to the table regardless of date or market name. With SQL databases the data can easily be sorted and grouped at query time. Only the raw data is stored to keep the database small. All data derived using calculations are performed at query time. To keep the query simple and to allow easy viewing of results, we use a View future_calc to apply the calculated values to a virtual table.