Ms scraper #373#508
Conversation
palewire
left a comment
There was a problem hiding this comment.
We've got the Michigan file being edited here again. Let's get those changes out so we can focus on the file you want to add.
|
My bad, done! I copied the current mi.py code. |
| alllinks = ["https://mdes.ms.gov/" + link["href"] for link in a] | ||
| links = [] | ||
| pdf_list = [] | ||
| for link in alllinks: | ||
| if "map" not in link: | ||
| links.append(link) |
There was a problem hiding this comment.
Can we combine this filter with the list comprehension above? Something more like:
[f"https://mdes.ms.gov/{link['href']}" for link in a if "map" not in link["href"]]
| cache = Cache(cache_dir) | ||
| cache.write("xx/yy.html", html) | ||
| soup = BeautifulSoup(html, "html5lib") | ||
| a = soup.select("a[href*=pdf]") |
There was a problem hiding this comment.
Is a the best name for this variable?
There was a problem hiding this comment.
Definitely not, my bad.
| if "map" not in link: | ||
| links.append(link) | ||
| for i in links: | ||
| cache_key = i.split("/")[-2] + i.split("/")[-1] |
There was a problem hiding this comment.
I'm not exactly sure how to properly name the cache_keys, I tried to generate a unique key by taking the last two parts of every pdf link's url, which contains the month, year, and quarter.
| with pdfplumber.open(file) as pdf: | ||
| for page in pdf.pages: | ||
| text = page.extract_tables() | ||
| if text == []: |
There was a problem hiding this comment.
Can this just be a more straightforward None test like, if not text?
| for i in range(len(text)): | ||
| if text[i][0][0] != "" and text[i][0][0][0] == "D": | ||
| notices = text[i] | ||
| break |
There was a problem hiding this comment.
The extracted tables contained a lot of useless lists, the notices would be contained in the list in which the first term contained "Date" or some variation of it. I've modified it to make more sense.
| end_table = False | ||
| while notices[startrow][0] is None or "/" not in notices[startrow][0]: | ||
| startrow += 1 | ||
| if startrow == len(notices): | ||
| end_table = True | ||
| break | ||
| if end_table: | ||
| continue |
There was a problem hiding this comment.
Can you explain this bit to me?
There was a problem hiding this comment.
Since there was often space in between the header row and the information, startrow locates the index where the data info. Sometimes, a table contains no information at all, which is handled when startrow runs out of rows to search by setting end_table to True and ending the loop.
| for row in range(startrow, len(notices) - 1, 2): | ||
| final = [] | ||
|
|
||
| if notices[row][0] is None or "/" not in notices[row][0]: | ||
| continue | ||
|
|
||
| for i in notices[row]: | ||
| if i is not None: | ||
| final.append(i) | ||
|
|
||
| for i in notices[row + 1]: | ||
| if i is not None: | ||
| final.append(i.strip()) | ||
| while len(final) != 9: | ||
| if "" in final: | ||
| final.remove("") | ||
| else: | ||
| break | ||
| final_data.append(final) |
There was a problem hiding this comment.
What's going on in this chunk?
There was a problem hiding this comment.
There's two rows of pdf data extracted by pdf plumber for each row that goes in the csv, so I have to parse row and row + 1. There's often empty strings at the end of some of the rows, so I remove those just after.
|
Triggering tests by closing and reopening. |
|
OK, so for the record I've done some terrible things to @Ash1R 's draft, and hope to do more soon and get this into production.
To-do:
|
|
@Ash1R , I've got a bunch more validation in the scraper. I incorporated the fixes made by @jsvine but then had to go farther off the reservation to patch an even weirder PDF. Still need to set up some of the historical data but first need to do some validation of the CSV. Looks like it picked up about 30 more rows than you were getting, which is ... weird. |
|
Seeing some data integrity problems with edge cases that bump up against the logic of "every other row has the layoff number" kind of thing. A good example: Another way to handle this might be able to split the rows up into sections (e.g., every section must have a "/" in the first cell of the first row, to show a date). That's likely overkill. |
|
The PDF parsing is still failing in some interesting ways. I tried to get the historical data cleaned up but found most of a page missing, e.g., 152801_py2018_q4_warn_apr2019_jun2019.pdf I tweaked a couple things in the Python to try to improve logging and readability in the output, but it does not affect the substance, only the sort order. Somewhat patched CSV: Note pages set to "manual," which I only started after patching some in 2013-2015. |
This is for issue #373 , to add a scraper for the Mississippi (never spelt that one wrong...).
Works correctly, but around 8 rows have two of the values switched, all for the same reason. Should I fix that or leave it for downstream?