Ms scraper #373 by Ash1R · Pull Request #508 · biglocalnews/warn-scraper

Ash1R · 2022-12-25T08:48:39Z

This is for issue #373 , to add a scraper for the Mississippi (never spelt that one wrong...).

Works correctly, but around 8 rows have two of the values switched, all for the same reason. Should I fix that or leave it for downstream?

palewire

We've got the Michigan file being edited here again. Let's get those changes out so we can focus on the file you want to add.

Ash1R · 2023-01-08T08:10:12Z

My bad, done! I copied the current mi.py code.

palewire · 2023-01-19T18:19:59Z

+    alllinks = ["https://mdes.ms.gov/" + link["href"] for link in a]
+    links = []
+    pdf_list = []
+    for link in alllinks:
+        if "map" not in link:
+            links.append(link)


Can we combine this filter with the list comprehension above? Something more like:

[f"https://mdes.ms.gov/{link['href']}" for link in a if "map" not in link["href"]]

palewire · 2023-01-19T18:20:20Z

+    cache = Cache(cache_dir)
+    cache.write("xx/yy.html", html)
+    soup = BeautifulSoup(html, "html5lib")
+    a = soup.select("a[href*=pdf]")


Is a the best name for this variable?

Definitely not, my bad.

palewire · 2023-01-19T18:20:39Z

+        if "map" not in link:
+            links.append(link)
+    for i in links:
+        cache_key = i.split("/")[-2] + i.split("/")[-1]


What's going on here?

I'm not exactly sure how to properly name the cache_keys, I tried to generate a unique key by taking the last two parts of every pdf link's url, which contains the month, year, and quarter.

palewire · 2023-01-19T18:21:16Z

+        with pdfplumber.open(file) as pdf:
+            for page in pdf.pages:
+                text = page.extract_tables()
+                if text == []:


Can this just be a more straightforward None test like, if not text?

palewire · 2023-01-19T18:21:27Z

+                for i in range(len(text)):
+                    if text[i][0][0] != "" and text[i][0][0][0] == "D":
+                        notices = text[i]
+                        break


What's going on here?

The extracted tables contained a lot of useless lists, the notices would be contained in the list in which the first term contained "Date" or some variation of it. I've modified it to make more sense.

palewire · 2023-01-19T18:21:39Z

+                end_table = False
+                while notices[startrow][0] is None or "/" not in notices[startrow][0]:
+                    startrow += 1
+                    if startrow == len(notices):
+                        end_table = True
+                        break
+                if end_table:
+                    continue


Can you explain this bit to me?

Since there was often space in between the header row and the information, startrow locates the index where the data info. Sometimes, a table contains no information at all, which is handled when startrow runs out of rows to search by setting end_table to True and ending the loop.

palewire · 2023-01-19T18:22:05Z

+                for row in range(startrow, len(notices) - 1, 2):
+                    final = []
+
+                    if notices[row][0] is None or "/" not in notices[row][0]:
+                        continue
+
+                    for i in notices[row]:
+                        if i is not None:
+                            final.append(i)
+
+                    for i in notices[row + 1]:
+                        if i is not None:
+                            final.append(i.strip())
+                    while len(final) != 9:
+                        if "" in final:
+                            final.remove("")
+                        else:
+                            break
+                    final_data.append(final)


What's going on in this chunk?

There's two rows of pdf data extracted by pdf plumber for each row that goes in the csv, so I have to parse row and row + 1. There's often empty strings at the end of some of the rows, so I remove those just after.

stucka · 2023-08-21T00:38:33Z

Triggering tests by closing and reopening.

stucka · 2023-10-09T17:03:35Z

OK, so for the record I've done some terrible things to @Ash1R 's draft, and hope to do more soon and get this into production.

Realized naming scheme for HTML in cache was bad, then realized caching the HTML was unnecessary.
Realized caching of the PDFs was faulty -- it'd cache every page once, so the first layoff of a quarter would be seen as the complete quarter. I'm hoping to shift to older PDFs into an exported historical CSV to make everything run that much faster. That's partially implemented but we still need clean CSV exports from which to generate that data. We're close there.
Realized some of the PDF data was coming through with bad or weird data -- line breaks in the middle of company names, odd unicode dashes. Wrote a simple function to clean that up.

To-do:

Improve logging
Revamp Excel logic -- skipping some cells is disastrous, as with Aramark March 2019 line getting merged with Palmer House April 2019. Might be something as basic as, if the last cell is empty, then drop in a null value, add the row, and we'll get on with life. 8/3/2012 and 11/21/2013 are going to be a useful edge case for validation.
Implement historical data download and import
Implement record duplicate checker
Evaluate whether it's possible to extract out city and county from this thing and if so, implement it. A sample suggests it's always that last line, typically with a city name, sometimes a garbage character, county in parenthesis, sometimes a ZIP code. Is county good enough?
Build out a transformer. Note bad date with a year of 208.
Improve text cleanup -- Ruth's Chris Steak House with odd apostrophe. P.F. Chang's.
Evaluate a number of rows from 2013-2015 with fields swapped around, and at least one with an extra field. Historical data can be patched up before export, but getting the scraper to work on them could improve future effectiveness.
Consider dropping pre-2016 rows if cleanup seems too unwieldy, then flag as an issue.

stucka · 2023-10-16T21:21:35Z

@Ash1R , I've got a bunch more validation in the scraper. I incorporated the fixes made by @jsvine but then had to go farther off the reservation to patch an even weirder PDF. Still need to set up some of the historical data but first need to do some validation of the CSV. Looks like it picked up about 30 more rows than you were getting, which is ... weird.

ms.csv

stucka · 2023-10-17T13:22:41Z

Seeing some data integrity problems with edge cases that bump up against the logic of "every other row has the layoff number" kind of thing. A good example:
https://mdes.ms.gov/media/26893/PY2011_Q1_WARN_July2011_Sep2011.pdf

Another way to handle this might be able to split the rows up into sections (e.g., every section must have a "/" in the first cell of the first row, to show a date). That's likely overkill.

stucka · 2023-11-15T19:01:20Z

The PDF parsing is still failing in some interesting ways. I tried to get the historical data cleaned up but found most of a page missing, e.g., 152801_py2018_q4_warn_apr2019_jun2019.pdf

I tweaked a couple things in the Python to try to improve logging and readability in the output, but it does not affect the substance, only the sort order.

Somewhat patched CSV:
ms.csv

Note pages set to "manual," which I only started after patching some in 2013-2015.

Ash1R added 3 commits August 5, 2022 10:46

fixed issue biglocalnews#469 . First commit to this project

42a4e08

addressed changes from other PRs

5af969c

made ms scraper

e7b807e

palewire requested changes Jan 4, 2023

View reviewed changes

Merged in main

4001935

palewire requested changes Jan 19, 2023

View reviewed changes

Ash1R and others added 5 commits January 20, 2023 00:52

clearer code

a9da9d8

Merged in main

bc5b97f

unchanged mi.py

009ce3d

unchanged mi.py

3ae0309

undo file change

1376267

stucka closed this Aug 21, 2023

stucka reopened this Aug 21, 2023

Change a bunch of stuff -- documenting in pull request

5e80679

Rework with better table selection, bunch of validation checks

e6a8b5f

Add some controls for sort order; no substantive changes

43da1af

Conversation

Ash1R commented Dec 25, 2022

Uh oh!

palewire left a comment

Choose a reason for hiding this comment

Uh oh!

Ash1R commented Jan 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ash1R Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stucka commented Aug 21, 2023

Uh oh!

stucka commented Oct 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stucka commented Oct 16, 2023

Uh oh!

stucka commented Oct 17, 2023

Uh oh!

stucka commented Nov 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ash1R Jan 20, 2023 •

edited

Loading

stucka commented Oct 9, 2023 •

edited

Loading