Skip to content

Ms scraper #373#508

Open
Ash1R wants to merge 12 commits intobiglocalnews:mainfrom
Ash1R:ms-scraper
Open

Ms scraper #373#508
Ash1R wants to merge 12 commits intobiglocalnews:mainfrom
Ash1R:ms-scraper

Conversation

@Ash1R
Copy link
Copy Markdown
Contributor

@Ash1R Ash1R commented Dec 25, 2022

This is for issue #373 , to add a scraper for the Mississippi (never spelt that one wrong...).

Works correctly, but around 8 rows have two of the values switched, all for the same reason. Should I fix that or leave it for downstream?

Copy link
Copy Markdown
Contributor

@palewire palewire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got the Michigan file being edited here again. Let's get those changes out so we can focus on the file you want to add.

@Ash1R
Copy link
Copy Markdown
Contributor Author

Ash1R commented Jan 8, 2023

My bad, done! I copied the current mi.py code.

Comment thread warn/scrapers/ms.py Outdated
Comment on lines +39 to +44
alllinks = ["https://mdes.ms.gov/" + link["href"] for link in a]
links = []
pdf_list = []
for link in alllinks:
if "map" not in link:
links.append(link)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this filter with the list comprehension above? Something more like:

[f"https://mdes.ms.gov/{link['href']}" for link in a if "map" not in link["href"]]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread warn/scrapers/ms.py Outdated
cache = Cache(cache_dir)
cache.write("xx/yy.html", html)
soup = BeautifulSoup(html, "html5lib")
a = soup.select("a[href*=pdf]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a the best name for this variable?

Copy link
Copy Markdown
Contributor Author

@Ash1R Ash1R Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not, my bad.

Comment thread warn/scrapers/ms.py Outdated
if "map" not in link:
links.append(link)
for i in links:
cache_key = i.split("/")[-2] + i.split("/")[-1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure how to properly name the cache_keys, I tried to generate a unique key by taking the last two parts of every pdf link's url, which contains the month, year, and quarter.

Comment thread warn/scrapers/ms.py Outdated
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
text = page.extract_tables()
if text == []:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be a more straightforward None test like, if not text?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread warn/scrapers/ms.py
Comment on lines +71 to +74
for i in range(len(text)):
if text[i][0][0] != "" and text[i][0][0][0] == "D":
notices = text[i]
break
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extracted tables contained a lot of useless lists, the notices would be contained in the list in which the first term contained "Date" or some variation of it. I've modified it to make more sense.

Comment thread warn/scrapers/ms.py
Comment on lines +76 to +83
end_table = False
while notices[startrow][0] is None or "/" not in notices[startrow][0]:
startrow += 1
if startrow == len(notices):
end_table = True
break
if end_table:
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this bit to me?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there was often space in between the header row and the information, startrow locates the index where the data info. Sometimes, a table contains no information at all, which is handled when startrow runs out of rows to search by setting end_table to True and ending the loop.

Comment thread warn/scrapers/ms.py Outdated
Comment on lines +84 to +102
for row in range(startrow, len(notices) - 1, 2):
final = []

if notices[row][0] is None or "/" not in notices[row][0]:
continue

for i in notices[row]:
if i is not None:
final.append(i)

for i in notices[row + 1]:
if i is not None:
final.append(i.strip())
while len(final) != 9:
if "" in final:
final.remove("")
else:
break
final_data.append(final)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on in this chunk?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two rows of pdf data extracted by pdf plumber for each row that goes in the csv, so I have to parse row and row + 1. There's often empty strings at the end of some of the rows, so I remove those just after.

@stucka
Copy link
Copy Markdown
Contributor

stucka commented Aug 21, 2023

Triggering tests by closing and reopening.

@stucka stucka closed this Aug 21, 2023
@stucka stucka reopened this Aug 21, 2023
@stucka
Copy link
Copy Markdown
Contributor

stucka commented Oct 9, 2023

OK, so for the record I've done some terrible things to @Ash1R 's draft, and hope to do more soon and get this into production.

  • Realized naming scheme for HTML in cache was bad, then realized caching the HTML was unnecessary.
  • Realized caching of the PDFs was faulty -- it'd cache every page once, so the first layoff of a quarter would be seen as the complete quarter. I'm hoping to shift to older PDFs into an exported historical CSV to make everything run that much faster. That's partially implemented but we still need clean CSV exports from which to generate that data. We're close there.
  • Realized some of the PDF data was coming through with bad or weird data -- line breaks in the middle of company names, odd unicode dashes. Wrote a simple function to clean that up.

To-do:

  • Improve logging
  • Revamp Excel logic -- skipping some cells is disastrous, as with Aramark March 2019 line getting merged with Palmer House April 2019. Might be something as basic as, if the last cell is empty, then drop in a null value, add the row, and we'll get on with life. 8/3/2012 and 11/21/2013 are going to be a useful edge case for validation.
  • Implement historical data download and import
  • Implement record duplicate checker
  • Evaluate whether it's possible to extract out city and county from this thing and if so, implement it. A sample suggests it's always that last line, typically with a city name, sometimes a garbage character, county in parenthesis, sometimes a ZIP code. Is county good enough?
  • Build out a transformer. Note bad date with a year of 208.
  • Improve text cleanup -- Ruth's Chris Steak House with odd apostrophe. P.F. Chang's.
  • Evaluate a number of rows from 2013-2015 with fields swapped around, and at least one with an extra field. Historical data can be patched up before export, but getting the scraper to work on them could improve future effectiveness.
  • Consider dropping pre-2016 rows if cleanup seems too unwieldy, then flag as an issue.

@stucka
Copy link
Copy Markdown
Contributor

stucka commented Oct 16, 2023

@Ash1R , I've got a bunch more validation in the scraper. I incorporated the fixes made by @jsvine but then had to go farther off the reservation to patch an even weirder PDF. Still need to set up some of the historical data but first need to do some validation of the CSV. Looks like it picked up about 30 more rows than you were getting, which is ... weird.

ms.csv

@stucka
Copy link
Copy Markdown
Contributor

stucka commented Oct 17, 2023

Seeing some data integrity problems with edge cases that bump up against the logic of "every other row has the layoff number" kind of thing. A good example:
https://mdes.ms.gov/media/26893/PY2011_Q1_WARN_July2011_Sep2011.pdf

Another way to handle this might be able to split the rows up into sections (e.g., every section must have a "/" in the first cell of the first row, to show a date). That's likely overkill.

@stucka
Copy link
Copy Markdown
Contributor

stucka commented Nov 15, 2023

The PDF parsing is still failing in some interesting ways. I tried to get the historical data cleaned up but found most of a page missing, e.g., 152801_py2018_q4_warn_apr2019_jun2019.pdf

I tweaked a couple things in the Python to try to improve logging and readability in the output, but it does not affect the substance, only the sort order.

Somewhat patched CSV:
ms.csv

Note pages set to "manual," which I only started after patching some in 2013-2015.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants