Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wv scraper #496

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Wv scraper #496

wants to merge 11 commits into from

Conversation

Ash1R
Copy link
Contributor

@Ash1R Ash1R commented Oct 28, 2022

This is for issue #375, for West Virginia.
It was a large pdf on their workforce site pdfplumber extracted the tables pretty well, although there were some irregularities in the pdf (for example, some tables had boxes that had the specific sites of the layoffs).
There are a couple of errors on their end (switching up values), but nothing too significant.

warn/scrapers/mi.py Outdated Show resolved Hide resolved
warn/scrapers/mi.py Outdated Show resolved Hide resolved
warn/scrapers/mi.py Outdated Show resolved Hide resolved
warn/scrapers/mi.py Outdated Show resolved Hide resolved
@palewire
Copy link
Contributor

palewire commented Dec 5, 2022

@Ash1R. I am still seeing MI being deleted from the repo, I think. Do you see this same thing on the files tab?

https://github.com/biglocalnews/warn-scraper/pull/496/files
Screenshot from 2022-12-05 10-39-36

warn/scrapers/wv.py Outdated Show resolved Hide resolved
companydone = False
row = []
for k in range(len(data)):
if data[k][0] is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to the range in the loop here? Can you not simple do something more like for row in data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each company's data is contained in two consecutive rows, with some blank rows in between these company row-pairs. Alternative company names and addresses are stored on the second row. I used range so I can access the second row using an index of k + 1. I did unnecessarily use range later, so I removed that.

@stucka
Copy link
Contributor

stucka commented Aug 21, 2023

Triggering tests by closing and reopening.

@stucka stucka closed this Aug 21, 2023
@stucka stucka reopened this Aug 21, 2023
@stucka
Copy link
Contributor

stucka commented Aug 21, 2023

mypy is flagging some type errors:
warn/scrapers/wv.py:65: error: Item "None" of "Optional[str]" has no attribute "strip" [union-attr]
warn/scrapers/wv.py:66: error: Item "None" of "Optional[str]" has no attribute "strip" [union-attr]
warn/scrapers/wv.py:68: error: Item "None" of "Optional[str]" has no attribute "strip" [union-attr]
warn/scrapers/wv.py:72: error: Item "None" of "Optional[str]" has no attribute "strip" [union-attr]
warn/scrapers/wv.py:74: error: Item "None" of "Optional[str]" has no attribute "strip" [union-attr]
warn/scrapers/wv.py:75: error: Item "None" of "Optional[str]" has no attribute "strip" [union-attr]
warn/scrapers/la.py:170: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs [annotation-unchecked]

@stucka
Copy link
Contributor

stucka commented Aug 21, 2023

@Ash1R , I think I see maybe an easy way to work around the mypy type conflict and also maybe make this a bit more readable, something like:

if not data[k][0]:
     rowkey = None
else:
    rowkey = data[k][0].strip()

Then start folding in those changes into the flagged rows, like if rowkey in in header_whitelist: and then keep working down to down. Last bit might be more readable as elif ((not rowkey) and (k != 0)) ... ?

@stucka
Copy link
Contributor

stucka commented Sep 22, 2023

The landing page perhaps has been killed off. This is the closest I could find, and I can't guarantee it'd be updated in the same way notice after notice. https://workforcewv.org/about-us/

I have not tried seeing if this scraper works with that PDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants