Move data source collection tools to data-source-identification #252

josh-chamberlain · 2024-11-13T20:09:46Z

Context

https://github.com/Police-Data-Accessibility-Project/data-source-identification

The Scrapers repo is for collecting data from one or more source at a time for use/analysis.

However, we have some tools for scraping with the express goal of generating sources (lists of URLs) for submission to our database. We have tools in the data source ID repo which can parse those lists of URLs, either identifying agencies, sending the URLs to our annotation pipeline, etc.

Requirements

move the scripts for collecting data sources from MuckRock and CKAN to the Data Source Identification repo
- this will require one PR to remove them, and one to add them; make sure they are linked to each other and we'll approve both at once
- they probably belong in a subfolder with common_crawler called something like "source_collectors"
optionally, develop a lightweight system for tracking sources which have been tried so we can reduce duplication of effort (every time we run the MuckRock or CKAN scraper, for example, we should avoid trying to submit those sources which we already collected the previous time)

The text was updated successfully, but these errors were encountered:

eddie-m-m · 2024-11-13T21:35:01Z

Clarification needed:
There is the muckrock_tools and then there is the muckrock_scraper.py with templates. Should only the muckrock_tools be moved (since the other doesn't appear to deal with source collection)?

josh-chamberlain · 2024-11-14T20:46:06Z

@eddie-m-m good catch, yes! The other one is for grabbing files from MuckRock, and it's in the right spot!

josh-chamberlain mentioned this issue Nov 13, 2024

MuckRock scraper enhancements Police-Data-Accessibility-Project/data-source-identification#105

Open

6 tasks

This was referenced Nov 16, 2024

Remove CKAN, MuckRock source collection scrapers #253

Draft

Add CKAN, MuckRock source collection scrapers Police-Data-Accessibility-Project/data-source-identification#106

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move data source collection tools to data-source-identification #252

Move data source collection tools to data-source-identification #252

josh-chamberlain commented Nov 13, 2024

eddie-m-m commented Nov 13, 2024

josh-chamberlain commented Nov 14, 2024

Move data source collection tools to data-source-identification #252

Move data source collection tools to data-source-identification #252

Comments

josh-chamberlain commented Nov 13, 2024

Context

Requirements

eddie-m-m commented Nov 13, 2024

josh-chamberlain commented Nov 14, 2024