Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset should be messier and larger #108

Open
8 tasks
bencomp opened this issue Jun 23, 2022 · 7 comments
Open
8 tasks

Dataset should be messier and larger #108

bencomp opened this issue Jun 23, 2022 · 7 comments
Labels
help wanted Looking for Contributors type:enhancement Propose enhancement to the lesson

Comments

@bencomp
Copy link
Contributor

bencomp commented Jun 23, 2022

The prepared SAFI dataset is, I think, not messy enough to really show OpenRefine's power.

I would like:

  • several cells with leading or trailing whitespace in rows that are far apart (to use with "Trim leading and trailing whitespace" transform)
  • more variants in the village names or another column (to use with clustering)
  • a date that is a clear outlier (to find using a timeline facet)
  • non-numeric data in a column that should be numeric (to find using a numeric facet)

See also #35. The number of columns doesn't make the data messy.


Summarising the to-dos from the discussion below:

  • Add leading and trailing spaces to (let's say 10) cells in the village and respondent_roof_type columns in rows that are far apart
  • Add accents, spaces in the middle of names, or typos to several cells in the village column
  • Change a few date values to a different date format (making sure that parsing works correctly, or fails completely, so that you don't think everything worked when it did not)
  • Change the year on a date value to make that value an (obvious) outlier, e.g., December 2016 becomes December 2017
  • Change a numeric value to a non-numeric value like NULL
  • Change a numeric (missing?) value to -99
  • Add a step on setting the character encoding to the project creation section

While we are making changes, I think this should (or could) be part of the update too, even though it was part of #29 and not explicitly mentioned now:

  • Add more rows from the original dataset
@bencomp bencomp added status:refer to cac Curriculum Advisory Committee input needed type:discussion Discussion or feedback about the lesson type:enhancement Propose enhancement to the lesson labels Jun 23, 2022
@bencomp
Copy link
Contributor Author

bencomp commented Jun 27, 2022

I got a question about the many NULL values during today's workshop. This hasn't been addressed in the lesson as far as I know.

There are other inconsistencies that are not covered by the lesson:

  • duplicate value 1 in the quest_no column
  • Bandula in the district column
  • Manica in the ward column
  • motorcyle instead of motorcycle in the items_owned column
  • _members_count as duplicate of no_membrs column

Maybe I should say that the data is messy in the wrong places ;)

@bencomp
Copy link
Contributor Author

bencomp commented Jan 25, 2023

@datacarpentry/curriculum-advisors-social-science Your input would be very welcome.

@ndporter
Copy link

ndporter commented Jan 26, 2023

several cells with leading or trailing whitespace in rows that are far apart (to use with "Trim leading and trailing whitespace" transform)

Yes

more variants in the village names or another column (to use with clustering)

Yes (perhaps with accents, whitespace, etc)

a date that is a clear outlier (to find using a timeline facet)

In addition to this, we could consider adding some dates in a different format to show that they can be processed too

non-numeric data in a column that should be numeric (to find using a numeric facet)

Yes (something like NULL or NA would work). Also, we can add missing data codes (-99, etc) if we want the numeric facet to be especially useful.

@bencomp
Copy link
Contributor Author

bencomp commented Feb 16, 2023

Thanks for your comments, @ndporter! I edited them to get the formatting right.
I like all of your suggestions! Hopefully the rest of the CAC agrees!

I do wonder who would be responsible for actually updating the dataset...

@ndporter
Copy link

ndporter commented Jul 6, 2023

CAC agrees with all above recommendations, and further suggests mentioning how to manually specify encodings (esp for work with non-English text), perhaps following the model in step 4 listed in the LC lesson. Thanks!

@tobyhodges
Copy link
Member

tobyhodges commented Jul 7, 2023

@bencomp I recommend adding the 'help wanted' label to this issue now, and removing the CAC and discussion labels so that would-be contributors know it is ready to be tackled. You could also pin the issue to the repo issue listing, so that it is even more visible. Finally, as the desired changes are spread across several posts here, it might be helpful to summarise what changes should be made to solve the issue, all in one post at the end of this thread?

If you would like to take additional steps to encourage community members to contribute, we could post about the issue on Slack, and offer help to anyone who is interested but not fully confident with making changes to a lesson.

[Edit: when it comes to it, the Curriculum Team will be able to log in and take care of updating the FigShare entry to include the new version of the dataset.]

@bencomp bencomp added help wanted Looking for Contributors and removed status:refer to cac Curriculum Advisory Committee input needed type:discussion Discussion or feedback about the lesson labels Jul 7, 2023
@bencomp
Copy link
Contributor Author

bencomp commented Jul 18, 2023

Thanks for the feedback, @ndporter! And thanks for the suggestions and explanation, @tobyhodges!

Edit: I'm moving the tasks to the top of the issue.

@bencomp bencomp changed the title Dataset should be messier Dataset should be messier and larger Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Looking for Contributors type:enhancement Propose enhancement to the lesson
Projects
None yet
Development

No branches or pull requests

3 participants