Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate typos and error #86

Open
maelle opened this issue Jul 23, 2018 · 17 comments
Open

Generate typos and error #86

maelle opened this issue Jul 23, 2018 · 17 comments
Labels

Comments

@maelle
Copy link
Member

maelle commented Jul 23, 2018

I haven't been able to find an example in the faker packages of other languages, but then maybe I have missed existing stuff.

The idea would be to have something similar to MissingDataProvider but instead of replacing the picked values with NA's, it'd modify them slightly to make them invalid (for stuff that can be valid, e.g. phone numbers have a given format) or just different (e.g. for people names). I guess making an element different isn't too difficult, but making it invalid is a bit more effort.

cc @isteves

@sckott sckott added the feature label Jul 23, 2018
@sckott
Copy link
Collaborator

sckott commented Jul 23, 2018

Thx @maelle !

Right, some will be eiaser than others for sure. And may even take some consulting with people knowledgeable in the area to tell us what's invalid :)

@maelle
Copy link
Member Author

maelle commented Sep 19, 2018

Interesting: https://github.com/mdlincoln/salty (examples of raw data are created with charlatan actually)

@sckott
Copy link
Collaborator

sckott commented Sep 19, 2018

very cool,

@sckott
Copy link
Collaborator

sckott commented Oct 18, 2018

i'm not sure whether this would best be done within each generator or have a function that you call that will do the tweak appropriately for each variable the user selects inside of the function

@maelle
Copy link
Member Author

maelle commented Oct 20, 2018

Not sure either 🤔

@sckott sckott added this to the v0.4 milestone Jan 19, 2019
@sckott
Copy link
Collaborator

sckott commented Jan 19, 2019

@maelle I started playing with this a bit on the invalid branch. just one provider so far:

z <- CoordinateProvider$new()
dat=replicate(1000, z$lat())
dat2=replicate(1000, z$lat(invalid = TRUE))
summary(dat)
summary(dat2)

i was looking around for a sort of framework for invalid values for coordinates, rather than just adjusting numbers, but haven't found anything yet

@isteves
Copy link

isteves commented Jan 21, 2019

@sckott in terms of common coordinate errors, there are some guidelines in the CoordinateCleaner package: https://github.com/ropensci/CoordinateCleaner

They include:

  • duplicate entries for long/lat (95, 95 instead of 95, 123)
  • country centroids (instead of exact coordinates)
  • 0's

@sckott
Copy link
Collaborator

sckott commented Jan 22, 2019

thanks @isteves !

yeah, the out of range of valid values for lat and lon is what is built in thus far (your first bullet). however, you can imagine many ways to do this. If you set invalid=TRUE should that genarate all invalid data? should it generate some invalid data and some valid data? should it generate a distribution of data where there's some invalid data in each tail of the distribution? (applies to numeric stuff only i guess)

the 2nd two bullets though are valid values of lat and lon by themselves but definitely would deserve a 2nd look as to whether they are correct or not.

For "validity" itself, I think only the 1st bullet fits. I think i'd like to stick to strictly valid or invalid data generation as if we want to do this across the package where applicable I think it has to be somewhat consistent.

But, we could think about generating something like "common mistakes" or similar which i think woul encompass your latter 2 bullets.

@isteves
Copy link

isteves commented Jan 22, 2019

That's fair. For the first point, I really meant to give an example of another "common mistake" (20, 20 versus 20, 22), but I guess I inadvertently gave a totally invalid example 😬

I like the distinction between "valid" and "common mistakes" to keep it more general 👍

@sckott
Copy link
Collaborator

sckott commented Jan 23, 2019

ah okay, i see about the common mistakes.

So I guess we have the following use cases we could support:

  • invalid: only for data types where you can strictly determine valid and invalid values, e.g.'s:
    • lat values outside -90/90 and lon values outside -180/180
    • other egs?
  • typos: only for character data types probably, AND probably only for fact based data; e.g., place names, dates, etc., a good start might be https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings e.g.'s:
    • Apirl instead of April
    • other egs?
  • common mistakes: only where we can define a clear set of mistakes that are commonly made for the data type, eg's:
    • lat/lon at 0,0
    • other egs?

A question for all of the above is how to approach creating them (repeating from above comment, generalizing). Should invalid=TRUE, typos=TRUE, and common_mistakes=TRUE genarate data that's all invalid/typo/mistake? Or should it generate some invalid/typo/mistake data and some not? Should it generate a distribution of data where there's some invalid/typo/mistake data in each tail of the distribution? (applies to numeric stuff only i guess)

@thoughtfulbloke
Copy link

For typos, from some projects I've done in the past, most computer data entry typos are a letter substituted for a nearby key in the keyboard layout being used.

This is particularly the case in things like proper names of any kind, if any spellchecking is making the assumption that because of the capitalised first letter it is a name it may not know about.

@isteves
Copy link

isteves commented Jan 24, 2019

I now feel like I need to backtrack a bit... I wonder if it's best to just focus on "common typos"--whether it's number that's way out of range or typos. Perhaps lat/long-specific common mistakes are better suited to specialized packages (like CoordinateCleaner).

In terms of typos, common categorical variables (jobs, color, t/f, marital status, etc...see https://github.com/trinker/wakefield for a bunch of examples) are probably the best way to go. With names/locations/etc, it's difficult to determine typos with certainty.

@sckott
Copy link
Collaborator

sckott commented Jan 24, 2019

thanks for your input @thoughtfulbloke !

i like that idea of a letter substituted by a nearby key. Do you know of any dataset/list of these?


@isteves

see https://github.com/trinker/wakefield for a bunch of examples

of? it doesn't give typo's, correct? or does it?

@thoughtfulbloke
Copy link

@thoughtfulbloke
Copy link

Also, I just noticed https://github.com/colinmorris/reddit-dubious-spelling

@sckott
Copy link
Collaborator

sckott commented Jan 25, 2019

both look promising, thanks @thoughtfulbloke

@isteves
Copy link

isteves commented Jan 27, 2019

@sckott nope no typos, just some more examples of common categorical variables (in addition to what I saw in the charlatan README) that would be good typo candidates.

@sckott sckott modified the milestones: v0.4, v0.5 Oct 3, 2019
@sckott sckott modified the milestones: v0.5, v0.6 Aug 16, 2022
@RMHogervorst RMHogervorst removed this from the v0.6 milestone Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants