Generate typos and error #86

maelle · 2018-07-23T06:50:43Z

I haven't been able to find an example in the faker packages of other languages, but then maybe I have missed existing stuff.

The idea would be to have something similar to MissingDataProvider but instead of replacing the picked values with NA's, it'd modify them slightly to make them invalid (for stuff that can be valid, e.g. phone numbers have a given format) or just different (e.g. for people names). I guess making an element different isn't too difficult, but making it invalid is a bit more effort.

cc @isteves

sckott · 2018-07-23T18:26:58Z

Thx @maelle !

Right, some will be eiaser than others for sure. And may even take some consulting with people knowledgeable in the area to tell us what's invalid :)

maelle · 2018-09-19T12:02:27Z

Interesting: https://github.com/mdlincoln/salty (examples of raw data are created with charlatan actually)

sckott · 2018-09-19T17:46:42Z

very cool,

sckott · 2018-10-18T18:23:09Z

i'm not sure whether this would best be done within each generator or have a function that you call that will do the tweak appropriately for each variable the user selects inside of the function

maelle · 2018-10-20T04:47:46Z

Not sure either 🤔

sckott · 2019-01-19T19:45:31Z

@maelle I started playing with this a bit on the invalid branch. just one provider so far:

z <- CoordinateProvider$new()
dat=replicate(1000, z$lat())
dat2=replicate(1000, z$lat(invalid = TRUE))
summary(dat)
summary(dat2)

i was looking around for a sort of framework for invalid values for coordinates, rather than just adjusting numbers, but haven't found anything yet

isteves · 2019-01-21T11:31:22Z

@sckott in terms of common coordinate errors, there are some guidelines in the CoordinateCleaner package: https://github.com/ropensci/CoordinateCleaner

They include:

duplicate entries for long/lat (95, 95 instead of 95, 123)
country centroids (instead of exact coordinates)
0's

sckott · 2019-01-22T00:58:18Z

thanks @isteves !

yeah, the out of range of valid values for lat and lon is what is built in thus far (your first bullet). however, you can imagine many ways to do this. If you set invalid=TRUE should that genarate all invalid data? should it generate some invalid data and some valid data? should it generate a distribution of data where there's some invalid data in each tail of the distribution? (applies to numeric stuff only i guess)

the 2nd two bullets though are valid values of lat and lon by themselves but definitely would deserve a 2nd look as to whether they are correct or not.

For "validity" itself, I think only the 1st bullet fits. I think i'd like to stick to strictly valid or invalid data generation as if we want to do this across the package where applicable I think it has to be somewhat consistent.

But, we could think about generating something like "common mistakes" or similar which i think woul encompass your latter 2 bullets.

isteves · 2019-01-22T18:54:08Z

That's fair. For the first point, I really meant to give an example of another "common mistake" (20, 20 versus 20, 22), but I guess I inadvertently gave a totally invalid example 😬

I like the distinction between "valid" and "common mistakes" to keep it more general 👍

sckott · 2019-01-23T20:20:51Z

ah okay, i see about the common mistakes.

So I guess we have the following use cases we could support:

invalid: only for data types where you can strictly determine valid and invalid values, e.g.'s:
- lat values outside -90/90 and lon values outside -180/180
- other egs?
typos: only for character data types probably, AND probably only for fact based data; e.g., place names, dates, etc., a good start might be https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings e.g.'s:
- Apirl instead of April
- other egs?
common mistakes: only where we can define a clear set of mistakes that are commonly made for the data type, eg's:
- lat/lon at 0,0
- other egs?

A question for all of the above is how to approach creating them (repeating from above comment, generalizing). Should invalid=TRUE, typos=TRUE, and common_mistakes=TRUE genarate data that's all invalid/typo/mistake? Or should it generate some invalid/typo/mistake data and some not? Should it generate a distribution of data where there's some invalid/typo/mistake data in each tail of the distribution? (applies to numeric stuff only i guess)

thoughtfulbloke · 2019-01-24T06:35:48Z

For typos, from some projects I've done in the past, most computer data entry typos are a letter substituted for a nearby key in the keyboard layout being used.

This is particularly the case in things like proper names of any kind, if any spellchecking is making the assumption that because of the capitalised first letter it is a name it may not know about.

isteves · 2019-01-24T09:23:00Z

I now feel like I need to backtrack a bit... I wonder if it's best to just focus on "common typos"--whether it's number that's way out of range or typos. Perhaps lat/long-specific common mistakes are better suited to specialized packages (like CoordinateCleaner).

In terms of typos, common categorical variables (jobs, color, t/f, marital status, etc...see https://github.com/trinker/wakefield for a bunch of examples) are probably the best way to go. With names/locations/etc, it's difficult to determine typos with certainty.

sckott · 2019-01-24T18:44:25Z

thanks for your input @thoughtfulbloke !

i like that idea of a letter substituted by a nearby key. Do you know of any dataset/list of these?

@isteves

see https://github.com/trinker/wakefield for a bunch of examples

of? it doesn't give typo's, correct? or does it?

thoughtfulbloke · 2019-01-25T06:56:49Z

You could have a look at
https://userinterfaces.aalto.fi/136Mkeystrokes/resources/chi-18-analysis.pdf

with the dataset
https://userinterfaces.aalto.fi/136Mkeystrokes/

thoughtfulbloke · 2019-01-25T20:26:44Z

Also, I just noticed https://github.com/colinmorris/reddit-dubious-spelling

sckott · 2019-01-25T22:11:15Z

both look promising, thanks @thoughtfulbloke

isteves · 2019-01-27T22:03:20Z

@sckott nope no typos, just some more examples of common categorical variables (in addition to what I saw in the charlatan README) that would be good typo candidates.

sckott added the feature label Jul 23, 2018

sckott added this to the v0.4 milestone Jan 19, 2019

sckott added a commit that referenced this issue Jan 19, 2019

#86 trying a coordinate invalid lat and lon option

f1621b9

sckott modified the milestones: v0.4, v0.5 Oct 3, 2019

sckott modified the milestones: v0.5, v0.6 Aug 16, 2022

RMHogervorst mentioned this issue Sep 26, 2022

ROADMAP 2022-2023 #126

Closed

RMHogervorst removed this from the v0.6 milestone Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate typos and error #86

Generate typos and error #86

maelle commented Jul 23, 2018

sckott commented Jul 23, 2018

maelle commented Sep 19, 2018

sckott commented Sep 19, 2018

sckott commented Oct 18, 2018

maelle commented Oct 20, 2018

sckott commented Jan 19, 2019

isteves commented Jan 21, 2019

sckott commented Jan 22, 2019

isteves commented Jan 22, 2019

sckott commented Jan 23, 2019

thoughtfulbloke commented Jan 24, 2019

isteves commented Jan 24, 2019

sckott commented Jan 24, 2019

thoughtfulbloke commented Jan 25, 2019

thoughtfulbloke commented Jan 25, 2019

sckott commented Jan 25, 2019

isteves commented Jan 27, 2019

Generate typos and error #86

Generate typos and error #86

Comments

maelle commented Jul 23, 2018

sckott commented Jul 23, 2018

maelle commented Sep 19, 2018

sckott commented Sep 19, 2018

sckott commented Oct 18, 2018

maelle commented Oct 20, 2018

sckott commented Jan 19, 2019

isteves commented Jan 21, 2019

sckott commented Jan 22, 2019

isteves commented Jan 22, 2019

sckott commented Jan 23, 2019

thoughtfulbloke commented Jan 24, 2019

isteves commented Jan 24, 2019

sckott commented Jan 24, 2019

thoughtfulbloke commented Jan 25, 2019

thoughtfulbloke commented Jan 25, 2019

sckott commented Jan 25, 2019

isteves commented Jan 27, 2019