-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chef changes time concepts #124
Comments
here is the thing : time concept was converted to int16 For 4 digits year number it works well but numbers like 20200520 is bigger than the upper bound of int16 |
I didn't use string as dtype here because it's not space efficient and in our datasets the time values are integers. Using But the decision was made before I used dask to load datapoints. Dask solves the data-bigger-than-memory issue, so possibly for now it's ok to just use string for time. Need to test it |
work around #124, need more consideration
Time concept is not neccesarily numeric. See the documentation. Week, Quarter and Month contain If you can internally parse it to |
related to #124 Because pandas set_index() and reset_index() have a bug that reset dtypes. uint64 will reset to int64 and cause problems so set it to int64 for now.
Ok, as we have defined the format for |
After implementing reading time columns as datetime objects, we also need to change the filtering function in ingredient filter and filter procedure, because datetime object is not comparable to int (e.g filter time > 1900 not working). But I found that the performance with # df is population from WPP
>>> df.shape
(6130902, 5)
>>> df.columns
Index(['age1yearinterval', 'country', 'gender', 'time', 'population'], dtype='object')
>>> %time df.eval("time > '2020'")
CPU times: user 50.4 s, sys: 614 ms, total: 51 s
Wall time: 51 s
>>> %time df['time'] > '2020'
CPU times: user 7.45 ms, sys: 0 ns, total: 7.45 ms
Wall time: 7.05 ms
# below is performance when using string
>>> %time df['time'] > "2020"
CPU times: user 253 ms, sys: 0 ns, total: 253 ms
Wall time: 252 ms
>>> %time df.eval('time > "2020"')
CPU times: user 255 ms, sys: 23.8 ms, total: 279 ms
Wall time: 276 ms ... which I think is not acceptable. And there will be many changes to make if we don't use df.query(). So I think we for now settle with strings for time concepts because filter should just work. >>> '2010q4' < '2020q2'
True
>>> '2010w51' < '2020w01'
True Comments or OK? |
time values are strings, so we don't compare them to ints related issue #124
chef will read time columns as strings now. I also sent a bug report to pandas |
chef changes the
day
concept while it shouldn't, probably some parsing thing:https://github.com/open-numbers/ddf--open_numbers--covid_government_response/blob/master/ddf--datapoints--stringency_index--by--geo--time.csv
https://github.com/open-numbers/ddf--oxford--covid_government_response/blob/master/ddf--datapoints--stringency_index--by--country--day.csv
The text was updated successfully, but these errors were encountered: