chef changes time concepts #124

semio · 2020-05-27T01:53:49Z

chef changes the day concept while it shouldn't, probably some parsing thing:

https://github.com/open-numbers/ddf--open_numbers--covid_government_response/blob/master/ddf--datapoints--stringency_index--by--geo--time.csv

https://github.com/open-numbers/ddf--oxford--covid_government_response/blob/master/ddf--datapoints--stringency_index--by--country--day.csv

The text was updated successfully, but these errors were encountered:

semio · 2020-05-27T02:22:47Z

here is the thing : time concept was converted to int16

For 4 digits year number it works well but numbers like 20200520 is bigger than the upper bound of int16

semio · 2020-05-27T02:54:38Z

I didn't use string as dtype here because it's not space efficient and in our datasets the time values are integers. Using categorical is better choice which is more space efficient, but it's not easy to get all categories (i.e. all possible values of the time concept).

But the decision was made before I used dask to load datapoints. Dask solves the data-bigger-than-memory issue, so possibly for now it's ok to just use string for time. Need to test it

work around #124, need more consideration

jheeffer · 2020-05-27T12:54:26Z

Time concept is not neccesarily numeric. See the documentation. Week, Quarter and Month contain w, q and - respectively.
Also undocumented in DDF but following ISO8601, when adding time, there would be a t seperator between date and time and . separating seconds from milliseconds

If you can internally parse it to datetime and format it back when outputting that'd be fine too. Might be more memory efficient. But this may be hard, especially in case of time concept, whose format is ambiguous: should midnight 1st of january 2020 be formatted as 20200101 or 2020 or 2020-01? Ambiguity might be solved by using the format the data was in when you parsed it. Might be difficult to trace that?

related to #124 Because pandas set_index() and reset_index() have a bug that reset dtypes. uint64 will reset to int64 and cause problems so set it to int64 for now.

semio · 2020-06-10T02:06:10Z

Ok, as we have defined the format for year, month etc, I think it's possible to parse these columns to datetime object internally. For time, we can use string for now, or maybe add some way to specify the format in recipe or datapackage.json

semio · 2020-07-07T03:04:30Z

After implementing reading time columns as datetime objects, we also need to change the filtering function in ingredient filter and filter procedure, because datetime object is not comparable to int (e.g filter time > 1900 not working). But I found that the performance with eval() and query() (which are used in filter) is very poor

# df is population from WPP
>>> df.shape
(6130902, 5)

>>> df.columns
Index(['age1yearinterval', 'country', 'gender', 'time', 'population'], dtype='object')

>>> %time df.eval("time > '2020'")
CPU times: user 50.4 s, sys: 614 ms, total: 51 s
Wall time: 51 s

>>>  %time df['time'] > '2020'
CPU times: user 7.45 ms, sys: 0 ns, total: 7.45 ms
Wall time: 7.05 ms

# below is performance when using string
>>> %time df['time'] > "2020"
CPU times: user 253 ms, sys: 0 ns, total: 253 ms
Wall time: 252 ms

>>> %time df.eval('time > "2020"')
CPU times: user 255 ms, sys: 23.8 ms, total: 279 ms
Wall time: 276 ms

... which I think is not acceptable. And there will be many changes to make if we don't use df.query(). So I think we for now settle with strings for time concepts because filter should just work.

>>> '2010q4' < '2020q2'
True

>>> '2010w51' < '2020w01'
True

Comments or OK?

related issue #124

time values are strings, so we don't compare them to ints related issue #124

semio · 2020-07-20T01:11:58Z

chef will read time columns as strings now. I also sent a bug report to pandas

semio added a commit that referenced this issue May 27, 2020

use bigger int for time concept

c1fe41e

work around #124, need more consideration

semio added a commit that referenced this issue Jun 9, 2020

just use int64

01c3fb3

related to #124 Because pandas set_index() and reset_index() have a bug that reset dtypes. uint64 will reset to int64 and cause problems so set it to int64 for now.

semio added a commit that referenced this issue Jul 12, 2020

use string dtype for time for now

eed2e69

related issue #124

semio added a commit that referenced this issue Jul 12, 2020

query builder: handle input for query about time columns

db3a5e3

time values are strings, so we don't compare them to ints related issue #124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chef changes time concepts #124

chef changes time concepts #124

semio commented May 27, 2020

semio commented May 27, 2020 •

edited

Loading

semio commented May 27, 2020 •

edited

Loading

jheeffer commented May 27, 2020 •

edited

Loading

semio commented Jun 10, 2020

semio commented Jul 7, 2020 •

edited

Loading

semio commented Jul 20, 2020

chef changes time concepts #124

chef changes time concepts #124

Comments

semio commented May 27, 2020

semio commented May 27, 2020 • edited Loading

semio commented May 27, 2020 • edited Loading

jheeffer commented May 27, 2020 • edited Loading

semio commented Jun 10, 2020

semio commented Jul 7, 2020 • edited Loading

semio commented Jul 20, 2020

semio commented May 27, 2020 •

edited

Loading

semio commented May 27, 2020 •

edited

Loading

jheeffer commented May 27, 2020 •

edited

Loading

semio commented Jul 7, 2020 •

edited

Loading