analyzer: on aggregation queries with indexes, do not call to table when possible. #215

ajnavarro · 2018-06-13T09:58:54Z

No description provided.

erizocosmico · 2018-06-28T10:34:18Z

Cases:

SELECT COUNT(x) FROM t where T has any index and x is any column or *. Replace with a static count of the index entries.
SELECT FN(x) FROM t [WHERE ...] where FN is an aggregation function, x is a column for which we have an index in table t, if there are other nodes, we must have an index for all the columns that appear. Replace t with a node that returns all values of the index in full scan.
SELECT FN(x)[, y, ...] FROM t GROUP BY foo[, bar, ...] where FN is an aggregation function and all the columns that appear belonging to table t have an index (for all them), replace t with a node that returns all values of the index in full scan (this is a more complicated case of the previous one).

WDYT @ajnavarro @src-d/data-retrieval? Any cases I missed with this?

erizocosmico · 2018-07-04T09:06:00Z

I've begin implementing this but I have noticed something that gets a little in the way of this:

All this can be generalized into a single rule (not only for aggregations): when we have N columns in use for table T and we have an index for table T for all these columns. If columns are inside a COUNT, they are ignored. If no other columns are used, but a count is, just search for whatever index we may have for this table.

With that, we replace the table node with an iterator that just reads the values from the index and puts nils wherever the index does not have them (those values will never be used, anyway).

That would not only work for aggregations but for any table that does not have an index yet.

What are the problems?

The problems come when we want to get the values from the index.

We can't decode the values. Values are gob-encoded and they must be decoded using the concrete type (which we don't have), we can only decode to interface{}, and that does not work. This could be solved doing our own "gob-like" encoder/decoder. go-mysql-server only works with a very well defined set of types, so it would be easy to do.
Reconstruct the values from pilosa and the mapping. There is no mapping from rowID to value, afaik, so it's kind of hard (and perhaps very, very slow) to do it that way.

What are the use cases of this?

Full-scan with no already available indexes (if it didn't have an index already, it won't have it with even more columns in the mix).

For aggregations it needs the values to be usable, which, as we discussed before, cannot be done (right now).

So we end up with this being only usable for queries with only this form: SELECT COUNT(whatever) FROM table. If the query has a WHERE it needs the values themselves (index filters are not erased, see #187. Also, no groupbys, other columns in the select, because that means having to read the actual rows and doing some work on them.

This case could be speed up by additionally storing in the mapping the number of rows for the index and replacing the whole query with a static number getting it from the index.

How can this be solved?

For starters, we could do the optimization for SELECT COUNT(whatever) FROM table, given the other optimization requires really heavy changes in the way we do things in the pilosa index.

We need the following things from the index for this to be possible:

Be able to get all values for the index.
Be able to get the actual values easily and if possible, not very costly (if it's really costly it may not be worth doing at all).
Be able to decode the values into interface{}.

Thoughts? @ajnavarro @src-d/data-retrieval

erizocosmico · 2018-07-04T10:25:34Z

Discussed via slack:

Things that should be done:

Implement a way to be able to decode-encode our go-mysql-server types

Add a new interface to be used by IndexLookup to return IndexKeyValueIter

Implement that interface on pilosa and add the missing mapping

This is kept as paused until pilosa new version is released and we can ditch the mapping altogether.

kuba-- · 2018-07-10T10:17:53Z

Be able to get all values for the index.
It should be as simple as iteration through BoltDB bucket
Be able to decode the values into interface{}.
What if we try to save these interfaces as a row's attributes in frames?
func (f *Frame) SetRowAttrs(rowID uint64, attrs map[string]interface{}) *PQLBaseQuery

ajnavarro added the enhancement New feature or request label Jun 13, 2018

ajnavarro added this to the Index-2 milestone Jun 13, 2018

erizocosmico self-assigned this Jun 28, 2018

erizocosmico assigned erizocosmico and unassigned erizocosmico Jun 28, 2018

erizocosmico removed their assignment Jul 4, 2018

erizocosmico added the blocked Some other issue is blocking this label Jul 4, 2018

ajnavarro removed this from the Index-2 milestone Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyzer: on aggregation queries with indexes, do not call to table when possible. #215

analyzer: on aggregation queries with indexes, do not call to table when possible. #215

ajnavarro commented Jun 13, 2018

erizocosmico commented Jun 28, 2018

erizocosmico commented Jul 4, 2018 •

edited

Loading

erizocosmico commented Jul 4, 2018

kuba-- commented Jul 10, 2018 •

edited

Loading

analyzer: on aggregation queries with indexes, do not call to table when possible. #215

analyzer: on aggregation queries with indexes, do not call to table when possible. #215

Comments

ajnavarro commented Jun 13, 2018

erizocosmico commented Jun 28, 2018

erizocosmico commented Jul 4, 2018 • edited Loading

What are the problems?

What are the use cases of this?

How can this be solved?

erizocosmico commented Jul 4, 2018

kuba-- commented Jul 10, 2018 • edited Loading

erizocosmico commented Jul 4, 2018 •

edited

Loading

kuba-- commented Jul 10, 2018 •

edited

Loading