Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

analyzer: on aggregation queries with indexes, do not call to table when possible. #215

Open
ajnavarro opened this issue Jun 13, 2018 · 4 comments
Labels
blocked Some other issue is blocking this enhancement New feature or request

Comments

@ajnavarro
Copy link
Contributor

No description provided.

@ajnavarro ajnavarro added the enhancement New feature or request label Jun 13, 2018
@ajnavarro ajnavarro added this to the Index-2 milestone Jun 13, 2018
@erizocosmico erizocosmico self-assigned this Jun 28, 2018
@erizocosmico
Copy link
Contributor

Cases:

  • SELECT COUNT(x) FROM t where T has any index and x is any column or *. Replace with a static count of the index entries.
  • SELECT FN(x) FROM t [WHERE ...] where FN is an aggregation function, x is a column for which we have an index in table t, if there are other nodes, we must have an index for all the columns that appear. Replace t with a node that returns all values of the index in full scan.
  • SELECT FN(x)[, y, ...] FROM t GROUP BY foo[, bar, ...] where FN is an aggregation function and all the columns that appear belonging to table t have an index (for all them), replace t with a node that returns all values of the index in full scan (this is a more complicated case of the previous one).

WDYT @ajnavarro @src-d/data-retrieval? Any cases I missed with this?

@erizocosmico
Copy link
Contributor

erizocosmico commented Jul 4, 2018

I've begin implementing this but I have noticed something that gets a little in the way of this:

All this can be generalized into a single rule (not only for aggregations): when we have N columns in use for table T and we have an index for table T for all these columns. If columns are inside a COUNT, they are ignored. If no other columns are used, but a count is, just search for whatever index we may have for this table.

With that, we replace the table node with an iterator that just reads the values from the index and puts nils wherever the index does not have them (those values will never be used, anyway).

That would not only work for aggregations but for any table that does not have an index yet.

What are the problems?

The problems come when we want to get the values from the index.

  • We can't decode the values. Values are gob-encoded and they must be decoded using the concrete type (which we don't have), we can only decode to interface{}, and that does not work. This could be solved doing our own "gob-like" encoder/decoder. go-mysql-server only works with a very well defined set of types, so it would be easy to do.
  • Reconstruct the values from pilosa and the mapping. There is no mapping from rowID to value, afaik, so it's kind of hard (and perhaps very, very slow) to do it that way.

What are the use cases of this?

Full-scan with no already available indexes (if it didn't have an index already, it won't have it with even more columns in the mix).

For aggregations it needs the values to be usable, which, as we discussed before, cannot be done (right now).

So we end up with this being only usable for queries with only this form: SELECT COUNT(whatever) FROM table. If the query has a WHERE it needs the values themselves (index filters are not erased, see #187. Also, no groupbys, other columns in the select, because that means having to read the actual rows and doing some work on them.

This case could be speed up by additionally storing in the mapping the number of rows for the index and replacing the whole query with a static number getting it from the index.

How can this be solved?

For starters, we could do the optimization for SELECT COUNT(whatever) FROM table, given the other optimization requires really heavy changes in the way we do things in the pilosa index.

We need the following things from the index for this to be possible:

  • Be able to get all values for the index.
  • Be able to get the actual values easily and if possible, not very costly (if it's really costly it may not be worth doing at all).
  • Be able to decode the values into interface{}.

Thoughts? @ajnavarro @src-d/data-retrieval

@erizocosmico
Copy link
Contributor

Discussed via slack:

Things that should be done:

  • Implement a way to be able to decode-encode our go-mysql-server types
  • Add a new interface to be used by IndexLookup to return IndexKeyValueIter
  • Implement that interface on pilosa and add the missing mapping

This is kept as paused until pilosa new version is released and we can ditch the mapping altogether.

@erizocosmico erizocosmico removed their assignment Jul 4, 2018
@erizocosmico erizocosmico added the blocked Some other issue is blocking this label Jul 4, 2018
@kuba--
Copy link
Contributor

kuba-- commented Jul 10, 2018

  • Be able to get all values for the index.
    It should be as simple as iteration through BoltDB bucket

  • Be able to decode the values into interface{}.
    What if we try to save these interfaces as a row's attributes in frames?
    func (f *Frame) SetRowAttrs(rowID uint64, attrs map[string]interface{}) *PQLBaseQuery

@ajnavarro ajnavarro removed this from the Index-2 milestone Aug 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked Some other issue is blocking this enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants