Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Apply row limit transform to query in backend #1461

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

baumandm
Copy link
Contributor

@baumandm baumandm commented Jun 17, 2024

This PR addresses the issue whereby users will create and test a DataDoc manually with automatic row limits applied, but then schedule the DataDoc and it runs without the automatic limits.

If the Query Engine has the (Experimental) Enable Row Limit feature enabled, all row limit transforms are now performed in the backend, for both adhoc and scheduled queries. If this feature is disabled, nothing changes.

Summary of changes:

  • Creating a new DataDoc cell now initializes it in the database with the default row limit (previously unset)
  • POST /query_execution now accepts an optional row_limit param and applies a limit transform if configured; this is provided for adhoc executions from either the adhoc editor or a DataDoc cell
  • The celery task run_datadoc now looks up the DataDoc cell metadata, and applies a limit transform if configured
  • The Run All feature also uses the run_datadoc task, so it behaves the same as scheduled DataDocs
  • The frontend no longer applies the row limit transform on the client-side, but it does do a limit check to enable the Your SELECT query is unbounded popup; this feature continues to work as before, allowing the user to confirm or cancel the execution. The original query is sent to /query_execution along with the row_limit
  • Added an admin_logic.get_engine_feature_param() method to make it easier to get a single param

I considered a number of different approaches for this PR, and picked this approach because it required the fewest, simplest changes, but I can refactor if needed. Happy to discuss!

rowLimit,
engine
);
await checkUnlimitedQuery(sampledQuery, rowLimit, engine);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about still keeping the old logic here, but only replace the transformLimitedQuery with the backend api, like sampledQuery does? you probably thought about this, what is the concern?

btw, I think we should keep the logic consistent for both the sampling and limited query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two thoughts:

  1. There would be three API calls required to prep a query before sending it to be executed, seemed like too much. I was thinking about creating a single unified endpoint that combined templating/sampling/limiting, but that was even more changes
  2. Scheduled DataDocs don't go through that flow, so it doesn't make it simpler; it just adds a new endpoint purely for adhoc executions

That said, I'm fine with refactoring to use a new endpoint for this. If it returned a flag for unlimited queries, then we can remove even more frontend code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you raised a good point. I was also thinking of merging all the transforms into one single API endpoint. maybe we could try that route?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I combined sampling/limiting into POST /query/transform/, but left the individual endpoints—sampling preview uses /query/transform/sampling/, and there may be some use for a standalone limited transform as well.

}

if (rowLimit != null && rowLimit >= 0) {
return getLimitedQuery(query, rowLimit, engine.language);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may delete the function getLimitedQuery, which is not needed

@baumandm baumandm force-pushed the external/backend-limit branch from f3fcd1f to b8a33e0 Compare June 20, 2024 16:54
@baumandm baumandm force-pushed the external/backend-limit branch from b8a33e0 to 1164dde Compare June 20, 2024 17:08
@jczhong84
Copy link
Collaborator

@baumandm thanks for the PR and all the updates! lgtm.

Just one more question: with this change, will all the scheduled docs automatically have the limit enabled as most of the query cells have the default limit? Dont want to surprise the users

@baumandm
Copy link
Contributor Author

Just one more question: with this change, will all the scheduled docs automatically have the limit enabled as most of the query cells have the default limit? Dont want to surprise the users

Good question, that was something I thought about as well. Previously, DataDoc cells didn't store the limit value in the meta column unless the user manually changed it from the default. So after deploying this change, any scheduled DataDoc cells where the user has manually adjusted the row limit will start being limited, automatically. But any existing cells which have never had the limit changed will continue being unlimited (when scheduled).

This PR includes a change so that new DataDoc cells will be initialized with the default row limit from the frontend, to avoid this kind of situation in the future.

If this change is something you are worried about, the only thing I can think of would be to run a SQL command to unset every DataCell's meta.limit field and let users adjust as needed. But that seems pretty dramatic and could have negative consequences the other direction.

For us, since it only applies to SELECT queries, the impact should be low and the potential improvements are high, so we are just planning to communicate the change rather than try to mitigate it.

@jczhong84
Copy link
Collaborator

Just one more question: with this change, will all the scheduled docs automatically have the limit enabled as most of the query cells have the default limit? Dont want to surprise the users

Good question, that was something I thought about as well. Previously, DataDoc cells didn't store the limit value in the meta column unless the user manually changed it from the default. So after deploying this change, any scheduled DataDoc cells where the user has manually adjusted the row limit will start being limited, automatically. But any existing cells which have never had the limit changed will continue being unlimited (when scheduled).

This PR includes a change so that new DataDoc cells will be initialized with the default row limit from the frontend, to avoid this kind of situation in the future.

If this change is something you are worried about, the only thing I can think of would be to run a SQL command to unset every DataCell's meta.limit field and let users adjust as needed. But that seems pretty dramatic and could have negative consequences the other direction.

For us, since it only applies to SELECT queries, the impact should be low and the potential improvements are high, so we are just planning to communicate the change rather than try to mitigate it.

how about we add a flag in the doc schedule modal, people can explicitly choose to enable or disable query limiting or sampling? by default it is disabled.

@baumandm
Copy link
Contributor Author

Does this sound correct:

  1. Schedule modal contains a toggle to apply query transforms (sampling, limiting)
  2. Toggle defaults to disabled for all existing schedules (where it's not configured)
  3. Toggle defaults to disabled for new schedules (or should this be enabled?)

@jczhong84
Copy link
Collaborator

that sounds perfectly right! for the 3rd one, I dont have a strong opinion, may start with disabled by default

@adamstruck
Copy link
Contributor

Any chance this will be merged soon?

@baumandm
Copy link
Contributor Author

I believe we ran into an issue where in some scenarios, running a query through Sqlglot made breaking changes in the query, instead of just adding a limit. So we put it on the back burner and got distracted with some other priorities.

Our plan is to return to this feature and make it optional as discussed above. That would provide a workaround for any of queries that Sqlglot mangles. I don't have a timeline for this but I'll try to refactor the PR and push an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants