Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive #2038

Closed
Kache opened this issue Aug 28, 2024 · 8 comments
Assignees
Labels
question status-triage_done Initial triage done, will be further handled by the driver team

Comments

@Kache
Copy link
Contributor

Kache commented Aug 28, 2024

Python version

Python 3.11.4 (main, Jan 10 2024, 15:34:31) [Clang 15.0.0 (clang-1500.1.0.2.5)]

Operating system and processor architecture

Linux-6.10.6-orbstack-00249-g92ad2848917c-x86_64-with-debian-10.13

Installed packages

snowflake-connector-python 3.2.0

What did you do?

I want to log the query id without needing to wait for the query to have completed.

Starting from sync usage:

    with snowflake.connector.connect(**creds) as conn:
        with conn.cursor(cursor_class=DictCursor) as cur:
            cur.execute(sql_str)

            # logs after (potentially long) query completion
            logging.info(f"Query id: {cur.sfqid}")
            yield from cur

Attempt: use async

            cur.execute_async(sql_str)
            # logs right away
            logging.info(f"Query id: {cur.sfqid}")
            cur.get_results_from_sfqid(cur.sfqid)
            yield from cur

However, the above only works for fast queries. We use ABORT_DETACHED_QUERY = TRUE because we want queries to passively stop if/when the tasks/processes driving the original query stop/terminate, so queries > 5 mins get canceled:

snowflake.connector.errors.ProgrammingError: 000604: 604: SQL execution canceled

Although docs says:

To perform asynchronous queries, you must ensure the ABORT_DETACHED_QUERY configuration parameter is FALSE (default value).

It also says:

For asynchronous queries ... If ABORT_DETACHED_QUERY is set to TRUE, Snowflake automatically aborts all in-progress asynchronous queries when a client connection is not re-established after five minutes.

Sounds like just need a way to either keep the connection alive or re-connect.

Attempt: use async, ABORT_DETACHED_QUERY, and active heartbeat queries to "keep alive"

    import itertools as it
    from snowflake.connector.cursor import ASYNC_RETRY_PATTERN, DictCursor

    query_id = None
    wait_310_sql = "SELECT SYSTEM$WAIT(310)"
    retry_pattern = it.chain(ASYNC_RETRY_PATTERN, it.repeat(ASYNC_RETRY_PATTERN[-1]))
    tstart = time.monotonic()
    t_heartbeat = tstart

    with snowflake.connector.connect(**creds) as conn:
        with conn.cursor(cursor_class=DictCursor) as cur:
            cur.execute('ALTER SESSION SET ABORT_DETACHED_QUERY = TRUE')
            cur.execute_async(wait_310_sql)
            query_id = cur.sfqid
            assert query_id
            logging.info(f"Query ID: {query_id}")

            # wait loop based on wait_until_ready() from cursor.get_results_from_sfqid()
            while True:
                status = cur.connection.get_query_status_throw_if_error(query_id)
                now = time.monotonic()
                logging.info(f"{now - tstart:>6.1f} Query status: {status}")
                if not cur.connection.is_still_running(status):
                    break

                if now - t_heartbeat > 60 * 2:  # every ~2 mins
                    cur.execute("SELECT 'heartbeat'")  # try to keep connection alive
                    t_heartbeat = time.monotonic()

                time.sleep(0.5 * next(retry_pattern))

            cur.get_results_from_sfqid(query_id)
            logging.info(cur.fetchall())

However, the query gets cancelled all the same in 5 minutes.

Using a "re-connect" strategy by using a new connection every couple minutes (rather than using the same connection) is similarly ineffective.

What did you expect to see?

Expected to be able to prevent an async query from being canceled even though ABORT_DETACHED_QUERY = TRUE by either actively keeping the connection alive or by actively re-connecting.

Keeping ABORT_DETACHED_QUERY = TRUE is desirable because we want queries to passively stop if/when the tasks/processes driving the original query stop/terminate.

Can you set logging to DEBUG and collect the logs?

n/a

@github-actions github-actions bot changed the title Async queries: unable to re-establish connection/keep connection alive SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive Aug 28, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Aug 30, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage Issue is under initial triage and removed needs triage labels Aug 30, 2024
@sfc-gh-dszmolka
Copy link
Contributor

hi - thanks for submitting this issue and especially for the reproduction. taking a look .

@sfc-gh-dszmolka
Copy link
Contributor

sfc-gh-dszmolka commented Sep 2, 2024

i can confirm the async query gets cancelled anyways after 5 minutes even though the additional 'heartbeat' queries happening every 2 minutes. Apparently, a single query will not be recognized as a heartbeat. We have a private function to send request to snowflake to keep the connection alive (this is happening when client_session_keep_alive is set) - will check how it can be utilized in this scenario to properly send requests to /session/heartbeat

edit: replaced the SELECT 'heartbeat' with conn.rest._heartbeat() which , confirmed from the logs, correctly reaches out to the heartbeat endpoint but still did not prevent the SQL from being canceled after 5 minutes.
We need to investigate further.

@sfc-gh-dszmolka sfc-gh-dszmolka removed their assignment Sep 2, 2024
@sfc-gh-dszmolka sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Sep 2, 2024
@sfc-gh-dszmolka
Copy link
Contributor

tested various methods to no avail, until my colleague @sfc-gh-hachouraria figured out that sending a request to /queries/<queryId>/results allows the session to continue past the 5 minute mark.

This is done by issuing cur.query_result(query_id) ; tested it with your repro program and indeed these calls to query_result, instead issuing the heartbeat calls, allowed the detached async query to live more than 5 minutes even with ABORT_DETACHED_QUERY=TRUE and run to completion

Of course there aren't really any query results in the query_result until the query actually finishes, so probably needs to be wrapped in some logic when used as an alternative 'heartbeat' to prevent detached async query from being aborted, but perhaps could help in your situation.

Can you please try it and let us know if it works for you ?

@sfc-gh-dszmolka sfc-gh-dszmolka self-assigned this Sep 3, 2024
@Kache
Copy link
Contributor Author

Kache commented Sep 10, 2024

Switching out cur.execute("SELECT 'heartbeat'") for cur.query_result(query_id) alone didn't work for me. I observed:

 {
   'code': '333333',
   'data': {
       'getResultUrl': '/queries/01b6f149-030a-5b62-76fd-87016f53f77f/result',
       'progressDesc': None,
       'queryAbortsAfterSecs': 300,
       'queryId': '01b6f149-030a-5b62-76fd-87016f53f77f',
   },
   'message': 'Query execution in progress.  Query will be aborted unless request for result is received within 300 seconds.',
   'success': True,
}

In addition:

  • letting it fall to the not ret.get("success") case will raise ProgrammingError with the message in the response
  • using as the heartbeat: cur.connection.rest.request(url=f"/queries/{query_id}/result", method="get") (i.e. cur.query_result(query_id) without response handling) at first try does seem to work

I'll keep experimenting a bit more

@Kache
Copy link
Contributor Author

Kache commented Sep 11, 2024

Since the endpoint takes 30-50 sec to respond, I'm thinking of running it in a thread with daemon=True to "fire and forget" the heartbeat, rather than be blocked for an unused response

Just a little worried about potential bad resource cleanup though

@sfc-gh-yuwang
Copy link
Contributor

Hi, we have confirmed that when ABORT_DETACHED_QUERY is set to True, the only way to keep it alive is to call cur.query_result(query_id) to fetch the result of that query. But this will change the query into a sync query, which will not affected by the parameter. That being said, no async query can be always kept alive when ABORT_DETACHED_QUERY is set to True.
In the meantime, we will update our doc to make it more clear of the usage of ABORT_DETACHED_QUERY.
If you do want a async query to keep alive even if ABORT_DETACHED_QUERY is set to True, please file a ticket to us. Thank you!

@Kache
Copy link
Contributor Author

Kache commented Oct 14, 2024

In my testing, that GET will return early if the query completes and otherwise block Snowflake-side for up to ~40sec before returning "Query execution in progress".

This means that overall the query does remains async, and this is a valid (but clunky) workaround.

I would like to be able to keep async queries alive even if ABORT_DETACHED_QUERY is set to True, and there are a handful of possible options:

With zero Snowflake support:

  • Use the clunky "semi-blocking heartbeat", an undocumented but technically public API
    • have clients block ~40s at a time
    • block in a separate thread, to not affect main thread
      • with normal threads -- tricky/clunky thread joining/cleanup
      • with daemon threads -- risk of bad resource cleanup on process exit/kill
  • Reverse engineer private implementation of cur.connection.rest.request() to hit the "semi-blocking heartbeat" using alternative non-blocking HTTP client

Various things Snowflake could support:

  • Provide a similar heartbeat endpoint (or configure existing one) to not block. While better, ideally clients still want a "fire and forget" usage using a non-blocking HTTP client in order to not wait for the request response
  • Officially provide the non-blocking HTTP heartbeat request in the Snowflake connector python client, e.g. conn.heartbeat()

which (or any other) option can be considered?

@sfc-gh-yuwang
Copy link
Contributor

Hi @Kache,
Both of the way you mentioned looks good to me, I'll file a ticket to server side and let them decide what they should do. This is going to be a server side change that snowpark have no control. Thanks for your advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

3 participants