SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive #2038

Kache · 2024-08-28T23:44:55Z

Python version

Python 3.11.4 (main, Jan 10 2024, 15:34:31) [Clang 15.0.0 (clang-1500.1.0.2.5)]

Operating system and processor architecture

Linux-6.10.6-orbstack-00249-g92ad2848917c-x86_64-with-debian-10.13

Installed packages

snowflake-connector-python 3.2.0

What did you do?

I want to log the query id without needing to wait for the query to have completed.

Starting from sync usage:

    with snowflake.connector.connect(**creds) as conn:
        with conn.cursor(cursor_class=DictCursor) as cur:
            cur.execute(sql_str)

            # logs after (potentially long) query completion
            logging.info(f"Query id: {cur.sfqid}")
            yield from cur

Attempt: use async

            cur.execute_async(sql_str)
            # logs right away
            logging.info(f"Query id: {cur.sfqid}")
            cur.get_results_from_sfqid(cur.sfqid)
            yield from cur

However, the above only works for fast queries. We use ABORT_DETACHED_QUERY = TRUE because we want queries to passively stop if/when the tasks/processes driving the original query stop/terminate, so queries > 5 mins get canceled:

snowflake.connector.errors.ProgrammingError: 000604: 604: SQL execution canceled

Although docs says:

To perform asynchronous queries, you must ensure the ABORT_DETACHED_QUERY configuration parameter is FALSE (default value).

It also says:

For asynchronous queries ... If ABORT_DETACHED_QUERY is set to TRUE, Snowflake automatically aborts all in-progress asynchronous queries when a client connection is not re-established after five minutes.

Sounds like just need a way to either keep the connection alive or re-connect.

Attempt: use async, ABORT_DETACHED_QUERY, and active heartbeat queries to "keep alive"

    import itertools as it
    from snowflake.connector.cursor import ASYNC_RETRY_PATTERN, DictCursor

    query_id = None
    wait_310_sql = "SELECT SYSTEM$WAIT(310)"
    retry_pattern = it.chain(ASYNC_RETRY_PATTERN, it.repeat(ASYNC_RETRY_PATTERN[-1]))
    tstart = time.monotonic()
    t_heartbeat = tstart

    with snowflake.connector.connect(**creds) as conn:
        with conn.cursor(cursor_class=DictCursor) as cur:
            cur.execute('ALTER SESSION SET ABORT_DETACHED_QUERY = TRUE')
            cur.execute_async(wait_310_sql)
            query_id = cur.sfqid
            assert query_id
            logging.info(f"Query ID: {query_id}")

            # wait loop based on wait_until_ready() from cursor.get_results_from_sfqid()
            while True:
                status = cur.connection.get_query_status_throw_if_error(query_id)
                now = time.monotonic()
                logging.info(f"{now - tstart:>6.1f} Query status: {status}")
                if not cur.connection.is_still_running(status):
                    break

                if now - t_heartbeat > 60 * 2:  # every ~2 mins
                    cur.execute("SELECT 'heartbeat'")  # try to keep connection alive
                    t_heartbeat = time.monotonic()

                time.sleep(0.5 * next(retry_pattern))

            cur.get_results_from_sfqid(query_id)
            logging.info(cur.fetchall())

However, the query gets cancelled all the same in 5 minutes.

Using a "re-connect" strategy by using a new connection every couple minutes (rather than using the same connection) is similarly ineffective.

What did you expect to see?

Expected to be able to prevent an async query from being canceled even though ABORT_DETACHED_QUERY = TRUE by either actively keeping the connection alive or by actively re-connecting.

Keeping ABORT_DETACHED_QUERY = TRUE is desirable because we want queries to passively stop if/when the tasks/processes driving the original query stop/terminate.

Can you set logging to DEBUG and collect the logs?

n/a

The text was updated successfully, but these errors were encountered:

sfc-gh-dszmolka · 2024-08-30T14:13:43Z

hi - thanks for submitting this issue and especially for the reproduction. taking a look .

sfc-gh-dszmolka · 2024-09-02T06:58:22Z

i can confirm the async query gets cancelled anyways after 5 minutes even though the additional 'heartbeat' queries happening every 2 minutes. Apparently, a single query will not be recognized as a heartbeat. We have a private function to send request to snowflake to keep the connection alive (this is happening when client_session_keep_alive is set) - will check how it can be utilized in this scenario to properly send requests to /session/heartbeat

edit: replaced the SELECT 'heartbeat' with conn.rest._heartbeat() which , confirmed from the logs, correctly reaches out to the heartbeat endpoint but still did not prevent the SQL from being canceled after 5 minutes.
We need to investigate further.

sfc-gh-dszmolka · 2024-09-03T10:55:29Z

tested various methods to no avail, until my colleague @sfc-gh-hachouraria figured out that sending a request to /queries/<queryId>/results allows the session to continue past the 5 minute mark.

This is done by issuing cur.query_result(query_id) ; tested it with your repro program and indeed these calls to query_result, instead issuing the heartbeat calls, allowed the detached async query to live more than 5 minutes even with ABORT_DETACHED_QUERY=TRUE and run to completion

Of course there aren't really any query results in the query_result until the query actually finishes, so probably needs to be wrapped in some logic when used as an alternative 'heartbeat' to prevent detached async query from being aborted, but perhaps could help in your situation.

Can you please try it and let us know if it works for you ?

Kache · 2024-09-10T18:43:23Z

Switching out cur.execute("SELECT 'heartbeat'") for cur.query_result(query_id) alone didn't work for me. I observed:

endpoint is quite slow, taking 30+ seconds
It KeyError in _init_result_and_meta() when ret.get('data')['rowtype'] because endpoint response looks like:

 {
   'code': '333333',
   'data': {
       'getResultUrl': '/queries/01b6f149-030a-5b62-76fd-87016f53f77f/result',
       'progressDesc': None,
       'queryAbortsAfterSecs': 300,
       'queryId': '01b6f149-030a-5b62-76fd-87016f53f77f',
   },
   'message': 'Query execution in progress.  Query will be aborted unless request for result is received within 300 seconds.',
   'success': True,
}

In addition:

letting it fall to the not ret.get("success") case will raise ProgrammingError with the message in the response
using as the heartbeat: cur.connection.rest.request(url=f"/queries/{query_id}/result", method="get") (i.e. cur.query_result(query_id) without response handling) at first try does seem to work

I'll keep experimenting a bit more

Kache · 2024-09-11T17:35:35Z

Since the endpoint takes 30-50 sec to respond, I'm thinking of running it in a thread with daemon=True to "fire and forget" the heartbeat, rather than be blocked for an unused response

Just a little worried about potential bad resource cleanup though

sfc-gh-yuwang · 2024-10-14T18:51:55Z

Hi, we have confirmed that when ABORT_DETACHED_QUERY is set to True, the only way to keep it alive is to call cur.query_result(query_id) to fetch the result of that query. But this will change the query into a sync query, which will not affected by the parameter. That being said, no async query can be always kept alive when ABORT_DETACHED_QUERY is set to True.
In the meantime, we will update our doc to make it more clear of the usage of ABORT_DETACHED_QUERY.
If you do want a async query to keep alive even if ABORT_DETACHED_QUERY is set to True, please file a ticket to us. Thank you!

Kache · 2024-10-14T21:49:17Z

In my testing, that GET will return early if the query completes and otherwise block Snowflake-side for up to ~40sec before returning "Query execution in progress".

This means that overall the query does remains async, and this is a valid (but clunky) workaround.

I would like to be able to keep async queries alive even if ABORT_DETACHED_QUERY is set to True, and there are a handful of possible options:

With zero Snowflake support:

Use the clunky "semi-blocking heartbeat", an undocumented but technically public API
- have clients block ~40s at a time
- block in a separate thread, to not affect main thread
  - with normal threads -- tricky/clunky thread joining/cleanup
  - with daemon threads -- risk of bad resource cleanup on process exit/kill
Reverse engineer private implementation of cur.connection.rest.request() to hit the "semi-blocking heartbeat" using alternative non-blocking HTTP client

Various things Snowflake could support:

Provide a similar heartbeat endpoint (or configure existing one) to not block. While better, ideally clients still want a "fire and forget" usage using a non-blocking HTTP client in order to not wait for the request response
Officially provide the non-blocking HTTP heartbeat request in the Snowflake connector python client, e.g. conn.heartbeat()

which (or any other) option can be considered?

sfc-gh-yuwang · 2024-10-15T17:18:53Z

Hi @Kache,
Both of the way you mentioned looks good to me, I'll file a ticket to server side and let them decide what they should do. This is going to be a server side change that snowpark have no control. Thanks for your advice!

Kache added bug needs triage labels Aug 28, 2024

github-actions bot changed the title ~~Async queries: unable to re-establish connection/keep connection alive~~ SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive Aug 28, 2024

sfc-gh-dszmolka self-assigned this Aug 30, 2024

sfc-gh-dszmolka added status-triage Issue is under initial triage and removed needs triage labels Aug 30, 2024

sfc-gh-dszmolka removed their assignment Sep 2, 2024

sfc-gh-dszmolka added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Sep 2, 2024

sfc-gh-dszmolka self-assigned this Sep 3, 2024

sfc-gh-dszmolka added question and removed bug labels Sep 3, 2024

sfc-gh-yuwang closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive #2038

SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive #2038

Kache commented Aug 28, 2024 •

edited

Loading

sfc-gh-dszmolka commented Aug 30, 2024

sfc-gh-dszmolka commented Sep 2, 2024 •

edited

Loading

sfc-gh-dszmolka commented Sep 3, 2024

Kache commented Sep 10, 2024 •

edited

Loading

Kache commented Sep 11, 2024 •

edited

Loading

sfc-gh-yuwang commented Oct 14, 2024

Kache commented Oct 14, 2024 •

edited

Loading

sfc-gh-yuwang commented Oct 15, 2024

SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive #2038

SNOW-1643781: Async queries: unable to re-establish connection/keep connection alive #2038

Comments

Kache commented Aug 28, 2024 • edited Loading

Python version

Operating system and processor architecture

Installed packages

What did you do?

What did you expect to see?

Can you set logging to DEBUG and collect the logs?

sfc-gh-dszmolka commented Aug 30, 2024

sfc-gh-dszmolka commented Sep 2, 2024 • edited Loading

sfc-gh-dszmolka commented Sep 3, 2024

Kache commented Sep 10, 2024 • edited Loading

Kache commented Sep 11, 2024 • edited Loading

sfc-gh-yuwang commented Oct 14, 2024

Kache commented Oct 14, 2024 • edited Loading

sfc-gh-yuwang commented Oct 15, 2024

Kache commented Aug 28, 2024 •

edited

Loading

sfc-gh-dszmolka commented Sep 2, 2024 •

edited

Loading

Kache commented Sep 10, 2024 •

edited

Loading

Kache commented Sep 11, 2024 •

edited

Loading

Kache commented Oct 14, 2024 •

edited

Loading