[PECO-1263] Implement a .returned_as_direct_result property for AsyncExecution status #325

susodapop · 2024-01-22T20:05:11Z

Description

This PR implements an additional boolean flag on AsyncExecution that indicates whether this execution returned directResults. When this happens, results cannot be fetched from the server again. Because when a directResults query returns its operation is closed immediately.

I determined this is necessary while documenting the new execute_async() behaviour. For context, when we send a TExecuteStatementReq to the thrift server, there are two possible return conditions:

If the query completes within five seconds, the resulting TExecuteStatementResp will actually include the results!
If the query takes longer than five seconds to complete, the resulting TExecuteStatementResp will only include the query_id for the query, which we can use to poll for the results until they are ready.

In case 1, the result is sent within the initial TExecuteStatementResp and they cannot be fetched a second time. What this means for users is that if they are going to call .serialize() to persist a query_id and secret for use by another thread, they need a way to verify that the results will actually be available to another thread.

This new .returned_as_direct_result property makes that easier.

One unfortunate implication of this behaviour is that users cannot use one thread to kickoff all their queries and a separate thread to fetch their results. Because not all AsyncExecution objects can actually be picked up separately.

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

benc-db · 2024-01-22T20:45:42Z

If the query completes within five seconds, the resulting TExecuteStatementResp will actually include the results!
If the query takes longer than five seconds to complete, the resulting TExecuteStatementResp will only include the query_id for the query, which we can use to poll for the results until they are ready.

Oh man, why? This is really unfortunate behavior.

benc-db · 2024-01-22T20:48:33Z

src/databricks/sql/ae.py

@@ -225,6 +228,15 @@ def last_sync_timestamp(self) -> Optional[datetime]:
        """The timestamp of the last time self.status was synced with the server"""
        return self._last_sync_timestamp

+    @property
+    def is_available(self) -> bool:


How long is it available for? How does this limitation work in the CUJ of recovering from a user-space crash?

I'm going to check with the thrift server folks to get that answer. I believe it's available for a few hours. With cloud fetch enabled it's 24 hours.

What do you think of the name is_available? I'd like a name that's reflective but also easy enough to type.

Its really a tough one; I think this might be a little misleading, because isn't the answer 'Unknown' for crash recovery CUJ?

I don't think so. Because under this situation, if the client crashes it will do so before the query_id is returned. Which means there would be no way to pick up the execution anyway.

Ah, so if it gets a query_id, then it must be available, unless its been a long time? Although, again, available doesn't sound quite right because the query could still be executing. Maybe just returned_as_direct_result?

Good idea. I've done the rename and pushed.

Note: we could get around this by allowing users to disable directResults when the TExecuteStatementReq is emitted.

thrift server team. Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

jadewang-db · 2024-01-22T22:16:38Z

is really unfortunate behavior.

will this cause any issues?

benc-db · 2024-01-22T22:17:42Z

is really unfortunate behavior.

will this cause any issues?

Jade, this is the issue:

In case 1, the result is sent within the initial TExecuteStatementResp and they cannot be fetched a second time. What this means for users is that if they are going to call .serialize() to persist a query_id and secret for use by another thread, they need a way to verify that the results will actually be available to another thread.

jadewang-db · 2024-01-22T23:51:14Z

src/databricks/sql/ae.py

@@ -81,6 +81,7 @@ class AsyncExecution:
    ]
    _last_sync_timestamp: Optional[datetime] = None
    _result_set: Optional["ResultSet"] = None
+    _returned_as_direct_result: bool = False


do we need change get_result method to also check this flag?

Implement an .is_available property for AsyncExecution status

bc48178

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

susodapop marked this pull request as ready for review January 22, 2024 20:06

susodapop requested review from arikfr, yunbodeng-db and andrefurlan-db as code owners January 22, 2024 20:06

susodapop requested review from benc-db and jadewang-db January 22, 2024 20:09

susodapop mentioned this pull request Jan 22, 2024

[PECO-1263] Add documentation for execute_async #322

Open

benc-db reviewed Jan 22, 2024

View reviewed changes

Jesse Whitehouse added 5 commits January 22, 2024 16:42

Revise logic for checking for direct results after conversation with

c0d35a7

thrift server team. Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

Fix: when moving ResultSet to results.py I didn't instantiate a logger

6337a89

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

Rename .is_available to .returned_as_direct_result

dde90d9

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

Missed these in the last commit...whoops

8318df7

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

Fix outdated assertion

e4d6c1a

Signed-off-by: Jesse Whitehouse <jesse.whitehouse@databricks.com>

susodapop changed the title ~~[PECO-1263] Implement an .is_available property for AsyncExecution status~~ [PECO-1263] Implement a .returned_as_direct_result property for AsyncExecution status Jan 22, 2024

jadewang-db reviewed Jan 22, 2024

View reviewed changes

MeinAccount mentioned this pull request Aug 9, 2024

[Feature Request] Support async execution #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PECO-1263] Implement a .returned_as_direct_result property for AsyncExecution status #325

[PECO-1263] Implement a .returned_as_direct_result property for AsyncExecution status #325

susodapop commented Jan 22, 2024 •

edited

Loading

benc-db commented Jan 22, 2024

benc-db Jan 22, 2024

susodapop Jan 22, 2024

benc-db Jan 22, 2024

susodapop Jan 22, 2024

benc-db Jan 22, 2024

susodapop Jan 22, 2024

jadewang-db commented Jan 22, 2024

benc-db commented Jan 22, 2024

jadewang-db Jan 22, 2024

[PECO-1263] Implement a .returned_as_direct_result property for AsyncExecution status #325

Are you sure you want to change the base?

[PECO-1263] Implement a .returned_as_direct_result property for AsyncExecution status #325

Conversation

susodapop commented Jan 22, 2024 • edited Loading

Description

benc-db commented Jan 22, 2024

benc-db Jan 22, 2024

Choose a reason for hiding this comment

susodapop Jan 22, 2024

Choose a reason for hiding this comment

benc-db Jan 22, 2024

Choose a reason for hiding this comment

susodapop Jan 22, 2024

Choose a reason for hiding this comment

benc-db Jan 22, 2024

Choose a reason for hiding this comment

susodapop Jan 22, 2024

Choose a reason for hiding this comment

jadewang-db commented Jan 22, 2024

benc-db commented Jan 22, 2024

jadewang-db Jan 22, 2024

Choose a reason for hiding this comment

susodapop commented Jan 22, 2024 •

edited

Loading