You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
With apache/datafusion#14079 merged, we're a step closed having support for INSERT INTO in ballista.
Latest issue is that scheduler can't find table reference specified in DML.table_name (type of TableReference).
This specific issue is due to having two different un-synchronized session contexts ballista has, client and corresponding scheduler context.
Describe the solution you'd like
I do not have a good or preferred solution at this point, asking for opinions.
Ideally it should be a solution which would be flexible.
Describe alternatives you've considered
I have few alternatives, not of which is ideal. Have I missed something?
Replace TableReference with actual table in the LogicalPlan::DML
BallistaQueryPlanner is in charge of client-scheduler communication, at the moment it does not propagate DDL statements from client to scheduler. It could be modified to handle DDL statements, the problem is that SessionContext will execute DDL statements immediately and LogicalPlan::DDL will be swapped with LogicalPlan::Empty, thus no DDL information will reach the planner.
Looking at datafusion code, I'm not sure that this could be changed on the SessionContext without major disruption.
Synchronize Catalogs Between Client and Scheduler
INSERT INTO will work if scheduler catalog has table information, so some kind of remote catalog would help. As it would affect user experience if remote catalog had to be setup, this option is not the first choice .
We could come up with ballista catalog (schema registry) which could synchronize catalog state between client and the scheduler, it could be a bit of the work with non async methods exposed by SchemaCatalog.
At the end, as SchemaProvider.table is async, table could be lazy registered first time table is needed by a query plan. This would require custom SchemaProvider on the client side.
Synchronize Contexts on ExecuteQuery
Implement some kind of tracking logic, which would be triggered on ExecuteQuery which would synchronize SchemaRegistry between client and scheduler.
I'm not really keen on this solution as I believe it will get very complicated very quickly.
Modify Ballista Protocol to send PhysicalPlans
At the moment client would send LogicalPlan to scheduler which would be then converted to physical plan on the scheduler. At this point we need table reference. I was wondering do can we resolve physical plan on the client side, but split them to stages on the server side.
This would be quite a big change, so i'm asking if anybody remembers why logical plan was selected to be exchange instead of physical plan.
Additional context
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem or challenge?
With apache/datafusion#14079 merged, we're a step closed having support for
INSERT INTO
in ballista.Latest issue is that scheduler can't find table reference specified in
DML.table_name
(type ofTableReference
).This specific issue is due to having two different un-synchronized session contexts ballista has, client and corresponding scheduler context.
Describe the solution you'd like
I do not have a good or preferred solution at this point, asking for opinions.
Ideally it should be a solution which would be flexible.
Describe alternatives you've considered
I have few alternatives, not of which is ideal. Have I missed something?
Replace TableReference with actual table in the
LogicalPlan::DML
Initial idea was to replace
TableReference
with actual table in the plan but that would not work due totable provider lookup to create
insert into
exec https://github.com/milenkovicm/arrow-datafusion-fork/blob/dc22b3fc846c23f69325be6e11c8ef204c3dc6be/datafusion/core/src/physical_planner.rs#L550I'm not convinced it will work
Propagate DDLs Statements to QueryPlanner
BallistaQueryPlanner
is in charge of client-scheduler communication, at the moment it does not propagate DDL statements from client to scheduler. It could be modified to handleDDL
statements, the problem is thatSessionContext
will executeDDL
statements immediately andLogicalPlan::DDL
will be swapped withLogicalPlan::Empty
, thus noDDL
information will reach the planner.Looking at datafusion code, I'm not sure that this could be changed on the
SessionContext
without major disruption.Synchronize Catalogs Between Client and Scheduler
INSERT INTO
will work if scheduler catalog has table information, so some kind of remote catalog would help. As it would affect user experience if remote catalog had to be setup, this option is not the first choice .We could come up with ballista catalog (schema registry) which could synchronize catalog state between client and the scheduler, it could be a bit of the work with non async methods exposed by
SchemaCatalog
.At the end, as
SchemaProvider.table
is async, table could be lazy registered first time table is needed by a query plan. This would require customSchemaProvider
on the client side.Synchronize Contexts on ExecuteQuery
Implement some kind of tracking logic, which would be triggered on
ExecuteQuery
which would synchronize SchemaRegistry between client and scheduler.I'm not really keen on this solution as I believe it will get very complicated very quickly.
Modify Ballista Protocol to send PhysicalPlans
At the moment client would send LogicalPlan to scheduler which would be then converted to physical plan on the scheduler. At this point we need table reference. I was wondering do can we resolve physical plan on the client side, but split them to stages on the server side.
This would be quite a big change, so i'm asking if anybody remembers why logical plan was selected to be exchange instead of physical plan.
Additional context
The text was updated successfully, but these errors were encountered: