You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While programming languages have come a long way in making it easy for developers to write data processing pipelines declaratively, relational algebra is still the most widely used and well-known method of processing large amounts of data. Even data processing solutions that offer native language integrations (such as Apache Spark) end up exposing APIs that abandon the semantics of the programming language and mimic the behavior of relational algebra instead. As a result, there's still friction between the programming language and the data processing pipelines.
Proposal
One of the most successful solutions to this problem is Microsoft's Language Integrated Query (LINQ) which seamlessly integrates relational algebra into a programming language. What if we did something similar in Morphir?
To be as non-intrusive as possible, we could start by adding a separate language building block that captures relational operations in a separate data structure and refers back to existing Morphir IR constructs for the column level operations.
Important Note: Initially we would define relational operations outside the core IR as an add-on. In other words, we would not be adding it as a new type of Value. This means that relational operations can't be referred to from other Morphir business logic. While this is a strong limitation it is already widely accepted since Spark DataFrames and SQL queries also return values that are not native in the language. We might lift this limitation in the future but at that point we will need to ensure that we have proper implementations for relational operations in all of our existing backends (such as the Scala backend). Until then, this limitation decreases the risk of the proposed change significantly.
Benefits
There are a number of benefits that this approach:
Since column level operations use standard Morphir building blocks, we get all the benefits of accurately modeled types (enums instead of strings), code reuse (reusable functions), testability and many more.
Data processing pipelines can be built using constructs that are already familiar and will behave exactly as expected.
Mapping these constructs to relational execution environment (such as Apache Spark, relational databases) is trivial. In other words, it will take very little effort.
Abstracts pipeline to allow standardization of execution, error-handling and processing.
Implementation Details
We can start by defining a relation as a recursive data structure:
moduleMorphir.IR.Relationexposing (..)
importMorphir.IR.Valueexposing (Value)
type Relation=From{ source :Value}|Where{ predicate :Value, source :Relation}|Select{ fields :DictNameValue, source :Relation}|Join{ joinType :JoinType, on :Value, left :Relation, right :Relation}|GroupBy{ keys :ListValue, source :Relation}
While we could represent these to a certain extent as function calls with the existing Value constructs, the advantage of having dedicated structures for this is that we can more accurately represent the semantics of relational algebra and map it to values efficiently. We will define the exact semantics later, but to demonstrate the concept we will examine what goes into the predicate of a Where node.
In SQL, what column/object names you can use in a where clause depends on what is available at that point in the relation, which depends on what is in the from clause and what was joined. The most direct mapping of that behavior in Morphir is to treat the predicate as a function body where the variables in scope is derived from the source relation. For example in the query below:
SELECT*WHERE (a.amount<100)
FROM Foo AS a
The expression inside the parenthesis is a Morphir value where only variable a is available which is a record with fields that can be derived from Foo's schema. On the other hand, in the query below:
SELECT*WHERE (a.amount<b.amount)
FROM Foo AS a
JOIN Bar AS b
ONa.id=b.id
Now a and b are both variables that are available in the predicate's scope.
We would need new name resolution and type inference tooling on the relation level that follows the semantics described here (and expanded later) but for the column level we could simply refer back to the existing tooling.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Problem Statement
While programming languages have come a long way in making it easy for developers to write data processing pipelines declaratively, relational algebra is still the most widely used and well-known method of processing large amounts of data. Even data processing solutions that offer native language integrations (such as Apache Spark) end up exposing APIs that abandon the semantics of the programming language and mimic the behavior of relational algebra instead. As a result, there's still friction between the programming language and the data processing pipelines.
Proposal
One of the most successful solutions to this problem is Microsoft's Language Integrated Query (LINQ) which seamlessly integrates relational algebra into a programming language. What if we did something similar in Morphir?
To be as non-intrusive as possible, we could start by adding a separate language building block that captures relational operations in a separate data structure and refers back to existing Morphir IR constructs for the column level operations.
Benefits
There are a number of benefits that this approach:
Implementation Details
We can start by defining a relation as a recursive data structure:
While we could represent these to a certain extent as function calls with the existing Value constructs, the advantage of having dedicated structures for this is that we can more accurately represent the semantics of relational algebra and map it to values efficiently. We will define the exact semantics later, but to demonstrate the concept we will examine what goes into the
predicate
of aWhere
node.In SQL, what column/object names you can use in a where clause depends on what is available at that point in the relation, which depends on what is in the from clause and what was joined. The most direct mapping of that behavior in Morphir is to treat the
predicate
as a function body where the variables in scope is derived from thesource
relation. For example in the query below:The expression inside the parenthesis is a Morphir value where only variable
a
is available which is a record with fields that can be derived fromFoo
's schema. On the other hand, in the query below:Now
a
andb
are both variables that are available in thepredicate
's scope.We would need new name resolution and type inference tooling on the relation level that follows the semantics described here (and expanded later) but for the column level we could simply refer back to the existing tooling.
What does everyone think?
Beta Was this translation helpful? Give feedback.
All reactions