Adding first-class relational support to Morphir #181

AttilaMihaly · 2023-06-21T12:10:41Z

AttilaMihaly
Jun 21, 2023
Maintainer

Problem Statement

While programming languages have come a long way in making it easy for developers to write data processing pipelines declaratively, relational algebra is still the most widely used and well-known method of processing large amounts of data. Even data processing solutions that offer native language integrations (such as Apache Spark) end up exposing APIs that abandon the semantics of the programming language and mimic the behavior of relational algebra instead. As a result, there's still friction between the programming language and the data processing pipelines.

Proposal

One of the most successful solutions to this problem is Microsoft's Language Integrated Query (LINQ) which seamlessly integrates relational algebra into a programming language. What if we did something similar in Morphir?

To be as non-intrusive as possible, we could start by adding a separate language building block that captures relational operations in a separate data structure and refers back to existing Morphir IR constructs for the column level operations.

Important Note: Initially we would define relational operations outside the core IR as an add-on. In other words, we would not be adding it as a new type of Value. This means that relational operations can't be referred to from other Morphir business logic. While this is a strong limitation it is already widely accepted since Spark DataFrames and SQL queries also return values that are not native in the language. We might lift this limitation in the future but at that point we will need to ensure that we have proper implementations for relational operations in all of our existing backends (such as the Scala backend). Until then, this limitation decreases the risk of the proposed change significantly.

Benefits

There are a number of benefits that this approach:

Since column level operations use standard Morphir building blocks, we get all the benefits of accurately modeled types (enums instead of strings), code reuse (reusable functions), testability and many more.
Data processing pipelines can be built using constructs that are already familiar and will behave exactly as expected.
Mapping these constructs to relational execution environment (such as Apache Spark, relational databases) is trivial. In other words, it will take very little effort.
Abstracts pipeline to allow standardization of execution, error-handling and processing.

Implementation Details

We can start by defining a relation as a recursive data structure:

module Morphir.IR.Relation exposing (..)


import Morphir.IR.Value exposing (Value)


type Relation 
    = From 
        { source : Value
        } 
    | Where 
        { predicate : Value 
        , source : Relation
        }
    | Select 
        { fields : Dict Name Value
        , source : Relation
        }
    | Join 
        { joinType : JoinType 
        , on : Value 
        , left : Relation 
        , right : Relation
        }
    | GroupBy
        { keys : List Value
        , source : Relation
        }

While we could represent these to a certain extent as function calls with the existing Value constructs, the advantage of having dedicated structures for this is that we can more accurately represent the semantics of relational algebra and map it to values efficiently. We will define the exact semantics later, but to demonstrate the concept we will examine what goes into the predicate of a Where node.

In SQL, what column/object names you can use in a where clause depends on what is available at that point in the relation, which depends on what is in the from clause and what was joined. The most direct mapping of that behavior in Morphir is to treat the predicate as a function body where the variables in scope is derived from the source relation. For example in the query below:

SELECT *
WHERE (a.amount < 100)
FROM Foo AS a

The expression inside the parenthesis is a Morphir value where only variable a is available which is a record with fields that can be derived from Foo's schema. On the other hand, in the query below:

SELECT *
WHERE (a.amount < b.amount)
FROM Foo AS a
JOIN Bar AS b
  ON a.id = b.id

Now a and b are both variables that are available in the predicate's scope.

We would need new name resolution and type inference tooling on the relation level that follows the semantics described here (and expanded later) but for the column level we could simply refer back to the existing tooling.

What does everyone think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding first-class relational support to Morphir #181

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Adding first-class relational support to Morphir #181

AttilaMihaly Jun 21, 2023 Maintainer

Problem Statement

Proposal

Benefits

Implementation Details

Replies: 0 comments

AttilaMihaly
Jun 21, 2023
Maintainer