Skip to content

Latest commit

 

History

History
387 lines (279 loc) · 18.3 KB

README.md

File metadata and controls

387 lines (279 loc) · 18.3 KB

Nodely

Clojars Project

Declarative data dependency graph with conditionals. See the tutorial for guided code examples.

Status

There is a list of engines together with their state of maturity. The mature engines have been used in production in critical services and this is usually the main criteria for flagging an engine as mature. Recommended usage is through the api provided via namespace nodely.api.v0. We aim to avoid breaking the api, and if it becomes necessary we will create a new api in a new namespace (such as nodely.api.v1) instead.

Introduction

Nodely is a library for declaring and executing data dependent computations with support for conditional dependencies. Declaration of the graph is completely decoupled from its execution, allowing for multiple execution stategies (engines) to be implementable. Many tools exist for computation of DAGs, but few of them support conditional dependencies.

Main Features

  • Declarative data dependency graph
  • Conditional data dependency
  • Async data dependency resolution
  • Lazy data dependency resolution

Rationale & Motivation

The development of the library was motivated by those main problems:

  • How can we maximally decouple pure logic from effectful computation?
  • Separate data dependency declaration from the data fetching execution model
  • Support expression of conditional logic

There are many reasons why we would want to solve any of these problems. Here we mention just a few of them:

By decoupling the way we declare the work to be done and the way it is run gives the programmer the option to quickly switch execution engines with few code changes. Async errors, for instance are known to be hard to debug. Switching from async to lazy/sync engine is a way for the programmer to understand the source of an error and if they are caused by asyncrony or not.

By separating how we fetch the data from the logic that uses it enables unit testing your graph in a way that is decoupled from the means of fetching the data, which might involve a query to a database or a http request.

Finally, supporting the expression of conditional logic enables us to express that "Depending on the result of computation a that depends on X, I will do computation b that depends on Y or I will do computation c that depends on Z" without coupling the expression to the way X, Y and Z are acquired. It is a step towards further decoupling of the logic, its data dependencies and its actual execution.

Directed Acyclic Graphs

Nodely does not work in Directed Acyclic Graphs (DAGs), but the environments that run in Nodely are a superset of DAGs. Thus, DAGs are a useful basis for reasoning about Nodely environments.

Nodely environments are similar to DAGs in the following ways:

  • The edges between nodes are expressed in dependencies, and this expression creates a single unambiguous direction for the edge.
  • The graph may not contain cycles: Nodely uses a dependency resolution process to evaluate nodes. If Nodely accepted and attempted to evaluate a cycle, it would loop indefinitely (see Cycle Detection).
  • The combination of nodes and their dependencies instructs Nodely how to evaluate some target node.

Nodely environments differ from DAGs primarily in embodying branching logic. Some Nodes may be "branch" Nodes, and express an uncertainty of the graph that may not be resolved until run time. By expressing branches; Nodely allows for conditional dependencies that defer evaluation until run time checks have determined it is necessary to evaluate a dependency.

Definitions

Nodes

A node can be a:

  • Value
  • Leaf
  • Sequence
  • Branch

Value

This is the simplest node, it represents an immediately available value. It is used internally for representing computations that were already carried through and it can also be used to represent given values or simple calculations that do not depend on other nodes. Examples:

;; Simple value
(nodely/>value 1)
;; Simple calculation without dependencies on other nodes
(nodely/>value (* 2 3))

Leaf

Leaf nodes are the node equivalent of functions. A leaf may have no dependencies, representing some computation that is going to be evaluated at run time and it may have dependencies. All dependencies are strong dependencies (in opposition to conditional dependencies), meaning that they will definitely be necessary for the node to be evaluated. Examples:

;; Leaf with no dependencies
(nodely/>leaf (expensive-call))
;; Leaf that depends on node `a` and `b`
(nodely/>leaf (+ ?a ?b))

Sequence

Sequences nodes are roughly equivalent to a map over a sequential value. They are used to express a computation that needs to be carried over for each element of the sequence input. The key advantage of having that kind of computation expressed as a node is that nodely can optimize execution by running the computations concurrently. The exact execution strategy may change depending on the engine being used.

;; Env with sequence
(def my-env
  {:a (nodely/>value [1 2 3])
   :b (nodely/>sequence ?a (fn [each] (inc each)))})
(nodely/eval-key my-env :b)
;; => [2 3 4]

Branch

Branch nodes are how we express conditional data dependency. The reason for their existence is better explained with an example:

(defn my-fn
  [x y]
  (if (even? x)
    x
    y))

my-fn is a function that requires two arguments (x and y). In practice though, only x is always needed. If (even? x) is true, we don't need y at all. You might want to say that "y requirement is conditionally dependent on x". Typical function declaration on Clojure does not have enough metadata to describe that conditional dependency. Here comes nodely:

(def branch-node
  (nodely/>if (>leaf (even? ?x))
    (>leaf ?x)
    (>leaf ?y)))

This is syntactic sugar for declaring a branch node. By inspecting the var, we can see that it is just a map:

> branch-node
=> #:nodely.data{:type :branch,
                 :in [:x],
                 :condition #:nodely.data{:type :leaf,
                                          :in [:x],
                                          :fn #function[fn--15482]}
                 :truthy    #:nodely.data{:type :leaf,
                                          :in [:x],
                                          :fn #function[fn--15486]},
                 :falsey    #:nodely.data{:type :leaf,
                                          :in [:y],
                                          :fn #function[fn--15492]}}

The branch node has four fields:

type: The type of the node, in this case, :branch

in: The input keys of the node. Only what is needed to execute the condition appears here.

condition: the node representing the condition of the branch; (>leaf (even? x))

truthy: the node to replace the node branch in case condition returns truthy

falsey the node to replace the node branch in case condition returns falsey

The leaf nodes on the :truthy and :falsey keys do not have any :condition, so the :in field on leaf nodes specify strong dependency (not conditional). As you can see, in the whole map, the dependency on y only appears on the :falsey node and nowhere else. Internal details aside, what this enables is a declarative way of expressing weak/conditional data dependencies clearly.

Environment

An environment is just a map of nodes. Consider this branch node:

(def branch-node 
  (nodely/>if (>leaf (even? ?x))
    (>leaf ?x)
    (>leaf ?y)))
    
(defn my-env {:x (>leaf (request-for-x ...))
              :y (>leaf (request-for-y ...))
              :z branch-node})

Let's say that x and y are actually expensive http requests. What we want is to make sure we don't execute the request for y unless we actually need it. We can evaluate a node within an environment by running:

(nodely/eval-key my-env :z  {::nodely/engine :sync.lazy})

This will evaluate the node by realizing all required dependencies, without executing the dependencies that are not needed. For instance, if x is even, y is not evaluated at all.

Testing

Using eval-node-with-values is the most straightforward way to evaluate a node by supplying actual values for the data dependencies. If we want to test branch-node without actually making the expensive calls, it is as simple as that:

(nodely/eval-node-with-values branch-node {:x 2 :y 3})
;=> 2

Core Async Support

Nodely implements partial support for non blocking operations with core-async. The main helper for building async nodes is >channel-leaf. It is similar to >leaf, but expects the enclosured expression to return a core-async channel:

(def env {:a (>value 1)
          :b (nodely/>channel-leaf
               (async/go (+ ?a 5)))})
(nodely/eval-key env :b {::nodely/engine :core-async.lazy-scheduling})
;=> 6

The example above is just to demonstrate the syntax. The main benefit of using >channel-leaf is for wrapping it around async IO operations. >channel-leaf is compatible with all other engines, but to take full advantage of non-blocking IO you should use one of the core-async engines, such as :core-async.lazy-scheduling.

Evaluation

Defining graphs of data dependency is of marginal value unless we can apply those definitions to arrive at answers. Consequently, Nodely defines multiple engines which will evaluate graphs to arrive at a value for a targeted node.

Engines supported in Nodely include:

Engine Keyword Status
Lazy Synchronous :sync.lazy Mature
Core Async Lazy Scheduling :core-async.lazy-scheduling Mature
Applicative Virtual Threads :applicative.virtual-futures Mature
Async Virtual Threads :async.virtual-futures Experimental
Core Async Iterative Scheduling :core-async.iterative-scheduling Experimental
Async Manifold :async.manifold Experimental
Async Applicative :applicative.core-async Experimental
Promesa Async Applicative :applicative.promesa Experimental

Lazy Synchronous

The Nodely :sync.lazy engine executes in one thread of execution, visiting nodes in depth-first traversal from the targeted node to terminal nodes. Only those nodes necessary to evaluate the specified target will be evaluated. Blocks execution of the calling thread until a value is produced.

This engine should be the least complicated to reason about, and useful for debugging. It will schedule nodes to evaluate strictly head and tail, and may introduce accidental latencies consequently.

Core Async Lazy Scheduling

The Nodely :core-async/lazy-scheduling engine executes in Clojure's core.async library. Core async jobs are spawned to evaluate indvidual nodes lazily, as those nodes are determined to be required to satisfy the target node. When a branch condition is not yet resolved, one job will be spawned to evaluate the branch condition, and at each step of branching, new jobs will be scheduled to resolve all nested branch conditions.

This engine will minimize the number of blocking Threads, and afford good concurrency for graphs that do not imply sequencing of evaluation. However, in degenerate cases it may result in scheduling work head and tail, while incurring additional costs of running in core.async.

When functions are specified in nodes with the :core-async/lazy-scheduling engine, the functions must not block, or they will block in the core.async Dispatch Thread Pool and present a major performance penalty.

Core Async Iterative Scheduling

Iterative scheduling differs from lazy scheduling by trying to predict what branches will have to be resolved ahead of time. This strategy was found to be inefficient on environments with a large number of branches, since the complexity of analyzing the environment grows exponentially with the number of branches.

Async Manifold

The Nodely :async.manifold engine executes in the context of Manifold Deferred Futures. First, all branches in the environment are synchronously resolved to determine one DAG. Then, all nodes are immediately scheduled to eval in parallel, with dependencies managed by inter-thread blocking.

This engine may afford the greatest concurrency of evaluating nodes and will likely produce a result in the smallest amount of wall clock time. However, it is likely to create many blocking Threads in the process, and so may have poor consequences when, e.g. serving many hundreds or thousands of requests per minute.

Applicative

Applicative engines are implemented based on the category theory concept of Applicative Functors. Applicative Functors have been considered as good abstractions for asynchronous processes because parts of its structure do not imply a strict execution order. Because of that we were able to implement support to several engines with a single code base, just changing the required set of protocols. Some examples of engines we have implemented using the applicative framework:

  • Synchronous
  • Core Async
  • Promesa
  • Virtual threads

Setting a sub-engine (evaluation context) is done with the :nodely.engine.applicative/context option, which defaults to promesa.

(ns ...
  (:require [nodely.engine.applicative.core-async :as applicative.core-async])
(nodely/eval-key my-env :z  {::nodely/engine :sync.lazy
                             :nodely.engine.applicative/context applicative.core-async/context})

We don't recommend usage of functions that are not part of the api namespace, those are subject to breaking changes. Instead use the engines that are provided by nodely.api.v0.

Async Virtual Futures

This engine relies on support to Virtual Threads added in Java 21. This engine will only be loaded and enabled in the api if you are running Java 21 or above. This engine works similarly to :async.manifold engine but dispatching futures using virtual threads instead.

Risks

Nodely presently is built to support acyclic graphs; this means that the graph does not have a loop. Put another way, the following env would cause problems:

{:a (nodely/>leaf (+ ?b 1)) :b (nodely/>leaf (- ?a 1))}

:a depends on :b, and :b depends on :a. Nodely neither resolves this kind of loop in any way (as in the above, how could it?), nor does it detect these loops proactively and abort evaluation when provided them. Loops in environments will fail if they're on the path of evaluation, and should be avoided.

While the presence of loops in evaluation is not defined, the consequences in most nodely engines are that evaluation will hang and no result will be produced. Sometimes this may be accompanied by a stack overflow error from mutual recursion, other times this will result in the process hanging without even consuming much CPU as two core.async workers deadlock, waiting on each other to complete.

If you see evaluation hanging, start investigating for loops in your environment.

Cycle Detection

Nodely provides a handy function that throws an error when a cycle has been detected in your environment. To use this function, simply execute the following code:

(comment
  (require '[nodely.api.v0 :as nodely])
  (nodely/checked-env
   {:a (nodely/>value 1)
    :b (nodely/>leaf (even? ?a))})
) ;; When no cycle is detected, returns env {:a #:nodely.data{:type :value, :value 1}...}

Avoid using the checked-env function at runtime, as its calculations can be costly. It is recommended to perform this verification only during development, using the REPL.

It's important to be aware that this function may sometimes produce false positives when dealing with graphs that contain mutually exclusive conditions. Here's an example where the function might produce a false positive:

(comment
  (require '[nodely.api.v0 :as nodely])
  (nodely/checked-env
   {:f  (>if (>leaf ?c) :it-was-even! (>leaf ?e))
     :e (>if (>leaf ?c) (>leaf ?f) :it-was-odd!)
     :c (>leaf (even? (rand-int 2)))})
) ;; throws "Checked-env found cycles at compile time"

Keep in mind these limitations and double-check the results if your graph contains such scenarios.

Related Work

From our knowledge there are some libraries with similar functionality, but none of them do exactly what nodely proposes

Supports declaring data dependency graphs with lazy and async execution. Does not support expressing conditional data dependency. Does not support expression of data dependency over sequential data.

In Pathom3 we can define resolvers that represent knowledge of how to fetch/process some data. By defining resolvers we can express relationships between data attributes. Pathom then takes care of choosing the resolution paths.

Even though there is some similarity and intersection between Pathom3 and Nodely, as far as we can understand, we think that Pathom3 is much larger, complex, opinionated and tightly coupled with itself. Pathom3 relies a lot on dynamic behavior and mutability, something that Nodely tries to avoid.

Contrary to Pathom3, Nodely does not aim to be a full fledged out of the box framework. It is supposed to solve a particular problem in a functional declarative style. Nodely could be used as backend functionality for more complete data resolution engines or systems.

Manifold offers the let-flow macro, which enables data dependency declaration that are automatically resolved and can be executed asyncronously.

However, manifold does not decouple the data dependency declaration from the execution model and also does not support conditional data dependencies.

Lazy Map