marimo's text format is hard for humans/non-marimo tools to understand #1379

Ubehebe · 2024-05-15T15:12:01Z

Ubehebe
May 15, 2024

One of the main advantages of marimo compared to other notebook formats is that marimo notebooks are syntactically valid Python files. This means that tools that analyze Python files (linters, formatters, type-checkers, IDEs) can generally do something useful with marimo notebooks without any setup.

As I've used marimo more, I've discovered some exceptions. marimo notebooks are syntactically valid Python, but they aren't idiomatic Python. This means that some tools can't analyze marimo notebooks in a useful way.

Here's an example. When you use an import in a notebook:

import pandas as pd

df = pd.DataFrame(...)

marimo serializes that to disk as something like:

@app.cell
def __():
  import pandas as pd
  return pd

@app.cell
def __(pd):
  df = pd.DataFrame(...)

This is basically a serialized DAG: the nodes (cells) are represented by top-level functions decorated with @app.cell, and the edges (dependencies) are represented by function params/return values.

The serialization is elegant, but tools other than marimo can't understand the indirection -- for example, they can't understand that the DataFrame constructor comes from the pandas import. This means that:

tools like ruff can't remove unused imports from marimo notebooks. If you delete the DataFrame instantiation above and run the file through ruff, the import pandas as pd statement remains.
IDEs can't navigate from the DataFrame instantiation to its definition in pandas.

I can see a few approaches we might take to improve this situation, but before proposing anything specific, I wanted to start a discussion. Maintainers, have you thought about this? How important do you think it is to improve?

My own view is that it's medium importance. For (1), unused imports can significantly slow down notebook execution. And for (2), being able to use IDE features to edit marimo notebooks would make large codebases significantly more maintainable (refactoring, etc.).

Thanks for your time!

akshayka · 2024-05-20T18:50:48Z

akshayka
May 20, 2024
Maintainer

@Ubehebe , sorry for the late response -- just saw your post.

I can see a few approaches we might take to improve this situation, but before proposing anything specific, I wanted to start a discussion. Maintainers, have you thought about this? How important do you think it is to improve?

I've thought about this insofar as the indirection bothers me, too. But I haven't spent time trying to design something better. In particular, number 2 resonates with me -- it would be great to make editing in an IDE/text editor easier.

Definitely open to hearing your suggestions. Thanks so much for the thoughtful message!

0 replies

Ubehebe · 2024-05-22T02:23:34Z

Ubehebe
May 22, 2024
Author

I think the main question I have is: why does marimo have to serialize the dataflow graph into the source file? Why can't it be an in-memory data structure on the server?

The first problem is that marimo needs some way to partition a Python source file into cells. The @app.cell decoration and the synthetic functions are as good a way as any. (There are other possible approaches, but the synthetic functions don't confuse non-marimo tools, so I'm not that concerned about them.)

The edges of the dataflow graph (the parameters of the synthetic functions) are what confuse non-marimo tools. Why do they need to be in the source file? Couldn't marimo run the dataflow analysis once on startup and keep the graph in-memory? This might slow down the initial time to interactive, but I doubt it would be significant when import statements regularly take 1+ second.

If we're able to make marimo notebooks more idiomatic Python so that other tools can work on them seamlessly, I think that's a good tradeoff.

3 replies

akshayka May 22, 2024
Maintainer

Why do they need to be in the source file? Couldn't marimo run the dataflow analysis once on startup and keep the graph in-memory?

@Ubehebe, you're totally right, they don't need to be in the source file. In fact marimo doesn't even read the args and returns of the app.cell decorated functions, and instead just redoes the dataflow analysis as you suggest.

The main reason references are included as cell/function args is so that the code is more legible to human eyes. For example, when designing the file format, I believed:

@app.cell
def __():
  x = 0
  return x

@app.cell
def __(x):
  y = x + 1
  return y

was more legible than

@app.cell
def __():
  x = 0

@app.cell
def __():
  y = x + 1

Because in the former, at least x is bound to something (the function argument). Plus, it makes using the cell.run() API a bit easier, since you can read off the references from the signature.

But, I might have been mistaken! I am open to not serializing the edges in the file format if it would make the Python more idiomatic. How does this improve the IDE experience? Do you have a suggestion based on removing the edges that would make the file format more idiomatic?

Ubehebe May 22, 2024
Author

I am open to not serializing the edges in the file format if it would make the Python more idiomatic. How does this improve the IDE experience?

I realized I've been thinking only about imports. marimo currently serializes all names that appear in a Python file into the params of synthetic functions. Top-level imports are a kind of name. If marimo didn't serialize top-level imports into the function params, IDEs could navigate from the use of an import to its definition. Simple example:

# user writes
import pandas as pd

df = pd.DataFrame(...)

# current marimo serialization
import marimo

app = marimo.App()

@app.cell
def __():
  import pandas as pd
  return pd

@app.cell
def __(pd):
  df = pd.DataFrame(...) # bad: `pd` param "shadows" top-level import
  return df

# if marimo didn't serialize top-level imports
import marimo
import pandas as pd

app = marimo.App()

@app.cell
def __():
  df = pd.DataFrame(...) # good: IDE knows `pd` is pandas

This might be worth doing to get ruff unused imports working, plus limited IDE navigation. But this doesn't help tools with names that are not top-level imports (like local variables). If marimo doesn't serialize any names, then non-marimo tools can't do much with synthetic functions like this, as you point out.

@app.cell
def __():
  y = x + 2 # marimo knows where this x comes from, other tools do not

I think the fundamental problem is that marimo's choice of marker for cells (functions) introduces a lexical scope. This hides relationships that would otherwise be legible to both humans and non-marimo tools.

Did you consider using a cell marker that doesn't introduce a lexical scope?

# marimo:cell
x = 1

# marimo:cell
y = x + 2

Putting behavior in comments is grungy, but this has the advantage that the relationship between the names is immediately apparent.

akshayka May 31, 2024
Maintainer

Sorry for the delayed response @Ubehebe

Did you consider using a cell marker that doesn't introduce a lexical scope?

Yea we did consider this. The reason we didn't go down this route is that we wanted the notebook files to be importable as Python modules, for reusability -- to be able to support reusing cells, functions, or classes from one notebook in another.

Writing the notebook code as a flat script would either mean that the notebook would be executed on import, or the code wouldn't be reusable because it would be nested under an if __name__ == "__main__" guard.

But I do agree that the flat version you suggest provides a much better editing experience. I wonder if we can somehow get the best of both worlds.

Ubehebe · 2024-05-31T13:33:12Z

Ubehebe
May 31, 2024
Author

I renamed this topic to reflect the most important issue (and to focus less on specific solutions). marimo's text format is good compared to other notebook formats, but I think it can be even better.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

marimo's text format is hard for humans/non-marimo tools to understand #1379

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

marimo's text format is hard for humans/non-marimo tools to understand #1379

Ubehebe May 15, 2024

Replies: 3 comments · 3 replies

akshayka May 20, 2024 Maintainer

Ubehebe May 22, 2024 Author

akshayka May 22, 2024 Maintainer

Ubehebe May 22, 2024 Author

akshayka May 31, 2024 Maintainer

Ubehebe May 31, 2024 Author

Ubehebe
May 15, 2024

Replies: 3 comments 3 replies

akshayka
May 20, 2024
Maintainer

Ubehebe
May 22, 2024
Author

akshayka May 22, 2024
Maintainer

Ubehebe May 22, 2024
Author

akshayka May 31, 2024
Maintainer

Ubehebe
May 31, 2024
Author