Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Children proxy interface #9477

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

Just a sketch for now - follows the same general idea of .coords.

cc @shoyer

@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label Sep 11, 2024
@TomNicholas TomNicholas marked this pull request as draft September 11, 2024 02:24
@@ -44,6 +45,68 @@ def __init__(self, *pathsegments):
Tree = TypeVar("Tree", bound="TreeNode")


class Children(Mapping[str, Tree], Generic[Tree]):
"""
Dictionary-like container for the immediate children of a single DataTree node.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this class should allow path-like access to grandchildren or only allow access to immediate children.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question also for setting

Copy link
Member Author

@TomNicholas TomNicholas Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts I think this class should only allow getting/setting immediate children (because it's basically a glorified dict).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts I think this class should only allow getting/setting immediate children (because it's basically a glorified dict).

Agreed!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to #9485 though - why exactly do we want to support full paths for .coords but only local access for .children? We could do whatever, but what's the rationale for them being different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're basically thinking of .coords/.children as meaning "access to all coords/children defined anywhere on the subtree". I originally thought of them as "access to just coords/children defined on this node".

FYI path-like access like this means that list(dt.children) will return a subset of what can be accessed via .__getitem__, (and __contains__ could go either way, see #9354)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One compromise option of sorts (similar to what is done for Dataset for derived variables like time.year, or for coordinates in Dataset.__getitem__) is to only list immediate children in __iter__ and __contains__ but to support accessing them in __getitem__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a detailed discussion about this on the old repo (xarray-contrib/datatree#240 (comment)). There the conclusion was that __contains__ should support paths if __getitem__ does, but it's okay for .keys()/.values()/.items() to be local only.

So you're basically thinking of .coords/.children as meaning "access to all coords/children defined anywhere on the subtree".

To play devil's advocate a bit: If .coords can access the whole tree then why not also .variables? I think they should all be consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a detailed discussion about this on the old repo (xarray-contrib/datatree#240 (comment)). There the conclusion was that __contains__ should support paths if __getitem__ does, but it's okay for .keys()/.values()/.items() to be local only.

Yes, that seems to be how Dataset works for coordinates.

To play devil's advocate a bit: If .coords can access the whole tree then why not also .variables? I think they should all be consistent.

Sure, we could do that. Hopefully it's all very straightforwardly reusing the same machinery?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems to be how Dataset works for coordinates.

I'm not sure what you mean. You're talking about virtual vs non-virtual coordinates on Dataset? That they are all in __contains__ and __getitem__, but only a subset (the non-virtual ones) are listed by .keys()?

If .coords can access the whole tree then why not also .variables?

Sure, we could do that.

I actually meant that as a example of counter-intuitive behaviour 😅 . But maybe it's fine too? That would mean we have really done a 180 degree turn in this issue... It would imply changing the current behaviour of dt.data_vars too.

Hopefully it's all very straightforwardly reusing the same machinery?

It's not right now (local vs path-supporting have different codepaths), but it could be made to: everything could go through some version of using TreeNode._set_item

def _set_item(

or ._walk_to and we could add a parameter like traverse_paths to generalize it for both cases. Generalizing .update to support paths might be the approach that is easiest to integrate with the Coordinates base class.

Also one final thing: We are talking here about supporting paths to descendant nodes only? Or upwards via dt['../../path/to/node/'] too?

@@ -224,6 +230,56 @@ def test_overwrite_child(self):
assert marys_evil_twin.parent is john


class TestChildren:
Copy link
Member Author

@TomNicholas TomNicholas Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit awkward to know whether to put tests for this class in test_treenode.py or test_datatree.py.

On the one hand the concept of access to .children is very much part of the TreeNode structure - it has nothing to do with .data. On the other hand it's hard to properly test it without full features that are only available on DataTree, such as .copy, a nice repr, and assertions. It's also a fully public-facing class, unlike TreeNode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot less awkward after #9482.

Comment on lines +295 to +297
# test constructor
john2 = TreeNode(children=children)
assert john2.children == children
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failure is due to #9478

@@ -44,6 +45,68 @@ def __init__(self, *pathsegments):
Tree = TypeVar("Tree", bound="TreeNode")


class Children(Mapping[str, Tree], Generic[Tree]):
"""
Dictionary-like container for the immediate children of a single DataTree node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts I think this class should only allow getting/setting immediate children (because it's basically a glorified dict).

Agreed!

Comment on lines +101 to +103
children = self._treenode._children.copy()
children.update(other)
self._treenode.children = children
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify, this is a no-op if children have incompatible indexes, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .children setter defined on DataTree (and TreeNode) will check for name collisions between variables and children, and if it fails it will avoid modification in-place (we can thank the anytree library for that bit of code 1). In the same place then check_alignment also gets called. Is that what you meant?

Footnotes

  1. Should we adjust the licensing part of the readme as we're bundling small portions of anytree code? We did already add it to the licenses directory here.

Comment on lines +98 to +99
if not len(other):
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slightly more canonical Python to avoid the call to len():

Suggested change
if not len(other):
return
if not other:
return

Comment on lines +64 to +65
def _names(self) -> list[str]:
return list(self._treenode._children.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not directly use the set-like DictKeys object?

Suggested change
def _names(self) -> list[str]:
return list(self._treenode._children.keys())
def _names(self) -> Set[str]:
return self._treenode._children.keys()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably use .keys() you're right, but I don't think I should type this as returning Set, because I originally tried return set(self._treenode._children) and that failed intermittently due to ordering differences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of collections.abc.Set, which is an abstract base class that includes DictKeys, not the specific built-in set class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set child nodes via DataTree.children
2 participants