refactor(i): Rewrite document fetcher #3279

AndrewSisley · 2024-11-28T20:27:40Z

Relevant issue(s)

Resolves #3277

Description

Rewrites the document fetcher.

Motivation

The old document fetcher had grown over the last few years and had IMO turned into a bit of a mess with a large amount of mutable state, and a lot of different concepts blurred into the same type where they were interleaved amongst each other that made maintaining the file pretty horrible.

Picking up #3275 motivated me to make this change now, so I can integrate the new package into something that I (hopefully we) can read.

What has not been done

The Fetcher interface has not changed (much, see first commit) - it would involve touching more parts of the codebase and would be a lot more hassle, it would have minimal impact on introducing CoreKV.

What has been done

The old code has been deleted and replaced with something written from scratch. Please review each line as if it is brand new code, complain about anything you like. Please be quite critical, especially if you actually liked the old fetcher code.

Behavioural changes

Only one significant change has been made to the behaviour.

Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.

The old behaviour makes very little sense to me, as best case scenario, only fields that had names lexicographically larger than the last filter field for the last document in the prefix would have been avoided being scanned. This is quite tiny, and the code required to do so is almost certainly more computationally expensive than the saving (across the database's lifetime). Note: the primary side of relations gained the most from this, until we add automatic indexing of relations.

Even once CoreKV is introduced, with its seek functionality, skipping past those key-values would require the iterator to seek backwards and forwards (or have two iterators, with extra seeking), and the savings would likely be minimal at best.

Medium-long term this means that doing any field-based filtering makes little sense, and we might as well do it outside of the fetcher. This is more hassle than it is worth to change in this PR, so, in order to keep a manageable scope to this PR a filtered fetcher has been implemented in order to host fetcher-level filtering. Hopefully we can remove that too soon.

It hasn't been used in 3 years, it can die and come back if/when needed.

codecov · 2024-11-28T20:45:14Z

Codecov Report

Attention: Patch coverage is 78.57143% with 81 lines in your changes missing coverage. Please review.

Project coverage is 78.04%. Comparing base (8509211) to head (f03a867).

Files with missing lines	Patch %	Lines
internal/db/fetcher/wrapper.go	78.95%	13 Missing and 7 partials ⚠️
internal/db/fetcher/deleted.go	65.91%	11 Missing and 4 partials ⚠️
internal/db/fetcher/document.go	85.58%	10 Missing and 5 partials ⚠️
internal/db/fetcher/prefix.go	75.81%	10 Missing and 5 partials ⚠️
internal/db/fetcher/filtered.go	70.00%	6 Missing and 3 partials ⚠️
internal/db/fetcher/permissioned.go	82.86%	4 Missing and 2 partials ⚠️
internal/db/fetcher/versioned.go	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3279      +/-   ##
===========================================
+ Coverage    77.95%   78.04%   +0.09%     
===========================================
  Files          382      388       +6     
  Lines        35364    35298      -66     
===========================================
- Hits         27568    27548      -20     
+ Misses        6148     6120      -28     
+ Partials      1648     1630      -18

Flag	Coverage Δ
all-tests	`78.04% <78.57%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/db/collection.go	`69.87% <100.00%> (ø)`
internal/db/collection_get.go	`74.14% <100.00%> (ø)`
internal/db/collection_index.go	`87.43% <ø> (-0.03%)`	⬇️
internal/db/fetcher/fetcher.go	`100.00% <ø> (+22.78%)`	⬆️
internal/db/fetcher/indexer.go	`83.69% <ø> (-0.11%)`	⬇️
internal/lens/fetcher.go	`70.09% <ø> (-0.14%)`	⬇️
internal/planner/scan.go	`89.88% <100.00%> (-0.04%)`	⬇️
internal/db/fetcher/versioned.go	`81.77% <80.00%> (+1.55%)`	⬆️
internal/db/fetcher/permissioned.go	`82.86% <82.86%> (ø)`
internal/db/fetcher/filtered.go	`70.00% <70.00%> (ø)`
... and 4 more

... and 19 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8509211...f03a867. Read the comment docs.

shahzadlone · 2024-11-29T02:02:50Z

internal/db/fetcher/deleted.go

+// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
+type deleted struct {


suggestion: Why not rename to deletedFetcher, I know the package name is fetcher but still prefer the more descriptive name especially as you also have newDeletedFetcher function.

Suggested change

// deleted is a fetcher that orchastrates the fetching of deleted and active documents.

type deleted struct {

// deleted is a fetcher that orchastrates the fetching of deleted and active documents.

type deletedFetcher struct {

I dislike fetcher.deletedFetcher now, and I would have expected @fredcarle to complain if I had named it so. Happy to rename if the majority want to though, otherwise leaving as-is.

Fred and I have rather similar views on this kind of naming stuff and although I'm not speaking for him, I do prefer the explicitness of deletedFetcher here. deleted is just too generic of a term, same with document which given the context of the package could just as easily refer to an actual document.

As for the import/package full naming (ie: fetcher.deletedFetcher) this is also OK and not against any language idioms or best practises, because the alternative (fetcher.deleted) is not a clear enough naming scheme (ignoring the fact that this isnt a public type).

I dislike fetcher.deletedFetcher now, and I would have expected @fredcarle to complain if I had named it so. Happy to rename if the majority want to though, otherwise leaving as-is.

Why newDeletedFetcher over newDeleted then also haha? The only reason deleted is easy to understand for us is because we have context. A new developer just stumbling on this file even though the package is called fetcher might not grasp what a general term like deleted is referring to here, let alone what a "deleted fetcher" is.

I do understand that this is subjective so happy to follow consensus. I do think I am bias to longer and descriptive names haha my ideal would definitely be deletedAndActiveDocumentFetcher or activeDocumentAndDeletedFetcher for this and activeDocumentFetecher for the other one that is currently document

I don't like both of deleted and deletedFetcher. The documentation for the struct is not deletion-specific, it's for both deleted and active. If the main thing this fetcher does is orchestrating something, let's call it orchestratingFetcher.

And it's also not clear by just looking at it's fields why delete includes deletedFetcher.

Another question is: who deleted this fetcher? :D

My preference after thinking about it over the weekend.

mainFetcher or docFetcher deletedFetcher versionedFetcher filteredFetcher permissionedFetcher indexedFetcher

shahzadlone · 2024-11-29T02:05:48Z

tests/integration/index/query_with_relation_filter_test.go

@@ -362,7 +362,7 @@ func TestQueryWithIndexOnOneToOnePrimaryRelation_IfFilterOnIndexedFieldOfRelatio
 				// we make 3 index fetch to get the 3 address with city == "Montreal"
 				// then we scan all 10 users to find one with matching "address_id" for each address
 				// after this we fetch the name of each user
-				Asserter: testUtils.NewExplainAsserter().WithFieldFetches(33).WithIndexFetches(3),
+				Asserter: testUtils.NewExplainAsserter().WithFieldFetches(60).WithIndexFetches(3),


question: Whats the context on field fetches increases, is it because all fields are read now?

Noted in the description. It is a large increase due to the lack of indexing (and other potential optimisations), so we end up doing a full scan for every parent record - field count here has gone from (3*10*1+(1*3)) to (3*10*2) (note if you rename address to baddress on develop branch you get the same result).

I missed the part in the description explaining why number of fields fetched increased.

Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.

Is it this one?

And what do you mean by "lack of indexing"?

I still don't understand why it fetches 6 fields for every User doc.

shahzadlone · 2024-11-29T02:06:54Z

internal/db/fetcher/document.go

+// document is the type responsible for fetching documents from the datastore.
+//
+// It does not filter the data in any way.
+type document struct {


suggestion: Similar renaming suggestion as above.

Same answer as above

same comment as above, I don't think document is the best naming here, too generic/general and could be interpreted in a myriad of ways (which I initially did).

In addition to my above comment, from a purely gramatical POV, the fetcher is a verb indicating action of some kind, which makes sense since the fetchers are a system of pulling and processing key values into a consistent output form. Where as document is a noun and indicates a static object. (This grammar argument isn't my primary concern with the naming, just an observation)

jsimnz · 2024-12-02T19:46:15Z

reviewing now

shahzadlone · 2024-12-02T22:29:33Z

internal/db/fetcher/document.go

+	"github.com/sourcenetwork/defradb/internal/keys"
+)
+
+// document is the type responsible for fetching documents from the datastore.


question/suggestion: Is this only for active documents? or can work for both?

Perhaps this suggestion if only for active:

Suggested change

// document is the type responsible for fetching documents from the datastore.

// document is the type responsible for fetching only active documents from the datastore.

fredcarle

There is going to be a lot of refactoring needed when integrating core-kv but I understand this is a change in the right direction. Looking forward to the follow-up PR.

Make sure the other comments are resolved before merging.

fredcarle · 2024-12-02T18:01:46Z

internal/db/fetcher/wrapper.go

+func NewDocumentFetcher() Fetcher {
+	return &wrapper{}
+}
+
+func (f *wrapper) Init(
+	ctx context.Context,
+	identity immutable.Option[acpIdentity.Identity],
+	txn datastore.Txn,
+	acp immutable.Option[acp.ACP],
+	col client.Collection,
+	fields []client.FieldDefinition,
+	filter *mapper.Filter,
+	docMapper *core.DocumentMapping,
+	showDeleted bool,
+) error {
+	f.identity = identity
+	f.txn = txn
+	f.acp = acp
+	f.col = col
+	f.fields = fields
+	f.filter = filter
+	f.docMapper = docMapper
+	f.showDeleted = showDeleted
+
+	return nil
+}
+
+func (f *wrapper) Start(ctx context.Context, prefixes ...keys.Walkable) error {


thought: I know why you're doing this but the pattern New -> Init -> Start... 🤮 lol. Hopefully this will be changed soon.

fredcarle · 2024-12-02T18:16:03Z

internal/db/fetcher/deleted.go

+	}
+}
+
+func (f *deleted) NextDoc() (immutable.Option[string], error) {


praise: I wasn't sure initially when I looked at this implementation but now I really like it! Nice simple way to keep the docs in order.

islamaliev

Overall looks really great. I had hart time comprehending what was happening in the old code. This one is well-structured and easy to understand. Thanks Andy.

I just have serious concern about the numbers of fields that executer is reporting now. I hope we can clarify/solve it before it's merged.

islamaliev · 2024-12-09T18:05:03Z

internal/db/fetcher/deleted.go

+	activeFetcher fetcher
+	activeDocID   immutable.Option[string]
+
+	deletedFetcher fetcher
+	deletedDocID   immutable.Option[string]
+
+	currentFetcher fetcher


suggestion: would be nice to have per-field docs explaining differences between fetchers, especially between activeFetcher and currentFetcher

islamaliev · 2024-12-09T18:08:03Z

internal/db/fetcher/deleted.go

+// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
+type deleted struct {


I don't like both of deleted and deletedFetcher. The documentation for the struct is not deletion-specific, it's for both deleted and active. If the main thing this fetcher does is orchestrating something, let's call it orchestratingFetcher.

And it's also not clear by just looking at it's fields why delete includes deletedFetcher.

Another question is: who deleted this fetcher? :D

islamaliev · 2024-12-09T18:16:28Z

internal/db/fetcher/document.go

+	// The most recently yielded item from kvResultsIter.
+	currentKV keyValue
+	// nextKV may hold a datastore key value retrieved from kvResultsIter
+	// that was not yet ready to be yielded from the instance.
+	//
+	// When the next document is requested, this value should be yielded
+	// before resuming iteration through the kvResultsIter.
+	nextKV immutable.Option[keyValue]


praise: great comments. Thanks

islamaliev · 2024-12-09T18:20:56Z

internal/db/fetcher/document.go

+		if dsKey.DocID != f.currentKV.Key.DocID {
+			f.currentKV = keyValue{
+				Key:   dsKey,
+				Value: res.Value,
+			}
+			break
+		}


question: if they are equal, don't we need to update f.currentKV.Value?

islamaliev · 2024-12-09T18:23:26Z

internal/db/fetcher/document.go

+	return immutable.Some[EncodedDocument](&doc), nil
+}
+
+func (f *document) appendKv(doc *encodedDocument, kv keyValue) error {


nitpick: rename to appendKV

islamaliev · 2024-12-10T07:57:47Z

internal/db/fetcher/prefix.go

+		prefixes = make([]keys.DataStoreKey, 0, len(uniquePrefixes))
+		for prefix := range uniquePrefixes {
+			prefixes = append(prefixes, prefix)
+		}


suggestion: why not to write like this:

i := 0 for prefix := range uniquePrefixes { prefixes[i] = prefix i += 1 } prefixes = prefixes[0:len(uniquePrefixes)]

This will reuse existing memory.

I'm pretty sure the way Andy did it also uses existing memory because the capacity has been set.

capacity is good, but make always allocates a new array. The old one is discarded.

Ah I had missed the reuse of prefixes. Given that it's always going to be a small array, I'm not sure it has much impact either way but Andy can decide which one he prefers.

islamaliev · 2024-12-10T10:04:00Z

internal/db/fetcher/prefix.go

+		slices.SortFunc(prefixes, func(a, b keys.DataStoreKey) int {
+			return strings.Compare(a.ToString(), b.ToString())
+		})


suggestion: I assume the array of prefixes is not going to be large for any reasonable case and this approach would be fine.

The ToString() is not the cheapest and will be called 2 N (log N) time. We could make it just N.

The sorting itself could be also made using radix-like sorting style where we first sort by CollectionRootID, then by InstanceType... This won't give us exact scanner's sequence though.

islamaliev · 2024-12-10T10:23:18Z

tests/integration/index/query_with_relation_filter_test.go

@@ -362,7 +362,7 @@ func TestQueryWithIndexOnOneToOnePrimaryRelation_IfFilterOnIndexedFieldOfRelatio
 				// we make 3 index fetch to get the 3 address with city == "Montreal"
 				// then we scan all 10 users to find one with matching "address_id" for each address
 				// after this we fetch the name of each user
-				Asserter: testUtils.NewExplainAsserter().WithFieldFetches(33).WithIndexFetches(3),
+				Asserter: testUtils.NewExplainAsserter().WithFieldFetches(60).WithIndexFetches(3),


I missed the part in the description explaining why number of fields fetched increased.

Previously, filtering would be done as soon as all the fields for the filter had been read into memory. Now filtering is done once all selected fields have been read.

Is it this one?

And what do you mean by "lack of indexing"?

I still don't understand why it fetches 6 fields for every User doc.

Remove unused reverse prop

92d9051

It hasn't been used in 3 years, it can die and come back if/when needed.

AndrewSisley added area/query Related to the query component refactor This issue specific to or requires *notable* refactoring of existing codebases and components code quality Related to improving code quality labels Nov 28, 2024

AndrewSisley added this to the DefraDB v0.15 milestone Nov 28, 2024

AndrewSisley requested a review from a team November 28, 2024 20:27

AndrewSisley self-assigned this Nov 28, 2024

Rewrite document fetcher

f03a867

AndrewSisley force-pushed the 3277-cleanup-fetcher branch from d85d035 to f03a867 Compare November 28, 2024 20:30

shahzadlone reviewed Nov 29, 2024

View reviewed changes

AndrewSisley requested a review from shahzadlone November 29, 2024 18:40

shahzadlone reviewed Dec 2, 2024

View reviewed changes

fredcarle approved these changes Dec 9, 2024

View reviewed changes

islamaliev reviewed Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(i): Rewrite document fetcher #3279

refactor(i): Rewrite document fetcher #3279

AndrewSisley commented Nov 28, 2024

codecov bot commented Nov 28, 2024 •

edited

Loading

shahzadlone Nov 29, 2024 •

edited

Loading

AndrewSisley Nov 29, 2024

jsimnz Dec 2, 2024

shahzadlone Dec 2, 2024 •

edited

Loading

islamaliev Dec 9, 2024

fredcarle Dec 9, 2024

shahzadlone Nov 29, 2024

AndrewSisley Nov 29, 2024 •

edited

Loading

islamaliev Dec 10, 2024

shahzadlone Nov 29, 2024

AndrewSisley Nov 29, 2024

jsimnz Dec 2, 2024

jsimnz commented Dec 2, 2024

shahzadlone Dec 2, 2024

fredcarle left a comment •

edited

Loading

fredcarle Dec 2, 2024

fredcarle Dec 2, 2024

islamaliev left a comment

islamaliev Dec 9, 2024

islamaliev Dec 9, 2024

islamaliev Dec 9, 2024

islamaliev Dec 9, 2024

islamaliev Dec 9, 2024

islamaliev Dec 10, 2024

fredcarle Dec 10, 2024

islamaliev Dec 10, 2024

fredcarle Dec 10, 2024

islamaliev Dec 10, 2024

islamaliev Dec 10, 2024

		// deleted is a fetcher that orchastrates the fetching of deleted and active documents.
		type deleted struct {

	// document is the type responsible for fetching documents from the datastore.
	// document is the type responsible for fetching only active documents from the datastore.

refactor(i): Rewrite document fetcher #3279

Are you sure you want to change the base?

refactor(i): Rewrite document fetcher #3279

Conversation

AndrewSisley commented Nov 28, 2024

Relevant issue(s)

Description

Motivation

What has not been done

What has been done

Behavioural changes

codecov bot commented Nov 28, 2024 • edited Loading

Codecov Report

shahzadlone Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shahzadlone Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndrewSisley Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsimnz commented Dec 2, 2024

Choose a reason for hiding this comment

fredcarle left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

islamaliev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 28, 2024 •

edited

Loading

shahzadlone Nov 29, 2024 •

edited

Loading

shahzadlone Dec 2, 2024 •

edited

Loading

AndrewSisley Nov 29, 2024 •

edited

Loading

fredcarle left a comment •

edited

Loading