-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Story: Large Data Support / Incremental Summaries / Automatic Chunk Reuse (WIP) #11572
Comments
@DLehenbauer @noencke @milanro @vladsud, initial thoughts on using CDC to achieve incremental summaries. To evolve. |
This a great summary. Thank you for opening this issue! BTW - I removed 'M3' from the roadmap and renamed this issue to refer to M2. |
FYI. Created and published first draft of the cdc |
@DLehenbauer @noencke @vladsud, @justus-camp & @milanro FYI. Updated the Benchmarks section w/ the finding summary and links to the speed and stability benchmark repos, incl results |
@DLehenbauer @vladsud @milanro Updated the initial comment above with notes about current status, concrete library for chunking and POC bits for review. See Elected Chunking Algorithm, Modularization Strategy, POC and Chunk Deduplication paragraphs. At this stage few feedback and help topics stand out:
Publish blob summaryChunk_0/fc5b7d03ab1b62cb282e6bf23483609b5e008b5f 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_1/8c3637e1f9451cdb04b80e8229037d30fbc1dc06 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_2/55d38d2dc72c84b7d83e67a3d3289602ecbd0a1b 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_3/801daf7eac5abd985c542d448dbdd9f40ac7b222 125824 bytes summaryTreeUploadManager.js:76:25
Skip blob summaryChunk_4 b8f92c33f126806efa5f9591183fa9f2aba0a39a 57040 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_5 6d0b4cc31d30b54d99dece96515c4e155d6e3928 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_7 dbe71c48872f32042f9e17ade06f9301d68a1465 52216 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_6 7303a83c892539e3e9242cc954164b1197f6c181 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_8 3da1b465805486bfc50f90fe55adcaf259de3654 33392 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_9 a4b861322bb775bec5c6c2e6cf521c4668c84180 80404 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_10 57175dec0bd91aef34c86538d5c11fa2775db308 84788 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_11 498b6581aa9e039827bb324ed3c7bf12be9eeece 36928 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_12 9e0d970dfdbea6fc7a2c437bc7b8f5155216e706 56972 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_13 2e880a02d68c05eb628e4c63f0c4d56044af1278 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_14 c5ae10e1472f9290c32d2f08337e25bb634e2972 77852 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_15 b1b884fed584bd3ae06c5fdeeb03d15450f2981b 33328 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_16 98260eed7497a8032e88663a501e1f37ad02574f 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_17 cb49d29a504839a21565e5d7870197b03c86a87e 49464 bytes summaryTreeUploadManager.js:83:25
Publish blob summaryChunk_18/57ef150ffd292824ab3479b36c09b8e6516b197a 127260 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_19/1e5eb6e151f1a840384edb6538ef7e0b5c4b1d16 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_20/34abc96473227dc66fa63fb00ca3d8480a8c49aa 27552 bytes summaryTreeUploadManager.js:76:25 The branch hosting the POC is pdds-summary-chunking-poc |
Appears to me that the actual reason resides in the |
Wow. This is really amazing work!
|
Thank you Daniel! Few quick thoughts:
My assessment finds 2 missing links for the goal at hand: A. The place we want to instrument now is in propertyTree#summarizeCore which seems unreachable by the current propagation of B. Could not find a way to provide This state of affairs looks rather blocking for the feature unless some work is done for both A. and B. Is this a package we should consider taking in our group?
Thanks for the hint. Yes, dependencies seemed to be alright but will revisit
Indeed. This separation seems appropriate to me especially by flagging it experimental but wanted to know your opinion about it. |
I'm sorry, I misunderstood your question about 1. Was the problem with the derived classes that the matrix of options is getting too big (i.e., too many subclasses required?) FYI - I did successfully build '/server/routerlicious' on Ubuntu (WSL) by: sudo apt update
sudo apt install -y python3 make git curl g++ openssl libssl-dev ca-certificates
cd server/routerlicious
npm i
npm run build Were you doing the same? Regarding 4, we mostly use 'experimental' for things that have a risk of customer confusion and/or temporarily need special exemptions from our linting/testing policies (e.g., bulk code imports). I don't think your work necessarily needs to be in experimental, but it's fine to put it there if it's convenient. Regarding 6, we should loop in @znewton when he is back in office next week. |
Yes I did the same and now fails during the |
Yes this is the case, and the options I see listed above |
One Fluid build problem seems to be the strict dependency to openssl 1.1.1. The openssl version in Ubuntu 22.04 is 3.0. Documenting the fix if needed by anyone else: Check your current version
Update openssl to 1.1.1 if different
|
Next section works:
Incremental
Incremental
|
@dstanesc, @milanro - I was able to build You can identify the type validation test failures by the path of the compilation error, but it's annoying. Using Although I wasn't sure if your plans were to continue building deduping at the driver level, or if you were going to try implementing deduping as a layer above the driver, in which case I think you'd avoid the hassles of |
Thank you! Still investigating the possibilities for the WholeSummary. However, strongly inclined to follow your suggestion and lift the implementation above the driver. That should reduce the hassle as you mentioned. FYI. Not sure you have seen it, meanwhile I sent a pull request for the experimental chunking module alone. |
@DLehenbauer @vladsud @milanro Just created a draft pull request for the deduplication, localized above the driver, in the runtime. |
@DLehenbauer @vladsud Created a merge-able PR - 12699 isolating only the relevant changes for the deduplication feature for review convenience. Looks like the |
This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
This PR looks interesting, I wonder why it wasn't merged into the master branch. |
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
M2.1 : Large Data Support / Incremental Summaries / Automatic Chunk Reuse
This story provides execution details on Epic: Large Data Support / M2 / Automatic chunk reuse strategy
Introduction
The goal of current work item is to reduce the size of periodic summary uploads by reusing portions of previous summaries for unchanged data. The proposed strategy is to identify and de-duplicate data redundancies on the client-side using DDS agnostic techniques such content-based slicing using a rolling hash.
This area is actively researched in the academia and in the industry, traditionally relevant for differential backup, data reduction and redundancy de-duplication software. More recently similar techniques are used for modeling the persistence of distributed database systems.
Scoping Down
Multiple algorithm choices satisfy the incremental summary use-case. However, based on the industry adoption, recent academic research and available libraries for reuse, narrowing down the scope of investigation to Gear fingerprint, ie. FastCDC and Cyclic polynomial, ie. Buzhash
Buzhash
- has widest adoption for backup software, eg. Borg, Asuran, Casync, Probabilistic B-Trees, etc.FastCDC
- fastest to date, Asuran, Rdedup, etc.Wasm
Content defined chunking, just by the nature of computing and evaluating hashes byte-by-byte are CPU and data intensive. To address this concern most of the reviewed solutions are implemented in natively compiled languages (eg. C, Go, Rust). This aspect challenges Fluid's web / javascript -centric philosophy. The current thoughts are to isolate the chunking process in a web assembly reusable module. The module would ideally be created as a wrapper on existing open-source cdc libraries. During the investigation phase it is favorable to combine together multiple algorithms in a single module, however not ideal for production software which should minimize the loaded code size to the opted-in features.
Evaluation Library
This repository is used to generate the CDC
wasm
library.Wasm library for content-based slicing. Convenience wrapper on existing rolling hash implementations in rust, such as provided by fastcdc, asuran chunker, etc.
Usage NodeJS
Usage Webpack
API
Benchmarks
Performance Results
FastCDC
constantly stands out for relative performance, roughly 10x faster thanBuzhash
.More details at chunking speed benchmark page.
Stability Results
The good
The bad
More details at the chunking stability benchmark page
Elected Chunking Algorithm
After the evaluation phase, FastCDC stability and speed appears to be the closest match for content-defined chunking needs for summarization. Created a dedicated library for rust based wasm generation. Relies on the leading Rust implementation for the FastCDC
Purpose built library for Fluid usage is wasm-chunking-fastcdc
Modularization Strategy
Introduce a standalone @fluid-experimental/content-chunking Fluid component.
Usage:
POC
A PropertyDDS based application and a convenience library is used to analyze the functional feasibility and the API evolution needs
The SharedPropertyTree changes are minimal but configuration needs (eg for chunk size and chunking strategy) stand out.
Chunk Deduplication
Two
IDocumentStorageService
implementations relevant in the blob/chunk deduplication discussion:ShreddedSummaryDocumentStorageService
is used in the context of Tinylicious relay. Notably, it looks to already provide chunk deduplication ie. only uploads the necessary blobs - which is - not referenced by the prior summary snapshotWholeSummaryDocumentStorageService
is used in context of Routerlicious and inherently FRS but lacks the ability to deduplicate chunksThe text was updated successfully, but these errors were encountered: