Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Story: Large Data Support / Incremental Summaries / Automatic Chunk Reuse (WIP) #11572

Closed
dstanesc opened this issue Aug 17, 2022 · 21 comments
Closed

Comments

@dstanesc
Copy link
Contributor

dstanesc commented Aug 17, 2022

M2.1 : Large Data Support / Incremental Summaries / Automatic Chunk Reuse

This story provides execution details on Epic: Large Data Support / M2 / Automatic chunk reuse strategy

Introduction

The goal of current work item is to reduce the size of periodic summary uploads by reusing portions of previous summaries for unchanged data. The proposed strategy is to identify and de-duplicate data redundancies on the client-side using DDS agnostic techniques such content-based slicing using a rolling hash.

This area is actively researched in the academia and in the industry, traditionally relevant for differential backup, data reduction and redundancy de-duplication software. More recently similar techniques are used for modeling the persistence of distributed database systems.

Scoping Down

Multiple algorithm choices satisfy the incremental summary use-case. However, based on the industry adoption, recent academic research and available libraries for reuse, narrowing down the scope of investigation to Gear fingerprint, ie. FastCDC and Cyclic polynomial, ie. Buzhash

Wasm

Content defined chunking, just by the nature of computing and evaluating hashes byte-by-byte are CPU and data intensive. To address this concern most of the reviewed solutions are implemented in natively compiled languages (eg. C, Go, Rust). This aspect challenges Fluid's web / javascript -centric philosophy. The current thoughts are to isolate the chunking process in a web assembly reusable module. The module would ideally be created as a wrapper on existing open-source cdc libraries. During the investigation phase it is favorable to combine together multiple algorithms in a single module, however not ideal for production software which should minimize the loaded code size to the opted-in features.

Evaluation Library

This repository is used to generate the CDC wasm library.

Wasm library for content-based slicing. Convenience wrapper on existing rolling hash implementations in rust, such as provided by fastcdc, asuran chunker, etc.

Usage NodeJS

npm install @dstanesc/wasm-chunking-node-eval
import {compute_chunks_buzhash, compute_chunks_fastcdc} from "@dstanesc/wasm-chunking-node-eval";

Usage Webpack

npm install @dstanesc/wasm-chunking-webpack-eval
import {compute_chunks_buzhash, compute_chunks_fastcdc} from "@dstanesc/wasm-chunking-webpack-eval";

API

const buf = ...
// mask 0b11111111111111
const offsets_buz = compute_chunks_buzhash(buf, 15).values(); 
// chunk sizes: min 16 KiB, avg 32 KiB, max 64 KiB
const offsets_fast = compute_chunks_fastcdc(buf, 16384, 32768, 65536).values();   

Benchmarks

Performance Results

  1. FastCDC constantly stands out for relative performance, roughly 10x faster than Buzhash.
  2. Also good absolute performance, chunking roughly 0.5 MiB / ms in my local Firefox env.

More details at chunking speed benchmark page.

Stability Results

The good

  1. Both FastCDC and Buzhash display a remarkable reuse behavior on json and serialized (msgpack encoded) data.
  2. One surprising insight is that LZ4 compression is also preserving a very good reusability rate, almost equivalent with the json and serialized data.
  3. Sparse modifications (see bottom charts) applied to json, msgpack and LZ4 data formats also manifest very good reusability

The bad

  1. CDC on the LZ77+Huffman compression format (ie. Pako) shows NO ability to reuse blocks
  2. Base64 encoding also collapses the reusability rate

More details at the chunking stability benchmark page

Elected Chunking Algorithm

After the evaluation phase, FastCDC stability and speed appears to be the closest match for content-defined chunking needs for summarization. Created a dedicated library for rust based wasm generation. Relies on the leading Rust implementation for the FastCDC

Purpose built library for Fluid usage is wasm-chunking-fastcdc

Modularization Strategy

Introduce a standalone @fluid-experimental/content-chunking Fluid component.

  1. The module will be evaluated first in the context of PropertyDDS. This activity should require minimal changes.
  2. Same module can be used to experiment w/ content-defined chunking in larger context
  3. The module can be used to experiment w/ content-defined chunking combined w/ LZ4 compression (appealing, considering on the above stability benchmarking results)

Usage:

const buffer=...
const chunkingConfig: ChunkingConfig = { avgChunkSize: 64 * 1024, chunkingStrategy: ChunkingStrategy.ContentDefined };
const contentChunker: IContentChunker = createChunkingMethod(chunkingConfig);
const chunks: Uint8Array[] = contentChunker.computeChunks(buffer);

Note: there are two chunking strategies proposed at this stage, the new ChunkingStrategy.ContentDefined and ChunkingStrategy.FixedSize for backward compatibility

POC

A PropertyDDS based application and a convenience library is used to analyze the functional feasibility and the API evolution needs

The SharedPropertyTree changes are minimal but configuration needs (eg for chunk size and chunking strategy) stand out.

Chunk Deduplication

Two IDocumentStorageService implementations relevant in the blob/chunk deduplication discussion:

  1. ShreddedSummaryDocumentStorageService using the SummaryTreeUploadManager to upload the blobs. ShreddedSummaryDocumentStorageService is used in the context of Tinylicious relay. Notably, it looks to already provide chunk deduplication ie. only uploads the necessary blobs - which is - not referenced by the prior summary snapshot
  2. WholeSummaryDocumentStorageService using the WholeSummaryUploadManager to upload the blobs. WholeSummaryDocumentStorageService is used in context of Routerlicious and inherently FRS but lacks the ability to deduplicate chunks
@dstanesc
Copy link
Contributor Author

@DLehenbauer @noencke @milanro @vladsud, initial thoughts on using CDC to achieve incremental summaries. To evolve.

@DLehenbauer
Copy link
Contributor

DLehenbauer commented Aug 17, 2022

This a great summary. Thank you for opening this issue!

BTW - I removed 'M3' from the roadmap and renamed this issue to refer to M2.

@dstanesc
Copy link
Contributor Author

FYI. Created and published first draft of the cdc wasm module. See above details.

@dstanesc
Copy link
Contributor Author

dstanesc commented Sep 2, 2022

@DLehenbauer @noencke @vladsud, @justus-camp & @milanro FYI. Updated the Benchmarks section w/ the finding summary and links to the speed and stability benchmark repos, incl results

@dstanesc
Copy link
Contributor Author

dstanesc commented Sep 27, 2022

@DLehenbauer @vladsud @milanro Updated the initial comment above with notes about current status, concrete library for chunking and POC bits for review. See Elected Chunking Algorithm, Modularization Strategy, POC and Chunk Deduplication paragraphs. At this stage few feedback and help topics stand out:

  1. Feature Configuration.
    A. The app and/or feature rollout ability to switch between the content-defined and fixed-size chunking. fixed-size chunking is current behavior and needs preserved as default.
    B. Chunk (average) size is important for chunking stability tuning. Domain specific most of the time.
    The solution strategy we adopted for configuring compressed trees (aka derived property tree classes) seems not the most appropriate vehicle for configuring chunking.

  2. WholeSummary -- Blob Deduplication Instrumentation Priority. My review indicates that the ShreddedSummary and its helping stack already have the ability to deduplicate summary blobs. Technically IMHO ShreddedSummary represents the superior way forward. The question remains whether borrowing the dedup pattern for the WholeSummary stack is still a desirable investment.

  3. Building Service & Client Side. The relevant upload managers shredded summary and whole summary are colocated in the 2nd build. Would be extremely useful in case we (as open source developers) could access more information on:
    A. What are the prerequisites to build the server. In my case manual build fails, err log attached. Docker build succeeds but won't deliver libraries
    B. How to build together client and server/routerlicious/packages/services-client so that build results are usable for development (testing, debugging)

Note: In my current context the classes get pulled indirectly (ie webpack://@fluid-experimental/shared-property-map-hello/../../../../node_modules/@fluidframework/routerlicious-driver/node_modules/@fluidframework/server-services-client/lib/summaryTreeUploadManager.js), even if directly referenced in client's package.json. I believe another variable is the hoisting outcome which favors older versions because version inconsistencies in main (eg. 0.1038.1000 vs 0.1038.2000)

  1. Is the team supporting the proposed modularization strategy? See @fluid-experimental/content-chunking package.

  2. Actual chunking is performed in wasm. We created an external library to be used by @fluid-experimental/content-chunking. Unlike the evaluation libs which are chunkier -- pun intended :) -- and also suffer from particular licensing drawbacks, the new library targets solely the winning algorithm (see chunking speed and stability benchmarks). It is written in rust. It is part of my open source contribution.

  3. Blob deduplication evidence. I was working around the inability to regenerate shredded summary uploader by directly modifying the .js files nested under root node_modules (ie. webpack://@fluid-experimental/shared-property-map-hello/../../../../node_modules/@fluidframework/routerlicious-driver/node_modules/@fluidframework/server-services-client/lib/summaryTreeUploadManager.js ) to analyze the current behavior.
    A. Below the results when adding 0.05 MB of data to a preexistent summary of 1.2 MB, classic SharePropertyTree (no compression) configured w/ the avgChunkSize: 48KB. The hello app is using synthetic material data for triggering the incremental summary behavior
    B. Deserves mentioned that after some inactivity all blobs need published (no reuse) which likely relate to the history pruning in PropertyDDS but other factors may play a role as well. Still open investigation.

Publish blob summaryChunk_0/fc5b7d03ab1b62cb282e6bf23483609b5e008b5f 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_1/8c3637e1f9451cdb04b80e8229037d30fbc1dc06 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_2/55d38d2dc72c84b7d83e67a3d3289602ecbd0a1b 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_3/801daf7eac5abd985c542d448dbdd9f40ac7b222 125824 bytes summaryTreeUploadManager.js:76:25
Skip blob summaryChunk_4 b8f92c33f126806efa5f9591183fa9f2aba0a39a 57040 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_5 6d0b4cc31d30b54d99dece96515c4e155d6e3928 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_7 dbe71c48872f32042f9e17ade06f9301d68a1465 52216 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_6 7303a83c892539e3e9242cc954164b1197f6c181 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_8 3da1b465805486bfc50f90fe55adcaf259de3654 33392 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_9 a4b861322bb775bec5c6c2e6cf521c4668c84180 80404 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_10 57175dec0bd91aef34c86538d5c11fa2775db308 84788 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_11 498b6581aa9e039827bb324ed3c7bf12be9eeece 36928 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_12 9e0d970dfdbea6fc7a2c437bc7b8f5155216e706 56972 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_13 2e880a02d68c05eb628e4c63f0c4d56044af1278 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_14 c5ae10e1472f9290c32d2f08337e25bb634e2972 77852 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_15 b1b884fed584bd3ae06c5fdeeb03d15450f2981b 33328 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_16 98260eed7497a8032e88663a501e1f37ad02574f 131072 bytes summaryTreeUploadManager.js:83:25
Skip blob summaryChunk_17 cb49d29a504839a21565e5d7870197b03c86a87e 49464 bytes summaryTreeUploadManager.js:83:25
Publish blob summaryChunk_18/57ef150ffd292824ab3479b36c09b8e6516b197a 127260 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_19/1e5eb6e151f1a840384edb6538ef7e0b5c4b1d16 131072 bytes summaryTreeUploadManager.js:76:25
Publish blob summaryChunk_20/34abc96473227dc66fa63fb00ca3d8480a8c49aa 27552 bytes summaryTreeUploadManager.js:76:25

The branch hosting the POC is pdds-summary-chunking-poc

@dstanesc
Copy link
Contributor Author

B. Deserves mentioned that after some inactivity all blobs need published (no reuse) which likely relate to the history pruning in PropertyDDS but other factors may play a role as well. Still open investigation.

Appears to me that the actual reason resides in the blobsShaCache lifecycle and looks more like a bug. The cache becomes empty after a period of idleness. While called a cache, actually represents in my view a convenient projection/transformation/index on the previous summary snapshot helping with blob existence query cases. I think a construct associating the ISnapshotTree tree with the set of hashes for the blob leaves, evtl. encapsulated by the already existing ISnapshotTreeEx would allow better lifecycle correlation between the summary snapshot and its corresponding blobs hash set. Any other thoughts?

@DLehenbauer
Copy link
Contributor

Wow. This is really amazing work!

  1. For configuration, I think you'll want to add to IContainerRuntimeOptions similar to what @justus-camp did for op compression.

  2. For a POC, I think it's up to you if it's easier to start with the shredded or whole document summarizer and add CDC. Longer-term, I think your work might converge the two? (i.e., you could perhaps configure the you're summarizer to produce as many or few chunks as desired, including a whole document summary?)

  3. I wasn't unable to resolve this, but I'll have another look tomorrow when I'm at my desktop. As a starting point, did you install the node-gpy dependencies listed in the Dockerfile?

  4. Could you explain a bit more? Are you asking about separating the CDC implementation into its own small module?

  5. Nice!

  6. I have a hypothesis. I'll ask some questions and get back to you soon.

@dstanesc
Copy link
Contributor Author

dstanesc commented Oct 4, 2022

@DLehenbauer @milanro

Thank you Daniel!

Few quick thoughts:

  1. For configuration, I think you'll want to add to IContainerRuntimeOptions similar to what @justus-camp did for op compression.

My assessment finds 2 missing links for the goal at hand:

A. The place we want to instrument now is in propertyTree#summarizeCore which seems unreachable by the current propagation of IContainerRuntimeOptions or derived. The closest layer to receive ISummarizeOptions (relevant fragment of the IContainerRuntimeOptions) stays the runtime and is the SummaryGenerator#summarizeCore component which won't propagate it further to the DDS itself

B. Could not find a way to provide IContainerRuntimeOptions when instantiating a container via the AzureClient (this is how HxGN Nexus uses fluid today). If I understand correctly the limit seems to be in the AzureClient and its provider for the runtime factory which creates its own runtime options

This state of affairs looks rather blocking for the feature unless some work is done for both A. and B. Is this a package we should consider taking in our group?

  1. ... As a starting point, did you install the node-gpy dependencies listed in the Dockerfile?

Thanks for the hint. Yes, dependencies seemed to be alright but will revisit

  1. Could you explain a bit more? Are you asking about separating the CDC implementation into its own small module?

Indeed. This separation seems appropriate to me especially by flagging it experimental but wanted to know your opinion about it.

@DLehenbauer
Copy link
Contributor

I'm sorry, I misunderstood your question about 1. Was the problem with the derived classes that the matrix of options is getting too big (i.e., too many subclasses required?)

FYI - I did successfully build '/server/routerlicious' on Ubuntu (WSL) by:

sudo apt update
sudo apt install -y python3 make git curl g++ openssl libssl-dev ca-certificates
cd server/routerlicious
npm i
npm run build

Were you doing the same?

Regarding 4, we mostly use 'experimental' for things that have a risk of customer confusion and/or temporarily need special exemptions from our linting/testing policies (e.g., bulk code imports). I don't think your work necessarily needs to be in experimental, but it's fine to put it there if it's convenient.

Regarding 6, we should loop in @znewton when he is back in office next week.

@dstanesc
Copy link
Contributor Author

dstanesc commented Oct 4, 2022

FYI - I did successfully build '/server/routerlicious' on Ubuntu

Yes I did the same and now fails during the npm i phase w/ the attached err. Interesting enough I recall also having it working in the more remote past. I can only blame the kernel upgrades (also ubuntu version -- now 22.04 for this machine) I did since then. The strech-slim used by Docker seems antique (ie. kernel 4.9 compared to mine 5.15.0 ). This alone can justify header files to diverge. The other option I am trying without much success for now is to build selectively as I really need only the routerlicious/packages/services-client in this exercise. Otherwise the solution would be to downgrade somehow my environment or virtualize it.

@dstanesc
Copy link
Contributor Author

dstanesc commented Oct 4, 2022

I'm sorry, I misunderstood your question about 1. Was the problem with the derived classes that the matrix of options is getting too big (i.e., too many subclasses required?)

Yes this is the case, and the options I see listed above

@dstanesc
Copy link
Contributor Author

dstanesc commented Oct 5, 2022

One Fluid build problem seems to be the strict dependency to openssl 1.1.1. The openssl version in Ubuntu 22.04 is 3.0. Documenting the fix if needed by anyone else:

Check your current version

dpkg -l | grep libssl-dev

Update openssl to 1.1.1 if different

wget http://security.ubuntu.com/ubuntu/pool/main/o/openssl/openssl_1.1.1f-1ubuntu2.16_amd64.deb
wget http://security.ubuntu.com/ubuntu/pool/main/o/openssl/libssl-dev_1.1.1f-1ubuntu2.16_amd64.deb
sudo dpkg -i libssl-dev_1.1.1f-1ubuntu2.16_amd64.deb
sudo dpkg -i openssl_1.1.1f-1ubuntu2.16_amd64.deb

@dstanesc
Copy link
Contributor Author

dstanesc commented Oct 5, 2022

Next section works:

npm run clean
git clean -xfd
npm install
npm run build:fast -- --nolint --install 
cd server/routerlicious
npm i
npm run build

Incremental fluid-build, symlink within monorepo works.

alias fb='clear && node "$(git rev-parse --show-toplevel)/node_modules/.bin/fluid-build"'
fb --install --symlink
fb @fluid-experimental/property-inspector 

Incremental fluid-build, symlink across monorepos fails. This is relevant as I need to change the @fluidframework/server-services-client ( pulled via @fluidframework/azure-client and the @fluidframework/routerlicious-driver in my poc)
Unclear whether I am doing smth wrong or needs reported as a bug.

alias fb='clear && node "$(git rev-parse --show-toplevel)/node_modules/.bin/fluid-build"'
fb --install --symlink:full
fb --all @fluid-experimental/property-inspector
Symlink in full mode
ERROR: Unexpected error. @fluidframework/build-tools: tsc not found for project reference
Error: @fluidframework/build-tools: tsc not found for project reference
    at TscTask.addDependentTasks (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/leaf/tscTask.js:88:23)
    at TscTask.initializeDependentTask (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/leaf/leafTask.js:76:14)
    at NPMTask.initializeDependentTask (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/npmTask.js:20:18)
    at NPMTask.initializeDependentTask (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/npmTask.js:20:18)
    at NPMTask.initializeDependentTask (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/npmTask.js:20:18)
    at ConcurrentNPMTask.initializeDependentTask (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/npmTask.js:20:18)
    at NPMTask.initializeDependentTask (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/tasks/npmTask.js:20:18)
    at /home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/buildGraph.js:271:27
    at Map.forEach (<anonymous>)
    at BuildGraph.filterPackagesAndInitializeTasks (/home/dstanesc/code/FluidFramework.Debug/build-tools/packages/build-tools/dist/fluidBuild/buildGraph.js:263:28)

@DLehenbauer
Copy link
Contributor

@dstanesc, @milanro - I made some progress on the build failures reported above (see #12400).

They turned out to be unrelated to synchronizing the client/server monorepos (i.e., they repro without 'symlink:full'.)

@DLehenbauer
Copy link
Contributor

@dstanesc, @milanro - I was able to build --symlink:full w/the above fix yesterday, with the caveat that it fails our type validation tests. The type validation tests are generated *.ts files that mix the use of Fluid types from the current source and the previously published packages in order to catch unintentional build breaks.

You can identify the type validation test failures by the path of the compilation error, but it's annoying. Using fb --all --script tsc will build only the commonjs module (no tests), which might be an acceptable workaround.

Although I wasn't sure if your plans were to continue building deduping at the driver level, or if you were going to try implementing deduping as a layer above the driver, in which case I think you'd avoid the hassles of --symlink:full?

@dstanesc
Copy link
Contributor Author

Although I wasn't sure if your plans were to continue building deduping at the driver level, or if you were going to try implementing deduping as a layer above the driver, in which case I think you'd avoid the hassles of --symlink:full?

Thank you! Still investigating the possibilities for the WholeSummary. However, strongly inclined to follow your suggestion and lift the implementation above the driver. That should reduce the hassle as you mentioned.

FYI. Not sure you have seen it, meanwhile I sent a pull request for the experimental chunking module alone.

@dstanesc
Copy link
Contributor Author

dstanesc commented Oct 14, 2022

@DLehenbauer @vladsud @milanro

Just created a draft pull request for the deduplication, localized above the driver, in the runtime.
Please review at your convenience. Looking for your feedback before polishing the implementation and sending an official pull request.

@dstanesc
Copy link
Contributor Author

@DLehenbauer @vladsud Created a merge-able PR - 12699 isolating only the relevant changes for the deduplication feature for review convenience. Looks like the repo-policy-check still needs attention from a Fluid maintainer to run the workflow, apparently still first time contributor status.

@microsoft-github-policy-service
Copy link
Contributor

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@polar1shu
Copy link

This PR looks interesting, I wonder why it wasn't merged into the master branch.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants