-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Large Data Support #10494
Comments
Hello @DLehenbauer, should we create special discussion / design ticket for this epic? Base64 removal optimization :
It looks like Property DDS already uses SummaryTreeBuilder but still encodes the blobs with base64. When this is disabled and the blobs are in binary form, it looks that the routerlicious Upload Manager converts it to base64 automatically. Is there any trick to force the binary blobs in the SummaryTree (or is there other SummaryTreeBuilder available than that one used by Property DDS)? The WholeSummaryUploadManager is used for uploading to routerlicious, the implementation is located in
The following method is called in order to upload
which generates the transfer form of the tree at
function
This method encodes the binary blobs by base64
|
@milanro - Creating a new GitHub issue about base64 would be very helpful. Would you mind doing that and tagging me in it? |
@milanro, @dstanesc - FYI. Some initial feedback from @vladsud that I'll factor into the plan soon: We're thinking of starting the summary work (M1) and a subset of the ops work (M1.5) in parallel. The thought is that the two of you would drive large summaries and MSFT devs would drive ops (at least initially). The reason is that MSFT has more pain around large ops due to batching, while I believe you're most interested in large summaries and the single large op scenario. Vlad would prefer to decouple binary encoding from eliminating base64. His reasoning is that we'll need to continue to support downgrading to base64 until the protocol update has sufficient penetration on both the client and service, and therefore he sees it as a lower priority than landing the other benefits of binary & compression, even with the 33% overhead of base64. Vlad corrected me that incremental summaries (M3) do not depend on switching to shredded summaries (M2). Even though the "whole document" summarizer uploads a single monolithic blob, the service will still decompose that blob and assign ids to the fragments. This means we'll have reusable blob IDs for the interior nodes and can deprioritize shredded summaries, which would then only become relevant when a document has +30mb of changes to upload. |
I was reading through @vladsud's notes on op chunking, and one observation he made is that because chunking will amplify the number of ops a client sends, clients will be more likely to hit service rate limits (e.g., too many ops per period of time.) I believe the current throttling algorithm tolerates short bursts, but we may need to add some form of backpressure to the large data roadmap so that clients can locally throttle transmission and/or production to avoid exceeding these limits. |
@DLehenbauer @vladsud Created first draft for material like synthetic data generation fake-material-data. Experimenting w/ template based data generation, probably a reusable pattern for other data domains. Looking forward to feedback and contributions :). |
Derived from above, just published a materials data compression benchmark repository. Evaluating currently |
@DLehenbauer @vladsud @milanro One more synthetic data generator fake-metrology-data. Also available via the npmjs repo. Pretty large payloads can be created with it. |
... and also the associated metrology data compression benchmark. Deserves noted that while |
Status overview on HxGN large data contributions:Open PRs
Fruition Depends On
|
This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
We should probably keep this opened. |
This issue has been automatically marked as stale because it has had no activity for 180 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
Roadmap for Large Ops and Summaries
Milestones
M0: Design and POC
While we generally understand the path for M1 (efficient use of storage and network), many Fluid customers are highly sensative to application startup and document loads times.
The outcome of this deliverable is to show that the additional code and computation required for compression yield a neutral to positive impact for the spectrum of Fluid customers as well as select the specific algorithms we will use in M1 and M1.5.
An additional desirable outcome is that the synthetic benchmarks developed for measuring compression can be used for additional performance tuning with data that statistically resembles real-world scenarios.
Deliverables
Develop benchmark to evaluate existing serializers (e.g., MsgPackR)
Develop benchmark to evaluate existing compression algorithms (e.g., LZ4)
Develop benchmark to evaluate effects of holistic UUID compression for Fluid runtime
Analysis of summary structure and metadata
Analysis of op message structure and metadata
Proof of concept using PropertyTree DDS as tactical solution
M1: Summaries make efficient use of network & storage (~10x)
This milestone tracks pragmatic work to improve the efficiency with which we use our current network and storage limits. Changes to the storage format require backwards compatibility with prior versions. Therefore, it is desirable to bundle changes to minimize the number of versions that the runtime must support.
Deliverables
Replace JSON serialization with a faster/compact binary alternative
Modify runtime to avoid base64 encoding with binary payloads
Apply alternative serialization and/or general compression to summaries
Implement ID compression
M1.5: Ops make efficient use of network & storage (~0.5x..15x)
This milestone tracks pragmatic work to improve the efficiency with which we use our current network and storage limits. Changes to the storage format require backwards compatibility with prior versions. Therefore, it is desirable to bundle changes to minimize the number of versions that the runtime must support.
Deliverables
Replace JSON serialization with a faster/compact binary alternative
Modify runtime to support binary payloads in op messages
Apply alternative serialization and/or general compression to payload
Remove duplicate serialization of op message content
Process msg without deserializing or decompressing payload
Implement ID compression
Remove redundant metadata within batched ops
M2: Incremental Summaries (#11572)
This milestone tracks work items that reduce the size of summary uploads by reusing portions of previous summaries for unchanged data. We plan to tackle this problem at two levels. The first is through automatic identification and de-duplication on the client using techniques like Content-Defined Chunking. The second is by exposing the original blob handles to the DDS so that a sophisticated DDS (SharedTree) can improve reuse with less cost by manually tracking and reusing chunks from previous summaries.
Deliverables
Automatic chunk reuse:
Develop benchmark to evaluate Content-Defined Chunking (CDC) algorithms
Modify runtime to apply CDC to summaries prior to compression
Implement client-side blob deduplication
Explicit chunk reuse (SharedTree)
Expose summary blob handles to DDS
Track and reuse summary blobs
M3: Chunked Ops
This milestones track the work items necessary to process operations that exceed service limits (Socket.io, Kafka, etc). This work requires some initial analysis to find the right balance between allowing concurrent edits and overall system complexity.
Complexity and performance analysis of three alternatives:
Use of a side-channel for large ops
Synthetically assigning seq#s at the client
Applying multiple operations with the seq# of the final chunk
Modify runtime to apply chunking strategy to large op payloads
Detection of large ops
Partitioning of large batches into multiple messages
Re-assembly of multiple messages into a single batch
Orchestration of incoming/outgoing messages to preserve ordering guarantees
Future: Distributed Summaries (SharedTree)
Future: Cloud Summaries (SharedTree)
The text was updated successfully, but these errors were encountered: