-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 153 Incremental metadata expansion for Manta buckets #117
Comments
Hi, I've re-read RFD 149 and then this RFD and have several questions around the sharding design and flow:
Overall it'll be helpful to have a description of how a bucket object is processed after it hits muskie. |
Thanks for reading those over and for the questions, @askfongjojo.
I think at this point it'll probably end up being a different set of electric-morays that function similarly, but that use the single source of truth hash ring server. It's possible we could change the existing electric-moray to work with multiple hash rings, but I think it is probably best to use separate instances. The input to the hashing function will include the account owner information, the bucket name, and the object name so that an object in a particular bucket for a particular account always hashes to the same location.
Definitely agree. There will be separate hash rings. Ideally the deployment of buckets will mostly leave existing manta unperturbed and the new expansion behavior is different enough to make me not want to attempt to blend the two hash rings together.
That's great to know. I think we're close to having enough pieces sorted out to provide this now. That feels a bit beyond the scope of this RFD so it maybe deserves another document, but I would like to put that together very soon. |
Thanks for writing this up! It's clearly well thought out. I think it's good that we're considering the architecture from the point of view of adding shards without downtime, but I'd strongly recommend separating the implementation into two phases: one which requires (hopefully minimal) downtime followed by one which supports expansion without downtime. Today, in practice, we take about 30 minutes of read downtime per shard, but there's good reason to think that even with what we've got today, we could shrink this to much less time. I'm imagining a process based on what you describe under "Add a new pnode", except that we roll out two phases of ring updates. In step 2, we distribute a ring update that marks a vnode read-only (without moving it). In step 6, since it's read only, we can just restore the pg_dump directly. Afterwards, we distribute a ring update that marks the vnode writable and on a different pnode. I think the total read downtime for the shard would be dominated by the pg_dump and pg_restore. There's also the advantage that if anything goes wrong at any point (i.e., a failure or partition of some electric-morays or the orchestration process), it's trivial to roll everything back. (We just drop the new data from the new pnode and reactivate the original ring version.) I imagine phase 2 could implement the process you describe. There are a lot more pieces to implement here, they seem quite tricky (I'm still unclear on how conflict resolution works if writes have been issued to the new pnode before dependent data has been moved to it), and there are a bunch more failure modes to consider (including how we might rollback if something went wrong). I still think this is a good approach -- I'd just suggest doing these more complex pieces after we've delivered and gained operational confidence with the more basic pieces. What do you think? I think we should more strongly emphasize the automation. We've discussed a number of principles (much more broadly than this RFD) that we haven't written down, so I figured I'd take this opportunity to write them down. You've implicitly covered a lot of these already.
Like I said, you've got most of this implicitly in the RFD, but I'd suggest eliminating the suggestion of manual execution and maybe emphasizing the autonomy of the automation. (I'll plan to write these down somewhere more useful than this issue comment, but I wanted to include them here so readers are on the same page about what we mean by automation here.) While not spelled out, the current resharder is built around these principles, and I think we may want to leverage a lot of it (specifically, the state machine execution engine that it defines, along with pause/pause-at-step/resume, status reporting, etc.). (It's the ideas that I think are important rather than the specific code.) I think this approach has been critical for our ability to expand to hundreds of shards. There have been quite a number of unexpected cases where the system came to rest where previous types of automation we have built would often barrel on and create cascading failures. Anyway, thanks again for the great write-up! |
I would argue that we already have a consensus mechanism as part of the
So, I think using LevelDB in the way that we do here is unfortunate -- it's not I don't agree that coordinated distribution of the hash ring archive, as we do Electric Moray presently collects the current version of the hash ring at As with so much Manta, I think the actual specific lines of code we've written I think the onus of coordination has to be placed on the operation that isn't
Though it's not a HTTP API, we do have an API for this today. The new hash I fully agree that this should be a HTTP API, rather requiring the resharder to
We actually have this today as well:
The stamp file includes the zone in which it was created, and the time at which The update process downloads the current ring version (a tar file, stored in The status API in Electric Moray today could definitely be extended to expose Today, we provide only a simple "this pnode is read only" intermediate state, To summarise, a lot of this RFD sounds extremely promising, it's just that I |
Thanks for the comments, Dave! Sorry for the delayed response.
I think a phased approach should be fine. As you say, the correctness and
This is good to hear. I also feel like it is a very important part of this. I'll
I can try to explain this a bit better or perhaps draw a diagram that |
Thanks for the feedback, Josh! Again, sorry for the delay in any response.
I am not extremely familiar with the details of how the resharder handles
I don't think anything is unsound with the current process and I don't want to One of the benefits I see of adding a new service to persist and manage the It's also an excellent opportunity to hide the internals of consistent As far a just the effort to write this new service I really just envision it as The point about electric-moray currently being able to continue operation in the Now we could do a hybrid approach whereby we have a ring service, but also Yet another way is to stay with something more similar to the current approach There seems to be multiple paths to achieve the end goals. I really do like the
I wouldn't say that the onus of coordination would be shifted to the the
Thanks for highlighting a lot of the existing features that might be used or |
This is for discussion of RFD 153 Incremental metadata expansion for Manta buckets.
The text was updated successfully, but these errors were encountered: