-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 174 Manta storage efficiency discussion #142
Comments
Thank you for writing up this RFD. I just want to share what I know about some of the open questions about mako service discovery and TTL:
Probably not. The storage endpoint resolution uses nameservice. This is how a storage record looks like in zk within the nameservice zone:
In a failover situation, we can update the record to point to the new IP address of the passive server.
The ttl is defaulted to 1800 seconds and written to the zk record as shown above. I'd defer to others who understand this area better to advise how to force muskie to expire the record and do a lookup again.
We can probably model after metadata failover mechanism. It also involves updating the corresponding zk record in nameservice. |
This 5.5% overhead (with 36 disks) or worse for systems with fewer disks. Perhaps it would be better to partition the disks and use a small amount (less than 5 GiB) from each and construct a pool across them. I don't understand the justification for setting WCE=disabled when not in whole disk mode, so that would need to be explored. It seems as though if the consumers of each partition were issuing flushes at the time critical to each, WCE=enabled is safe. If it weren't for needing a place for dump files, I'd be inclined to persist networking and iscsi configuration (both of which should be quite static) as boot modules and give all local storage to iscsi LUNs. |
Can you clarify what layout is problematic? The diagram makes it look like there will be 20 shrimp, each one mapping each disk to one stripe of a 17+3 raidz3 pool. If the shrimp have N data disks, I'd expect that there are N pools in a matching number of active servers. |
There are a number of things to consider with this approach, chief among them are ZFS on ZFS issues. Previous exploration of an architecture that would layer ZFS on ZFS with iscsi in the mix turned up a number of issues.
|
Could you not slice up SSDs in a couple boxes and mirror slogs across them? |
It seems rather expensive to have 2N servers, with half of them as standbys. It would seem better to have a fewer number of standby servers with the ability to float iscsi initiators around the pool of servers. Such an architecture would probably need to be able to cope with more failures than expected by doubling up pools on servers, which has implications that may or may not be hard to deal with. |
Routing tables shouldn't need to be updated, just ARP caches. I think that when an address is plumbed that a gratuitous ARP is broadcast. Presuming the systems that need to know about the change accept gratuitous ARPs and are not so busy that it gets dropped, the delay should be quite small. By the time that the failover happens, TCP sessions may have backed off quite a bit. This could cause an extended delay (10s of seconds or worse?) as the healthy end of the connection will not know that it needs to reestablish the connection until it sends a packet that is received by the standby IP stack and that stack sends a RST because of the unknown connection. This problem exists in pretty much any failover scenario, regardless of IP failover strategy. |
Earlier you talked about unique networks (and maybe switches) for iscsi traffic. Failure of a heartbeat network does not necessarily imply iscsi is dead. Failover is much easier when failure of one or more nodes leaves a machine capable of taking over that is part of a majority (quorum). Pairs of servers, each failing over to the other in the pair may be harder to get right than having all servers trying to form a quorum or having an odd number of management nodes that are responsible for maintaining a quorum and orchestrating management operations from a member of that quorum. My gut says we should not be writing clustering software from scratch. |
This is an open question. I think @bahamat mentioned that we have had experiences where small TTLs for storage instances caused pain in zookeeper. I would prefer this approach over updating ARP caches on switches due to the mysterious stale ARP cache bugs we've seen in production in the last few years. As a note to my future self, the registrar README has a section describing how TTLs behave. Relying on TTLs in the way described in this RFD goes against one of the assumptions outlined in the README ('However, the TTLs on the "A" resolutions can be much longer, because it's almost unheard of for the IP address to change for a specific Triton or Manta zone'). From the RFD in regards to dense shrimp (4u60+ boxes vs the 4u36 we use today):
Could we allow one shrimp be a target for multiple storage groups? For example, instead of creating 20 volumes (whether these are zvols or bare disks over iSCSI), create 60 volumes with 20 volumes each consumed by one storage group. This would mean a shrimp with 60 or more disks is a member of three storage groups, possibly with three different local raidz1 pools and 20 zvols on each pool. Maybe this is a 'walk before you run' problem that needs to be solved though. A relatively small open question is how we handle configuration for the mako service. I think we would need to duplicate the SAPI instance metadata for each of the active/passive pairs if we intend for the mako zones to look the same to the outside world. Either that or we would need to assign the same SAPI instance uuid to both of the active/passive instances so they look up the same metadata in SAPI.
Do we need to worry about putting a pool on top of zvols when we think about object deletion activity? For example, say we write 100G of data to the upper raidz3 pool. Now we delete 50G of data from the upper raidz3 pool. I imagine the capacity used by the sum of the raidz1 pools remains at 100G (5G each for the 20 raidz1 pools) because delete operations aren't immediately sent down to the underlying zvols, but the capacity used by the raidz3 pool has become 50G. If we keep writing and deleting data would we run into a situation where we the raidz1 pools fill up before the upper raidz3 pool has reached capacity? TRIM can prevent us from hitting this problem, right? I expect we might see a similar problem today with KVM/bhyve machines running on top of the zvols we provide as data disks.
Yeah, this does seem expensive. Assuming that each storage group has an active/passive pair then we need
One of the tradeoffs that you outline here is that we move from a very flexible design where storage nodes are independent of one another in almost every way to a design where storage nodes are very interconnected and may have specific network requirements. It would be good to hear from the operations team(s) to know if this would make pre-flight checks, DC expansions, etc. prohibitively difficult in the long term. If there are some anticipated problems maybe we can work out a solution or design change before we face the problem in reality. |
Yes, I think that is correct. The thing that is problematic is having an iscsi target with 36 disks when we can only make use of 20 data disks and the 2 system disks since there are 14 disks we cannot use. |
Yes, that might be an option if the slog over iscsi performance is acceptable, along with the added cost. I'll make a note of this as an alternative to investigate. |
Having distinct pairs of active/passive machines is much easier to install, manage and reason about when it comes to failures. It is simple and clear to understand what happens when an active machine dies. Once we lose that pairing, there has to be some other HA thing that can manage the failover, configure the passive machine appropriately, etc. I think this should be considered under a complete storage unit cost analysis, although the solution you're proposing will add new failure modes and cost in other ways, along with additional time before the solution could be ready. |
Sorry I wasn't clear. There is no heartbeat network. My assumption is that the network communication within the storage group is all on the same vlan or switch which is isolated and separate from the network connections to the rest of Manta. Thus, when a primary cannot ack heartbeats because its network is dead, it also cannot talk to its iscsi targets. Likewise, when a passive machine cannot see heartbeat acks because its network is dead, it also cannot successfully import the zpool. I'll try to make that clearer in the RFD. I do not understand your suggestion that we should use a more complex quorum solution vs active/passive. I'm not sure how what that would be or how it would be better. I also do not know what "clustering software" you're referencing. Is that the whole concept presented in the RFD, the heartbeater, how the mako IP is flipped, something else? |
I think what @mgerdts meant is that the 1:1 active/passive model may have the drawback of unnecessary churns when there is a single connectivity issue between them. Having more members to form a quorum can reduce the likelihood of that but it's a more complex thing to do (cluster management). I think it's going to be more expensive too (more than 1 passive servers per cluster?). |
OK, I hope I addressed that by clarifying that a network issue for the heartbeat also implies a network issue for the iscsi traffic, so it depends on exactly what failed. I don't see the active/passive constantly flipping as a significant risk. Maybe I just can't think of a failure mode that could cause that. A bigger issue is the failure of the entire network within the storage group since that takes out 20 makos at once. I have that listed under the failure testing section but I will highlight it more in the layout discussion. |
Just adding a note for posterity that I had a conversation with Mike on this and I now understand his point and my confusion. I'll be updating the RFD to explain things better, along the lines of what Mike is describing. |
Closed by mistake. |
As you mention if a shrimp goes down all zpools will be resilvering. Do we need to be worried about all of this happening at once? Worth explicitly calling out in testing? |
How would we do rolling shrimp maintenance? Would this involve 20 "update; wait for resilver across the storage group;" cycles? |
With ZFS sitting on top of iSCSI we lose FMA in both directions (faults, blinkenlights etc). Would we need to consider a transport between the shrimps and the makos? |
Heartbeat: other systems have kept the heart-beat on disk on the basis that it's a closer reflection of the health of the system (it's not much use responding to ping if your storage stack is piled up on some iscsi CV). Worth consideration? |
MMP: admittedly I'm going off old slides, so I have no idea of the current status, but it sounds like this essentially snoops on the uberblock and such to decide on forced takeovers. And more importantly, there's no disk-level IO fencing on takeover. Given how catastrophic a dual import would now be, I'm wondering if it's worth considering a sideband way to properly fence off the storage. |
(To expand on that last comment a little: it's entirely feasible that a mako's iscsi stack gets completely stuck for multiple minutes, thus appearing to be dormant. Post takeover, we'd want something for the loser to lose all access to storage, in case it decides to unglue itself and start doing I/O again. AIUI from the MMP slides, there is no active post-import multihost detection.) |
Mako maintenance: just for clarification, we are stating that when we need to maintain makos, users can expect an outage of 1 minute (or whatever we end up with) ? Presumably this is predominantly import time. Do we have numbers for what this looks like currently? |
There seem to be several options here.
FWIW, Veritas cluster uses a mechanism where a kernel module heartbeats on a dedicated (LLT - Low Latency Transport) network with a protocol called gab. If a node stops hearing from other members for a set number of seconds (30 by default, I think), the We could build a similar mechanism using etcd leases and watches. Surely there's a way to do something quite similar with zookeeper.
|
Is there a typo in the paragraph starting with, "An additional point of comparison"? I think you meant to say 34, but 54 is in that paragraph. |
It's somewhat complicated by the fact that there are basically 3 seperate illumos entities performing some form of FMA for disks. There's ZFS FMA, which specifically detects faults related to ZFS constructs (vdevs, pools). Then there's the scsi target driver, which produces FMA telemetry as part of handling SCSI transactions. And finally there's an fmd module that periodically polls the SMART status on disks to check for various conditions that indicate a bad disk (predicted failure asserted or overtemp or drive POST failure). In fact it's not uncommon for a drive failure to result in two or three FMA faults as the different entities each independently diagnose the failure from their vantage point. On the makos, the later two cases should continue to work as before. I honestly don't know how well the ZFS FMA stufff would work on the initiator hosts in this scenario. What is also different from the current situation is that a drive failure in this scenario will result in faults events spread across multiple machines - potentially ZFS and SCSI diagnosed faults on the initiator hosts and SMART and SCSI faults diagnosed on the target host. So definitely some implications for OPS and for our monitoring software. |
Apologies if I don't have the Manta fundamentals down properly. If anything here is invalidated by me not understanding Manta, I withdraw. (Maybe pointers to how manta works?) So after pass 1, I think I can't say MUCH more beyond what's been said above, except for two things. First, I think you need two more pictures. The first would show the current situation (with its 38% effeciency) showing two datacenters, and each shrimp (and its matching mako server) having an evil twin in another DC^H^HAZ. And the second would show the new world order of having a shrimp, showing that it services N virtual mako servers, and how the disks are assigned to the different mako servers. The second thing: An in-one-place breakdown of resilience and the costs incurred. If I were to guess: Old way: Survives disk failures per raidz2. Survives shrimp failure by having whole other copy in the other AZ. High performing, simple to diagnose, and has fewer failure modes (good) but fewer failure points (bad). New way: Survives disk failures AND shrimp failures per raidz3. Has no-other-AZ backup, but could be added. Adds iSCSI failure modes to existing ones, but gains more points of failure that need to take down the virtualized mako server. I'd be very curious to see the probabilities of failures (as a function of what things CAN fail and their cost) in both ways. |
Yes, we do need to be sure this is fine. I'll add it to the testing section. |
I think that would be right. I'll add this to the maintenance section. |
This is a good point. I'll call it out as an explicit area for complete testing and potential future project work. I'm also wondering if any of the Nexenta iscsi work might have improved this situation? I need to explore that work fully. |
This is exactly how mmp works so we do have that heartbeat in the zpool already. It seems simpler to use a network connection heartbeat to trigger the attempt to import on a flip instead of just trying to import on the passive every few seconds. |
So far I don't see this as necessary. mmp does the right thing here and suspends the zpool on the active machine if it hasn't been able to update the uberblock within the heartbeat window. I have tested this and it works as expected, allowing us to import on the old passive machine even though the old active still has the zpool imported. There is things we'll want to do to make this work seamlessly, but mmp makes sure there is no dual-writer problem. |
No, mmp detects this if the IO comes back and suspends the zpool. We will probably want to reboot the old active machine so it starts fresh as a passive mako. I'll add more details in the RFD to make this clear. |
We do not have numbers yet. The import time is longer, but on the order of 10 seconds (this is tunable). The zone boot and svc startup is also on the order of a few seconds. I am mostly worried about network propagation for the new mako IP or MAC address transition so muskie starts talking to the correct server. |
So far, I haven't seen any reason for this added complexity since mmp is doing the "right thing" but if we find situations which need more handling, we'll have to revisit this. |
The typo is actually in the previous paragraph which should say 17. I'll fix this. |
I'll work on a new section which details the various failure modes we have thought of so far and compares them to the old/new approach. |
This is probably fine if we tie the heartbeat into the the storage stack somehow (i.e. we only beat if we're actively able to r/w storage in some manner). Otherwise we end up with the healthy-heartbeat, dead-I/O path situation. |
Re: MMP, I guess maybe it's bulletproof? I'm not sure quite how MMP could prevent corruption in this case - it has zero control over the recovered other node without STONITH or fencing - but perhaps the nature of how zfs writes work means that this is in fact totally safe and we can rely on this never breaking in the future. (I think it has to be "never" as this is a massive-data-loss scenario, right?) |
Yes, I think we want to make sure the heartbeat responder is periodically doing zfs status checks too so if the zpool is suspended, we stop acking the heartbeat. I'll make this explicit. |
I'd hesitate to call any piece of SW bulletproof, but so far it seems to do exactly what it should. It's actually not that complex. If the txg sync can't write within the mmp timeout, it suspends the zpool in memory and the pool is no longer able to accept any write activity. This zpool state is now valid for a zpool forced import on another machine. If the txg's are being written when multihost is enabled, forced import is not allowed on another machine. Of course, if something goes around zfs and writes to the raw iscsi device, mmp cannot help us. |
Per mmp.c, once enough time has passed without being able to update uberblocks,
With that, when MMP detects trouble the panic should happen quickly enough that a syncing txg will be quickly interrupted and writes are prevented before the failover machine takes over. |
Setting up to panic like this sounds like a good idea. We can't export the zpool since it is suspended, so the only recovery (as zpool status even tells us) is to reboot. One thing in the comment above that I am confused about is the resilvering one. I am not sure how zfs would be able to initiate resilver writes outside of a txg. As far as I know, the only two write paths are txg sync and zil commit, but maybe there is something I am missing. |
I know nothing about the resilver I/O path. There may be nothing special to worry about there at all. |
Thanks for RFDs, it's very interesting. . Manta compute . (no)SLOG . iscsi ? |
https://github.com/joyent/rfd/tree/master/rfd/0174
The text was updated successfully, but these errors were encountered: