Alternate strategies for outdated commitments #1838

pm47 · 2021-06-07T08:45:33Z

This PRs adds strategies to handle two typical disaster scenarios: outdated commitments and unhandled exceptions.

Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node).

Strategies for outdated commitments:

request the counterparty to close the channel (default).
if the node was restarted less than 10 min ago, log an error message and stop the node

Strategies for unhandled exceptions:

local force close of the channel (default)
log an error message and stop the node

Default settings maintain the same behavior as before.

t-bast

I don't think these two commits should be done in the same PR, I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.

As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala

pm47 · 2021-06-07T12:26:19Z

I don't think these two commits should be done in the same PR

Agreed, I got a bit carried away and kept experimenting.

I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.

IIUC lightning/bolts#834 deals with protocol errors, but here I'm trying to address the case where we have an internal error. Disk full, OOM exception, that sort of things.

As for the first commit, what do you do when the node stops and you do have your most up-to-date backup? If you restart it will stop again. So you'd need to change your config from "halt-if-just-restarted" to "always-request-remote-close" before restarting, otherwise a single faulty channel would DoS the whole node...

Exactly, it requires human intervention, that's why this mode is for more advanced users. But it's less susceptible to blind mass force-close, which is what this PR aims to address. I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.

t-bast · 2021-06-07T12:41:57Z

Exactly, it requires human intervention, that's why this mode is for more advanced users.

So the flow if the node stops because of an outdated commitment would be:

The node stops
Investigate your backups
If you had a DB issue, restart after fixing your DB
Otherwise if you find nothing and think you really are late, change your config to always-request-remote-close and restart, and accept losing channels

Is that right?
I think it should be documented clearly, probably a new document in the docs folder.

I feel like any other solution in-between would be a trade-off between convenience and security, that's how I ended with this rather basic mechanism.

I agree, the other solutions I can think of are just proxies for almost the same behavior, so it's better to choose the simplest one.
I don't think it's an area where we'll want many different strategies though, so maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?

pm47 · 2021-06-08T13:32:53Z

I think it should be documented clearly, probably a new document in the docs folder.

Agree!

maybe the new configuration parameter should just be a boolean stop-node-on-outdated-commitment-after-restart = false or something similar?

It was a left-over of my previous attempt, I flip-flopped around this and thought keeping an explicit strategy name would be easier to understand ... but it is probably not.

eclair-core/src/main/resources/reference.conf

pm47 · 2021-09-09T15:31:08Z

I don't think these two commits should be done in the same PR,

Agreed, I got a bit carried away and kept experimenting.

On 2nd thought, I kept both changes in the same PR, I think it makes sense since those are related topics.

I'm not sure the second one makes a lot of sense. In the long term it makes sense to change how we deal with remote errors and not always broadcast our local-commit, but that's probably better addressed by working on lightning/bolts#834 and reviewing in which case we don't have a choice and really need to force-close.

I could add a third strategy, to not force-close even if our peer sends an error, like lnd does (used to do?). But that would'nt be spec compliant, especially since we have warning messages now.

I think it should be documented clearly, probably a new document in the docs folder.

Done with 93d07c4.

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala

codecov-commenter · 2021-10-14T13:32:40Z

Codecov Report

Merging #1838 (8fda63e) into master (765a0c5) will decrease coverage by 0.10%.
The diff coverage is 76.41%.

@@            Coverage Diff             @@
##           master    #1838      +/-   ##
==========================================
- Coverage   87.61%   87.51%   -0.11%     
==========================================
  Files         161      161              
  Lines       12575    12609      +34     
  Branches      552      532      -20     
==========================================
+ Hits        11018    11035      +17     
- Misses       1557     1574      +17

Impacted Files	Coverage Δ
...air-core/src/main/scala/fr/acinq/eclair/Logs.scala	`81.35% <0.00%> (-1.41%)`	⬇️
...c/main/scala/fr/acinq/eclair/channel/Channel.scala	`87.56% <67.39%> (-1.14%)`	⬇️
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala	`91.30% <71.42%> (-0.63%)`	⬇️
...c/main/scala/fr/acinq/eclair/channel/Helpers.scala	`94.78% <83.72%> (-0.79%)`	⬇️
...in/scala/fr/acinq/eclair/channel/Commitments.scala	`94.65% <100.00%> (+0.14%)`	⬆️
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala	`79.77% <0.00%> (-0.91%)`	⬇️
...ore/src/main/scala/fr/acinq/eclair/Timestamp.scala	`100.00% <0.00%> (ø)`
...chain/bitcoind/rpc/BasicBitcoinJsonRPCClient.scala	`100.00% <0.00%> (ø)`
...main/scala/fr/acinq/eclair/io/PeerConnection.scala	`84.58% <0.00%> (+0.41%)`	⬆️
... and 1 more

Discovered this while working on #1838. In the following scenario, at reconnection: - `localCommit.index = 7` - `nextRemoteRevocationNumber = 6` So when `localCommit.index == nextRemoteRevocationNumber + 1` we must retransmit the revocation. ``` local remote | | | (no pending sig) | commit = 6 | | next rev = 6 |<----- sig 7 ------| commit = 7 | | |-- rev 6 --> ? | | | | (disconnection) | | | ```

pm47 · 2021-10-15T14:25:44Z

So, while testing this PR I realized that in an "outdated commitment" scenario where we are on the up-to-date side, we always react by force-closing the channel immediately, not giving our peer a chance to fix their data and restart. On top of that, we consider this a commitment sync error, instead of clearly logging that our counterparty is using outdated data.

Addressing this turned out to be rabbit-holey: our sync code is quite complicated and is a bit redundant because we separate between:

checking whether we are late
deciding what messages we need to retransmit

Also, discovered a missing corner case when syncing in SHUTDOWN state.

Obviously this code is very sensitive, because a bug there could trigger a mass force close, which is what we are trying to avoid with this PR after all. If we choose to update our sync code, we could first deploy a version that only compares legacy/new sync results for a few weeks on our prod nodes to get a level of confidence.

* use a map for feature->channelType resolution Instead of explicitly listing all the combination of features, and risk inconsistency, we may has well build the reverse map using the channel type objects. * better and less spammy logs We can switch the "funding tx already spent" router log from _warn_ to _debug_ because as soon as there are more than 10 of them, the peer's announcements will be ignored and there will be a warning message about that. * timedOutHtlcs -> trimmedOrTimedOutHtlcs Add a precision on trimmed htlcs, which can be failed as soon as the commitment tx has been confirmed. * proper logging of outgoing messages It is also logical to make `Outgoing` a command of `Peer`. It should have been done this way from the start if `Peer` had been a typed actor. * fixed mixed up comments Discovered this while working on #1838. In the following scenario, at reconnection: - `localCommit.index = 7` - `nextRemoteRevocationNumber = 6` So when `localCommit.index == nextRemoteRevocationNumber + 1` we must retransmit the revocation. ``` local remote | | | (no pending sig) | commit = 6 | | next rev = 6 |<----- sig 7 ------| commit = 7 | | |-- rev 6 --> ? | | | | (disconnection) | | | ```

t-bast

The end result feels much cleaner than previously, it's easier to simply read through the functions in Helpers.Syncing to check that all cases are properly handled.

The behavior looks good to me, I just added many comments to help clarify the code.

docs/Advanced.md

eclair-core/src/main/resources/reference.conf

t-bast · 2021-10-18T13:20:41Z

eclair-core/src/test/scala/fr/acinq/eclair/channel/states/e/OfflineStateSpec.scala

- // ... and ask bob to publish its current commitment
+
+ // bob sees that alice is late and stays in stand-by
+ alice2bob.forward(bob)


I think the test makes more sense if we change the order of some operations and explicitly name each channel_reestablish:

val reestablishA = alice2bob.expectMsgType[ChannelReestablish] val reestablishB = bob2alice.expectMsgType[ChannelReestablish] // bob sees that alice is late and stays in stand-by alice2bob.forward(bob, reestablishA) bob2alice.expectNoMessage(100 millis) // alice realizes she has an old state when receiving Bob's reestablish bob2alice.forward(alice, reestablishB) // alice asks bob to publish its current commitment val error = alice2bob.expectMsgType[Error]

Done with 12a9184 + 8fda63e.

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

t-bast

Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)

t-bast · 2021-10-25T16:04:14Z

eclair-core/src/main/scala/fr/acinq/eclair/channel/Commitments.scala

@@ -875,6 +875,22 @@ object Commitments {
 (commitTx, htlcTxs)
 }

+ /**
+ * When reconnecting, we drop all outgoing unsigned changes.


Suggested change

* When reconnecting, we drop all outgoing unsigned changes.

* When reconnecting, we drop all unsigned changes.

Done with b4f5548.

t-bast · 2021-10-25T16:25:36Z

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala

+ case res: SyncResult.LocalLateProven =>
+ val msg = s"counterparty proved that we have an outdated (revoked) local commitment!!! ourLocalCommitmentNumber=${res.ourLocalCommitmentNumber} theirRemoteCommitmentNumber=${res.theirRemoteCommitmentNumber}"
+ log.error(msg)
+ context.system.eventStream.publish(NotifyNodeOperator(NotificationsLogger.Error, msg))


For these 4 node operator logs, you must use NotificationsLogger.logFatalError or create a different direct function call in NotificationsLogger.

Otherwise if the node is stopped because of the strategy, the message will likely not be logged because of the pub/sub delay.

Also, the message should explicitly include the nodeId and channelId since the NotificationsLogger doesn't have access to the channel's MDC.

Done with 9696436. I'm not even sure we should include the info about nodeId/channelId, if the goal is to give a general guidance to the operator on how to proceed.

pm47 · 2021-10-26T13:02:05Z

Don't forget to also add an entry in the release notes for this, it's really newsworthy ;)

Done with 9957a83. I could put the full doc, but I don't think we should inflate the release note too much? On the other hand it would save us from using a relative link, which will break at some point.

pm47 · 2021-10-27T09:52:46Z

Rebased on top of #2036, the diff is now much smaller.

This PRs adds strategies to handle outdated commitments. Default strategies may be the best choice for smaller loosely administered nodes, while alternative strategies may avoid unnecessary mass force-close (but are reserved for advanced users who closely monitor the node). Strategies for outdated commitments: - request the counterparty to close the channel (default). - if the node was restarted less than 10 min ago, log an error message and stop the node Default settings maintain the same behavior as before.

pm47 · 2021-10-27T16:04:06Z

Rebased on top of #2037.

This PR is not immediately useful due to how other implementations react when they detect that the peer is using an outdated commitment: they immediately force-close. Actually it would be dangerous to not use the default strategy in a non-static-remote-key, non-anchor-output commitment, because the peer will not resend their channel_reestablish which tells us to to spend our main balance.

t-bast reviewed Jun 7, 2021

View reviewed changes

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala Show resolved Hide resolved

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala Outdated Show resolved Hide resolved

pm47 commented Sep 7, 2021

View reviewed changes

eclair-core/src/main/resources/reference.conf Outdated Show resolved Hide resolved

pm47 force-pushed the optin-auto-recovery branch from d6f88ef to b9308a4 Compare September 9, 2021 14:40

pm47 changed the title ~~Add a strategy to deal with outdated commitments~~ Add a strategy to deal with outdated commitments and unhandled exceptions Sep 9, 2021

pm47 marked this pull request as ready for review September 9, 2021 14:40

pm47 requested a review from t-bast September 9, 2021 14:47

t-bast reviewed Sep 13, 2021

View reviewed changes

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala Show resolved Hide resolved

eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala Outdated Show resolved Hide resolved

pm47 force-pushed the optin-auto-recovery branch from 6f5b2e1 to c00c557 Compare October 14, 2021 13:01

pm47 changed the title ~~Add a strategy to deal with outdated commitments and unhandled exceptions~~ Alternate strategies for outdated commitments and unhandled exceptions Oct 14, 2021

pm47 requested a review from t-bast October 15, 2021 14:25

t-bast reviewed Oct 18, 2021

View reviewed changes

pm47 force-pushed the optin-auto-recovery branch from d8b2cbe to 875c74b Compare October 25, 2021 13:41

pm47 requested a review from t-bast October 25, 2021 13:42

t-bast reviewed Oct 25, 2021

View reviewed changes

pm47 requested a review from t-bast October 26, 2021 13:10

pm47 mentioned this pull request Oct 27, 2021

Rework channel reestablish #2036

Merged

pm47 force-pushed the optin-auto-recovery branch from 8fda63e to 10fcea0 Compare October 27, 2021 09:51

pm47 mentioned this pull request Oct 27, 2021

Alternate strategy for unhandled exceptions #2037

Merged

pm47 force-pushed the optin-auto-recovery branch from 10fcea0 to c585b9c Compare October 27, 2021 15:56

pm47 changed the title ~~Alternate strategies for outdated commitments and unhandled exceptions~~ Alternate strategies for outdated commitments Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate strategies for outdated commitments #1838

Alternate strategies for outdated commitments #1838

pm47 commented Jun 7, 2021 •

edited

t-bast left a comment

pm47 commented Jun 7, 2021

t-bast commented Jun 7, 2021

pm47 commented Jun 8, 2021

pm47 commented Sep 9, 2021

codecov-commenter commented Oct 14, 2021 •

edited

pm47 commented Oct 15, 2021 •

edited

t-bast left a comment

t-bast Oct 18, 2021

pm47 Oct 26, 2021 •

edited

t-bast left a comment

t-bast Oct 25, 2021

pm47 Oct 26, 2021 •

edited

t-bast Oct 25, 2021

pm47 Oct 26, 2021

pm47 commented Oct 26, 2021

pm47 commented Oct 27, 2021

pm47 commented Oct 27, 2021

	* When reconnecting, we drop all outgoing unsigned changes.
	* When reconnecting, we drop all unsigned changes.

Alternate strategies for outdated commitments #1838

Are you sure you want to change the base?

Alternate strategies for outdated commitments #1838

Conversation

pm47 commented Jun 7, 2021 • edited

t-bast left a comment

Choose a reason for hiding this comment

pm47 commented Jun 7, 2021

t-bast commented Jun 7, 2021

pm47 commented Jun 8, 2021

pm47 commented Sep 9, 2021

codecov-commenter commented Oct 14, 2021 • edited

Codecov Report

pm47 commented Oct 15, 2021 • edited

t-bast left a comment

Choose a reason for hiding this comment

t-bast Oct 18, 2021

Choose a reason for hiding this comment

pm47 Oct 26, 2021 • edited

Choose a reason for hiding this comment

t-bast left a comment

Choose a reason for hiding this comment

t-bast Oct 25, 2021

Choose a reason for hiding this comment

pm47 Oct 26, 2021 • edited

Choose a reason for hiding this comment

t-bast Oct 25, 2021

Choose a reason for hiding this comment

pm47 Oct 26, 2021

Choose a reason for hiding this comment

pm47 commented Oct 26, 2021

pm47 commented Oct 27, 2021

pm47 commented Oct 27, 2021

pm47 commented Jun 7, 2021 •

edited

codecov-commenter commented Oct 14, 2021 •

edited

pm47 commented Oct 15, 2021 •

edited

pm47 Oct 26, 2021 •

edited

pm47 Oct 26, 2021 •

edited