-
Notifications
You must be signed in to change notification settings - Fork 2.2k
watchtower: Add retry logic for fetching blocks #8381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
watchtower: Add retry logic for fetching blocks #8381
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @anibilthare! Left some comments
watchtower/lookout/lookout.go
Outdated
block, err := fetchBlockWithRetries(l.cfg.BlockFetcher, | ||
epoch.Hash, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting
block, err := fetchBlockWithRetries(
l.cfg.BlockFetcher, epoch.Hash, 3,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also - you could probably make fetchBlockWithRetries
a method instead since then there is no need to pass the block fetcher.
Then you can also more easily listen on the quit channel between retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also - you could probably make
fetchBlockWithRetries
a method instead since then there is no need to pass the block fetcher.Then you can also more easily listen on the quit channel between retries
Could you help me understand a bit more about your approach? How exactly are we planning to keep an eye on the quit channel? It kind of sounds like you're suggesting that every channel needs constant monitoring for breaches. Maybe I'm missing something here. Can you break it down a bit more for me? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah - so the channel im referring to here is a golang channel, not a lightning channel. So I mean the Lookout
's quit chan struct
member. It will signal on that channel when it is exiting.
watchtower/lookout/lookout.go
Outdated
var block *wire.MsgBlock | ||
var err error | ||
|
||
for attempt := 0; attempt <= maxRetries; attempt++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about instead retrying indefinitely with an exponential back-off between retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const baseDelay = 1 // base delay in seconds
const maxDelay = 60 // maximum delay in seconds
for attempt := 0; ; attempt++ {
block, err = fetcher.GetBlock(hash)
if err == nil {
return block, nil
}
// Calculate delay with exponential back-off
delay := time.Duration(math.Min(math.Pow(2, float64(attempt)) * float64(baseDelay), float64(maxDelay))) * time.Second
if attempt > 0 {
log.Infof("Block fetch failed, retrying attempt %d after %v", attempt, delay)
}
// Wait for the calculated delay before the next attempt
time.Sleep(delay)
}
Does something like this makes sense? In case this project has some preferences about how delays are handled please let me know, for example, is calling sleep
okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah something like this but with a few tweaks:
- I think you can just do something simple like this for the back off (copied this from the session negotiator)
updateBackoff := func() {
if backoff == 0 {
backoff = n.cfg.MinBackoff
} else {
backoff *= 2
if backoff > n.cfg.MaxBackoff {
backoff = n.cfg.MaxBackoff
}
}
}
- We should not use
time.Sleep
in the code base since we cant listen on a quit channel while sleeping. so I suggest something like this at the end of your for-loop instead:
select {
case <-time.After(backoff):
case <-l.quit:
return
}
watchtower/lookout/lookout.go
Outdated
// in sync with the backend. | ||
block, err := fetchBlockWithRetries(l.cfg.BlockFetcher, | ||
epoch.Hash, 3) | ||
|
||
if err != nil { | ||
// TODO(conner): add retry logic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should remove this todo in this pr
watchtower/lookout/lookout.go
Outdated
block, err = fetcher.GetBlock(hash) | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we still want to log the retry errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @ellemouton ! I'll address nits later once main logic has been finalised.
watchtower/lookout/lookout.go
Outdated
block, err := fetchBlockWithRetries(l.cfg.BlockFetcher, | ||
epoch.Hash, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also - you could probably make
fetchBlockWithRetries
a method instead since then there is no need to pass the block fetcher.Then you can also more easily listen on the quit channel between retries
Could you help me understand a bit more about your approach? How exactly are we planning to keep an eye on the quit channel? It kind of sounds like you're suggesting that every channel needs constant monitoring for breaches. Maybe I'm missing something here. Can you break it down a bit more for me? Thanks!
watchtower/lookout/lookout.go
Outdated
var block *wire.MsgBlock | ||
var err error | ||
|
||
for attempt := 0; attempt <= maxRetries; attempt++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const baseDelay = 1 // base delay in seconds
const maxDelay = 60 // maximum delay in seconds
for attempt := 0; ; attempt++ {
block, err = fetcher.GetBlock(hash)
if err == nil {
return block, nil
}
// Calculate delay with exponential back-off
delay := time.Duration(math.Min(math.Pow(2, float64(attempt)) * float64(baseDelay), float64(maxDelay))) * time.Second
if attempt > 0 {
log.Infof("Block fetch failed, retrying attempt %d after %v", attempt, delay)
}
// Wait for the calculated delay before the next attempt
time.Sleep(delay)
}
Does something like this makes sense? In case this project has some preferences about how delays are handled please let me know, for example, is calling sleep
okay?
d57e9f0
to
35598e5
Compare
@ellemouton I have implemented the change you suggested, I have further added a bool in the list of returned values, the purpose of the that is early exit so that we don't run into something weird (we might not, but I've placed it there for safety). |
35598e5
to
8f4981e
Compare
Thanks @anibilthare :) Here is my suggested patch to your diff (i havent tested it yet so please do so): I've re-added the max-retries as well since we probably dont want a failure in 1 block to prevent us from fetching more blocks forever (my bad - I should have thought of this before 🙈 ) So this improves what we have today (which tries once and then moves on) but eventually (in a different PR) we should be able to handle the case where one block fetch continues to fail: we should not let that prevent us from moving on but should also not just never get back to that block. So we should either handle retries of past blocks explicitly or we should just do a hard error out (ie, LND will be forced to restart) if we cannot fetch the block after x number of retries since then clearly something is wrong. |
8f4981e
to
5b37b13
Compare
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
![]() Thanks for the patch @ellemouton I have pushed the changes according to your suggestions. PTAL. I have tested it and the only issue I could see was |
5b37b13
to
d8cf3f1
Compare
@ellemouton Does this look okay? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good!
Just need to rebase & add something to the release notes entry to show that this is a watchtower server change.
Also, in future - just click the re-request review button otherwise it doesnt show up in my review queue 🙏
* [Add retry logic for block fetching](https://github.com/lightningnetwork/lnd/pull/8381) | ||
block fetching to retry indefinitely with an exponential back-off between | ||
retries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should mention that this is for a watchtower server
@anibilthare - do you still plan to continue on this? |
closing as it seems the author has disappeared |
@ellemouton can I continue on this ? will it be possible for you to reopen this ? |
@anibilthare - re-opened 👍 |
fbbd408
to
ab80d28
Compare
@ellemouton I have rebased my changes and updates release notes as well. Can you please let me know if I need to move my release notes to some other file ? Apart from this can how please also tell me how do we decide where to put these release notes |
@ellemouton are these tests locally runnable ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! I think this is good to go.
Last thing is to move the release notes entry to the 19 doc 🙏 (can re-request from me once this is done)
Re you question about running the tests locally: yes you can run them locally but all of these here look like flakes & not related to this diff
@ellemouton I have updated the docs. Thanks for the review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks! 🚀
cc @saubyk for second reviewer here 🙏
@saubyk pls take a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, a single nit is left 🎉. Thanks for the PR! Could you rebase as described in the comment to have a clean commit history?
0a19667
to
1482146
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anibilthare - this still contains a merge commit
@anibilthare, remember to re-request review from reviewers when ready |
b7b94a2
to
b2cbfe3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - looks good. I think we can put the release notes entry under Code Health
though.
c942b69
to
78246c0
Compare
By default, try to fetch the blocks 3 more times in case of error.
78246c0
to
ecd4480
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🗼, thank you!
cc @guggero for override merge 🙏 (failing required check is unrelated) |
Change Description
Fixes #8205 , This change sets the total retry attempt count for fetching blocks to 4 (1 original attempt + 3 Retries).
Steps to Test
I'm uncertain about how reviewers might naturally replicate a scenario in which the backend fails to deliver blocks to lnd. For testing purposes, I introduced a probabilistic failure in the GetBlock function (by returning a non-nil error). Below is the code snippet I used to implement this probabilistic error injection. For the purposes of testing, I have used bitcoind as backend.
Testing
This represents a scenario when block is finally fetched after all the attempts have been exhausted.
This represents a scenario when block is fetched after 2 attempts (1 original + 1 retry) have been exhausted.
This represents a scenario when block fetching fails.
I've been contemplating the practicality of implementing a unit test for this function. Given that we'll be mocking the
GetBlock
function to return either a nil or a non-nil error, I'm curious about the overall usefulness of such a test. Could someone help clarify if there's a significant benefit to this approach? I'm open to insights or suggestions on how we might enhance the testing strategy for this part of the code.