watchtower: Add retry logic for fetching blocks #8381

anibilthare · 2024-01-14T21:08:51Z

Change Description

Fixes #8205 , This change sets the total retry attempt count for fetching blocks to 4 (1 original attempt + 3 Retries).

Steps to Test

I'm uncertain about how reviewers might naturally replicate a scenario in which the backend fails to deliver blocks to lnd. For testing purposes, I introduced a probabilistic failure in the GetBlock function (by returning a non-nil error). Below is the code snippet I used to implement this probabilistic error injection. For the purposes of testing, I have used bitcoind as backend.

type MyCustomError struct {
	Msg string
}

func (e MyCustomError) Error() string {
	return fmt.Sprintf("Error: %s", e.Msg)
}

func (b *BtcWallet) GetBlock(blockHash *chainhash.Hash) (*wire.MsgBlock, error) {
	rand.Seed(time.Now().UnixNano())
	randomNumber := rand.Intn(100) + 1
	_, ok := b.chain.(*chain.NeutrinoClient)
	if !ok {
		k, v := b.blockCache.GetBlock(blockHash, b.chain.GetBlock)
		if randomNumber < 20 {
			return k, v
		}
		return k, MyCustomError{Msg: "Animesh: Pseudo Error"}
	}

	// For the neutrino implementation of lnwallet.BlockChainIO the neutrino
	// GetBlock function can be called directly since it uses the same block
	// cache. However, it does not lock the block cache mutex for the given
	// block hash and so that is done here.
	b.blockCache.HashMutex.Lock(lntypes.Hash(*blockHash))
	defer b.blockCache.HashMutex.Unlock(lntypes.Hash(*blockHash))

	return b.chain.GetBlock(blockHash)
}

Testing

This represents a scenario when block is finally fetched after all the attempts have been exhausted.

This represents a scenario when block is fetched after 2 attempts (1 original + 1 retry) have been exhausted.

This represents a scenario when block fetching fails.

I've been contemplating the practicality of implementing a unit test for this function. Given that we'll be mocking the GetBlock function to return either a nil or a non-nil error, I'm curious about the overall usefulness of such a test. Could someone help clarify if there's a significant benefit to this approach? I'm open to insights or suggestions on how we might enhance the testing strategy for this part of the code.

ellemouton

Thanks for the PR @anibilthare! Left some comments

watchtower/lookout/lookout.go

ellemouton · 2024-01-15T08:21:34Z

watchtower/lookout/lookout.go

+			block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
+				epoch.Hash, 3)


nit: formatting

block, err := fetchBlockWithRetries(
l.cfg.BlockFetcher, epoch.Hash, 3,
)

also - you could probably make fetchBlockWithRetries a method instead since then there is no need to pass the block fetcher.

Then you can also more easily listen on the quit channel between retries

also - you could probably make fetchBlockWithRetries a method instead since then there is no need to pass the block fetcher.

Then you can also more easily listen on the quit channel between retries

Could you help me understand a bit more about your approach? How exactly are we planning to keep an eye on the quit channel? It kind of sounds like you're suggesting that every channel needs constant monitoring for breaches. Maybe I'm missing something here. Can you break it down a bit more for me? Thanks!

ah - so the channel im referring to here is a golang channel, not a lightning channel. So I mean the Lookout's quit chan struct member. It will signal on that channel when it is exiting.

watchtower/lookout/lookout.go

ellemouton · 2024-01-15T08:22:53Z

watchtower/lookout/lookout.go

+	var block *wire.MsgBlock
+	var err error
+
+	for attempt := 0; attempt <= maxRetries; attempt++ {


what do you think about instead retrying indefinitely with an exponential back-off between retries?

const baseDelay = 1 // base delay in seconds const maxDelay = 60 // maximum delay in seconds for attempt := 0; ; attempt++ { block, err = fetcher.GetBlock(hash) if err == nil { return block, nil } // Calculate delay with exponential back-off delay := time.Duration(math.Min(math.Pow(2, float64(attempt)) * float64(baseDelay), float64(maxDelay))) * time.Second if attempt > 0 { log.Infof("Block fetch failed, retrying attempt %d after %v", attempt, delay) } // Wait for the calculated delay before the next attempt time.Sleep(delay) }

Does something like this makes sense? In case this project has some preferences about how delays are handled please let me know, for example, is calling sleep okay?

yeah something like this but with a few tweaks:

I think you can just do something simple like this for the back off (copied this from the session negotiator)

updateBackoff := func() { if backoff == 0 { backoff = n.cfg.MinBackoff } else { backoff *= 2 if backoff > n.cfg.MaxBackoff { backoff = n.cfg.MaxBackoff } } }

We should not use time.Sleep in the code base since we cant listen on a quit channel while sleeping. so I suggest something like this at the end of your for-loop instead:

select { case <-time.After(backoff): case <-l.quit: return }

ellemouton · 2024-01-15T08:23:09Z

watchtower/lookout/lookout.go

+			// in sync with the backend.
+			block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
+				epoch.Hash, 3)
+
 			if err != nil {
 				// TODO(conner): add retry logic?


should remove this todo in this pr

ellemouton · 2024-01-15T08:31:09Z

watchtower/lookout/lookout.go

+		block, err = fetcher.GetBlock(hash)
+		if err == nil {


i think we still want to log the retry errors.

anibilthare

Thanks for the review @ellemouton ! I'll address nits later once main logic has been finalised.

anibilthare · 2024-01-15T10:48:30Z

watchtower/lookout/lookout.go

+			block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
+				epoch.Hash, 3)


also - you could probably make fetchBlockWithRetries a method instead since then there is no need to pass the block fetcher.

Then you can also more easily listen on the quit channel between retries

Could you help me understand a bit more about your approach? How exactly are we planning to keep an eye on the quit channel? It kind of sounds like you're suggesting that every channel needs constant monitoring for breaches. Maybe I'm missing something here. Can you break it down a bit more for me? Thanks!

anibilthare · 2024-01-15T10:58:20Z

watchtower/lookout/lookout.go

+	var block *wire.MsgBlock
+	var err error
+
+	for attempt := 0; attempt <= maxRetries; attempt++ {


const baseDelay = 1 // base delay in seconds const maxDelay = 60 // maximum delay in seconds for attempt := 0; ; attempt++ { block, err = fetcher.GetBlock(hash) if err == nil { return block, nil } // Calculate delay with exponential back-off delay := time.Duration(math.Min(math.Pow(2, float64(attempt)) * float64(baseDelay), float64(maxDelay))) * time.Second if attempt > 0 { log.Infof("Block fetch failed, retrying attempt %d after %v", attempt, delay) } // Wait for the calculated delay before the next attempt time.Sleep(delay) }

Does something like this makes sense? In case this project has some preferences about how delays are handled please let me know, for example, is calling sleep okay?

anibilthare · 2024-01-16T08:41:41Z

@ellemouton I have implemented the change you suggested, I have further added a bool in the list of returned values, the purpose of the that is early exit so that we don't run into something weird (we might not, but I've placed it there for safety).

ellemouton · 2024-01-16T10:40:20Z

Thanks @anibilthare :) Here is my suggested patch to your diff (i havent tested it yet so please do so):
towerRetry.patch

I've re-added the max-retries as well since we probably dont want a failure in 1 block to prevent us from fetching more blocks forever (my bad - I should have thought of this before 🙈 )

So this improves what we have today (which tries once and then moves on) but eventually (in a different PR) we should be able to handle the case where one block fetch continues to fail: we should not let that prevent us from moving on but should also not just never get back to that block. So we should either handle retries of past blocks explicitly or we should just do a hard error out (ie, LND will be forced to restart) if we cannot fetch the block after x number of retries since then clearly something is wrong.

coderabbitai · 2024-01-18T08:12:40Z

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)

llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

anibilthare · 2024-01-18T08:16:53Z

Thanks @anibilthare :) Here is my suggested patch to your diff (i havent tested it yet so please do so): towerRetry.patch

I've re-added the max-retries as well since we probably dont want a failure in 1 block to prevent us from fetching more blocks forever (my bad - I should have thought of this before 🙈 )

So this improves what we have today (which tries once and then moves on) but eventually (in a different PR) we should be able to handle the case where one block fetch continues to fail: we should not let that prevent us from moving on but should also not just never get back to that block. So we should either handle retries of past blocks explicitly or we should just do a hard error out (ie, LND will be forced to restart) if we cannot fetch the block after x number of retries since then clearly something is wrong.

Thanks for the patch @ellemouton I have pushed the changes according to your suggestions. PTAL.

I have tested it and the only issue I could see was Retrying is x seconds part in the logs which I have fixed.

watchtower/lookout/lookout.go

anibilthare · 2024-01-29T09:00:15Z

@ellemouton Does this look okay?

ellemouton

Code looks good!
Just need to rebase & add something to the release notes entry to show that this is a watchtower server change.

Also, in future - just click the re-request review button otherwise it doesnt show up in my review queue 🙏

ellemouton · 2024-01-29T11:47:22Z

docs/release-notes/release-notes-0.18.0.md

+* [Add retry logic for block fetching](https://github.com/lightningnetwork/lnd/pull/8381)
+  block fetching to retry indefinitely with an exponential back-off between
+  retries.


I think you should mention that this is for a watchtower server

ellemouton · 2024-03-11T13:38:24Z

@anibilthare - do you still plan to continue on this?

ellemouton · 2024-04-22T17:35:54Z

closing as it seems the author has disappeared

anibilthare · 2024-07-30T11:48:00Z

@ellemouton can I continue on this ? will it be possible for you to reopen this ?

ellemouton · 2024-08-01T06:42:02Z

@anibilthare - re-opened 👍

anibilthare · 2024-08-04T05:30:11Z

@ellemouton I have rebased my changes and updates release notes as well.

Can you please let me know if I need to move my release notes to some other file ?

Apart from this can how please also tell me how do we decide where to put these release notes

anibilthare · 2024-08-23T11:04:37Z

@ellemouton are these tests locally runnable ?

ellemouton

nice! I think this is good to go.

Last thing is to move the release notes entry to the 19 doc 🙏 (can re-request from me once this is done)

Re you question about running the tests locally: yes you can run them locally but all of these here look like flakes & not related to this diff

anibilthare · 2024-09-04T03:30:56Z

@ellemouton I have updated the docs. Thanks for the review

ellemouton

lgtm, thanks! 🚀

cc @saubyk for second reviewer here 🙏

anibilthare · 2024-09-10T20:01:31Z

@saubyk pls take a look

bitromortac

Code LGTM, a single nit is left 🎉. Thanks for the PR! Could you rebase as described in the comment to have a clean commit history?

watchtower/lookout/lookout.go

docs/release-notes/release-notes-0.19.0.md

ellemouton

@anibilthare - this still contains a merge commit

lightninglabs-deploy · 2024-10-15T10:41:31Z

@anibilthare, remember to re-request review from reviewers when ready

ellemouton

Cool - looks good. I think we can put the release notes entry under Code Health though.

docs/release-notes/release-notes-0.19.0.md

By default, try to fetch the blocks 3 more times in case of error.

bitromortac

LGTM 🗼, thank you!

ellemouton · 2024-10-31T09:16:45Z

cc @guggero for override merge 🙏 (failing required check is unrelated)

ellemouton self-requested a review January 15, 2024 08:18

ellemouton added the watchtower label Jan 15, 2024

ellemouton reviewed Jan 15, 2024

View reviewed changes

anibilthare commented Jan 15, 2024

View reviewed changes

anibilthare force-pushed the issue_8205_num_retries_2 branch from d57e9f0 to 35598e5 Compare January 16, 2024 08:35

anibilthare requested a review from ellemouton January 16, 2024 08:43

anibilthare force-pushed the issue_8205_num_retries_2 branch from 35598e5 to 8f4981e Compare January 16, 2024 10:08

anibilthare force-pushed the issue_8205_num_retries_2 branch from 8f4981e to 5b37b13 Compare January 18, 2024 08:12

ellemouton reviewed Jan 19, 2024

View reviewed changes

watchtower/lookout/lookout.go Outdated Show resolved Hide resolved

anibilthare force-pushed the issue_8205_num_retries_2 branch from 5b37b13 to d8cf3f1 Compare January 21, 2024 10:49

ellemouton reviewed Jan 29, 2024

View reviewed changes

ellemouton closed this Apr 22, 2024

ellemouton reopened this Aug 1, 2024

anibilthare force-pushed the issue_8205_num_retries_2 branch 2 times, most recently from fbbd408 to ab80d28 Compare August 4, 2024 05:28

anibilthare requested a review from ellemouton August 4, 2024 05:30

ellemouton reviewed Aug 28, 2024

View reviewed changes

anibilthare requested a review from ellemouton September 4, 2024 03:30

ellemouton approved these changes Sep 9, 2024

View reviewed changes

saubyk assigned anibilthare Sep 11, 2024

saubyk added this to the v0.19.0 milestone Sep 11, 2024

saubyk requested a review from bitromortac September 11, 2024 18:26

bitromortac requested changes Sep 13, 2024

View reviewed changes

watchtower/lookout/lookout.go Outdated Show resolved Hide resolved

docs/release-notes/release-notes-0.19.0.md Outdated Show resolved Hide resolved

anibilthare force-pushed the issue_8205_num_retries_2 branch 3 times, most recently from 0a19667 to 1482146 Compare September 16, 2024 17:20

anibilthare requested a review from bitromortac September 16, 2024 17:21

bitromortac reviewed Sep 17, 2024

View reviewed changes

docs/release-notes/release-notes-0.19.0.md Outdated Show resolved Hide resolved

ellemouton requested changes Oct 1, 2024

View reviewed changes

anibilthare force-pushed the issue_8205_num_retries_2 branch from b7b94a2 to b2cbfe3 Compare October 17, 2024 11:31

anibilthare requested a review from ellemouton October 17, 2024 11:32

ellemouton approved these changes Oct 21, 2024

View reviewed changes

ellemouton reviewed Oct 21, 2024

View reviewed changes

docs/release-notes/release-notes-0.19.0.md Outdated Show resolved Hide resolved

ellemouton requested a review from bitromortac October 21, 2024 09:12

anibilthare force-pushed the issue_8205_num_retries_2 branch 2 times, most recently from c942b69 to 78246c0 Compare October 26, 2024 11:28

watchtower: Add retry logic for fetching blocks

ecd4480

By default, try to fetch the blocks 3 more times in case of error.

anibilthare force-pushed the issue_8205_num_retries_2 branch from 78246c0 to ecd4480 Compare October 26, 2024 11:29

bitromortac approved these changes Oct 31, 2024

View reviewed changes

guggero merged commit c0e85d3 into lightningnetwork:master Oct 31, 2024
27 of 33 checks passed

		block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
		epoch.Hash, 3)

watchtower: Add retry logic for fetching blocks #8381

watchtower: Add retry logic for fetching blocks #8381

Uh oh!

Conversation

anibilthare commented Jan 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Steps to Test

Testing

Uh oh!

ellemouton left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ellemouton Jan 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anibilthare left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anibilthare commented Jan 16, 2024

Uh oh!

ellemouton commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

anibilthare commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

anibilthare commented Jan 29, 2024

Uh oh!

ellemouton left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ellemouton commented Mar 11, 2024

Uh oh!

ellemouton commented Apr 22, 2024

Uh oh!

anibilthare commented Jul 30, 2024

Uh oh!

ellemouton commented Aug 1, 2024

Uh oh!

anibilthare commented Aug 4, 2024

Uh oh!

anibilthare commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

anibilthare commented Jan 14, 2024 •

edited

Loading

ellemouton Jan 15, 2024 •

edited

Loading

ellemouton commented Jan 16, 2024 •

edited

Loading

coderabbitai bot commented Jan 18, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

anibilthare commented Jan 18, 2024 •

edited

Loading

anibilthare commented Aug 23, 2024 •

edited

Loading

ellemouton left a comment •

edited

Loading