Skip to content

Conversation

anibilthare
Copy link
Contributor

@anibilthare anibilthare commented Jan 14, 2024

Change Description

Fixes #8205 , This change sets the total retry attempt count for fetching blocks to 4 (1 original attempt + 3 Retries).

Steps to Test

I'm uncertain about how reviewers might naturally replicate a scenario in which the backend fails to deliver blocks to lnd. For testing purposes, I introduced a probabilistic failure in the GetBlock function (by returning a non-nil error). Below is the code snippet I used to implement this probabilistic error injection. For the purposes of testing, I have used bitcoind as backend.

type MyCustomError struct {
	Msg string
}

func (e MyCustomError) Error() string {
	return fmt.Sprintf("Error: %s", e.Msg)
}

func (b *BtcWallet) GetBlock(blockHash *chainhash.Hash) (*wire.MsgBlock, error) {
	rand.Seed(time.Now().UnixNano())
	randomNumber := rand.Intn(100) + 1
	_, ok := b.chain.(*chain.NeutrinoClient)
	if !ok {
		k, v := b.blockCache.GetBlock(blockHash, b.chain.GetBlock)
		if randomNumber < 20 {
			return k, v
		}
		return k, MyCustomError{Msg: "Animesh: Pseudo Error"}
	}

	// For the neutrino implementation of lnwallet.BlockChainIO the neutrino
	// GetBlock function can be called directly since it uses the same block
	// cache. However, it does not lock the block cache mutex for the given
	// block hash and so that is done here.
	b.blockCache.HashMutex.Lock(lntypes.Hash(*blockHash))
	defer b.blockCache.HashMutex.Unlock(lntypes.Hash(*blockHash))

	return b.chain.GetBlock(blockHash)
}

Testing

image

This represents a scenario when block is finally fetched after all the attempts have been exhausted.

image

This represents a scenario when block is fetched after 2 attempts (1 original + 1 retry) have been exhausted.

image

This represents a scenario when block fetching fails.

I've been contemplating the practicality of implementing a unit test for this function. Given that we'll be mocking the GetBlock function to return either a nil or a non-nil error, I'm curious about the overall usefulness of such a test. Could someone help clarify if there's a significant benefit to this approach? I'm open to insights or suggestions on how we might enhance the testing strategy for this part of the code.

@ellemouton ellemouton self-requested a review January 15, 2024 08:18
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @anibilthare! Left some comments

Comment on lines 177 to 178
block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
epoch.Hash, 3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting

block, err := fetchBlockWithRetries(
l.cfg.BlockFetcher, epoch.Hash, 3,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also - you could probably make fetchBlockWithRetries a method instead since then there is no need to pass the block fetcher.

Then you can also more easily listen on the quit channel between retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also - you could probably make fetchBlockWithRetries a method instead since then there is no need to pass the block fetcher.

Then you can also more easily listen on the quit channel between retries

Could you help me understand a bit more about your approach? How exactly are we planning to keep an eye on the quit channel? It kind of sounds like you're suggesting that every channel needs constant monitoring for breaches. Maybe I'm missing something here. Can you break it down a bit more for me? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah - so the channel im referring to here is a golang channel, not a lightning channel. So I mean the Lookout's quit chan struct member. It will signal on that channel when it is exiting.

var block *wire.MsgBlock
var err error

for attempt := 0; attempt <= maxRetries; attempt++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about instead retrying indefinitely with an exponential back-off between retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    const baseDelay = 1 // base delay in seconds
    const maxDelay = 60 // maximum delay in seconds

    for attempt := 0; ; attempt++ {
        block, err = fetcher.GetBlock(hash)
        if err == nil {
            return block, nil
        }

        // Calculate delay with exponential back-off
        delay := time.Duration(math.Min(math.Pow(2, float64(attempt)) * float64(baseDelay), float64(maxDelay))) * time.Second

        if attempt > 0 {
            log.Infof("Block fetch failed, retrying attempt %d after %v", attempt, delay)
        }

        // Wait for the calculated delay before the next attempt
        time.Sleep(delay)
    }

Does something like this makes sense? In case this project has some preferences about how delays are handled please let me know, for example, is calling sleep okay?

Copy link
Collaborator

@ellemouton ellemouton Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah something like this but with a few tweaks:

  1. I think you can just do something simple like this for the back off (copied this from the session negotiator)
updateBackoff := func() {
		if backoff == 0 {
			backoff = n.cfg.MinBackoff
		} else {
			backoff *= 2
			if backoff > n.cfg.MaxBackoff {
				backoff = n.cfg.MaxBackoff
			}
		}
	}
  1. We should not use time.Sleep in the code base since we cant listen on a quit channel while sleeping. so I suggest something like this at the end of your for-loop instead:
                select {
		case <-time.After(backoff):
		case <-l.quit:
			return
		}

// in sync with the backend.
block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
epoch.Hash, 3)

if err != nil {
// TODO(conner): add retry logic?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should remove this todo in this pr

Comment on lines 144 to 145
block, err = fetcher.GetBlock(hash)
if err == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we still want to log the retry errors.

Copy link
Contributor Author

@anibilthare anibilthare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @ellemouton ! I'll address nits later once main logic has been finalised.

Comment on lines 177 to 178
block, err := fetchBlockWithRetries(l.cfg.BlockFetcher,
epoch.Hash, 3)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also - you could probably make fetchBlockWithRetries a method instead since then there is no need to pass the block fetcher.

Then you can also more easily listen on the quit channel between retries

Could you help me understand a bit more about your approach? How exactly are we planning to keep an eye on the quit channel? It kind of sounds like you're suggesting that every channel needs constant monitoring for breaches. Maybe I'm missing something here. Can you break it down a bit more for me? Thanks!

var block *wire.MsgBlock
var err error

for attempt := 0; attempt <= maxRetries; attempt++ {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    const baseDelay = 1 // base delay in seconds
    const maxDelay = 60 // maximum delay in seconds

    for attempt := 0; ; attempt++ {
        block, err = fetcher.GetBlock(hash)
        if err == nil {
            return block, nil
        }

        // Calculate delay with exponential back-off
        delay := time.Duration(math.Min(math.Pow(2, float64(attempt)) * float64(baseDelay), float64(maxDelay))) * time.Second

        if attempt > 0 {
            log.Infof("Block fetch failed, retrying attempt %d after %v", attempt, delay)
        }

        // Wait for the calculated delay before the next attempt
        time.Sleep(delay)
    }

Does something like this makes sense? In case this project has some preferences about how delays are handled please let me know, for example, is calling sleep okay?

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch from d57e9f0 to 35598e5 Compare January 16, 2024 08:35
@anibilthare
Copy link
Contributor Author

@ellemouton I have implemented the change you suggested, I have further added a bool in the list of returned values, the purpose of the that is early exit so that we don't run into something weird (we might not, but I've placed it there for safety).

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch from 35598e5 to 8f4981e Compare January 16, 2024 10:08
@ellemouton
Copy link
Collaborator

ellemouton commented Jan 16, 2024

Thanks @anibilthare :) Here is my suggested patch to your diff (i havent tested it yet so please do so):
towerRetry.patch

I've re-added the max-retries as well since we probably dont want a failure in 1 block to prevent us from fetching more blocks forever (my bad - I should have thought of this before 🙈 )

So this improves what we have today (which tries once and then moves on) but eventually (in a different PR) we should be able to handle the case where one block fetch continues to fail: we should not let that prevent us from moving on but should also not just never get back to that block. So we should either handle retries of past blocks explicitly or we should just do a hard error out (ie, LND will be forced to restart) if we cannot fetch the block after x number of retries since then clearly something is wrong.

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch from 8f4981e to 5b37b13 Compare January 18, 2024 08:12
Copy link
Contributor

coderabbitai bot commented Jan 18, 2024

Important

Review skipped

Auto reviews are limited to specific labels.

🏷️ Labels to auto review (1)
  • llm-review

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@anibilthare
Copy link
Contributor Author

anibilthare commented Jan 18, 2024

Thanks @anibilthare :) Here is my suggested patch to your diff (i havent tested it yet so please do so): towerRetry.patch

I've re-added the max-retries as well since we probably dont want a failure in 1 block to prevent us from fetching more blocks forever (my bad - I should have thought of this before 🙈 )

So this improves what we have today (which tries once and then moves on) but eventually (in a different PR) we should be able to handle the case where one block fetch continues to fail: we should not let that prevent us from moving on but should also not just never get back to that block. So we should either handle retries of past blocks explicitly or we should just do a hard error out (ie, LND will be forced to restart) if we cannot fetch the block after x number of retries since then clearly something is wrong.

Screenshot 2024-01-18 at 1 48 19 PM

Thanks for the patch @ellemouton I have pushed the changes according to your suggestions. PTAL.

I have tested it and the only issue I could see was Retrying is x seconds part in the logs which I have fixed.

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch from 5b37b13 to d8cf3f1 Compare January 21, 2024 10:49
@anibilthare
Copy link
Contributor Author

@ellemouton Does this look okay?

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good!
Just need to rebase & add something to the release notes entry to show that this is a watchtower server change.

Also, in future - just click the re-request review button otherwise it doesnt show up in my review queue 🙏

Comment on lines 132 to 305
* [Add retry logic for block fetching](https://github.com/lightningnetwork/lnd/pull/8381)
block fetching to retry indefinitely with an exponential back-off between
retries.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention that this is for a watchtower server

@ellemouton
Copy link
Collaborator

@anibilthare - do you still plan to continue on this?

@ellemouton
Copy link
Collaborator

closing as it seems the author has disappeared

@ellemouton ellemouton closed this Apr 22, 2024
@anibilthare
Copy link
Contributor Author

@ellemouton can I continue on this ? will it be possible for you to reopen this ?

@ellemouton ellemouton reopened this Aug 1, 2024
@ellemouton
Copy link
Collaborator

@anibilthare - re-opened 👍

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch 2 times, most recently from fbbd408 to ab80d28 Compare August 4, 2024 05:28
@anibilthare
Copy link
Contributor Author

@ellemouton I have rebased my changes and updates release notes as well.

Can you please let me know if I need to move my release notes to some other file ?

Apart from this can how please also tell me how do we decide where to put these release notes

@anibilthare anibilthare requested a review from ellemouton August 4, 2024 05:30
@anibilthare
Copy link
Contributor Author

anibilthare commented Aug 23, 2024

@ellemouton are these tests locally runnable ?

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! I think this is good to go.

Last thing is to move the release notes entry to the 19 doc 🙏 (can re-request from me once this is done)

Re you question about running the tests locally: yes you can run them locally but all of these here look like flakes & not related to this diff

@anibilthare
Copy link
Contributor Author

@ellemouton I have updated the docs. Thanks for the review

Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks! 🚀

cc @saubyk for second reviewer here 🙏

@anibilthare
Copy link
Contributor Author

@saubyk pls take a look

@saubyk saubyk added this to the v0.19.0 milestone Sep 11, 2024
@saubyk saubyk requested a review from bitromortac September 11, 2024 18:26
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, a single nit is left 🎉. Thanks for the PR! Could you rebase as described in the comment to have a clean commit history?

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch 3 times, most recently from 0a19667 to 1482146 Compare September 16, 2024 17:20
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anibilthare - this still contains a merge commit

@lightninglabs-deploy
Copy link

@anibilthare, remember to re-request review from reviewers when ready

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch from b7b94a2 to b2cbfe3 Compare October 17, 2024 11:31
Copy link
Collaborator

@ellemouton ellemouton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - looks good. I think we can put the release notes entry under Code Health though.

@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch 2 times, most recently from c942b69 to 78246c0 Compare October 26, 2024 11:28
By default, try to fetch the blocks 3 more times
in case of error.
@anibilthare anibilthare force-pushed the issue_8205_num_retries_2 branch from 78246c0 to ecd4480 Compare October 26, 2024 11:29
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🗼, thank you!

@ellemouton
Copy link
Collaborator

cc @guggero for override merge 🙏 (failing required check is unrelated)

@guggero guggero merged commit c0e85d3 into lightningnetwork:master Oct 31, 2024
27 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: Watchtower gives up to fetch block in case of timeout
6 participants