Skip to content

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

@bwplotka

Description

@bwplotka

I propose we restart Prometheus-es during the standard prombench runs e.g.

  • graceful restart (kubectl pod delete) after 3h of prombench run.
  • forceful restart ((kubectl pod delete --grace-period=0) after 6h of prombench run (so 3h after first restart).

This allows us to test important Prometheus features like using checkpoints WAL and memory snapshots during replay that in the past were causing resource spike and can take some time. We also planned more work to improve this flow, so reliable metrics would be nice to have.

This killing logic could be implemented in scaler perhaps, which already has access to Kube API.

On top of that I would ensure we:

  • Add dashboard panel for startup time metric (if such metric does not exist we might want to add one (time to readiness).
  • Add some vertical lines/threshold in dashboards to show that the drop in all metrics is expected, or maybe another panel/metric? (This could be perhaps done with some events?).

WDYT? @bboreham @kakkoyun

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions