Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing)

I propose we restart Prometheus-es during the standard prombench runs e.g.

- graceful restart (kubectl pod delete) after 3h of prombench run.
- forceful restart ((kubectl pod delete --grace-period=0) after 6h of prombench run (so 3h after first restart).

This allows us to test important Prometheus features like using checkpoints WAL and memory snapshots during replay that in the past were causing resource spike and can take some time. We also planned more work to improve this flow, so reliable metrics would be nice to have.

This killing logic could be implemented in `scaler` perhaps, which already has access to Kube API.

On top of that I would ensure we:

* Add dashboard panel for startup time metric (if such metric does not exist we might want to add one (time to readiness).
* Add some vertical lines/threshold in dashboards to show that the drop in all metrics is expected, or maybe another panel/metric? (This could be perhaps done with some events?).

WDYT? @bboreham @kakkoyun 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark Prometheus restart scenarios (for WAL + snapshot cost and timing) #820

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions