-
Notifications
You must be signed in to change notification settings - Fork 73
Open
Description
I propose we restart Prometheus-es during the standard prombench runs e.g.
- graceful restart (kubectl pod delete) after 3h of prombench run.
- forceful restart ((kubectl pod delete --grace-period=0) after 6h of prombench run (so 3h after first restart).
This allows us to test important Prometheus features like using checkpoints WAL and memory snapshots during replay that in the past were causing resource spike and can take some time. We also planned more work to improve this flow, so reliable metrics would be nice to have.
This killing logic could be implemented in scaler
perhaps, which already has access to Kube API.
On top of that I would ensure we:
- Add dashboard panel for startup time metric (if such metric does not exist we might want to add one (time to readiness).
- Add some vertical lines/threshold in dashboards to show that the drop in all metrics is expected, or maybe another panel/metric? (This could be perhaps done with some events?).