- abstract={Cloud providers install mitigations to reduce the impact of network failures in their datacenters. To determine the best action, existing automatic network mitigation systems rely on simple local criteria or global proxy metrics. In this paper, we show that we can explicitly optimize end-to-end flow-level metrics and analyze actions holistically to support a broader range of actions and select much more effective mitigations. To this end, we develop novel techniques to quickly estimate the impact of different mitigations and rank them with high fidelity. Our results on incidents from a large cloud provider show orders of magnitude improvements in flow completion time and throughput. We also show our approach scales to large datacenters.}
0 commit comments