TRT has identified scenarios where the CI cluster where tests are running (as opposed to the cluster under test) can sometimes lose networking causing disruption tests to fire. This was exposed by noticing several aggregated job runs across clouds and platforms all logged disruption at the exact same time. To help we should add a new disruption backend to hit an external service. If we see this go down in addition to the cluster itself, we know it's not real disruption.
This is already failing too often, and not even corresponding to observed disruption in the cluster under test. We're not sure what's going on but we're going to allow up to 10 minutes before failing the test, it will flake if we see any, so we can gather data and compare.