Description of problem:
As of today we are seeing e2e-azure and azure-serial pass relatively consistently for 4.2.0-0.nightly runs. But a ~20% fail rate still leaves a lot to be desired for release. See latest runs at: https://openshift-release.svc.ci.openshift.org/.
For failing runs, the situation generally boils down to two cases:
1. The cluster never initialized. This is generally due to initialization hitting the 30min timeout. This happens less often than the second case but is still a cause for concern (but not directly related to this BZ)
2. The cluster initialized, but some tests failed due to timeouts. If we take a look at failing tests at:
We can see that there's a cascading bar of red, indicating that on every run a different subsets of tests fail. Those failures are generally timeouts.
About a month ago we did some triaging, which lead to us discovering that disk/network latency was very high causing etcd to often be overloaded: https://bugzilla.redhat.com/show_bug.cgi?id=1737660
I took a look at some recent failures and it seemed that etcd overload/took too long messages are even more prevalent than before. See "Test job failure etcd logs" at the bottom of this description for examples.
I'm not 100% certain that this is the root cause of the flakiness of running e2e/serial on Azure but it is certainly concerning that etcd's performance is reporting so many overloads. I think this is a good point to start debugging overall platform performance.
Version-Release number of selected component (if applicable):
Test job failure etcd logs:
I think we may close this bug according to comments, feel free reopen it if there is issues, thx
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.