Description of problem: TRT has identified a problem that seems fairly common on OVN (particularly on Azure) where the etcd operator fails to upgrade https://search.ci.openshift.org/?search=Failed+to+upgrade+etcd&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Some snapshots of results from today: periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade (all) - 211 runs, 85% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 437 runs, 79% failed, 9% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 28 runs, 100% failed, 7% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade (all) - 28 runs, 68% failed, 5% of failures match = 4% impact If you exclude ovn job results, there are NO hits. This problem seemingly never happens on SDN. In the main examples TRT tried to analyze, the etcd operator is reporting Available=false for this reason for hours: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516832201574977536 We dug through logs but were not able to determine why the members are not healthy, though there were some suspicious "apply request took too long" entries in the main etcd pods of the bad nodes throughout the logfiles. TRT is going to investigate breaking out a clear test for this specific case.
No hits for three days, I'll keep watching and report back as soon as we see something.
*** Bug 2016574 has been marked as a duplicate of this bug. ***
It's back: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519289813402914816 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519110556793966592 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519110557616050176
With bug 2016574 closed as a dup, this bug is also now aiming to fix: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=120h&type=junit&search=event+happened.*times.*something+is+wrong.*deployment/etcd-operator.*Degraded+message+changed.*EndpointsDegraded' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact To avoid issues like: : [sig-arch] events should not repeat pathologically 0s 2 events happened too frequently event happened 32 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" event happened 29 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"