Description of problem: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12 We saw the installer fail with this: ``` level=info msg="Waiting up to 30m0s for bootstrapping to complete..." level=info msg="Destroying the bootstrap resources..." level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-stqltndy-eb682.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..." level=info msg="Cluster operator insights Disabled is False with : " level=fatal msg="failed to initialize the cluster: Working towards 4.3.0-0.nightly-2019-12-04-054458: 100% complete, waiting on authentication" 2019/12/04 06:44:10 Container setup in pod e2e-aws-ovn failed, exit code 1, reason Error Another process exited ``` Investigating, wking found: Server seems to have been flaky: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep -c 'connection refused' 188 even as late as ~5 min before the timeout: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'connection refused' | tail -n2 2019-12-04T06:38:03.499930732Z E1204 06:38:03.499730 1 memcache.go:189] couldn't get current server API group list: Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused 2019-12-04T06:38:03.499930732Z E1204 06:38:03.499793 1 task.go:77] error running apply for servicemonitor "openshift-insights/insights-operator" (335 of 492): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(monitoring.coreos.com/v1, Kind=ServiceMonitor): Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused
When those 06:38 'connection refused' happened, the kube-apiserver operator was reporting things were fine [1]: - lastTransitionTime: "2019-12-04T06:36:26Z" message: 'NodeControllerDegraded: All master node(s) are ready' reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2019-12-04T06:42:06Z" message: 'Progressing: 3 nodes are at revision 5' reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2019-12-04T06:12:36Z" message: 'Available: 3 nodes are active; 3 nodes are at revision 5' reason: AsExpected status: "True" type: Available Hmm. Actually, it had been reporting Available=True for a long time, Degraded=False for over 1m30s. But it was maybe still Progressing=True; not sure if that's significant. https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/cluster-scoped-resources/config.openshift.io/clusteroperators/kube-apiserver.yaml
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1776402 ?
Adding dependency on bug #1775878 The timeout issues noted in the logs appear to overlap. Not positive, but wanted to dry the link.
this is targeted 4.3.z but has no 4.4 or 4.5 clone, what is the current thinking around the impact of this bug?
The number of leader elections is excessive (10) in this test[1]. But disk I/O metrics do not seem to be the cause. In 4.3 etcd usually has 1 leader election for a CI run so this is a big deal. More recent runs appear more sane[2] showing 1 leader election. My thoughts are these failures were seen during the luks rhcos problems where the signature was leader elections but the fsync metrics appeared fine the reported data 2019-12-04 lines up with this. Closing [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12 [2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/871
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days