Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1779810

Summary:	Flaky kube-apiserver causing operators to take time to become Available on AWS OVN 4.3
Product:	OpenShift Container Platform	Reporter:	Jonathan Lebon <jlebon>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED NOTABUG	QA Contact:	ge liu <geliu>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	aos-bugs, bparees, deads, gblomqui, jokerman, mfojtik, sbatsche, scuppett, wking
Target Milestone:	---
Target Release:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-30 01:51:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1775878
Bug Blocks:

Description Jonathan Lebon 2019-12-04 19:04:57 UTC

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12

We saw the installer fail with this:

```
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-stqltndy-eb682.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Working towards 4.3.0-0.nightly-2019-12-04-054458: 100% complete, waiting on authentication"
2019/12/04 06:44:10 Container setup in pod e2e-aws-ovn failed, exit code 1, reason Error
Another process exited
```

Investigating, wking found:

Server seems to have been flaky:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep -c 'connection refused'
188

even as late as ~5 min before the timeout:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'connection refused' | tail -n2
2019-12-04T06:38:03.499930732Z E1204 06:38:03.499730       1 memcache.go:189] couldn't get current server API group list: Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused
2019-12-04T06:38:03.499930732Z E1204 06:38:03.499793       1 task.go:77] error running apply for servicemonitor "openshift-insights/insights-operator" (335 of 492): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(monitoring.coreos.com/v1, Kind=ServiceMonitor): Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused

Comment 1 W. Trevor King 2019-12-04 19:11:55 UTC

When those 06:38 'connection refused' happened, the kube-apiserver operator was reporting things were fine [1]:

  - lastTransitionTime: "2019-12-04T06:36:26Z"
    message: 'NodeControllerDegraded: All master node(s) are ready'
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-12-04T06:42:06Z"
    message: 'Progressing: 3 nodes are at revision 5'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-04T06:12:36Z"
    message: 'Available: 3 nodes are active; 3 nodes are at revision 5'
    reason: AsExpected
    status: "True"
    type: Available
 
Hmm.  Actually, it had been reporting Available=True for a long time, Degraded=False for over 1m30s.  But it was maybe still Progressing=True; not sure if that's significant.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/cluster-scoped-resources/config.openshift.io/clusteroperators/kube-apiserver.yaml

Comment 10 Greg Blomquist 2019-12-10 15:36:15 UTC

Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1776402 ?

Comment 11 Greg Blomquist 2019-12-11 19:21:38 UTC

Adding dependency on bug #1775878

The timeout issues noted in the logs appear to overlap.  Not positive, but wanted to dry the link.

Comment 13 Ben Parees 2020-04-29 20:40:32 UTC

this is targeted 4.3.z but has no 4.4 or 4.5 clone, what is the current thinking around the impact of this bug?

Comment 14 Sam Batschelet 2020-04-30 01:51:14 UTC

The number of leader elections is excessive (10) in this test[1]. But disk I/O metrics do not seem to be the cause. In 4.3 etcd usually has 1 leader election for a CI run so this is a big deal. More recent runs appear more sane[2] showing 1 leader election. My thoughts are these failures were seen during the luks rhcos problems where the signature was leader elections but the fsync metrics appeared fine the reported data 2019-12-04 lines up with this. Closing

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/871

Comment 15 Red Hat Bugzilla 2023-09-14 05:48:08 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days