Bug 1779810
| Summary: | Flaky kube-apiserver causing operators to take time to become Available on AWS OVN 4.3 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jonathan Lebon <jlebon> |
| Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> |
| Status: | CLOSED NOTABUG | QA Contact: | ge liu <geliu> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.0 | CC: | aos-bugs, bparees, deads, gblomqui, jokerman, mfojtik, sbatsche, scuppett, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.3.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-04-30 01:51:14 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1775878 | ||
| Bug Blocks: | |||
|
Description
Jonathan Lebon
2019-12-04 19:04:57 UTC
When those 06:38 'connection refused' happened, the kube-apiserver operator was reporting things were fine [1]:
- lastTransitionTime: "2019-12-04T06:36:26Z"
message: 'NodeControllerDegraded: All master node(s) are ready'
reason: AsExpected
status: "False"
type: Degraded
- lastTransitionTime: "2019-12-04T06:42:06Z"
message: 'Progressing: 3 nodes are at revision 5'
reason: AsExpected
status: "False"
type: Progressing
- lastTransitionTime: "2019-12-04T06:12:36Z"
message: 'Available: 3 nodes are active; 3 nodes are at revision 5'
reason: AsExpected
status: "True"
type: Available
Hmm. Actually, it had been reporting Available=True for a long time, Degraded=False for over 1m30s. But it was maybe still Progressing=True; not sure if that's significant.
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/cluster-scoped-resources/config.openshift.io/clusteroperators/kube-apiserver.yaml
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1776402 ? Adding dependency on bug #1775878 The timeout issues noted in the logs appear to overlap. Not positive, but wanted to dry the link. this is targeted 4.3.z but has no 4.4 or 4.5 clone, what is the current thinking around the impact of this bug? The number of leader elections is excessive (10) in this test[1]. But disk I/O metrics do not seem to be the cause. In 4.3 etcd usually has 1 leader election for a CI run so this is a big deal. More recent runs appear more sane[2] showing 1 leader election. My thoughts are these failures were seen during the luks rhcos problems where the signature was leader elections but the fsync metrics appeared fine the reported data 2019-12-04 lines up with this. Closing [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12 [2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/871 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |