Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1779810

Summary: Flaky kube-apiserver causing operators to take time to become Available on AWS OVN 4.3
Product: OpenShift Container Platform Reporter: Jonathan Lebon <jlebon>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED NOTABUG QA Contact: ge liu <geliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, bparees, deads, gblomqui, jokerman, mfojtik, sbatsche, scuppett, wking
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-30 01:51:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1775878    
Bug Blocks:    

Description Jonathan Lebon 2019-12-04 19:04:57 UTC
Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12

We saw the installer fail with this:

```
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-stqltndy-eb682.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Working towards 4.3.0-0.nightly-2019-12-04-054458: 100% complete, waiting on authentication"
2019/12/04 06:44:10 Container setup in pod e2e-aws-ovn failed, exit code 1, reason Error
Another process exited
```

Investigating, wking found:

Server seems to have been flaky:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep -c 'connection refused'
188

even as late as ~5 min before the timeout:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'connection refused' | tail -n2
2019-12-04T06:38:03.499930732Z E1204 06:38:03.499730       1 memcache.go:189] couldn't get current server API group list: Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused
2019-12-04T06:38:03.499930732Z E1204 06:38:03.499793       1 task.go:77] error running apply for servicemonitor "openshift-insights/insights-operator" (335 of 492): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(monitoring.coreos.com/v1, Kind=ServiceMonitor): Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused

Comment 1 W. Trevor King 2019-12-04 19:11:55 UTC
When those 06:38 'connection refused' happened, the kube-apiserver operator was reporting things were fine [1]:

  - lastTransitionTime: "2019-12-04T06:36:26Z"
    message: 'NodeControllerDegraded: All master node(s) are ready'
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-12-04T06:42:06Z"
    message: 'Progressing: 3 nodes are at revision 5'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-04T06:12:36Z"
    message: 'Available: 3 nodes are active; 3 nodes are at revision 5'
    reason: AsExpected
    status: "True"
    type: Available
 
Hmm.  Actually, it had been reporting Available=True for a long time, Degraded=False for over 1m30s.  But it was maybe still Progressing=True; not sure if that's significant.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/cluster-scoped-resources/config.openshift.io/clusteroperators/kube-apiserver.yaml

Comment 10 Greg Blomquist 2019-12-10 15:36:15 UTC
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1776402 ?

Comment 11 Greg Blomquist 2019-12-11 19:21:38 UTC
Adding dependency on bug #1775878

The timeout issues noted in the logs appear to overlap.  Not positive, but wanted to dry the link.

Comment 13 Ben Parees 2020-04-29 20:40:32 UTC
this is targeted 4.3.z but has no 4.4 or 4.5 clone, what is the current thinking around the impact of this bug?

Comment 14 Sam Batschelet 2020-04-30 01:51:14 UTC
The number of leader elections is excessive (10) in this test[1]. But disk I/O metrics do not seem to be the cause. In 4.3 etcd usually has 1 leader election for a CI run so this is a big deal. More recent runs appear more sane[2] showing 1 leader election. My thoughts are these failures were seen during the luks rhcos problems where the signature was leader elections but the fsync metrics appeared fine the reported data 2019-12-04 lines up with this. Closing

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/871

Comment 15 Red Hat Bugzilla 2023-09-14 05:48:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days