Bug 1779810 - Flaky kube-apiserver causing operators to take time to become Available on AWS OVN 4.3
Summary: Flaky kube-apiserver causing operators to take time to become Available on AW...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.3.z
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 1775878
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-04 19:04 UTC by Jonathan Lebon
Modified: 2023-09-14 05:48 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-04-30 01:51:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jonathan Lebon 2019-12-04 19:04:57 UTC
Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12

We saw the installer fail with this:

```
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 30m0s for the cluster at https://api.ci-op-stqltndy-eb682.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
level=info msg="Cluster operator insights Disabled is False with : "
level=fatal msg="failed to initialize the cluster: Working towards 4.3.0-0.nightly-2019-12-04-054458: 100% complete, waiting on authentication"
2019/12/04 06:44:10 Container setup in pod e2e-aws-ovn failed, exit code 1, reason Error
Another process exited
```

Investigating, wking found:

Server seems to have been flaky:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep -c 'connection refused'
188

even as late as ~5 min before the timeout:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/namespaces/openshift-cluster-version/pods/cluster-version-operator-666f64dd9c-9pgr5/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'connection refused' | tail -n2
2019-12-04T06:38:03.499930732Z E1204 06:38:03.499730       1 memcache.go:189] couldn't get current server API group list: Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused
2019-12-04T06:38:03.499930732Z E1204 06:38:03.499793       1 task.go:77] error running apply for servicemonitor "openshift-insights/insights-operator" (335 of 492): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(monitoring.coreos.com/v1, Kind=ServiceMonitor): Get https://127.0.0.1:6443/api?timeout=32s: dial tcp 127.0.0.1:6443: connect: connection refused

Comment 1 W. Trevor King 2019-12-04 19:11:55 UTC
When those 06:38 'connection refused' happened, the kube-apiserver operator was reporting things were fine [1]:

  - lastTransitionTime: "2019-12-04T06:36:26Z"
    message: 'NodeControllerDegraded: All master node(s) are ready'
    reason: AsExpected
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-12-04T06:42:06Z"
    message: 'Progressing: 3 nodes are at revision 5'
    reason: AsExpected
    status: "False"
    type: Progressing
  - lastTransitionTime: "2019-12-04T06:12:36Z"
    message: 'Available: 3 nodes are active; 3 nodes are at revision 5'
    reason: AsExpected
    status: "True"
    type: Available
 
Hmm.  Actually, it had been reporting Available=True for a long time, Degraded=False for over 1m30s.  But it was maybe still Progressing=True; not sure if that's significant.

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12/artifacts/e2e-aws/must-gather/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-13205e1b645979118b3ba5f60cb1a4b3c73e43a4745340e30dc441d13f646851/cluster-scoped-resources/config.openshift.io/clusteroperators/kube-apiserver.yaml

Comment 10 Greg Blomquist 2019-12-10 15:36:15 UTC
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=1776402 ?

Comment 11 Greg Blomquist 2019-12-11 19:21:38 UTC
Adding dependency on bug #1775878

The timeout issues noted in the logs appear to overlap.  Not positive, but wanted to dry the link.

Comment 13 Ben Parees 2020-04-29 20:40:32 UTC
this is targeted 4.3.z but has no 4.4 or 4.5 clone, what is the current thinking around the impact of this bug?

Comment 14 Sam Batschelet 2020-04-30 01:51:14 UTC
The number of leader elections is excessive (10) in this test[1]. But disk I/O metrics do not seem to be the cause. In 4.3 etcd usually has 1 leader election for a CI run so this is a big deal. More recent runs appear more sane[2] showing 1 leader election. My thoughts are these failures were seen during the luks rhcos problems where the signature was leader elections but the fsync metrics appeared fine the reported data 2019-12-04 lines up with this. Closing

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/12
[2] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.3/871

Comment 15 Red Hat Bugzilla 2023-09-14 05:48:08 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.