Bug 1853889 - [ovirt] test case "Managed cluster should have no crashlooping pods in core namespaces over four minutes" 100% failure
Summary: [ovirt] test case "Managed cluster should have no crashlooping pods in core n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Gal Zaidman
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks: 1858498
TreeView+ depends on / blocked
 
Reported: 2020-07-05 09:38 UTC by Gal Zaidman
Modified: 2020-10-27 16:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:12:20 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-runtimecfg pull 71 0 None closed BUG 1853889: Move haproxy port to 9445 due to conflict with KCM 2020-12-18 11:06:14 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:12:53 UTC

Description Gal Zaidman 2020-07-05 09:38:01 UTC
Description of problem:

On ovirt CI we see the test case:
"Managed cluster should have no crashlooping pods in core namespaces over four minutes" failing 100% of the times due to 'kube-controller-manager-recovery-controller' crash looping:

"

fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected
    <[]string | len:3, cap:4>: [
        "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-2 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-2_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)",
        "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-0 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-0_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)",
        "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-1 is not healthy: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)",
    ]
to be empty
"

on the logs we see:

back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller

"
    {
      "name": "kube-controller-manager-recovery-controller",
      "state": {
        "waiting": {
          "reason": "CrashLoopBackOff",
          "message": "back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)"
        }
      },
      "lastState": {
        "terminated": {
          "exitCode": 124,
          "reason": "Error",
          "message": "                [::ffff:192.168.216.1]:35060               timer:(timewait,44sec,0)\nESTAB      0      0        [::1]:9443                 [::1]:48768              \nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:35090               timer:(timewait,45sec,0)\nESTAB      0      0        [::1]:9443                 [::1]:49620              \nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34220               timer:(timewait,24sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34010               timer:(timewait,6.938ms,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:35496               timer:(timewait,58sec,0)\nESTAB      0      0        [::1]:9443                 [::1]:56764              \nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34750               timer:(timewait,37sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34638               timer:(timewait,35sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34530               timer:(timewait,31sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34142               timer:(timewait,14sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:33950               timer:(timewait,4.753ms,0)\nESTAB      0      0        [::1]:9443                 [::1]:50332              \nESTAB      0      0        [::1]:9443                 [::1]:42128              \nESTAB      0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34778              \nESTAB      0      0        [::1]:9443                 [::1]:54912              \nESTAB      0      0        [::1]:9443                 [::1]:49762              ' ']'\n+ sleep 1\n",
          "startedAt": "2020-07-04T20:17:16Z",
          "finishedAt": "2020-07-04T20:20:16Z",
          "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca"
        }
      },
      "ready": false,
      "restartCount": 6,
      "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84",
      "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84",
      "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca",
      "started": false
    }
"

We started seeing failures due to this from 26-6, around the time that was merged:
https://github.com/openshift/cluster-kube-controller-manager-operator/commit/88dc303df2fd687540f9d80d3a2b32561fb22eb4


You can see this on any ovirt run since 26-6 for example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1279687231723802624

Comment 1 Gal Zaidman 2020-07-05 09:39:58 UTC
I know that there is an open bug on that test case [1] but I believe this is a different reason.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1842002

Comment 3 Gal Zaidman 2020-07-06 09:35:22 UTC
This is just a wild guess because I don't know the code, but I see that the other pods have:
ports.containerPort
startupProbe,
livenessProbe
readinessProbe
but kube-controller-manager-recovery-controller doesn't

Comment 4 Maciej Szulik 2020-07-06 11:09:44 UTC
This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389 and backports to older versions are on the way.

*** This bug has been marked as a duplicate of bug 1851389 ***

Comment 5 Gal Zaidman 2020-07-06 12:07:10 UTC
(In reply to Maciej Szulik from comment #4)
> This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389
> and backports to older versions are on the way.
> 
> *** This bug has been marked as a duplicate of bug 1851389 ***

I think that this failure is caused because of the bug[1] fix[2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1851389
[2] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421

Comment 6 Maciej Szulik 2020-07-06 15:32:37 UTC
Both are needed and both are in-progress. The de-duplication still makes sense.

*** This bug has been marked as a duplicate of bug 1851389 ***

Comment 7 Gal Zaidman 2020-07-06 15:34:57 UTC
Sorry I edited the fields and didn't see you closed it again

*** This bug has been marked as a duplicate of bug 1851389 ***

Comment 8 Gal Zaidman 2020-07-07 07:23:56 UTC
Reopening this, after a talk with Maciej Szulik.
The new suspect is the combination on [1] and [2].
PR[1] added logic for checking port availability in recovery-controller.
PR[2] change the port of HAProxy to 9443, due to yet another port conflict.
Both of the PRs cause the kube-controller-manager-recovery-controller to crash loop.

[1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421
[2] https://github.com/openshift/baremetal-runtimecfg/pull/59

Comment 9 Maciej Szulik 2020-07-07 08:37:22 UTC
Yeah, that links that Gal pointed in the previous comment are the main reason this is failing consistently. 
I wonder why only now this popped up, when cluster-policy-controller is using 9443 port since version 4.3, at 
least. I'm moving this to oVirt team to fix it.

Comment 10 Maciej Szulik 2020-07-07 08:40:32 UTC
My bad, it's recovery controller that is using 9443, not cpc

Comment 14 Gal Zaidman 2020-07-20 15:08:28 UTC
Verified with CI run results

Comment 15 Sonigra Saurab 2020-08-24 06:19:20 UTC
Hello Team

Will the solution of this issue be backported to 4.4

Comment 17 errata-xmlrpc 2020-10-27 16:12:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.