Bug 1853889

Summary: [ovirt] test case "Managed cluster should have no crashlooping pods in core namespaces over four minutes" 100% failure
Product: OpenShift Container Platform Reporter: Gal Zaidman <gzaidman>
Component: InstallerAssignee: Gal Zaidman <gzaidman>
Installer sub component: OpenShift on RHV QA Contact: Lucie Leistnerova <lleistne>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: aos-bugs, hpopal, maszulik, mfojtik, ssonigra, wking, xtian
Version: 4.4Keywords: Reopened
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:12:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1858498    

Description Gal Zaidman 2020-07-05 09:38:01 UTC
Description of problem:

On ovirt CI we see the test case:
"Managed cluster should have no crashlooping pods in core namespaces over four minutes" failing 100% of the times due to 'kube-controller-manager-recovery-controller' crash looping:

"

fail [github.com/openshift/origin/test/extended/operators/cluster.go:115]: Expected
    <[]string | len:3, cap:4>: [
        "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-2 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-2_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)",
        "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-0 is not healthy: back-off 2m40s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-0_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)",
        "Pod openshift-kube-controller-manager/kube-controller-manager-ovirt16-4f6vr-master-1 is not healthy: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)",
    ]
to be empty
"

on the logs we see:

back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller

"
    {
      "name": "kube-controller-manager-recovery-controller",
      "state": {
        "waiting": {
          "reason": "CrashLoopBackOff",
          "message": "back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-ovirt16-4f6vr-master-1_openshift-kube-controller-manager(dd1b3f8e9a8c376ad2f3815f0b73a67b)"
        }
      },
      "lastState": {
        "terminated": {
          "exitCode": 124,
          "reason": "Error",
          "message": "                [::ffff:192.168.216.1]:35060               timer:(timewait,44sec,0)\nESTAB      0      0        [::1]:9443                 [::1]:48768              \nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:35090               timer:(timewait,45sec,0)\nESTAB      0      0        [::1]:9443                 [::1]:49620              \nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34220               timer:(timewait,24sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34010               timer:(timewait,6.938ms,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:35496               timer:(timewait,58sec,0)\nESTAB      0      0        [::1]:9443                 [::1]:56764              \nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34750               timer:(timewait,37sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34638               timer:(timewait,35sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34530               timer:(timewait,31sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34142               timer:(timewait,14sec,0)\nTIME-WAIT  0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:33950               timer:(timewait,4.753ms,0)\nESTAB      0      0        [::1]:9443                 [::1]:50332              \nESTAB      0      0        [::1]:9443                 [::1]:42128              \nESTAB      0      0         [::ffff:192.168.216.111]:9443                [::ffff:192.168.216.1]:34778              \nESTAB      0      0        [::1]:9443                 [::1]:54912              \nESTAB      0      0        [::1]:9443                 [::1]:49762              ' ']'\n+ sleep 1\n",
          "startedAt": "2020-07-04T20:17:16Z",
          "finishedAt": "2020-07-04T20:20:16Z",
          "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca"
        }
      },
      "ready": false,
      "restartCount": 6,
      "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84",
      "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:eebf5267f7a72f62c2ff0addf89b986c2a21699ddb044334d0f53feb11a6fa84",
      "containerID": "cri-o://f26d1743479c119c8f7f8352f911f18def3c63fde5b418d3d920517238015cca",
      "started": false
    }
"

We started seeing failures due to this from 26-6, around the time that was merged:
https://github.com/openshift/cluster-kube-controller-manager-operator/commit/88dc303df2fd687540f9d80d3a2b32561fb22eb4


You can see this on any ovirt run since 26-6 for example:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-ovirt-4.6/1279687231723802624

Comment 1 Gal Zaidman 2020-07-05 09:39:58 UTC
I know that there is an open bug on that test case [1] but I believe this is a different reason.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1842002

Comment 3 Gal Zaidman 2020-07-06 09:35:22 UTC
This is just a wild guess because I don't know the code, but I see that the other pods have:
ports.containerPort
startupProbe,
livenessProbe
readinessProbe
but kube-controller-manager-recovery-controller doesn't

Comment 4 Maciej Szulik 2020-07-06 11:09:44 UTC
This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389 and backports to older versions are on the way.

*** This bug has been marked as a duplicate of bug 1851389 ***

Comment 5 Gal Zaidman 2020-07-06 12:07:10 UTC
(In reply to Maciej Szulik from comment #4)
> This is being handled in https://bugzilla.redhat.com/show_bug.cgi?id=1851389
> and backports to older versions are on the way.
> 
> *** This bug has been marked as a duplicate of bug 1851389 ***

I think that this failure is caused because of the bug[1] fix[2]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1851389
[2] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421

Comment 6 Maciej Szulik 2020-07-06 15:32:37 UTC
Both are needed and both are in-progress. The de-duplication still makes sense.

*** This bug has been marked as a duplicate of bug 1851389 ***

Comment 7 Gal Zaidman 2020-07-06 15:34:57 UTC
Sorry I edited the fields and didn't see you closed it again

*** This bug has been marked as a duplicate of bug 1851389 ***

Comment 8 Gal Zaidman 2020-07-07 07:23:56 UTC
Reopening this, after a talk with Maciej Szulik.
The new suspect is the combination on [1] and [2].
PR[1] added logic for checking port availability in recovery-controller.
PR[2] change the port of HAProxy to 9443, due to yet another port conflict.
Both of the PRs cause the kube-controller-manager-recovery-controller to crash loop.

[1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/421
[2] https://github.com/openshift/baremetal-runtimecfg/pull/59

Comment 9 Maciej Szulik 2020-07-07 08:37:22 UTC
Yeah, that links that Gal pointed in the previous comment are the main reason this is failing consistently. 
I wonder why only now this popped up, when cluster-policy-controller is using 9443 port since version 4.3, at 
least. I'm moving this to oVirt team to fix it.

Comment 10 Maciej Szulik 2020-07-07 08:40:32 UTC
My bad, it's recovery controller that is using 9443, not cpc

Comment 14 Gal Zaidman 2020-07-20 15:08:28 UTC
Verified with CI run results

Comment 15 Sonigra Saurab 2020-08-24 06:19:20 UTC
Hello Team

Will the solution of this issue be backported to 4.4

Comment 17 errata-xmlrpc 2020-10-27 16:12:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196