1753040 – Operator fails to report itself degraded in certain cases

Bug 1753040 - Operator fails to report itself degraded in certain cases

Summary: Operator fails to report itself degraded in certain cases

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Abhinav Dahiya
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-17 22:02 UTC by Alex Crawford
Modified:	2019-09-30 18:54 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-30 18:54:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alex Crawford 2019-09-17 22:02:48 UTC

Description of problem: During an upgrade, the storage operator reported that it was available, not progressing, and not degraded, yet its version number was still the previous. After digging into the issue, I noticed that one of the operator pods was in a failing state:

    $ oc get pods -n openshift-cluster-storage-operator
    NAME                                        READY   STATUS                     RESTARTS   AGE
    cluster-storage-operator-58d5595667-98nr8   0/1     UnexpectedAdmissionError   0          97m
    cluster-storage-operator-644f8ff4b9-rqd7q   1/1     Running                    0          62m

    $ oc describe pod -n openshift-cluster-storage-operator cluster-storage-operator-58d5595667-98nr8
    Name:               cluster-storage-operator-58d5595667-98nr8
    Namespace:          openshift-cluster-storage-operator
    Priority:           2000000000
    PriorityClassName:  system-cluster-critical
    Node:               crawford-libvirt-xqscg-master-0/
    Start Time:         Tue, 17 Sep 2019 12:52:56 -0700
    Labels:             name=cluster-storage-operator
                        pod-template-hash=58d5595667
    Annotations:        openshift.io/scc: restricted
    Status:             Failed
    Reason:             UnexpectedAdmissionError
    Message:            Pod Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: memory, q: 11067392), ]
    IP:
    Controlled By:      ReplicaSet/cluster-storage-operator-58d5595667
    Containers:
      <snip>
    Events:
      Type     Reason                    Age                  From                                      Message
      ----     ------                    ----                 ----                                      -------
      Normal   Scheduled                 112m                 default-scheduler                         Successfully assigned openshift-cluster-storage-operator/cluster-storage-operator-58d5595667-98nr8 to crawford-libvirt-xqscg-master-0
      Normal   Pulling                   112m                 kubelet, crawford-libvirt-xqscg-master-0  Pulling image "registry.svc.ci.openshift.org/ocp/4.2-2019-09-16-131750@sha256:86faaa55d764559af238ebcf914df0c99dfdd4e606ba2e6ad7d771836e1332c9"
      Normal   Pulled                    111m                 kubelet, crawford-libvirt-xqscg-master-0  Successfully pulled image "registry.svc.ci.openshift.org/ocp/4.2-2019-09-16-131750@sha256:86faaa55d764559af238ebcf914df0c99dfdd4e606ba2e6ad7d771836e1332c9"
      Normal   Created                   111m (x2 over 111m)  kubelet, crawford-libvirt-xqscg-master-0  Created container cluster-storage-operator
      Normal   Started                   111m (x2 over 111m)  kubelet, crawford-libvirt-xqscg-master-0  Started container cluster-storage-operator
      Normal   Pulled                    111m                 kubelet, crawford-libvirt-xqscg-master-0  Container image "registry.svc.ci.openshift.org/ocp/4.2-2019-09-16-131750@sha256:86faaa55d764559af238ebcf914df0c99dfdd4e606ba2e6ad7d771836e1332c9" already present on machine
      Warning  UnexpectedAdmissionError  104m                 kubelet, crawford-libvirt-xqscg-master-0  Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: memory, q: 11067392), ]

Looking at the replica sets showed that this pod shouldn't have been scheduled anyway:

    oc get rs -n openshift-cluster-storage-operator
    NAME                                  DESIRED   CURRENT   READY   AGE
    cluster-storage-operator-58d5595667   0         0         0       4h23m
    cluster-storage-operator-644f8ff4b9   1         1         1       77m
    cluster-storage-operator-66d6c59d85   0         0         0       25h
    cluster-storage-operator-6b9bb58499   0         0         0       4d

After deleting the pod, the operator then started reporting the latest version.

Version-Release number of selected component (if applicable): 4.2.0-0.ci-2019-09-16-131750

How reproducible: Unsure

Steps to Reproduce: Unknown

Actual results: Operator reported no issues.

Expected results: Operator should have reported that it was waiting on something before it could transition; eventually marking itself degraded.

Comment 1 Jan Safranek 2019-09-18 08:21:20 UTC

It looks like all masters are full and have no memory to run the updated cluster-storage-operator.

> Expected results: Operator should have reported that it was waiting on something before it could transition; eventually marking itself degraded.

In this case, operator = the pod that failed to run. It can't mark itself as degraded, because it's not running. In addition, the old version of the operator is still running and working correctly. It is not degraded, it's just old. CVO should capture that somehow.

For CVO guys, cluster-storage-operator has this deployment: https://github.com/openshift/cluster-storage-operator/blob/ba0e3ad4d0de561ead106c4967b8bd818e84c539/manifests/02-deployment.yaml. CVO updates it, but it does not wait until rolling update succeeds. Failed rolling update should be more visible to users somehow.

[moving from 4.2 blocker list, we don't support libvirt && the documented minimum is 16 GB of memory on masters]

Comment 2 Scott Dodson 2019-09-30 18:54:24 UTC

This shouldn't happen if control plane hosts were provisioned with the required 16GiB.

Note You need to log in before you can comment on or make changes to this bug.