Description of problem: During an upgrade, the storage operator reported that it was available, not progressing, and not degraded, yet its version number was still the previous. After digging into the issue, I noticed that one of the operator pods was in a failing state: $ oc get pods -n openshift-cluster-storage-operator NAME READY STATUS RESTARTS AGE cluster-storage-operator-58d5595667-98nr8 0/1 UnexpectedAdmissionError 0 97m cluster-storage-operator-644f8ff4b9-rqd7q 1/1 Running 0 62m $ oc describe pod -n openshift-cluster-storage-operator cluster-storage-operator-58d5595667-98nr8 Name: cluster-storage-operator-58d5595667-98nr8 Namespace: openshift-cluster-storage-operator Priority: 2000000000 PriorityClassName: system-cluster-critical Node: crawford-libvirt-xqscg-master-0/ Start Time: Tue, 17 Sep 2019 12:52:56 -0700 Labels: name=cluster-storage-operator pod-template-hash=58d5595667 Annotations: openshift.io/scc: restricted Status: Failed Reason: UnexpectedAdmissionError Message: Pod Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: memory, q: 11067392), ] IP: Controlled By: ReplicaSet/cluster-storage-operator-58d5595667 Containers: <snip> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 112m default-scheduler Successfully assigned openshift-cluster-storage-operator/cluster-storage-operator-58d5595667-98nr8 to crawford-libvirt-xqscg-master-0 Normal Pulling 112m kubelet, crawford-libvirt-xqscg-master-0 Pulling image "registry.svc.ci.openshift.org/ocp/4.2-2019-09-16-131750@sha256:86faaa55d764559af238ebcf914df0c99dfdd4e606ba2e6ad7d771836e1332c9" Normal Pulled 111m kubelet, crawford-libvirt-xqscg-master-0 Successfully pulled image "registry.svc.ci.openshift.org/ocp/4.2-2019-09-16-131750@sha256:86faaa55d764559af238ebcf914df0c99dfdd4e606ba2e6ad7d771836e1332c9" Normal Created 111m (x2 over 111m) kubelet, crawford-libvirt-xqscg-master-0 Created container cluster-storage-operator Normal Started 111m (x2 over 111m) kubelet, crawford-libvirt-xqscg-master-0 Started container cluster-storage-operator Normal Pulled 111m kubelet, crawford-libvirt-xqscg-master-0 Container image "registry.svc.ci.openshift.org/ocp/4.2-2019-09-16-131750@sha256:86faaa55d764559af238ebcf914df0c99dfdd4e606ba2e6ad7d771836e1332c9" already present on machine Warning UnexpectedAdmissionError 104m kubelet, crawford-libvirt-xqscg-master-0 Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: memory, q: 11067392), ] Looking at the replica sets showed that this pod shouldn't have been scheduled anyway: oc get rs -n openshift-cluster-storage-operator NAME DESIRED CURRENT READY AGE cluster-storage-operator-58d5595667 0 0 0 4h23m cluster-storage-operator-644f8ff4b9 1 1 1 77m cluster-storage-operator-66d6c59d85 0 0 0 25h cluster-storage-operator-6b9bb58499 0 0 0 4d After deleting the pod, the operator then started reporting the latest version. Version-Release number of selected component (if applicable): 4.2.0-0.ci-2019-09-16-131750 How reproducible: Unsure Steps to Reproduce: Unknown Actual results: Operator reported no issues. Expected results: Operator should have reported that it was waiting on something before it could transition; eventually marking itself degraded.
It looks like all masters are full and have no memory to run the updated cluster-storage-operator. > Expected results: Operator should have reported that it was waiting on something before it could transition; eventually marking itself degraded. In this case, operator = the pod that failed to run. It can't mark itself as degraded, because it's not running. In addition, the old version of the operator is still running and working correctly. It is not degraded, it's just old. CVO should capture that somehow. For CVO guys, cluster-storage-operator has this deployment: https://github.com/openshift/cluster-storage-operator/blob/ba0e3ad4d0de561ead106c4967b8bd818e84c539/manifests/02-deployment.yaml. CVO updates it, but it does not wait until rolling update succeeds. Failed rolling update should be more visible to users somehow. [moving from 4.2 blocker list, we don't support libvirt && the documented minimum is 16 GB of memory on masters]
This shouldn't happen if control plane hosts were provisioned with the required 16GiB.