Tomas and Vadim noticed that, when a Deployment manifest leaves 'replicas' unset, the CVO ignores the property. This means that cluster admins can scale those Deployments up or, worse, down to 0, and the CVO will happily continue on without stomping them. Auditing 4.9.10: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.9.10-x86_64 Extracted release payload from digest sha256:e1853d68d8ff093ec353ca7078b6b6df1533729688bb016b8208263ee7423f66 created at 2021-12-01T09:19:24Z $ for F in $(grep -rl 'kind: Deployment' manifests); do yaml2json < "${F}" | jq -r '.[] | select(.kind == "Deployment" and .spec.replicas == null).metadata | .namespace + " " + .name'; done | sort | uniq openshift-cluster-machine-approver machine-approver openshift-insights insights-operator openshift-network-operator network-operator Those are all important operators, and I'm fairly confident that none of their maintainers expect "cluster admin scales them down to 0" to be a supported UX. We should have the CVO default Deployment replicas to 1 (the type's default [1]), so admins who decide they don't want a network operator pod, etc., have to use some more explicit, alarming API to remove those pods (e.g. setting spec.overrides in the ClusterVersion object to assume control of the resource themselves). [1]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec
*** Bug 2028599 has been marked as a duplicate of this bug. ***
Test plan is something like: 1. Install a nightly with the fix (of which none exist yet? Not clear why ART swept this into ON_QA so quickly) 2. Scale down an operator with a vulnerable manifest: $ oc -n openshift-network-operator scale --replicas 0 deployment/network-operator 3. Wait a few minutes while the CVO walks the manifest graph to notice the divergence. 4. Confirm that: $ oc -n openshift-network-operator get -o jsonpath='{.spec.replicas}{"\n"}' deployment network-operator has been returned to 1.
Reproducing with 4.9.10: 1. Install a v4.9.10 cluster 2. Check the pod status of network-operator # oc project openshift-network-operator # oc get all NAME READY STATUS RESTARTS AGE pod/network-operator-797978c5db-7tkkj 1/1 Running 0 3h54m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/network-operator 1/1 1 1 3h54m NAME DESIRED CURRENT READY AGE replicaset.apps/network-operator-797978c5db 1 1 1 3h54m 3. Scale down to 0 replica # oc scale --replicas 0 deployment/network-operator deployment.apps/network-operator scaled # oc get all NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/network-operator 0/0 0 0 3h59m NAME DESIRED CURRENT READY AGE replicaset.apps/network-operator-797978c5db 0 0 0 3h59m CVO's happy to scale down it to 0 ================================================================================= Verifying with 4.10.0-0.nightly-2021-12-03-213835 1. Install a cluster with build 4.10.0-0.nightly-2021-12-03-213835 2. Check the pod status of network-operator # oc project openshift-network-operator # oc get all NAME READY STATUS RESTARTS AGE pod/network-operator-f5b59798-2f4xn 1/1 Running 0 4h7m NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/network-operator 1/1 1 1 4h7m NAME DESIRED CURRENT READY AGE replicaset.apps/network-operator-f5b59798 1 1 1 4h7m 3. Scale down to 0 replica # oc scale --replicas 0 deployment/network-operator deployment.apps/network-operator scaled # oc get all NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/network-operator 0/0 0 0 4h8m NAME DESIRED CURRENT READY AGE replicaset.apps/network-operator-f5b59798 0 0 0 4h8m # oc get all NAME READY STATUS RESTARTS AGE pod/network-operator-f5b59798-m48c6 1/1 Running 0 19s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/network-operator 1/1 1 1 4h9m NAME DESIRED CURRENT READY AGE replicaset.apps/network-operator-f5b59798 1 1 1 4h9m So, for the manifest leaves 'replicas' unset, CVO defaults it to 1 replica. Moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056