Bug 2028217

Summary:	Cluster-version operator does not default Deployment replicas to one
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	Yang Yang <yanyang>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.1.z	CC:	aos-bugs, openshift-bugzilla-robot, yanyang
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:31:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2028602

Description W. Trevor King 2021-12-01 18:13:00 UTC

Tomas and Vadim noticed that, when a Deployment manifest leaves 'replicas' unset, the CVO ignores the property.  This means that cluster admins can scale those Deployments up or, worse, down to 0, and the CVO will happily continue on without stomping them.  Auditing 4.9.10:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.9.10-x86_64
  Extracted release payload from digest sha256:e1853d68d8ff093ec353ca7078b6b6df1533729688bb016b8208263ee7423f66 created at 2021-12-01T09:19:24Z
  $ for F in $(grep -rl 'kind: Deployment' manifests); do yaml2json < "${F}" | jq -r '.[] | select(.kind == "Deployment" and .spec.replicas == null).metadata | .namespace + " " + .name'; done | sort | uniq
  openshift-cluster-machine-approver machine-approver
  openshift-insights insights-operator
  openshift-network-operator network-operator

Those are all important operators, and I'm fairly confident that none of their maintainers expect "cluster admin scales them down to 0" to be a supported UX.  We should have the CVO default Deployment replicas to 1 (the type's default [1]), so admins who decide they don't want a network operator pod, etc., have to use some more explicit, alarming API to remove those pods (e.g. setting spec.overrides in the ClusterVersion object to assume control of the resource themselves).

[1]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/deployment-v1/#DeploymentSpec

Comment 1 W. Trevor King 2021-12-02 18:07:42 UTC

*** Bug 2028599 has been marked as a duplicate of this bug. ***

Comment 3 W. Trevor King 2021-12-02 18:49:39 UTC

Test plan is something like:

1. Install a nightly with the fix (of which none exist yet?  Not clear why ART swept this into ON_QA so quickly)
2. Scale down an operator with a vulnerable manifest:
     $ oc -n openshift-network-operator scale --replicas 0 deployment/network-operator
3. Wait a few minutes while the CVO walks the manifest graph to notice the divergence.
4. Confirm that:
     $ oc -n openshift-network-operator get -o jsonpath='{.spec.replicas}{"\n"}' deployment network-operator
   has been returned to 1.

Comment 5 Yang Yang 2021-12-06 07:21:51 UTC

Reproducing with 4.9.10:

1. Install a v4.9.10 cluster
2. Check the pod status of network-operator

# oc project openshift-network-operator
# oc get all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/network-operator-797978c5db-7tkkj   1/1     Running   0          3h54m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           3h54m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-797978c5db   1         1         1       3h54m

3. Scale down to 0 replica
# oc scale --replicas 0 deployment/network-operator
deployment.apps/network-operator scaled

# oc get all
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   0/0     0            0           3h59m

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-797978c5db   0         0         0       3h59m

CVO's happy to scale down it to 0

=================================================================================

Verifying with 4.10.0-0.nightly-2021-12-03-213835

1. Install a cluster with build 4.10.0-0.nightly-2021-12-03-213835
2. Check the pod status of network-operator

# oc project openshift-network-operator
# oc get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/network-operator-f5b59798-2f4xn   1/1     Running   0          4h7m

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           4h7m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   1         1         1       4h7m

3. Scale down to 0 replica
# oc scale --replicas 0 deployment/network-operator
deployment.apps/network-operator scaled
# oc get all
NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   0/0     0            0           4h8m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   0         0         0       4h8m

# oc get all
NAME                                  READY   STATUS    RESTARTS   AGE
pod/network-operator-f5b59798-m48c6   1/1     Running   0          19s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator   1/1     1            1           4h9m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-f5b59798   1         1         1       4h9m

So, for the manifest leaves 'replicas' unset, CVO defaults it to 1 replica. Moving it to verified state.

Comment 9 errata-xmlrpc 2022-03-10 16:31:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056