Description of problem: Cluster-autoscaler-operator deployment is managed by the Cluster Version Operator, it should be replaced with a clean manifest if it is updated. Version-Release number of selected component (if applicable): 4.10.0-0.nightly-2021-11-24-164029 How reproducible: Always Steps to Reproduce: 1. Update Cluster-autoscaler-operator deployment. volumeMounts name is cert, same with existing one, mountPath without "/" at the beginning. $ oc patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]' deployment.apps/cluster-autoscaler-operator patched Befor update: $ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] After update: $ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "etc/cluster-autoscaler-operator/tls/service-ca", "name": "cert", "readOnly": true }, { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] 2. Check Cluster-autoscaler-operator deployment, was not synced by cvo 3. Check cvo log I1125 01:59:49.344354 1 sync_worker.go:925] Update error 330 of 818: UpdatePayloadResourceInvalid Could not update deployment "openshift-machine-api/cluster-autoscaler-operator" (330 of 818): the object is invalid, possibly due to local cluster configuration (*errors.StatusError: Deployment.apps "cluster-autoscaler-operator" is invalid: spec.template.spec.containers[1].volumeMounts[1].mountPath: Invalid value: "/etc/cluster-autoscaler-operator/tls": must be unique) Actual results: Cluster-autoscaler-operator pod stuck in CreateContainerError, cluster-autoscaler-operator deployment was not synced by cvo $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-55d787667d-9q9dx 1/2 CreateContainerError 0 26m cluster-autoscaler-operator-7647f547dd-6bvr7 2/2 Running 0 30m cluster-baremetal-operator-7f69cdf84b-8k9pn 2/2 Running 1 (6h47m ago) 6h51m machine-api-controllers-557d766f7d-dt2l8 7/7 Running 0 6h48m machine-api-operator-596bf465c7-pcpdr 2/2 Running 0 23m $ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "etc/cluster-autoscaler-operator/tls/service-ca", "name": "cert", "readOnly": true }, { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] $ oc edit po cluster-autoscaler-operator-55d787667d-9q9dx waiting: message: | container create failed: time="2021-11-24T14:02:37Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting \"/var/lib/kubelet/pods/46a9fff8-8b01-46cc-96e3-dbd638d3bc15/volumes/kubernetes.io~secret/cert\" to rootfs at \"/etc/cluster-autoscaler-operator/tls/service-ca\" caused: mkdir /var/lib/containers/storage/overlay/4f052047a94b1b618351bfa56ed9753e7cd132f4d619098499c7d5b95abb019d/merged/etc/cluster-autoscaler-operator/tls/service-ca: read-only file system" reason: CreateContainerError Expected results: Either the patch return error or the cluster-autoscaler-operator deployment or should be synced by cvo soon, the pod should be in running status. Additional info:
Added more info here. > $ oc patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value": {"mountPath":"etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]' > deployment.apps/cluster-autoscaler-operator patched There are two issues for the patch config: 1) mountPath, which cause the pod `CreateContainerError`, after changing "etc/cluster-autoscaler-operator/tls/service-ca" to "/etc/cluster-autoscaler-operator/tls/service-ca", the pod running well. 2) duplicate volumeMounts, which cause cvo fail to remove the unrecognized volume mounts. The two volumeMount have the same name `cert`. $ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "etc/cluster-autoscaler-operator/tls/service-ca", "name": "cert", "readOnly": true }, { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] So cvo keep report the invalid config when trying to restore: I1125 03:17:42.281040 1 sync_worker.go:925] Update error 330 of 818: UpdatePayloadResourceInvalid Could not update deployment "openshift-machine-api/cluster-autoscaler-operator" (330 of 818): the object is invalid, possibly due to local cluster configuration (*errors.StatusError: Deployment.apps "cluster-autoscaler-operator" is invalid: spec.template.spec.containers[1].volumeMounts[1].mountPath: Invalid value: "/etc/cluster-autoscaler-operator/tls": must be unique) After changing the name of added volumeMount to a different name, cvo will update the deployment to required state as expected. So the issue here is that, if patch the deployment successfully even with invalid config, cvo should restore the update to required status no matter the in-cluster resource valid or not as long as the patch succeed.
(In reply to liujia from comment #1) > > So the issue here is that, if patch the deployment successfully even with > invalid config, cvo should restore the update to required status no matter > the in-cluster resource valid or not as long as the patch succeed. This is standard CVO behaviour whenever the API rejects a resource as being invalid. If I'm understanding what you mean by "restore the update to required status" and CVO simply stomped on the change and restored the resource a less than engaged user may not notice and believe their change was applied. Even if we were to make such a change to CVO it would be an RFE and not a bug.
From my poor understanding, cvo is always stomping on the changes unless the resource is set to unmanned in spec.overrides. And cvo's main responsibility is to translate/maintain the cluster to that target state(required status according to manifest). So from end user's perspective, cvo should always remove "unrecognized volume mounts". Now when adding unrecognized volume mount cert(/etc/cluster-autoscaler-operator-invalid/service-ca): # ./oc -n openshift-machine-api patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"/etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]' deployment.apps/cluster-autoscaler-operator patched # ./oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "/etc/cluster-autoscaler-operator/tls/service-ca", "name": "cert", "readOnly": true }, { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] It will do nothing with the added unrecognized volume mount there. But if adding unrecognized volume mount auth-proxy-config(/etc/cluster-autoscaler-operator-invalid/service-ca): # ./oc -n openshift-machine-api patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"/etc/cluster-autoscaler-operator/tls/service-ca","name":"auth-proxy-config","readOnly":true}}]' deployment.apps/cluster-autoscaler-operator patched # ./oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "/etc/cluster-autoscaler-operator/tls/service-ca", "name": "auth-proxy-config", "readOnly": true }, { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] It will stomp it directly(also without notice to engaged user), and the unrecognized volume mount will be erased. # ./oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] So the inconsistent behavior make us thought it's a bug.
Ah, because the current logic assumes that volumeMount entries are uniquely identified by name [1], but instead we should have been using mountPath [2]. [1]: https://github.com/openshift/cluster-version-operator/blob/85f767cd663e738b6b9aacec74d010cfbc32993a/lib/resourcemerge/core.go#L325 [2]: https://github.com/kubernetes/api/blob/1d6faf224f146dd002553f55cd9fcaaaa0dc00cb/core/v1/types.go#L2367
Verified clusterversion: 4.10.0-0.nightly-2021-12-06-201335 $ oc patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]' deployment.apps/cluster-autoscaler-operator patched After update, check autoscaler pod and deployment, work as expected. $ oc get po NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-5ff8cd67f-lg58p 2/2 Running 0 33m cluster-baremetal-operator-7bcc96c877-r7sgm 2/2 Running 0 33m machine-api-controllers-5c58b566b7-mc5jl 7/7 Running 0 33m machine-api-operator-644d9f94c5-ndl2c 2/2 Running 0 33m $ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts [ { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056