2026560 – Cluster-version operator does not remove unrecognized volume mounts

Bug 2026560 - Cluster-version operator does not remove unrecognized volume mounts

Summary: Cluster-version operator does not remove unrecognized volume mounts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	W. Trevor King
QA Contact:	sunzhaohua
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2032125
TreeView+	depends on / blocked

Reported:	2021-11-25 06:13 UTC by sunzhaohua
Modified:	2022-03-10 16:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:30:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 703	0	None	open	Bug 2026560: lib/resourcemerge/core: Merge volumeMounts by mountPath	2021-12-03 05:07:49 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:30:48 UTC

Description sunzhaohua 2021-11-25 06:13:20 UTC

Description of problem:
Cluster-autoscaler-operator deployment is managed by the Cluster Version Operator, it should be replaced with a clean manifest if it is updated. 

Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-11-24-164029

How reproducible:
Always

Steps to Reproduce:
1. Update Cluster-autoscaler-operator deployment. volumeMounts name is cert, same with existing one, mountPath without "/" at the beginning.

$ oc patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]'
deployment.apps/cluster-autoscaler-operator patched
Befor update:
$ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]
After update:
$ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "etc/cluster-autoscaler-operator/tls/service-ca",
    "name": "cert",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]
2. Check Cluster-autoscaler-operator deployment, was not synced by cvo

3. Check cvo log
I1125 01:59:49.344354       1 sync_worker.go:925] Update error 330 of 818: UpdatePayloadResourceInvalid Could not update deployment "openshift-machine-api/cluster-autoscaler-operator" (330 of 818): the object is invalid, possibly due to local cluster configuration (*errors.StatusError: Deployment.apps "cluster-autoscaler-operator" is invalid: spec.template.spec.containers[1].volumeMounts[1].mountPath: Invalid value: "/etc/cluster-autoscaler-operator/tls": must be unique)

Actual results:
Cluster-autoscaler-operator pod stuck in CreateContainerError, cluster-autoscaler-operator deployment was not synced by cvo
$ oc get po                                                                                                                                               
NAME                                           READY   STATUS                 RESTARTS        AGE
cluster-autoscaler-operator-55d787667d-9q9dx   1/2     CreateContainerError   0               26m
cluster-autoscaler-operator-7647f547dd-6bvr7   2/2     Running                0               30m
cluster-baremetal-operator-7f69cdf84b-8k9pn    2/2     Running                1 (6h47m ago)   6h51m
machine-api-controllers-557d766f7d-dt2l8       7/7     Running                0               6h48m
machine-api-operator-596bf465c7-pcpdr          2/2     Running                0               23m

$ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "etc/cluster-autoscaler-operator/tls/service-ca",
    "name": "cert",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]

$ oc edit po cluster-autoscaler-operator-55d787667d-9q9dx
      waiting:
        message: |
          container create failed: time="2021-11-24T14:02:37Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting \"/var/lib/kubelet/pods/46a9fff8-8b01-46cc-96e3-dbd638d3bc15/volumes/kubernetes.io~secret/cert\" to rootfs at \"/etc/cluster-autoscaler-operator/tls/service-ca\" caused: mkdir /var/lib/containers/storage/overlay/4f052047a94b1b618351bfa56ed9753e7cd132f4d619098499c7d5b95abb019d/merged/etc/cluster-autoscaler-operator/tls/service-ca: read-only file system"
        reason: CreateContainerError

Expected results:
Either the patch return error or the cluster-autoscaler-operator deployment or should be synced by cvo soon, the pod should be in running status. 

Additional info:

Comment 1 liujia 2021-11-25 06:49:27 UTC

Added more info here.

> $ oc patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value": {"mountPath":"etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]'
> deployment.apps/cluster-autoscaler-operator patched

There are two issues for the patch config:
1) mountPath, which cause the pod `CreateContainerError`, after changing "etc/cluster-autoscaler-operator/tls/service-ca" to "/etc/cluster-autoscaler-operator/tls/service-ca", the pod running well.

2) duplicate volumeMounts, which cause cvo fail to remove the unrecognized volume mounts. The two volumeMount have the same name `cert`. 
$ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "etc/cluster-autoscaler-operator/tls/service-ca",
    "name": "cert",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]

So cvo keep report the invalid config when trying to restore:
I1125 03:17:42.281040       1 sync_worker.go:925] Update error 330 of 818: UpdatePayloadResourceInvalid Could not update deployment "openshift-machine-api/cluster-autoscaler-operator" (330 of 818): the object is invalid, possibly due to local cluster configuration (*errors.StatusError: Deployment.apps "cluster-autoscaler-operator" is invalid: spec.template.spec.containers[1].volumeMounts[1].mountPath: Invalid value: "/etc/cluster-autoscaler-operator/tls": must be unique)

After changing the name of added volumeMount to a different name, cvo will update the deployment to required state as expected. 

So the issue here is that, if patch the deployment successfully even with invalid config, cvo should restore the update to required status no matter the in-cluster resource valid or not as long as the patch succeed.

Comment 2 Jack Ottofaro 2021-12-02 21:29:09 UTC

(In reply to liujia from comment #1)
> 
> So the issue here is that, if patch the deployment successfully even with
> invalid config, cvo should restore the update to required status no matter
> the in-cluster resource valid or not as long as the patch succeed.

This is standard CVO behaviour whenever the API rejects a resource as being invalid. If I'm understanding what you mean by "restore the update to required status" and CVO simply stomped on the change and restored the resource a less than engaged user may not notice and believe their change was applied. Even if we were to make such a change to CVO it would be an RFE and not a bug.

Comment 3 liujia 2021-12-03 02:21:01 UTC

From my poor understanding, cvo is always stomping on the changes unless the resource is set to unmanned in spec.overrides. And cvo's main responsibility is to translate/maintain the cluster to that target state(required status according to manifest). So from end user's perspective, cvo should always remove "unrecognized volume mounts".

Now when adding unrecognized volume mount cert(/etc/cluster-autoscaler-operator-invalid/service-ca):
# ./oc -n openshift-machine-api patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"/etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]'
deployment.apps/cluster-autoscaler-operator patched

# ./oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls/service-ca",
    "name": "cert",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]
It will do nothing with the added unrecognized volume mount there.

But if adding unrecognized volume mount auth-proxy-config(/etc/cluster-autoscaler-operator-invalid/service-ca):
# ./oc -n openshift-machine-api patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"/etc/cluster-autoscaler-operator/tls/service-ca","name":"auth-proxy-config","readOnly":true}}]'
deployment.apps/cluster-autoscaler-operator patched
# ./oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls/service-ca",
    "name": "auth-proxy-config",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]
It will stomp it directly(also without notice to engaged user), and the unrecognized volume mount will be erased.
# ./oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]

So the inconsistent behavior make us thought it's a bug.

Comment 4 W. Trevor King 2021-12-03 04:50:59 UTC

Ah, because the current logic assumes that volumeMount entries are uniquely identified by name [1], but instead we should have been using mountPath [2].

[1]: https://github.com/openshift/cluster-version-operator/blob/85f767cd663e738b6b9aacec74d010cfbc32993a/lib/resourcemerge/core.go#L325
[2]: https://github.com/kubernetes/api/blob/1d6faf224f146dd002553f55cd9fcaaaa0dc00cb/core/v1/types.go#L2367

Comment 7 sunzhaohua 2021-12-09 03:07:20 UTC

Verified
clusterversion: 4.10.0-0.nightly-2021-12-06-201335
$ oc patch deploy cluster-autoscaler-operator --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/1/volumeMounts/0","value":{"mountPath":"etc/cluster-autoscaler-operator/tls/service-ca","name":"cert","readOnly":true}}]'
deployment.apps/cluster-autoscaler-operator patched
        
After update, check autoscaler pod and deployment, work as expected.
$ oc get po                                                                                                                           
NAME                                          READY   STATUS    RESTARTS   AGE
cluster-autoscaler-operator-5ff8cd67f-lg58p   2/2     Running   0          33m
cluster-baremetal-operator-7bcc96c877-r7sgm   2/2     Running   0          33m
machine-api-controllers-5c58b566b7-mc5jl      7/7     Running   0          33m
machine-api-operator-644d9f94c5-ndl2c         2/2     Running   0          33m

$ oc get deployment.apps/cluster-autoscaler-operator -n openshift-machine-api -ojson|jq .spec.template.spec.containers[1].volumeMounts
[
  {
    "mountPath": "/etc/cluster-autoscaler-operator/tls",
    "name": "cert",
    "readOnly": true
  }
]

Comment 11 errata-xmlrpc 2022-03-10 16:30:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.