Bug 2002834
| Summary: | Cluster-version operator does not remove unrecognized volume mounts | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Christoph Blecker <cblecker> | |
| Component: | Cluster Version Operator | Assignee: | W. Trevor King <wking> | |
| Status: | CLOSED ERRATA | QA Contact: | Yang Yang <yanyang> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | urgent | |||
| Version: | 4.1.z | CC: | aos-bugs, lmohanty, mimccune, vrutkovs, wking, yanyang | |
| Target Milestone: | --- | Keywords: | ServiceDeliveryBlocker, Upgrades | |
| Target Release: | 4.10.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: The cluster-version operator did not remove volume(Mounts) which were not requested in the manifest.
Consequence: Cluster administrators were able to add additional volume(Mounts) and entries which the CVO had added in previous versions were not removed, potentially breaking pod creation on volume failure, without the CVO stomping their changes.
Fix: The cluster-version operator now removes volume(Mounts) that do not appear in the manifest.
Result: The cluster-version operator's manifests reconciliation will not stick when a non-manifest volume fails.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2004568 (view as bug list) | Environment: | ||
| Last Closed: | 2022-03-12 04:38:07 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2004568 | |||
Workaround: oc delete deploy -n openshift-machine-api cluster-autoscaler-operator This deployment is managed by the Cluster Version Operator, and will be replaced with a clean manifest if it is deleted. This workaround has been confirmed on a 4.9.0-rc.0 cluster. Looks like the autoscaler operator stopped asking for the volume mount in 4.3: $ git log --oneline origin/release-4.3 | grep 'Rely on service-ca-operator' $ git log --oneline origin/release-4.4 | grep 'Rely on service-ca-operator' f08589d4 webhooks: Rely on service-ca-operator for CA injection And the issue is that, since the CVO started caring about volumes, we require manifest entries to exist in-cluster, but allow not-in-manifest entries to persist without stomping them out [1]. We can't think of any valid use cases for admins injecting additional volume mounts (and they can use spec.overrides in ClusterVersion to take complete control if they need to in an emergency), so moving to the CVO to tighten up our volume management. The loose volume management is generic, but this specific autoscaler mount will be an issue for clusters born in 4.3 and earlier (which will have originally asked the CVO to add this volume) updating to 4.9 and later (where the autoscaler now asks the CVO to remove the ConfigMap). [1]: https://github.com/openshift/cluster-version-operator/blob/3b1047de4d2a82d2a1f4f92c224a1c46eba72ca9/lib/resourcemerge/core.go#L40-L55 > This specific autoscaler mount will be an issue for clusters born in 4.3 and earlier (which will have originally asked the CVO to add this volume) updating to 4.9 and later (where the autoscaler now asks the CVO to remove the ConfigMap).
We need to add this scenario to the tests as well.
4.3 -> ... -> 4.9 should be part of our usual pre-GA QE test matrix. But it will be hard to cover in CI, because it's going to take a long time for all of those updates, and a hiccup that upsets the update suite during any leg would fail the test before it completed the full chain. I've been testing 4.2/4.3 -> 4.9 upgrade recently and it worked as expected. Does this bug need any other steps to reproduce, e.g. have an autoscaler enabled? Huh, I expected it to just be "CVO installed the autoscaler operator" (all 4.3?) and "CVO deletes the ConfigMap" (all 4.9?). Looking more closely, I see [1] was missing cluster-profile labels, so the CVO would ignore those manifests. self-managed-high-availability was added later in [2], so any 4.9 target that includes that commit should have a crash-looping autoscaler operator (unless a cluster admin takes steps like comment 2). [1]: https://github.com/openshift/cluster-autoscaler-operator/pull/214/files [2]: https://github.com/openshift/cluster-autoscaler-operator/pull/216 Reproducing the major exposure, cluster-bot 'launch 4.3.40'. It has the mount:
$ oc -n openshift-machine-api get -o json deployment cluster-autoscaler-operator | jq '.spec.template.spec | {volumes, containers: ([.containers[] | {name, volumeMounts}])}'
{
"volumes": [
{
"configMap": {
"defaultMode": 420,
"items": [
{
"key": "service-ca.crt",
"path": "ca-cert.pem"
}
],
"name": "cluster-autoscaler-operator-ca"
},
"name": "ca-cert"
},
{
"name": "cert",
"secret": {
"defaultMode": 420,
"items": [
{
"key": "tls.crt",
"path": "tls.crt"
},
{
"key": "tls.key",
"path": "tls.key"
}
],
"secretName": "cluster-autoscaler-operator-cert"
}
},
{
"configMap": {
"defaultMode": 420,
"name": "kube-rbac-proxy-cluster-autoscaler-operator"
},
"name": "auth-proxy-config"
}
],
"containers": [
{
"name": "kube-rbac-proxy",
"volumeMounts": [
{
"mountPath": "/etc/kube-rbac-proxy",
"name": "auth-proxy-config",
"readOnly": true
},
{
"mountPath": "/etc/tls/private",
"name": "cert",
"readOnly": true
}
]
},
{
"name": "cluster-autoscaler-operator",
"volumeMounts": [
{
"mountPath": "/etc/cluster-autoscaler-operator/tls/service-ca",
"name": "ca-cert",
"readOnly": true
},
{
"mountPath": "/etc/cluster-autoscaler-operator/tls",
"name": "cert",
"readOnly": true
}
]
}
]
}
Update to 4.4:
$ oc adm upgrade channel fast-4.4 # requires 4.9 oc binary, otherwise 'oc patch ...' or whatever.
$ oc adm upgrade --to 4.4.33
$ # polling oc adm upgrade until the update completes
$ oc adm upgrade
Cluster version is 4.4.33
...
$ oc -n openshift-machine-api get pods
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-cfcd79bd4-hg6zb 2/2 Running 0 33m
machine-api-controllers-6c6f45f69-vgsjg 4/4 Running 0 33m
machine-api-operator-7955c5d9b9-fpzqs 2/2 Running 0 28m
And then jump to the 4.9-style issue by removing the ConfigMap (in 4.9, the CVO would do this automatically. I don't have time to take this now-4.4 cluster to 4.9 before cluster-bot reaps it, so doing it manually):
$ oc -n openshift-machine-api delete configmap cluster-autoscaler-operator-ca
A few minutes later, the autoscaler is still happy:
$ oc -n openshift-machine-api get pods
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-cfcd79bd4-hg6zb 2/2 Running 0 36m
machine-api-controllers-6c6f45f69-vgsjg 4/4 Running 0 36m
machine-api-operator-7955c5d9b9-fpzqs 2/2 Running 0 31m
Ah, Vadim, I bet you didn't hit this, because maybe things are ok if you remove the ConfigMap after the pod is up? But only get into trouble when the cluster tries to create a new pod after the ConfigMap is gone? Forcing a fresh autoscaler operator pod:
$ oc -n openshift-machine-api delete pod cluster-autoscaler-operator-cfcd79bd4-hg6zb
pod "cluster-autoscaler-operator-cfcd79bd4-hg6zb" deleted
And it starts to get sad:
$ oc -n openshift-machine-api get pods
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-cfcd79bd4-jnv27 0/2 ContainerCreating 0 27s
machine-api-controllers-6c6f45f69-vgsjg 4/4 Running 0 37m
machine-api-operator-7955c5d9b9-fpzqs 2/2 Running 0 32m
$ oc -n openshift-machine-api get events | grep -i volume
118s Warning FailedMount pod/cluster-autoscaler-operator-cfcd79bd4-hg6zb MountVolume.SetUp failed for volume "ca-cert" : configmap "cluster-autoscaler-operator-ca" not found
27s Warning FailedMount pod/cluster-autoscaler-operator-cfcd79bd4-jnv27 MountVolume.SetUp failed for volume "ca-cert" : configmap "cluster-autoscaler-operator-ca" not found
Reproducing it by starting a 4.9 cluster, creating configmap cluster-autoscaler-operator-ca and then injecting the configmap to cluster-autoscaler-operator as a volume. Upgrade it to 4.10 release which does not include the fix.
[root@preserve-yangyangmerrn-1 tmp]# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.9.0-0.nightly-2021-09-15-125245 True False 74m Cluster version is 4.9.0-0.nightly-2021-09-15-125245
A fresh installed 4.9 cluster does not have configmap cluster-autoscaler-operator-ca as a volume.
[root@preserve-yangyangmerrn-1 tmp]# oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[]
{
"name": "cert",
"secret": {
"defaultMode": 420,
"items": [
{
"key": "tls.crt",
"path": "tls.crt"
},
{
"key": "tls.key",
"path": "tls.key"
}
],
"secretName": "cluster-autoscaler-operator-cert"
}
}
{
"configMap": {
"defaultMode": 420,
"name": "kube-rbac-proxy-cluster-autoscaler-operator"
},
"name": "auth-proxy-config"
}
A fresh installed 4.9 cluster does not have configmap cluster-autoscaler-operator-ca.
[root@preserve-yangyangmerrn-1 tmp]# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api
Error from server (NotFound): configmaps "cluster-autoscaler-operator-ca" not found
So, manually create configmap cluster-autoscaler-operator-ca.
[root@preserve-yangyangmerrn-1 tmp]# oc create -f 4.3/0000_50_cluster-autoscaler-operator_05_configmap.yaml
configmap/cluster-autoscaler-operator-ca created
[root@preserve-yangyangmerrn-1 tmp]# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api
NAME DATA AGE
cluster-autoscaler-operator-ca 1 9s
Inject the configmap cluster-autoscaler-operator-ca to cluster-autoscaler-operator as a volume
# oc edit deploy cluster-autoscaler-operator -n openshift-machine-api
deployment.apps/cluster-autoscaler-operator edited
108 name: cluster-autoscaler-operator
109 ports:
110 - containerPort: 8443
111 protocol: TCP
112 resources:
113 requests:
114 cpu: 20m
115 memory: 50Mi
116 terminationMessagePath: /dev/termination-log
117 terminationMessagePolicy: FallbackToLogsOnError
118 volumeMounts:
119 - name: ca-cert
120 mountPath: /etc/cluster-autoscaler-operator/tls/service-ca
121 readOnly: true
122 - mountPath: /etc/cluster-autoscaler-operator/tls
123 name: cert
124 readOnly: true
139 volumes:
140 - name: ca-cert
141 configMap:
142 name: cluster-autoscaler-operator-ca
143 items:
144 - key: service-ca.crt
145 path: ca-cert.pem
We find a new pod is built.
[root@preserve-yangyangmerrn-1 tmp]# oc get po -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-74b6fd47cf-hmz62 2/2 Running 0 9s
cluster-baremetal-operator-6cb68fdfcf-vsh5q 2/2 Running 1 (134m ago) 141m
machine-api-controllers-b994fb746-82r5h 7/7 Running 0 133m
machine-api-operator-75476f6cc5-2mt27 2/2 Running 0 141m
# oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[]
{
"configMap": {
"defaultMode": 420,
"items": [
{
"key": "service-ca.crt",
"path": "ca-cert.pem"
}
],
"name": "cluster-autoscaler-operator-ca"
},
"name": "ca-cert"
}
{
"name": "cert",
"secret": {
"defaultMode": 420,
"items": [
{
"key": "tls.crt",
"path": "tls.crt"
},
{
"key": "tls.key",
"path": "tls.key"
}
],
"secretName": "cluster-autoscaler-operator-cert"
}
}
{
"configMap": {
"defaultMode": 420,
"name": "kube-rbac-proxy-cluster-autoscaler-operator"
},
"name": "auth-proxy-config"
}
Upgrade to 4.10 release which does not include the fix.
# oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:4a786f68b1dfa11f2a8202a6fc23f57dbc78700687e6fa1b0440f69437d99e4e --allow-explicit-upgrade --force
The configmap cluster-autoscaler-operator-ca is removed by CVO.
# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api
Error from server (NotFound): configmaps "cluster-autoscaler-operator-ca" not found
But cluster-autoscaler-operator still has the configmap cluster-autoscaler-operator-ca as a volume.
# oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[]
{
"configMap": {
"defaultMode": 420,
"items": [
{
"key": "service-ca.crt",
"path": "ca-cert.pem"
}
],
"name": "cluster-autoscaler-operator-ca"
},
"name": "ca-cert"
}
{
"name": "cert",
"secret": {
"defaultMode": 420,
"items": [
{
"key": "tls.crt",
"path": "tls.crt"
},
{
"key": "tls.key",
"path": "tls.key"
}
],
"secretName": "cluster-autoscaler-operator-cert"
}
}
{
"configMap": {
"defaultMode": 420,
"name": "kube-rbac-proxy-cluster-autoscaler-operator"
},
"name": "auth-proxy-config"
}
The autoscaler-operator pod gets stuck.
# oc get po -n openshift-machine-api
NAME READY STATUS RESTARTS AGE
cluster-autoscaler-operator-64b9c65cc7-fjf8r 0/2 ContainerCreating 0 66m
cluster-baremetal-operator-7d75b97d4d-zcg25 2/2 Running 0 66m
machine-api-controllers-7b97c84bf-cqtv8 7/7 Running 0 66m
machine-api-operator-775cdcb686-h5477 2/2 Running 0 60m
# oc get event -n openshift-machine-api | grep cluster-autoscaler-operator-64b9c65cc7-fjf8r
67m Normal Scheduled pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Successfully assigned openshift-machine-api/cluster-autoscaler-operator-64b9c65cc7-fjf8r to yangyangbz-nnkqr-master-0.c.openshift-qe.internal
6m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r MountVolume.SetUp failed for volume "ca-cert" : configmap "cluster-autoscaler-operator-ca" not found
46m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[ca-cert auth-proxy-config cert kube-api-access-z4kgm]: timed out waiting for the condition
79s Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[auth-proxy-config cert kube-api-access-z4kgm ca-cert]: timed out waiting for the condition
49m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[kube-api-access-z4kgm ca-cert auth-proxy-config cert]: timed out waiting for the condition
10m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[cert kube-api-access-z4kgm ca-cert auth-proxy-config]: timed out waiting for the condition
67m Normal SuccessfulCreate replicaset/cluster-autoscaler-operator-64b9c65cc7 Created pod: cluster-autoscaler-operator-64b9c65cc7-fjf8r
It's reproduced.
Following the procedure described in comment#11 to verify it. After cluster gets upgraded to 4.10.0-0.nightly-2021-09-16-034325, [root@preserve-yangyangmerrn-1 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-09-16-034325 True False 10m Cluster version is 4.10.0-0.nightly-2021-09-16-034325 Volume ca-cert with configmap cluster-autoscaler-operator-ca gets removed from deployment of cluster-autoscaler-operator. # oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } Configmap cluster-autoscaler-operator-ca is removed by CVO [root@preserve-yangyangmerrn-1 tmp]# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api Error from server (NotFound): configmaps "cluster-autoscaler-operator-ca" not found cluster-autoscaler-operator pod is running well. [root@preserve-yangyangmerrn-1 tmp]# oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-7568d4948c-wdntm 2/2 Running 0 12m cluster-baremetal-operator-6dddfcb777-mxh9c 2/2 Running 0 12m machine-api-controllers-7b97c84bf-6l2vj 7/7 Running 0 15m machine-api-operator-86658fbb65-j485n 2/2 Running 0 17m Moving it to verified state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Description of problem: The cluster-autoscaler-operator deployment is still requiring the configmap cluster-autoscaler-operator-ca as a volume, even though this configmap has been deleted/tombstoned in 4.9 Version-Release number of selected component (if applicable): 4.9.0-rc.0 How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: $ oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "configMap": { "defaultMode": 420, "items": [ { "key": "service-ca.crt", "path": "ca-cert.pem" } ], "name": "cluster-autoscaler-operator-ca" }, "name": "ca-cert" } { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } Expected results: cluster-autoscaler-operator deployment not require a configmap that has been configured to be deleted Additional info: https://github.com/openshift/cluster-autoscaler-operator/pull/214 was the PR that tombstoned/deleted the configmap in question