Description of problem: The cluster-autoscaler-operator deployment is still requiring the configmap cluster-autoscaler-operator-ca as a volume, even though this configmap has been deleted/tombstoned in 4.9 Version-Release number of selected component (if applicable): 4.9.0-rc.0 How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: $ oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "configMap": { "defaultMode": 420, "items": [ { "key": "service-ca.crt", "path": "ca-cert.pem" } ], "name": "cluster-autoscaler-operator-ca" }, "name": "ca-cert" } { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } Expected results: cluster-autoscaler-operator deployment not require a configmap that has been configured to be deleted Additional info: https://github.com/openshift/cluster-autoscaler-operator/pull/214 was the PR that tombstoned/deleted the configmap in question
Workaround: oc delete deploy -n openshift-machine-api cluster-autoscaler-operator This deployment is managed by the Cluster Version Operator, and will be replaced with a clean manifest if it is deleted. This workaround has been confirmed on a 4.9.0-rc.0 cluster.
Looks like the autoscaler operator stopped asking for the volume mount in 4.3: $ git log --oneline origin/release-4.3 | grep 'Rely on service-ca-operator' $ git log --oneline origin/release-4.4 | grep 'Rely on service-ca-operator' f08589d4 webhooks: Rely on service-ca-operator for CA injection And the issue is that, since the CVO started caring about volumes, we require manifest entries to exist in-cluster, but allow not-in-manifest entries to persist without stomping them out [1]. We can't think of any valid use cases for admins injecting additional volume mounts (and they can use spec.overrides in ClusterVersion to take complete control if they need to in an emergency), so moving to the CVO to tighten up our volume management. The loose volume management is generic, but this specific autoscaler mount will be an issue for clusters born in 4.3 and earlier (which will have originally asked the CVO to add this volume) updating to 4.9 and later (where the autoscaler now asks the CVO to remove the ConfigMap). [1]: https://github.com/openshift/cluster-version-operator/blob/3b1047de4d2a82d2a1f4f92c224a1c46eba72ca9/lib/resourcemerge/core.go#L40-L55
> This specific autoscaler mount will be an issue for clusters born in 4.3 and earlier (which will have originally asked the CVO to add this volume) updating to 4.9 and later (where the autoscaler now asks the CVO to remove the ConfigMap). We need to add this scenario to the tests as well.
4.3 -> ... -> 4.9 should be part of our usual pre-GA QE test matrix. But it will be hard to cover in CI, because it's going to take a long time for all of those updates, and a hiccup that upsets the update suite during any leg would fail the test before it completed the full chain.
I've been testing 4.2/4.3 -> 4.9 upgrade recently and it worked as expected. Does this bug need any other steps to reproduce, e.g. have an autoscaler enabled?
Huh, I expected it to just be "CVO installed the autoscaler operator" (all 4.3?) and "CVO deletes the ConfigMap" (all 4.9?). Looking more closely, I see [1] was missing cluster-profile labels, so the CVO would ignore those manifests. self-managed-high-availability was added later in [2], so any 4.9 target that includes that commit should have a crash-looping autoscaler operator (unless a cluster admin takes steps like comment 2). [1]: https://github.com/openshift/cluster-autoscaler-operator/pull/214/files [2]: https://github.com/openshift/cluster-autoscaler-operator/pull/216
Reproducing the major exposure, cluster-bot 'launch 4.3.40'. It has the mount: $ oc -n openshift-machine-api get -o json deployment cluster-autoscaler-operator | jq '.spec.template.spec | {volumes, containers: ([.containers[] | {name, volumeMounts}])}' { "volumes": [ { "configMap": { "defaultMode": 420, "items": [ { "key": "service-ca.crt", "path": "ca-cert.pem" } ], "name": "cluster-autoscaler-operator-ca" }, "name": "ca-cert" }, { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } }, { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } ], "containers": [ { "name": "kube-rbac-proxy", "volumeMounts": [ { "mountPath": "/etc/kube-rbac-proxy", "name": "auth-proxy-config", "readOnly": true }, { "mountPath": "/etc/tls/private", "name": "cert", "readOnly": true } ] }, { "name": "cluster-autoscaler-operator", "volumeMounts": [ { "mountPath": "/etc/cluster-autoscaler-operator/tls/service-ca", "name": "ca-cert", "readOnly": true }, { "mountPath": "/etc/cluster-autoscaler-operator/tls", "name": "cert", "readOnly": true } ] } ] } Update to 4.4: $ oc adm upgrade channel fast-4.4 # requires 4.9 oc binary, otherwise 'oc patch ...' or whatever. $ oc adm upgrade --to 4.4.33 $ # polling oc adm upgrade until the update completes $ oc adm upgrade Cluster version is 4.4.33 ... $ oc -n openshift-machine-api get pods NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-cfcd79bd4-hg6zb 2/2 Running 0 33m machine-api-controllers-6c6f45f69-vgsjg 4/4 Running 0 33m machine-api-operator-7955c5d9b9-fpzqs 2/2 Running 0 28m And then jump to the 4.9-style issue by removing the ConfigMap (in 4.9, the CVO would do this automatically. I don't have time to take this now-4.4 cluster to 4.9 before cluster-bot reaps it, so doing it manually): $ oc -n openshift-machine-api delete configmap cluster-autoscaler-operator-ca A few minutes later, the autoscaler is still happy: $ oc -n openshift-machine-api get pods NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-cfcd79bd4-hg6zb 2/2 Running 0 36m machine-api-controllers-6c6f45f69-vgsjg 4/4 Running 0 36m machine-api-operator-7955c5d9b9-fpzqs 2/2 Running 0 31m Ah, Vadim, I bet you didn't hit this, because maybe things are ok if you remove the ConfigMap after the pod is up? But only get into trouble when the cluster tries to create a new pod after the ConfigMap is gone? Forcing a fresh autoscaler operator pod: $ oc -n openshift-machine-api delete pod cluster-autoscaler-operator-cfcd79bd4-hg6zb pod "cluster-autoscaler-operator-cfcd79bd4-hg6zb" deleted And it starts to get sad: $ oc -n openshift-machine-api get pods NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-cfcd79bd4-jnv27 0/2 ContainerCreating 0 27s machine-api-controllers-6c6f45f69-vgsjg 4/4 Running 0 37m machine-api-operator-7955c5d9b9-fpzqs 2/2 Running 0 32m $ oc -n openshift-machine-api get events | grep -i volume 118s Warning FailedMount pod/cluster-autoscaler-operator-cfcd79bd4-hg6zb MountVolume.SetUp failed for volume "ca-cert" : configmap "cluster-autoscaler-operator-ca" not found 27s Warning FailedMount pod/cluster-autoscaler-operator-cfcd79bd4-jnv27 MountVolume.SetUp failed for volume "ca-cert" : configmap "cluster-autoscaler-operator-ca" not found
Reproducing it by starting a 4.9 cluster, creating configmap cluster-autoscaler-operator-ca and then injecting the configmap to cluster-autoscaler-operator as a volume. Upgrade it to 4.10 release which does not include the fix. [root@preserve-yangyangmerrn-1 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-15-125245 True False 74m Cluster version is 4.9.0-0.nightly-2021-09-15-125245 A fresh installed 4.9 cluster does not have configmap cluster-autoscaler-operator-ca as a volume. [root@preserve-yangyangmerrn-1 tmp]# oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } A fresh installed 4.9 cluster does not have configmap cluster-autoscaler-operator-ca. [root@preserve-yangyangmerrn-1 tmp]# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api Error from server (NotFound): configmaps "cluster-autoscaler-operator-ca" not found So, manually create configmap cluster-autoscaler-operator-ca. [root@preserve-yangyangmerrn-1 tmp]# oc create -f 4.3/0000_50_cluster-autoscaler-operator_05_configmap.yaml configmap/cluster-autoscaler-operator-ca created [root@preserve-yangyangmerrn-1 tmp]# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api NAME DATA AGE cluster-autoscaler-operator-ca 1 9s Inject the configmap cluster-autoscaler-operator-ca to cluster-autoscaler-operator as a volume # oc edit deploy cluster-autoscaler-operator -n openshift-machine-api deployment.apps/cluster-autoscaler-operator edited 108 name: cluster-autoscaler-operator 109 ports: 110 - containerPort: 8443 111 protocol: TCP 112 resources: 113 requests: 114 cpu: 20m 115 memory: 50Mi 116 terminationMessagePath: /dev/termination-log 117 terminationMessagePolicy: FallbackToLogsOnError 118 volumeMounts: 119 - name: ca-cert 120 mountPath: /etc/cluster-autoscaler-operator/tls/service-ca 121 readOnly: true 122 - mountPath: /etc/cluster-autoscaler-operator/tls 123 name: cert 124 readOnly: true 139 volumes: 140 - name: ca-cert 141 configMap: 142 name: cluster-autoscaler-operator-ca 143 items: 144 - key: service-ca.crt 145 path: ca-cert.pem We find a new pod is built. [root@preserve-yangyangmerrn-1 tmp]# oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-74b6fd47cf-hmz62 2/2 Running 0 9s cluster-baremetal-operator-6cb68fdfcf-vsh5q 2/2 Running 1 (134m ago) 141m machine-api-controllers-b994fb746-82r5h 7/7 Running 0 133m machine-api-operator-75476f6cc5-2mt27 2/2 Running 0 141m # oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "configMap": { "defaultMode": 420, "items": [ { "key": "service-ca.crt", "path": "ca-cert.pem" } ], "name": "cluster-autoscaler-operator-ca" }, "name": "ca-cert" } { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } Upgrade to 4.10 release which does not include the fix. # oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release@sha256:4a786f68b1dfa11f2a8202a6fc23f57dbc78700687e6fa1b0440f69437d99e4e --allow-explicit-upgrade --force The configmap cluster-autoscaler-operator-ca is removed by CVO. # oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api Error from server (NotFound): configmaps "cluster-autoscaler-operator-ca" not found But cluster-autoscaler-operator still has the configmap cluster-autoscaler-operator-ca as a volume. # oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "configMap": { "defaultMode": 420, "items": [ { "key": "service-ca.crt", "path": "ca-cert.pem" } ], "name": "cluster-autoscaler-operator-ca" }, "name": "ca-cert" } { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } The autoscaler-operator pod gets stuck. # oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-64b9c65cc7-fjf8r 0/2 ContainerCreating 0 66m cluster-baremetal-operator-7d75b97d4d-zcg25 2/2 Running 0 66m machine-api-controllers-7b97c84bf-cqtv8 7/7 Running 0 66m machine-api-operator-775cdcb686-h5477 2/2 Running 0 60m # oc get event -n openshift-machine-api | grep cluster-autoscaler-operator-64b9c65cc7-fjf8r 67m Normal Scheduled pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Successfully assigned openshift-machine-api/cluster-autoscaler-operator-64b9c65cc7-fjf8r to yangyangbz-nnkqr-master-0.c.openshift-qe.internal 6m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r MountVolume.SetUp failed for volume "ca-cert" : configmap "cluster-autoscaler-operator-ca" not found 46m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[ca-cert auth-proxy-config cert kube-api-access-z4kgm]: timed out waiting for the condition 79s Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[auth-proxy-config cert kube-api-access-z4kgm ca-cert]: timed out waiting for the condition 49m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[kube-api-access-z4kgm ca-cert auth-proxy-config cert]: timed out waiting for the condition 10m Warning FailedMount pod/cluster-autoscaler-operator-64b9c65cc7-fjf8r Unable to attach or mount volumes: unmounted volumes=[ca-cert], unattached volumes=[cert kube-api-access-z4kgm ca-cert auth-proxy-config]: timed out waiting for the condition 67m Normal SuccessfulCreate replicaset/cluster-autoscaler-operator-64b9c65cc7 Created pod: cluster-autoscaler-operator-64b9c65cc7-fjf8r It's reproduced.
Following the procedure described in comment#11 to verify it. After cluster gets upgraded to 4.10.0-0.nightly-2021-09-16-034325, [root@preserve-yangyangmerrn-1 tmp]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2021-09-16-034325 True False 10m Cluster version is 4.10.0-0.nightly-2021-09-16-034325 Volume ca-cert with configmap cluster-autoscaler-operator-ca gets removed from deployment of cluster-autoscaler-operator. # oc get deploy -n openshift-machine-api cluster-autoscaler-operator -o json | jq .spec.template.spec.volumes[] { "name": "cert", "secret": { "defaultMode": 420, "items": [ { "key": "tls.crt", "path": "tls.crt" }, { "key": "tls.key", "path": "tls.key" } ], "secretName": "cluster-autoscaler-operator-cert" } } { "configMap": { "defaultMode": 420, "name": "kube-rbac-proxy-cluster-autoscaler-operator" }, "name": "auth-proxy-config" } Configmap cluster-autoscaler-operator-ca is removed by CVO [root@preserve-yangyangmerrn-1 tmp]# oc get cm cluster-autoscaler-operator-ca -n openshift-machine-api Error from server (NotFound): configmaps "cluster-autoscaler-operator-ca" not found cluster-autoscaler-operator pod is running well. [root@preserve-yangyangmerrn-1 tmp]# oc get po -n openshift-machine-api NAME READY STATUS RESTARTS AGE cluster-autoscaler-operator-7568d4948c-wdntm 2/2 Running 0 12m cluster-baremetal-operator-6dddfcb777-mxh9c 2/2 Running 0 12m machine-api-controllers-7b97c84bf-6l2vj 7/7 Running 0 15m machine-api-operator-86658fbb65-j485n 2/2 Running 0 17m Moving it to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056