Description of problem: After upgrading the cluster from 4.5.22 to 4.6.8, the prometheus pods are going in crashloopbackoff state due to failure in mounting the NFS backed volumes. Checking the "id" from the prometheus pod, it now uses "65534" (due to introducing nonroot SCC). Whereas the old data in the mount including NFS share still posses the old uid/gid permissions. ~~~ # oc rsh prometheus-k8s-0 id uid=65534(nobody) gid=65534(nobody) groups=65534(nobody) # ls -lR nfs2 nfs2: total 0 drwxrwxrwx. 3 root root 39 Dec 26 05:04 prometheus-db nfs2/prometheus-db: total 20 -rw-r--r--. 1 1000080000 root 20001 Dec 26 06:24 queries.active drwxr-xr-x. 2 1000080000 root 70 Dec 26 06:01 wal nfs2/prometheus-db/wal: total 361216 -rw-r--r--. 1 1000080000 root 0 Dec 26 05:04 00000000 -rw-r--r--. 1 1000080000 root 134119424 Dec 26 05:26 00000001 -rw-r--r--. 1 1000080000 root 134152192 Dec 26 06:01 00000002 -rw-r--r--. 1 1000080000 root 101613568 Dec 26 06:24 00000003 ~~~ Here, the permissions of the volume are not being rewritten to have the uid/gid of 65534 (nobody user). Version-Release number of selected component (if applicable): OCP 4.6.8 NFS PVs used by prometheus pods. How reproducible: 100% Steps to Reproduce: 1. Spin up a 4.5 cluster. Create 2 NFS PVs for the prometheus pods. 2. After adding the volumes through cluster-monitoring-config cm, note the UID/GID on volume. 3. Upgrade to 4.6.8. Prometheus pods go into error state and upgrade halts. Actual results: Prometheus pods unable to mount the volume due to permission mismatch. ~~~ prometheus-k8s-0 5/6 Running 19 1h17m prometheus-k8s-1 5/6 Running 19 1h17m ~~~ Expected results: Prometheus pods are able to mount the volume such that it starts the prometheus container. Additional info: Referencing the discussion (https://bugzilla.redhat.com/show_bug.cgi?id=1868976), it appears that the SCC for prometheus pods has been changed from restricted to nonroot. Where the new uid/gid will be nobody (as it is also mentioned in the Prometheus base image) for all pods, irrespective of their storage backends. ~~~ "Config": { "User": "nobody", "ExposedPorts": { ~~~
upgrade from 4.6.16 to 4.7.0-0.nightly-2021-02-09-192846 succeed, and after upgrade the prometheus pods runs well and mount nfs volume normally. @Peter Hunt, I am not sure if we could verify the fix by upgrading from 4.6 to 4.7, because in 4.6, the user id in prometheus pod is 65534 already, and when upgrade to 4.7, the user id keep the same. But in the original bug, in 4.5, the user id is 0(root), and upgrade to 4.6, the user id change to 65534. Can you confirm this? If yes, I think this bug is verified. before upgrade ================ volumeMounts: - mountPath: /prometheus name: prometheus-k8s-db subPath: prometheus-db volumes: - name: prometheus-k8s-db persistentVolumeClaim: claimName: prometheus-k8s-db-prometheus-k8s-0 $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE prometheus-k8s-db-prometheus-k8s-0 Bound nfspv6 20Gi RWO,ROX,RWX nfs215 82s prometheus-k8s-db-prometheus-k8s-1 Bound nfspv5 20Gi RWO,ROX,RWX nfs215 82s $ oc rsh prometheus-k8s-0 Defaulting container name to prometheus. Use 'oc describe pod/prometheus-k8s-0 -n openshift-monitoring' to see all of the containers in this pod. sh-4.4$ id uid=65534(nobody) gid=65534(nobody) groups=65534(nobody) sh-4.4$ pwd /prometheus sh-4.4$ ls chunks_head queries.active wal sh-4.4$ ls -lR .: total 20 drwxr-xr-x. 2 nobody nobody 6 Feb 10 09:22 chunks_head -rw-r--r--. 1 nobody nobody 20001 Feb 10 09:27 queries.active drwxr-xr-x. 2 nobody nobody 22 Feb 10 09:22 wal after upgrade ============ prometheus-k8s-0 7/7 Running 1 37m prometheus-k8s-1 7/7 Running 1 42m $ oc rsh prometheus-k8s-0 Defaulting container name to prometheus. Use 'oc describe pod/prometheus-k8s-0 -n openshift-monitoring' to see all of the containers in this pod. sh-4.4$ id uid=65534(nobody) gid=65534(nobody) groups=65534(nobody) sh-4.4$ pwd /prometheus sh-4.4$ ls chunks_head queries.active wal sh-4.4$ ls -lR .: total 20 drwxr-xr-x. 2 nobody nobody 48 Feb 10 10:20 chunks_head -rw-r--r--. 1 nobody nobody 20001 Feb 10 10:52 queries.active drwxr-xr-x. 2 nobody nobody 54 Feb 10 10:15 wal
I believe the only reason it was failing in upgrade from 4.5 to 4.6 was the directory permission wasn't correctly handled for the new ID, not necessarily that there was a switch in ID. As in, the old ID (root) had permission but the new one did not. Thus, I think the upgrade from 4.6 to 4.7 succeeding verifies the bug is fixed, as this ID does have the correct permission
Thanks Peter for confirmation, marking it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633