Bug 1911016 - Prometheus unable to mount NFS volumes after upgrading to 4.6
Summary: Prometheus unable to mount NFS volumes after upgrading to 4.6
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.7.0
Assignee: Peter Hunt
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-27 01:19 UTC by Yash Chouksey
Modified: 2024-03-25 17:41 UTC (History)
19 users (show)

Fixed In Version: runc-1.0.0-82.rhaos4.6.git086e841.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:48:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5670841 0 None None None 2020-12-31 10:42:17 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:48:48 UTC

Description Yash Chouksey 2020-12-27 01:19:46 UTC
Description of problem:

After upgrading the cluster from 4.5.22 to 4.6.8, the prometheus pods are going in crashloopbackoff state due to failure in mounting the NFS backed volumes.
Checking the "id" from the prometheus pod, it now uses "65534" (due to introducing nonroot SCC). Whereas the old data in the mount including NFS share still posses the old uid/gid permissions.
~~~
# oc rsh prometheus-k8s-0 id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)

# ls -lR nfs2
nfs2:
total 0
drwxrwxrwx. 3 root root 39 Dec 26 05:04 prometheus-db

nfs2/prometheus-db:
total 20
-rw-r--r--. 1 1000080000 root 20001 Dec 26 06:24 queries.active
drwxr-xr-x. 2 1000080000 root    70 Dec 26 06:01 wal

nfs2/prometheus-db/wal:
total 361216
-rw-r--r--. 1 1000080000 root         0 Dec 26 05:04 00000000
-rw-r--r--. 1 1000080000 root 134119424 Dec 26 05:26 00000001
-rw-r--r--. 1 1000080000 root 134152192 Dec 26 06:01 00000002
-rw-r--r--. 1 1000080000 root 101613568 Dec 26 06:24 00000003
~~~

Here, the permissions of the volume are not being rewritten to have the uid/gid of 65534 (nobody user).

Version-Release number of selected component (if applicable):
OCP 4.6.8
NFS PVs used by prometheus pods.

How reproducible:
100%

Steps to Reproduce:
1. Spin up a 4.5 cluster. Create 2 NFS PVs for the prometheus pods.
2. After adding the volumes through cluster-monitoring-config cm, note the UID/GID on volume.
3. Upgrade to 4.6.8. Prometheus pods go into error state and upgrade halts.

Actual results:
Prometheus pods unable to mount the volume due to permission mismatch.
~~~
prometheus-k8s-0                              5/6    Running  19        1h17m
prometheus-k8s-1                              5/6    Running  19        1h17m
~~~

Expected results:
Prometheus pods are able to mount the volume such that it starts the prometheus container.

Additional info:
Referencing the discussion (https://bugzilla.redhat.com/show_bug.cgi?id=1868976), it appears that the SCC for prometheus pods has been changed from restricted to nonroot. Where the new uid/gid will be nobody (as it is also mentioned in the Prometheus base image) for all pods, irrespective of their storage backends.
~~~
        "Config": {
            "User": "nobody",
            "ExposedPorts": {
~~~

Comment 13 MinLi 2021-02-10 13:29:04 UTC
upgrade from 4.6.16 to 4.7.0-0.nightly-2021-02-09-192846 succeed, and after upgrade the prometheus pods runs well and mount nfs volume normally.

@Peter Hunt, I am not sure if we could verify the fix by upgrading from 4.6 to 4.7, because in 4.6, the user id in prometheus pod is 65534 already, and when upgrade to 4.7, the user id keep the same. But in the original bug, in 4.5, the user id is 0(root), and upgrade to 4.6, the user id change to 65534.
Can you confirm this? If yes, I think this bug is verified.



before upgrade ================
    volumeMounts:
    - mountPath: /prometheus
      name: prometheus-k8s-db
      subPath: prometheus-db
  
volumes:
  - name: prometheus-k8s-db
    persistentVolumeClaim:
      claimName: prometheus-k8s-db-prometheus-k8s-0

$ oc get pvc 
NAME                                 STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-k8s-db-prometheus-k8s-0   Bound    nfspv6   20Gi       RWO,ROX,RWX    nfs215         82s
prometheus-k8s-db-prometheus-k8s-1   Bound    nfspv5   20Gi       RWO,ROX,RWX    nfs215         82s

$ oc rsh prometheus-k8s-0
Defaulting container name to prometheus.
Use 'oc describe pod/prometheus-k8s-0 -n openshift-monitoring' to see all of the containers in this pod.
sh-4.4$ id 
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
sh-4.4$ pwd
/prometheus
sh-4.4$ ls
chunks_head  queries.active  wal
sh-4.4$ ls -lR
.:
total 20
drwxr-xr-x. 2 nobody nobody     6 Feb 10 09:22 chunks_head
-rw-r--r--. 1 nobody nobody 20001 Feb 10 09:27 queries.active
drwxr-xr-x. 2 nobody nobody    22 Feb 10 09:22 wal

after upgrade ============
prometheus-k8s-0                               7/7     Running   1          37m
prometheus-k8s-1                               7/7     Running   1          42m

$ oc rsh prometheus-k8s-0
Defaulting container name to prometheus.
Use 'oc describe pod/prometheus-k8s-0 -n openshift-monitoring' to see all of the containers in this pod.
sh-4.4$ id 
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
sh-4.4$ pwd
/prometheus
sh-4.4$ ls
chunks_head  queries.active  wal
sh-4.4$ ls -lR
.:
total 20
drwxr-xr-x. 2 nobody nobody    48 Feb 10 10:20 chunks_head
-rw-r--r--. 1 nobody nobody 20001 Feb 10 10:52 queries.active
drwxr-xr-x. 2 nobody nobody    54 Feb 10 10:15 wal

Comment 14 Peter Hunt 2021-02-10 16:32:58 UTC
I believe the only reason it was failing in upgrade from 4.5 to 4.6 was the directory permission wasn't correctly handled for the new ID, not necessarily that there was a switch in ID. As in, the old ID (root) had permission but the new one did not. Thus, I think the upgrade from 4.6 to 4.7 succeeding verifies the bug is fixed, as this ID does have the correct permission

Comment 15 Sunil Choudhary 2021-02-12 07:35:14 UTC
Thanks Peter for confirmation, marking it verified.

Comment 18 errata-xmlrpc 2021-02-24 15:48:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.