Description of problem (please be detailed as possible and provide log snippests): The issue arose in longevity testing while running the following script 'tests/e2e/longevity/test_stage4.py' in the PR: https://github.com/red-hat-storage/ocs-ci/pull/5943 Getting Permission denied error while writing IO (or while creating any file) on the PVC of storage class - cephrbd, access mode - RWO and volume mode - FS. ``` ~ $ fio --name=fio-rand-readwrite --filename=/mnt/fio_25 --readwrite=randrw --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=20 --size=500M --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --i oengine=libaio --rate=1m --rate_process=poisson --end_fsync=1 fio-rand-readwrite: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=4 fio-3.28 Starting 1 process fio-rand-readwrite: Laying out IO file (1 file / 500MiB) fio: pid=0, err=13/file:filesetup.c:174, func=open, error=Permission denied Run status group 0 (all jobs): ``` This issue is not just arising while FIO, it is also arising while creating a file. ``` ~ $ touch /mnt/abc.txt touch: /mnt/abc.txt: Permission denied ~ $ touch abc.txt touch: abc.txt: Permission denied ``` ``` ~ $ ls -ld /mnt drwxr-xr-x 3 root root 4096 Jun 20 09:29 /mnt ``` The IO operation is working fine on all the other PVCs of type - Cephfs - (RWO, RWX) Cephrbd - (RWO-block, RWX-block) The pod on which FIO is performed was created using the following yaml -> --- apiVersion: v1 kind: Pod metadata: name: perf-pod namespace: default spec: securityContext: runAsNonRoot: true runAsUser: 1000 containers: - name: performance image: quay.io/ocsci/perf:latest imagePullPolicy: IfNotPresent command: ['/bin/sh'] stdin: true tty: true volumeMounts: - name: mypvc mountPath: /mnt securityContext: allowPrivilegeEscalation: false runAsNonRoot: true runAsUser: 1000 capabilities: drop: - ALL seccompProfile: type: RuntimeDefault volumes: - name: mypvc persistentVolumeClaim: claimName: pvc readOnly: false Version of all relevant components (if applicable): OCP-4.10.15, ODF-4.10.3 OCP-4.11.0, ODF 4.11.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Not able to perform IO on the PVC of storage class - cephrbd, access mode - RWO and volume mode - FS. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Always Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create a PVC of storage class - cephrbd, access mode - RWO and volume mode - FS 2. Create a POD using the yaml provided in the description and attach to this PVC 3. Run IO on the POD. Actual results: Permission denied error while running IO on the PVC of storage class - cephrbd, access mode - RWO and volume mode - FS. The details of the error can be found in the description above. Expected results: IO should run completely without any error, on the PVC of storage class - cephrbd, access mode - RWO and volume mode - FS. Additional info: must-gather : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/tdesala-long-testd/tdesala-long-testd_20220525T080711/logs/failed_testcase_ocs_logs_1655465100/test_longevity_stage4_ocs_logs/ocs_must_gather/
Looks like the extra edition you are making in pod spec via below (ie trying to use non privileged/completely restricted pod execution.) looks to be causing this : https://github.com/red-hat-storage/ocs-ci/pull/5943/files#diff-c3679703e785f8a8ae14abbe4b97f354fc00aab50332d5592a9750e929d51d55R8 https://github.com/red-hat-storage/ocs-ci/pull/5943/files#diff-c3679703e785f8a8ae14abbe4b97f354fc00aab50332d5592a9750e929d51d55R24 If you are running in the OCP setup for restricted or non privileged pod, the SCC ..etc has to be configured correctly. Can you try the longevity test without those changes in the pod yaml? also, as a second thing, please add fsGroup* setting in the POD yaml and give a try: example snip can be found here: https://bugzilla.redhat.com/show_bug.cgi?id=1988284#c2 The important part or the required part is `fsgroup` and `fsGroupChangePolicy` addition which match the `runAsUser`. You can avoid `selinuxOptions` though. [...] securityContext: fsGroup: 1000510000 fsGroupChangePolicy: OnRootMismatch runAsUser: 1000510000 ...
Looks like a ci issue, moving to 4.12 while we work on RCA
After updating the yaml with fsgroup and fsgroupchangepolicy params, IO is running completely without any error on the PVC of storage class - cephrbd, access mode - RWO and volume mode - FS. Also IO is running fine on all the other PVCs as well. Updated yaml -> --- apiVersion: v1 kind: Pod metadata: name: perf-pod namespace: default spec: securityContext: runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 fsGroupChangePolicy: OnRootMismatch containers: - name: performance image: quay.io/ocsci/perf:latest imagePullPolicy: IfNotPresent command: ['/bin/sh'] stdin: true tty: true volumeMounts: - name: mypvc mountPath: /mnt securityContext: allowPrivilegeEscalation: false runAsNonRoot: true runAsUser: 1000 fsGroup: 1000 fsGroupChangePolicy: OnRootMismatch capabilities: drop: - ALL seccompProfile: type: RuntimeDefault volumes: - name: mypvc persistentVolumeClaim: claimName: pvc readOnly: false
@Humble, Even without the below fsgroup entries in the securityContext, we are able to write IO on all other supported PVC types successfully without any issues/errors. The permission denied error is observed only with this specific volume: ceph-rbd-RWO. What could be the reason why we are seeing this issue only on ceph-rbd-RWO? [...] securityContext: fsGroup: 1000510000 fsGroupChangePolicy: OnRootMismatch runAsUser: 1000510000 ...
(In reply to Prasad Desala from comment #5) > @Humble, > > Even without the below fsgroup entries in the securityContext, we are able > to write IO on all other supported PVC types successfully without any > issues/errors. The permission denied error is observed only with this > specific volume: ceph-rbd-RWO. What could be the reason why we are seeing > this issue only on ceph-rbd-RWO? Because this RBD volume has a filesystem on top, it is required to check for issues with the filesystem as well. In case the volume (or RBD connection to Ceph) had problems, it could cause the filesystem to become read-only. You would need to inspect the kernel logs from the time the issue occurred. Moving the Pod to an other node may show details about a corrupt filesystem too (mkfs execution in the csi-rbdplugin logs on the new node). Logs of the node where the problem happened do not seem to be available, or at least I am not able to find the linked in this BZ. Steps to reproduce this (get a volume into this error state) in an other environment or with an other volume would help.
Please reopen when we have enough data to move ahead.