Description of problem (please be detailed as possible and provide log snippests): While running the ocs-ci test 'tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expand_expanded_pvc', I see the rook-ceph-mgr-a Pod restarting. The test itself may pass, or it fails on a timeout of the fio command checking successful second PVC expansion. I also watched ceph status during test execution and see health warnings, because of slow osds and then mgr daemons not available. From monitoring I see that OOM kills have occurred, which I assume refer to the rook-ceph-mgr Version of all relevant components (if applicable): OCS: 4.8.0-175.ci OCP: 4.8.2 from candidate-4.8 stream ocs-ci: stable-ocs-4.8-202107251413 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Possible user impact is that applications using a PVC which has expanded twice may fail. Also, since the whole ceph cluster seems affected and going into health warning state, further storage issues may be possible. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Yes. By running the respective test from ocs-ci. Can this issue reproduce from the UI? Not tried. If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Test fails or passes, depending on timing, I assume. Ceph cluster shows health_warn during test execution and rook-ceph-mgr-a Pod is restarted. Expected results: Test passes, no health warning during execution, all Pods remain running. Additional info: * All logfiles, reports and must_gathers from a test run where the test passes, but restarts are detected: https://drive.google.com/file/d/1JSCj9id4hhF3cHmup7Pc2GNnWOX4dGMr/view?usp=sharing * All logfiles, reports and must_gathers from a test run where the test fails, first on timeout of the fio command, then on deletion of the PVC: https://drive.google.com/file/d/1a0FtRYiK7g7f97XgNTeJrl8daHNgv9jg/view?usp=sharing
In the must-gather, the mgr-a pod does show that it was OOMKilled. lastState: terminated: containerID: cri-o://7c2637404d0f700c65f551d219794e83632d2c8458868b56b93c87d4113f9e9f exitCode: 137 finishedAt: "2021-07-29T08:11:35Z" reason: OOMKilled startedAt: "2021-07-29T08:03:01Z" Can you repro this outside the ocs-ci? Exactly what operations were being performed? Besides the expansion were other operations completed recently or still in progress? To troubleshoot we really need to know how to repro. All I see in the must-gather is the oom-kill, which is just the side effect of the load.
I will try to reproduce the behaviour outside of ocs-ci.
I regret, until now I could not reproduce a similar situation, so possibly it is actually a load issue. ocs-ci is running the following procedure in parallel for different PVCs from different OCS storageclasses. The general recipe I went through manually so far is: 1. Create a 10G RWO PVC in ocs-storagecluster-ceph-rbd storage class 2. Create a Pod with quay.io/ocsci/nginx:latest as only image running and mounting the PVC at /var/lib/www/html 3. rsh bash into the Pod and - apt update - apt install fio 4. Run fio: fio --name=fio-rand-write --filename=/var/lib/www/html/pod-test-rbd_pre_expand --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=8G --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json 5. Once fio finished, expand PVC top 20G: oc patch PersistentVolumeClaim pvc-test -p '{"spec": {"resources": {"requests": {"storage": "20Gi"}}}}' 6. Run an other fio: fio --name=fio-rand-write --filename=/var/lib/www/html/pod-test-rbd_post_expand --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=8G --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json 7. Once finished, Repeat 5. to expand to 25G, then Repeat 6. With this manual walkthrough for one PVC-Pod combination, I could not see mgr-a restarts or health_warn states yet.
To further get an understanding of the issue, I am currently running the whole test from the ocs-ci suite on a cluster where I have separated the OCS nodes (infra) from the "normal" workers to not have fios executed on the same workers as the OCS pieces are running. I still see the mgr restarted, and this is about the ceph health warning: ceph health cluster: id: b5034972-6644-4819-bd43-74343a5ba8ad health: HEALTH_WARN 2 MDSs report slow metadata IOs 20521 slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.1,osd.2,osd.3] have slow ops. When the mgr is in CrashLoopBackoffstate and restarting: ceph health cluster: id: b5034972-6644-4819-bd43-74343a5ba8ad health: HEALTH_WARN no active mgr 4825 slow ops, oldest one blocked for 457 sec, daemons [osd.0,osd.1,osd.3] have slow ops. services: mon: 3 daemons, quorum a,b,c (age 30m) mgr: no daemons active (since 2m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 4 osds: 4 up (since 29m), 4 in (since 29m) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) data: pools: 10 pools, 176 pgs objects: 12.24k objects, 46 GiB usage: 143 GiB used, 32 TiB / 32 TiB avail pgs: 176 active+clean io: client: 50 MiB/s wr, 0 op/s rd, 4.50k op/s wr Once it is up again, health warning is cleared.
I am assuming that the mgr pod is still getting OOMKilled in the case above (comment 5), correct? Fairly strong evidence that this is a CI configuration issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1987549#c4 and https://bugzilla.redhat.com/show_bug.cgi?id=1987549#c5 strongly suggest that it is a ci issue.