Description of problem (please be detailed as possible and provide log snippests): During the tier1 test execution (tests/manage/pv_services/pvc_resize/test_pvc_expansion.py), OSD containers were out of memory killed and ceph cluster goes into error state during the pvc expansion tests, resulting 7 PGs in "active+recovery_unfound+undersized+degraded+remapped" state in "ocs-storagecluster-cephfilesystem-data0" pool. OSDs have resource limits as follows: cpu: 2 memory: 5Gi Version of all relevant components (if applicable): OCP 4.12 OCS 4.6.2 (4.6.2-233.ci) Ceph 14.2.8-115.el8cp Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? All the testcases after the FIO tests have been Skipped as the Ceph cluster is in Unhealthy state. Is there any workaround available to the best of your knowledge? Will try to increase the memory limits of OSDs and rerun the tests Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCP 4.12 and OCS 4.6.2 (4.6.2-233.ci) with 4 workers 2. The 4 OSDs should have resource limits as follows: Limits: cpu: 2 memory: 5Gi 3. Run ocs-ci with "tier1" marker as follows: run-ci -m 'tier1' --ocsci-conf config.yaml --cluster-path <cluster_path> --html=<path> --self-contained-html Actual results: OSDs are OOM killed during the tests and the Ceph cluster goes into error state. rook-ceph-osd-0-5799ff866-8gkl2 1/1 Running 4 15h rook-ceph-osd-1-f9b7d58bd-7ln25 1/1 Running 5 15h rook-ceph-osd-2-86767d765b-txz9s 1/1 Running 5 15h rook-ceph-osd-3-55b9ccbd7-qhz84 1/1 Running 5 15h sh-4.4# ceph -s cluster: id: ea4006aa-b962-4b12-b0b6-440fd0095ae8 health: HEALTH_ERR 293/15401 objects unfound (1.902%) Possible data damage: 7 pgs recovery_unfound Degraded data redundancy: 2595/46203 objects degraded (5.617%), 7 pgs degraded, 7 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 15h) mgr: a(active, since 15h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 4 osds: 4 up (since 9h), 4 in (since 15h); 7 remapped pgs rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 15.40k objects, 58 GiB usage: 137 GiB used, 1.8 TiB / 2.0 TiB avail pgs: 2595/46203 objects degraded (5.617%) 293/15401 objects unfound (1.902%) 169 active+clean 7 active+recovery_unfound+undersized+degraded+remapped io: client: 1.1 KiB/s rd, 12 KiB/s wr, 2 op/s rd, 1 op/s wr sh-4.4# ceph pg dump_stuck unclean ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 4.18 active+recovery_unfound+undersized+degraded+remapped [1,0,3] 1 [0,3] 0 4.8 active+recovery_unfound+undersized+degraded+remapped [2,0,3] 2 [0,3] 0 4.b active+recovery_unfound+undersized+degraded+remapped [2,3,1] 2 [2,3] 2 4.2 active+recovery_unfound+undersized+degraded+remapped [1,0,3] 1 [0,3] 0 4.10 active+recovery_unfound+undersized+degraded+remapped [2,3,1] 2 [2,3] 2 4.14 active+recovery_unfound+undersized+degraded+remapped [1,0,3] 1 [0,3] 0 4.11 active+recovery_unfound+undersized+degraded+remapped [0,1,2] 0 [0,1] 0 sh-4.4# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR .rgw.root 2.8 MiB 16 0 48 0 0 0 7165 8.7 MiB 28 24 KiB 0 B 0 B ocs-storagecluster-cephblockpool 24 GiB 5620 0 16860 0 0 0 2237290 23 GiB 2128178 114 GiB 0 B 0 B ocs-storagecluster-cephfilesystem-data0 109 GiB 9476 0 28428 293 293 2595 1909378 38 GiB 6164676 95 GiB 0 B 0 B ocs-storagecluster-cephfilesystem-metadata 84 MiB 34 0 102 0 0 0 106737 62 MiB 2805 29 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.data 384 KiB 2 0 6 0 0 0 166445 1.1 GiB 330054 1.8 GiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.index 0 B 22 0 66 0 0 0 336134 328 MiB 183756 80 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.log 6.4 MiB 211 0 633 0 0 0 85608 83 MiB 59341 4.1 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.meta 1.9 MiB 12 0 36 0 0 0 20333 17 MiB 9769 3.0 MiB 0 B 0 B Expected results: OSDs should not OOM during the tests and the Ceph cluster should be in healthy state. sh-4.4# ceph -s cluster: id: ea4006aa-b962-4b12-b0b6-440fd0095ae8 health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 7m) mgr: a(active, since 6m) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay osd: 4 osds: 4 up (since 6m), 4 in (since 6m) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 317 objects, 82 MiB usage: 4.1 GiB used, 1.9 TiB / 2.0 TiB avail pgs: 176 active+clean io: client: 853 B/s rd, 6.3 KiB/s wr, 1 op/s rd, 0 op/s wr Additional info: Worker Node Resources are as follows : No. of Workers:4 Memory: 64GB CPU : 16 Disk: 500 GB each for osds Attaching the container logs of the osds, must-gather logs and the tier1 test execution log for more details. https://drive.google.com/file/d/1hiiUlJNljo0KVubfbh82kmK1zgJ3WIHN/view?usp=sharing
@sbalusu could you provide some more information about these tests? In what repository can I find the test definitions?
From looking at the must-gather, there is only so much information I have. I think this is an issue that most properly belongs with the Ceph. Looking at rook-ceph-osd-0 as an example, one of the init containers issues the following command: ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0 This command reports the info 2021-01-25T22:09:52.167825497Z inferring bluefs devices from bluestore path 2021-01-25T22:09:56.779923017Z 1 : device size 0x7d00000000 : own 0x[3c00000000~500000000] = 0x500000000 : using 0xb2b0000(179 MiB) 2021-01-25T22:09:56.779923017Z Expanding... This seems to be where the OOM kill is happening. The output should report more info after "Expanding..." as in the example below. inferring bluefs devices from bluestore path 1 : device size 0x4ffe00000 : own 0x[11ff00000~40000000] = 0x40000000 : using 0x470000(4.4 MiB) : bluestore has 0x23fdd0000(9.0 GiB) available Expanding DB/WAL... Expanding Main... 1 : expanding from 0x27fe00000 to 0x4ffe00000 1 : size label updated to 21472739328 The OOM kill seems to be happening in around 7 seconds, so the memory usage gets to 5Gi pretty quickly. ceph-bluestore-tool doesn't have any documented ability to specify memory/CPU limits to its process. The fix for this could take some different shapes: - Rook could run the expand-bluefs container without memory/CPU limits, but this could OOM kill other pods on nodes that are near memory capacity - as a note, we should probably not allow multiple OSDs to expand simultaneously since they will use lots of memory - The better long-term solution would to make a change in Ceph to allow setting memory limits on the `ceph-bluestore-tool bluefs-bdev-expand` command which Rook could specify or which it could read from POD_MEMORY_LIMITS as the `ceph-osd` process does.
Do you mean the pvc expansion tests? You can find them in ocs-ci (https://github.com/red-hat-storage/ocs-ci) "tests/manage/pv_services/pvc_resize/test_pvc_expansion.py" as mentioned in the description.
Small update: I was reminded that Rook needs to set resource requests on all containers of a pod to get the benefits of pod priority which Kubernetes gives only to pods for which all containers have resource limits set. I think we need someone from the Ceph team to take a look at why bluefs expansion is using so much memory. I'll change the BZ's component to reflect this.
@sbalusu I am sure the Ceph team will wish to get more log data. I don't see any documented `verbose` or `log-level` options for `ceph-bluestore-tool` unfortunately. It would be a good exercise to try increasing the pod memory requests and limits to see if there is a value above which OOM killing doesn't happen. I might suggest trying 10Gi, 15Gi, 20Gi, etc. Also potentially of note, does this only happen on IBM-Z and not on regular x86_64? (I also updated the BZ status to NEW since I am passing this off to the Ceph team.)
This is likely caused by lack of tcmalloc - https://bugzilla.redhat.com/show_bug.cgi?id=1917815
@brgardne, @jdurgin : sure, I can try with the increased pod memory limits if I see this behaviour even after the tcmalloc issue is resolved.
@jdurgin, @
@jdurgin, @brgardne : A question here please, does this also mean that the pvc disk expansion is not supported on IBM Z until the tcmalloc/OOM issue (https://bugzilla.redhat.com/show_bug.cgi?id=1917815) is resolved? If so, this needs to be documented in the release notes before GA and will inform the team to add this in the documentation.
(In reply to Sravika from comment #10) > @jdurgin, @brgardne : A question here please, does > this also mean that the pvc disk expansion is not supported on IBM Z until > the tcmalloc/OOM issue (https://bugzilla.redhat.com/show_bug.cgi?id=1917815) > is resolved? If so, this needs to be documented in the release notes before > GA and will inform the team to add this in the documentation. Nothing should be supported until tcmalloc linking is resolved.
Same as https://bugzilla.redhat.com/show_bug.cgi?id=1917815, providing dev ack. tcmalloc linking BZ is already ON_QA
tc malloc fix is there in the build but we are stuck because of https://bugzilla.redhat.com/show_bug.cgi?id=1928471 Moving this to MODIFIED
Elad, please provide qa_ack. This should be fixed with tc malloc library
@tstober @sbalusu Can this be moved to verified?
I ran the command: run-ci -m 'tier1' --ocsci-conf config.yaml --cluster-path <cluster_path> --html=<path> --self-contained-html 100% of the tasks passed and ceph health was HEALTH_OK after the test. The run failed after that because I used the wrong value for the html path. What should I use for path in this command? Is the result that I got sufficient to demonstrate that this bug is fixed? Thanks
I have got OCP 4.7.1 and OCS 4.7 installed. I ran following command: run-ci -m "tier1" --ocsci-conf ~/ocs-ci/config.yaml --cluster-path /root/ocp4-workdir/ tests --no-print-logs --capture=no --html ~/testtier1Fixed_17thMarch2021.html --self-contained-html I have got 2 Faileds because of fio 1) can not be found and 2) not executed. 52 error messages are saying, that "routes.route.openshift.io "rook-ceph-rgw-ocs-storagecluster-cephobjectstore" already exists"
My OSDs didn't have any problems after the tier1 run: ook-ceph-osd-0-6f779b8d65-p9f2x 2/2 Running 1 23h rook-ceph-osd-1-7c46b87cb5-fvjp7 2/2 Running 0 23h rook-ceph-osd-2-687dcd65f7-b6vqk 2/2 Running 0 23h rook-ceph-osd-3-fddd688d8-dp8q5 2/2 Running 0 12h rook-ceph-osd-4-84d6c9797b-g9qw9 2/2 Running 0 12h rook-ceph-osd-5-65bdfc8b6c-zw6j6 2/2 Running 0 12h
I would say after my test run, this BZ can be closed as "resolved".
This issue has been fixed in the release of 4.7 and the pvc expansion tests in tier1 run through successfully.
verified
Running the tier1 test on OCP 4.7.6 (and 4.7.8) on RHCOS 4.7.7 OOM kills rook-ceph-mgr-a again. From describe rook-ceph-mgr-a-8545946864-mrxgs: State: Running Started: Thu, 29 Apr 2021 15:57:43 +0200 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Thu, 29 Apr 2021 15:51:24 +0200 Finished: Thu, 29 Apr 2021 15:54:58 +0200 Ready: True Restart Count: 6 Scenario is to run 'tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expand_expanded_pvc', and after the second PVC expansion, the respective fio job which tests the expanded filesystem does not start or execute or just hangs (not really clear). The test eventually fails with a command timeout after 600s. This happens in the Pod which mounts the PVC from the ocs-storagecluster-ceph-rbd StorageClass in Filesystem mode. Uploaded logfiles from testrun (test-logs-1619703324) and logs from successful tier1 run: https://drive.google.com/file/d/1e25f1fEhjC8ZtyDchDW5mjdysrchTW-M/view?usp=sharing
@rcyriac : Could you please reopen the bug as this bug has been reproduced again on the very latest version according to @mschaefe @mschaefe: Could you also mention the OCP, OCS and RHCOS versions where pvc expansion test in tier1 test has passed.
Blaine/Josh, PTAL
This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1929188, https://bugzilla.redhat.com/show_bug.cgi?id=1945016, and https://bugzilla.redhat.com/show_bug.cgi?id=1953430. Issues happen around FIO testing. Do the nodes become NotReady at any time during the test? @mschaefe: I think I see from the recorded events that this doesn't happen, but please confirm. It may be that we need to increase the recommended resource requirements/limits for Ceph MGR pods. The limits from the test are 3Gi. @mschaefe: As a debugging measure, increase limits on the mgr pods to 6Gi to see if the issue persists. @jdurgin: Could there be a mgr plugin that is consuming too many resources or a memory leak somewhere in the mgr? How much do you expect the MGR resource usage to be above 3Gi?
Sravika, please let us know if is being reproduced on IBM Z systems or both P and Z systems. We already have a BZ #1929188 where two cases are mentioned: 1. Pods restarting on P systems --- This is happening because one or more worker nodes are going down (BZ #1945016) 2. Pods restarting on Z systems -- Here we don't know the root cause yet but worker nodes are not going down. So, if this BZ is reproducible only on IBM Z systems then this is a dup of BZ #1929188 and we should focus on why this is happening only on Z systems.
@sbalusu Cannot find the RHCOS for the PASSing versions, but OCS was 4.7.0-801.ci, and OCP 4.7.4. The FAILing versions are: RHCOS 4.7.7, OCP 4.7.8, OCS 4.7.0-801.ci. So, different OCP and very likely RHCOS. I will repeat the test runs with 6G memory limit, however, we found that the OCS version I should use for the testing is OCS 4.7.0-372.ci.
@Mudit: Some background info, this issue has been initially observed and reported on OCS 4.6. With the tcmalloc fix delivered in 4.7 this was resolved and the test passed initially when @mschaefe verified on OCS 4.7 and OCP 4.7.4. However, this issue has come back on the last test execution as mentioned in the above comment by @mschaefe. If it is reproducible on IBM P am not sure about this, @svenkat could you please comment on this, thankyou.
Uploaded results from last test run to Google Drive: https://drive.google.com/file/d/1gjVHqCwgwpNztkr_jMWxiCfQfIYt5hI_/view?usp=sharing OCP version: 4.7.8 OCS version: 4.7.0-364.ci RHCOS: 4.7.7-s390x For convenience, I added output of some oc commands (*.out files) to the uploaded zip file, too. Observations: - Testcase keeps failing with ThreadExecutor timeout on the fio command for the test pod which writes to the PVC in Filesystem mode. - Other Pods can successfully finish their fio jobs on their double-expanded PVCs - rook-ceph-mgr-a is NOT restarting with memory limit AND request size of 6GB - must_gather seems to be failing for OCP, not clear why For further verification, I am planning to up a regression test on RHCOS 4.7.0.
Unfortunately, it seems I cannot reproduce the PASSing test. However, logs for the PASSing tier1 suite can be found here: https://drive.google.com/file/d/1zmV9Bv4J72GR3zDPcSZSINAZPZSaD7dE/view?usp=sharing Latest run environment: - OCP 4.7.4 - OCS 4.7.0-377.ci - RHCOS 4.7.0 - ocs-ci commit 34ce7028ac1736234e011d76e5c6c2b07b0ebeea Logfiles from that run are uploaded here: https://drive.google.com/file/d/1J633IWjLgQ4uieFO8TGj2U_JsxXit7wg/view?usp=sharing Observations: - Test is again FAILing with timeout on the fio command which writes to the Filesystem PVC - rook-ceph-mgr-a is NOT restarting, although kept memory limit/request at 3G - again must_gather reported some failures
Elad, From Michael's comment (https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c36) it seems like the pods are not restarting but the tests are failing. Can someone from ci squads check why the tests are failing? Venkat, As asked in https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c33 and https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c35, do you know if this is reproducible on P systems or not? Honestly, this gets confusing with every update. ==> Initially the bug was filed on 4.6 and pods were restarting. [01/26] ==> We applied the tc_malloc fix [02/03] ==> Tests started passing with OCS 4.7 and the bug was marked VERIFIED. [04/28] ==> Again the tests started failing, with pods getting restarted (but nothing changed in OCS4.7/RHCS which could cause this) [04/30] ==> Now, in the recent runs tests are failing but pods are not restarting. [05/04] Can we also verify this manually instead of relying on the ci runs? Raz, if you have some suggestions.
Ok, I analysed the failure from the last run. Its failing while deleting a pvc E TimeoutError: Timeout when waiting for pvc-66a8c2ac-494d-480c-9789-5001e67befbc to delete. Describe output: Name: pvc-66a8c2ac-494d-480c-9789-5001e67befbc E Labels: <none> E Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com E Finalizers: [kubernetes.io/pv-protection] E StorageClass: ocs-storagecluster-ceph-rbd E Status: Released E Claim: namespace-test-c6dd50de4a0545dcb1ece68b0/pvc-test-8d58774ffecc44fda86a34190079db2 E Reclaim Policy: Delete E Access Modes: RWO E VolumeMode: Filesystem E Capacity: 25Gi E Node Affinity: <none> E Message: E Source: E Type: CSI (a Container Storage Interface (CSI) volume source) E Driver: openshift-storage.rbd.csi.ceph.com E FSType: ext4 E VolumeHandle: 0001-0011-openshift-storage-0000000000000001-ec0e973f-acb6-11eb-b03f-0a580a81020c E ReadOnly: false E VolumeAttributes: clusterID=openshift-storage E csi.storage.k8s.io/pv/name=pvc-66a8c2ac-494d-480c-9789-5001e67befbc E csi.storage.k8s.io/pvc/name=pvc-test-8d58774ffecc44fda86a34190079db2 E csi.storage.k8s.io/pvc/namespace=namespace-test-c6dd50de4a0545dcb1ece68b0 E imageFeatures=layering E imageFormat=2 E imageName=csi-vol-ec0e973f-acb6-11eb-b03f-0a580a81020c E journalPool=ocs-storagecluster-cephblockpool E pool=ocs-storagecluster-cephblockpool E radosNamespace= E storage.kubernetes.io/csiProvisionerIdentity=1620114679356-8081-openshift-storage.rbd.csi.ceph.com E Events: E Type Reason Age From Message E ---- ------ ---- ---- ------- E Warning VolumeFailedDelete 55s (x9 over 3m5s) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-55b9744ff9-bxx67_057881f5-24f0-4e03-ab27-4f5f489748fb rpc error: code = Internal desc = rbd csi-vol-ec0e973f-acb6-11eb-b03f-0a580a81020c is still being used If you check the logs for csi plugin it says that a request for unstaging the volume came but the response never came back from the storage or probably we didn't wait enough for the response to come back. This might be expected because the test case expands the volume, writes data and deletes it immediately. Need to check the similar bugs we closed in recent past, should be another instance.
Plugin logs as mentioned in the last update, Req ID 1207 never came back within the required time as data was still there to sync. 2021-05-04T09:29:35.022089170Z I0504 09:29:35.022066 4225 utils.go:132] ID: 1207 Req-ID: 0001-0011-openshift-storage-0000000000000001- ec0e973f-acb6-11eb-b03f-0a580a81020c GRPC call: /csi.v1.Node/NodeUnstageVolume 2021-05-04T09:29:35.022112206Z I0504 09:29:35.022100 4225 utils.go:133] ID: 1207 Req-ID: 0001-0011-openshift-storage-0000000000000001- ec0e973f-acb6-11eb-b03f-0a580a81020c GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-66a8c2ac-494d-480c- 9789-5001e67befbc/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-ec0e973f-acb6-11eb-b03f-0a580a81020c"}
(In reply to Blaine Gardner from comment #32) > @jdurgin: Could there be a mgr plugin that is consuming too many > resources or a memory leak somewhere in the mgr? How much do you expect the > MGR resource usage to be above 3Gi? We haven't seen any cases of the mgr requiring a lot of memory. The initial report showed osd/mon/mgr all crashed - do we have any data to say what process was using more memory/reaching its memory limit? If it was the mgr, reproducing with different mgr modules disabled to isolate the cause would be the next step.
@Mudit: Similar behaviour has been observed in one of the scale tests where the test errored as pvc deletion was not successful during teardown. Although the pvc has been deleted, the csi-volume was not unstaged and the data sync to the pool was still in progress. Uploaded the logs of the testcase and the must gather logs to the drive. https://drive.google.com/file/d/1lKBWboEyYedjjMGXUxtPlz5vra8e2ayM/view?usp=sharing # oc get pvc -A | grep pvc-test-2266586c3a6e48109eafdf76ab884e # # oc get pv -A | grep pvc-8b0ce676-4e18-4c38-a3cb-2178667ecf87 pvc-8b0ce676-4e18-4c38-a3cb-2178667ecf87 1Ti RWO Delete Released namespace-test-62180a33c675434d88ca31af9/pvc-test-2266586c3a6e48109eafdf76ab884e3 ocs-storagecluster-ceph-rbd 15m Before test execution: sh-4.4# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR .rgw.root 2.8 MiB 16 0 48 0 0 0 2034 2.3 MiB 28 24 KiB 0 B 0 B ocs-storagecluster-cephblockpool 11 GiB 1498 0 4494 0 0 0 435904 1.6 GiB 167964 21 GiB 0 B 0 B ocs-storagecluster-cephfilesystem-data0 23 GiB 1949 0 5847 0 0 0 857048 3.3 GiB 304270 41 GiB 0 B 0 B ocs-storagecluster-cephfilesystem-metadata 28 MiB 55 0 165 0 0 0 14427 17 MiB 3863 11 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.data 192 KiB 1 0 3 0 0 0 2087 2.3 MiB 3623 302 KiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.index 0 B 22 0 66 0 0 0 5600 5.5 MiB 7706 876 KiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.log 6.6 MiB 181 0 543 0 0 0 14114 14 MiB 9545 1.1 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.meta 1.9 MiB 12 0 36 0 0 0 5540 4.7 MiB 2929 923 KiB 0 B 0 B total_objects 3742 sh-4.4# After test execution, the pool size "ocs-storagecluster-cephblockpool" still keeps increasing: sh-4.4# rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR .rgw.root 2.8 MiB 16 0 48 0 0 0 14298 17 MiB 28 24 KiB 0 B 0 B ocs-storagecluster-cephblockpool 269 GiB 49619 0 148857 0 0 0 440143 1.6 GiB 2051219 29 GiB 0 B 0 B ocs-storagecluster-cephfilesystem-data0 23 GiB 1949 0 5847 0 0 0 857048 3.3 GiB 304270 41 GiB 0 B 0 B ocs-storagecluster-cephfilesystem-metadata 28 MiB 55 0 165 0 0 0 137924 77 MiB 3871 11 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.data 192 KiB 1 0 3 0 0 0 9955 11 MiB 17111 1.4 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.index 0 B 22 0 66 0 0 0 26418 26 MiB 36628 4.0 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.log 6.6 MiB 213 0 639 0 0 0 66179 64 MiB 44837 5.1 MiB 0 B 0 B ocs-storagecluster-cephobjectstore.rgw.meta 1.9 MiB 12 0 36 0 0 0 26779 23 MiB 13871 4.2 MiB 0 B 0 B total_objects 51895 total_used 297 GiB total_avail 5.7 TiB total_space 6 TiB
Further looking into the test_expand_expanded_pvc, I isolated the actions taken for the pod using the PVC from the ocs-storagecluster-ceph-rbd StorageClass in Filesystem mode and ran the respective fio jobs manually. While doing so, no error or hangups occured and all the fios terminated in a reasonable time.
Following the hints given above, I modified the test_expand_expanded_pvc by adding a two minutes delay between expanding the PVC the second time and then actually starting the fio jobs. Observations: - With that little pause, the test is PASSing. - Anyway, it is now erroring in teardown when trying to remove the PV for the PVC in ocs-storagecluster-ceph-rbd StorageClass - same as noted above. - In an earlier run, I saw the rook-ceph-mgr pod must have been restarted (but actually didn't notice in time). - When the fio jobs are started by the ThreadExecutors, the cluster goes into HEALTH_WARN state, showing: 2 MDSs report slow metadata IOs 485 slow ops, oldest one blocked for 160 sec, daemons [ods.0,osd.1,osd.2] have slow ops. The # of slow ops and blocking time are steadily increasing until they begin going down after some minutes. As soon as this happens, I can see increasing write throughput in the io: client: section of ceph -s output. - This HEALTH_WARN state is occuring whenever the test is starting the fio jobs on the test pods. Added this pause to the test: [schaefm@m1308001 pvc_resize]$ git diff test_pvc_expansion.py diff --git a/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py b/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py index 5ccdd8be..22323227 100644 --- a/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py +++ b/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py @@ -1,5 +1,7 @@ import logging import pytest +import time + from concurrent.futures import ThreadPoolExecutor from ocs_ci.ocs import constants @@ -244,6 +246,8 @@ class TestPvcExpand(ManageTest): log.info("Expanding all PVCs for the second time.") self.expand_and_verify(pvc_size_expanded_2) + log.info("MS/OCPonZ: Sleeping two minutes to let things calm down.") + time.sleep(120) # Run IO and verify log.info("Starting post-second-expand IO on all pods.") self.run_io_and_verify(6, "post_expand")
Rejy, I agree to the above comment. Thie BZ can be moved to VERIFIED again. We don't even need to open a new BZ, the new issue we are hitting here (mon/mgr pods restarting) is already been looked via BZ #1929188 and we can continue the investigation/discussion there.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days