1920498 – [IBM Z] OSDs are OOM killed and storage cluster goes into error state during ocs-ci tier1 pvc expansion tests

Bug 1920498 - [IBM Z] OSDs are OOM killed and storage cluster goes into error state during ocs-ci tier1 pvc expansion tests

Summary: [IBM Z] OSDs are OOM killed and storage cluster goes into error state during...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Josh Durgin
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:	1917815
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-26 12:22 UTC by Sravika
Modified:	2023-09-15 00:59 UTC (History)
CC List:	18 users (show)
Fixed In Version:	4.7.0-278.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:18:35 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:19:02 UTC

Description Sravika 2021-01-26 12:22:27 UTC

Description of problem (please be detailed as possible and provide log
snippests):

During the tier1 test execution (tests/manage/pv_services/pvc_resize/test_pvc_expansion.py), OSD containers were out of memory killed and ceph cluster goes into error state during the pvc expansion tests, resulting 7 PGs in "active+recovery_unfound+undersized+degraded+remapped" state in "ocs-storagecluster-cephfilesystem-data0" pool. 


OSDs have resource limits as follows:
      cpu:     2
      memory:  5Gi

Version of all relevant components (if applicable):

OCP 4.12
OCS 4.6.2 (4.6.2-233.ci) 
Ceph 14.2.8-115.el8cp

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

All the testcases after the FIO tests have been Skipped as the Ceph cluster is in Unhealthy state.


Is there any workaround available to the best of your knowledge?

Will try to increase the memory limits of OSDs and rerun the tests

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP 4.12 and OCS 4.6.2 (4.6.2-233.ci) with 4 workers
2. The 4 OSDs should have resource limits as follows:
      Limits:
      cpu:     2
      memory:  5Gi

3. Run ocs-ci with "tier1" marker as follows:
run-ci -m 'tier1' --ocsci-conf config.yaml --cluster-path <cluster_path> --html=<path> --self-contained-html  

Actual results:

OSDs are OOM killed during the tests and the Ceph cluster goes into error state.

rook-ceph-osd-0-5799ff866-8gkl2                                   1/1     Running     4          15h
rook-ceph-osd-1-f9b7d58bd-7ln25                                   1/1     Running     5          15h
rook-ceph-osd-2-86767d765b-txz9s                                  1/1     Running     5          15h
rook-ceph-osd-3-55b9ccbd7-qhz84                                   1/1     Running     5          15h


sh-4.4# ceph -s
  cluster:
    id:     ea4006aa-b962-4b12-b0b6-440fd0095ae8
    health: HEALTH_ERR
            293/15401 objects unfound (1.902%)
            Possible data damage: 7 pgs recovery_unfound
            Degraded data redundancy: 2595/46203 objects degraded (5.617%), 7 pgs degraded, 7 pgs undersized

  services:
    mon: 3 daemons, quorum a,b,c (age 15h)
    mgr: a(active, since 15h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 4 osds: 4 up (since 9h), 4 in (since 15h); 7 remapped pgs
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   10 pools, 176 pgs
    objects: 15.40k objects, 58 GiB
    usage:   137 GiB used, 1.8 TiB / 2.0 TiB avail
    pgs:     2595/46203 objects degraded (5.617%)
             293/15401 objects unfound (1.902%)
             169 active+clean
             7   active+recovery_unfound+undersized+degraded+remapped

  io:
    client:   1.1 KiB/s rd, 12 KiB/s wr, 2 op/s rd, 1 op/s wr


sh-4.4# ceph pg dump_stuck unclean
ok
PG_STAT STATE                                                UP      UP_PRIMARY ACTING ACTING_PRIMARY
4.18    active+recovery_unfound+undersized+degraded+remapped [1,0,3]          1  [0,3]              0
4.8     active+recovery_unfound+undersized+degraded+remapped [2,0,3]          2  [0,3]              0
4.b     active+recovery_unfound+undersized+degraded+remapped [2,3,1]          2  [2,3]              2
4.2     active+recovery_unfound+undersized+degraded+remapped [1,0,3]          1  [0,3]              0
4.10    active+recovery_unfound+undersized+degraded+remapped [2,3,1]          2  [2,3]              2
4.14    active+recovery_unfound+undersized+degraded+remapped [1,0,3]          1  [0,3]              0
4.11    active+recovery_unfound+undersized+degraded+remapped [0,1,2]          0  [0,1]              0



sh-4.4# rados df
POOL_NAME                                                USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED  RD_OPS      RD  WR_OPS      WR USED COMPR UNDER COMPR
.rgw.root                                             2.8 MiB      16      0     48                  0       0        0    7165 8.7 MiB      28  24 KiB        0 B         0 B
ocs-storagecluster-cephblockpool                       24 GiB    5620      0  16860                  0       0        0 2237290  23 GiB 2128178 114 GiB        0 B         0 B
ocs-storagecluster-cephfilesystem-data0               109 GiB    9476      0  28428                293     293     2595 1909378  38 GiB 6164676  95 GiB        0 B         0 B
ocs-storagecluster-cephfilesystem-metadata             84 MiB      34      0    102                  0       0        0  106737  62 MiB    2805  29 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data   384 KiB       2      0      6                  0       0        0  166445 1.1 GiB  330054 1.8 GiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index      0 B      22      0     66                  0       0        0  336134 328 MiB  183756  80 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec     0 B       0      0      0                  0       0        0       0     0 B       0     0 B        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.control            0 B       8      0     24                  0       0        0       0     0 B       0     0 B        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.log            6.4 MiB     211      0    633                  0       0        0   85608  83 MiB   59341 4.1 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.meta           1.9 MiB      12      0     36                  0       0        0   20333  17 MiB    9769 3.0 MiB        0 B         0 B



Expected results:

OSDs should not OOM during the tests and the Ceph cluster should be in healthy state.

sh-4.4# ceph -s
  cluster:
    id:     ea4006aa-b962-4b12-b0b6-440fd0095ae8
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 7m)
    mgr: a(active, since 6m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-b=up:active} 1 up:standby-replay
    osd: 4 osds: 4 up (since 6m), 4 in (since 6m)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)

  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle

  data:
    pools:   10 pools, 176 pgs
    objects: 317 objects, 82 MiB
    usage:   4.1 GiB used, 1.9 TiB / 2.0 TiB avail
    pgs:     176 active+clean

  io:
    client:   853 B/s rd, 6.3 KiB/s wr, 1 op/s rd, 0 op/s wr


Additional info:

Worker Node Resources are as follows :

No. of Workers:4
Memory: 64GB
CPU : 16
Disk: 500 GB each for osds


Attaching the container logs of the osds, must-gather logs and the tier1 test execution log for more details.

https://drive.google.com/file/d/1hiiUlJNljo0KVubfbh82kmK1zgJ3WIHN/view?usp=sharing

Comment 2 Blaine Gardner 2021-02-02 17:00:35 UTC

@sbalusu could you provide some more information about these tests? In what repository can I find the test definitions?

Comment 3 Blaine Gardner 2021-02-02 17:27:19 UTC

From looking at the must-gather, there is only so much information I have. I think this is an issue that most properly belongs with the Ceph.

Looking at rook-ceph-osd-0 as an example, one of the init containers issues the following command:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

This command reports the info
2021-01-25T22:09:52.167825497Z inferring bluefs devices from bluestore path
2021-01-25T22:09:56.779923017Z 1 : device size 0x7d00000000 : own 0x[3c00000000~500000000] = 0x500000000 : using 0xb2b0000(179 MiB)
2021-01-25T22:09:56.779923017Z Expanding...

This seems to be where the OOM kill is happening. The output should report more info after "Expanding..." as in the example below.
inferring bluefs devices from bluestore path
1 : device size 0x4ffe00000 : own 0x[11ff00000~40000000] = 0x40000000 : using 0x470000(4.4 MiB) : bluestore has 0x23fdd0000(9.0 GiB) available
Expanding DB/WAL...
Expanding Main...
1 : expanding from 0x27fe00000 to 0x4ffe00000
1 : size label updated to 21472739328

The OOM kill seems to be happening in around 7 seconds, so the memory usage gets to 5Gi pretty quickly. ceph-bluestore-tool doesn't have any documented ability to specify memory/CPU limits to its process. The fix for this could take some different shapes:
- Rook could run the expand-bluefs container without memory/CPU limits, but this could OOM kill other pods on nodes that are near memory capacity
- as a note, we should probably not allow multiple OSDs to expand simultaneously since they will use lots of memory
- The better long-term solution would to make a change in Ceph to allow setting memory limits on the `ceph-bluestore-tool bluefs-bdev-expand` command which Rook could specify or which it could read from POD_MEMORY_LIMITS as the `ceph-osd` process does.

Comment 4 Sravika 2021-02-02 17:28:22 UTC

Do you mean the pvc expansion tests? You can find them in ocs-ci (https://github.com/red-hat-storage/ocs-ci) "tests/manage/pv_services/pvc_resize/test_pvc_expansion.py" as 
mentioned in the description.

Comment 5 Blaine Gardner 2021-02-02 17:53:52 UTC

Small update: I was reminded that Rook needs to set resource requests on all containers of a pod to get the benefits of pod priority which Kubernetes gives only to pods for which all containers have resource limits set. I think we need someone from the Ceph team to take a look at why bluefs expansion is using so much memory. I'll change the BZ's component to reflect this.

Comment 6 Blaine Gardner 2021-02-02 18:01:52 UTC

@sbalusu I am sure the Ceph team will wish to get more log data. I don't see any documented `verbose` or `log-level` options for `ceph-bluestore-tool` unfortunately. It would be a good exercise to try increasing the pod memory requests and limits to see if there is a value above which OOM killing doesn't happen. I might suggest trying 10Gi, 15Gi, 20Gi, etc. 

Also potentially of note, does this only happen on IBM-Z and not on regular x86_64?

(I also updated the BZ status to NEW since I am passing this off to the Ceph team.)

Comment 7 Josh Durgin 2021-02-03 01:34:52 UTC

This is likely caused by lack of tcmalloc - https://bugzilla.redhat.com/show_bug.cgi?id=1917815

Comment 8 Sravika 2021-02-03 09:13:34 UTC

@brgardne, @jdurgin : sure, I can try with the increased pod memory limits if I see this behaviour even after the tcmalloc issue is resolved.

Comment 9 Sravika 2021-02-05 14:36:53 UTC

@jdurgin, @

Comment 10 Sravika 2021-02-05 14:42:31 UTC

@jdurgin, @brgardne : A question here please, does this also mean that the pvc disk expansion is not supported on IBM Z until the tcmalloc/OOM issue (https://bugzilla.redhat.com/show_bug.cgi?id=1917815) is resolved? If so, this needs to be documented in the release notes before GA and will inform the team to add this in the documentation.

Comment 11 Josh Durgin 2021-02-05 23:40:42 UTC

(In reply to Sravika from comment #10)
> @jdurgin, @brgardne : A question here please, does
> this also mean that the pvc disk expansion is not supported on IBM Z until
> the tcmalloc/OOM issue (https://bugzilla.redhat.com/show_bug.cgi?id=1917815)
> is resolved? If so, this needs to be documented in the release notes before
> GA and will inform the team to add this in the documentation.

Nothing should be supported until tcmalloc linking is resolved.

Comment 12 Mudit Agarwal 2021-02-19 08:54:26 UTC

Same as https://bugzilla.redhat.com/show_bug.cgi?id=1917815, providing dev ack. tcmalloc linking BZ is already ON_QA

Comment 13 Mudit Agarwal 2021-02-23 12:54:14 UTC

tc malloc fix is there in the build but we are stuck because of https://bugzilla.redhat.com/show_bug.cgi?id=1928471

Moving this to MODIFIED

Comment 14 Mudit Agarwal 2021-03-02 08:52:15 UTC

Elad, please provide qa_ack. This should be fixed with tc malloc library

Comment 15 lmcfadde 2021-03-10 19:28:56 UTC

@tstober @sbalusu Can this be moved to verified?

Comment 16 Warren 2021-03-10 20:39:52 UTC

I ran the command:

run-ci -m 'tier1' --ocsci-conf config.yaml --cluster-path <cluster_path> --html=<path> --self-contained-html

100% of the tasks passed and ceph health was HEALTH_OK after the test.

The run failed after that because I used the wrong value for the html path.

What should I use for path in this command?

Is the result that I got sufficient to demonstrate that this bug is fixed?

Thanks

Comment 20 Sarah Julia Kriesch 2021-03-18 07:44:47 UTC

I have got OCP 4.7.1 and OCS 4.7 installed.


I ran following command:
run-ci -m "tier1"  --ocsci-conf ~/ocs-ci/config.yaml --cluster-path /root/ocp4-workdir/ tests --no-print-logs --capture=no --html ~/testtier1Fixed_17thMarch2021.html --self-contained-html 

I have got 2 Faileds because of fio 1) can not be found and 2) not executed.
52 error messages are saying, that  "routes.route.openshift.io "rook-ceph-rgw-ocs-storagecluster-cephobjectstore" already exists"

Comment 21 Sarah Julia Kriesch 2021-03-18 07:47:53 UTC

My OSDs didn't have any problems after the tier1 run:
ook-ceph-osd-0-6f779b8d65-p9f2x                                  2/2     Running     1          23h
rook-ceph-osd-1-7c46b87cb5-fvjp7                                  2/2     Running     0          23h
rook-ceph-osd-2-687dcd65f7-b6vqk                                  2/2     Running     0          23h
rook-ceph-osd-3-fddd688d8-dp8q5                                   2/2     Running     0          12h
rook-ceph-osd-4-84d6c9797b-g9qw9                                  2/2     Running     0          12h
rook-ceph-osd-5-65bdfc8b6c-zw6j6                                  2/2     Running     0          12h

Comment 22 Sarah Julia Kriesch 2021-03-22 08:41:35 UTC

I would say after my test run, this BZ can be closed as "resolved".

Comment 27 Sravika 2021-04-07 09:13:52 UTC

This issue has been fixed in the release of 4.7 and the pvc expansion tests in tier1 run through successfully.

Comment 28 tstober 2021-04-28 13:45:08 UTC

verified

Comment 29 Michael Schaefer 2021-04-29 15:58:16 UTC

Running the tier1 test on OCP 4.7.6 (and 4.7.8) on RHCOS 4.7.7 OOM kills rook-ceph-mgr-a again.

From describe rook-ceph-mgr-a-8545946864-mrxgs:

    State:          Running
      Started:      Thu, 29 Apr 2021 15:57:43 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 29 Apr 2021 15:51:24 +0200
      Finished:     Thu, 29 Apr 2021 15:54:58 +0200
    Ready:          True
    Restart Count:  6

Scenario is to run 'tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expand_expanded_pvc', and after the second PVC expansion, the respective fio job which tests the expanded filesystem does not start or execute or just hangs (not really clear). The test eventually fails with a command timeout after 600s. This happens in the Pod which mounts the PVC from the ocs-storagecluster-ceph-rbd StorageClass in Filesystem mode.

Uploaded logfiles from testrun (test-logs-1619703324) and logs from successful tier1 run:

https://drive.google.com/file/d/1e25f1fEhjC8ZtyDchDW5mjdysrchTW-M/view?usp=sharing

Comment 30 Sravika 2021-04-29 16:09:15 UTC

@rcyriac : Could you please reopen the bug as this bug has been reproduced again on the very latest version according to @mschaefe

@mschaefe:  Could you also mention the OCP, OCS and RHCOS versions where pvc expansion test in tier1 test has passed.

Comment 31 Mudit Agarwal 2021-04-30 13:44:20 UTC

Blaine/Josh, PTAL

Comment 32 Blaine Gardner 2021-04-30 19:41:16 UTC

This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=1929188, https://bugzilla.redhat.com/show_bug.cgi?id=1945016, and https://bugzilla.redhat.com/show_bug.cgi?id=1953430. Issues happen around FIO testing. Do the nodes become NotReady at any time during the test? 
@mschaefe: I think I see from the recorded events that this doesn't happen, but please confirm.

It may be that we need to increase the recommended resource requirements/limits for Ceph MGR pods. The limits from the test are 3Gi. 
@mschaefe: As a debugging measure, increase limits on the mgr pods to 6Gi to see if the issue persists.

@jdurgin: Could there be a mgr plugin that is consuming too many resources or a memory leak somewhere in the mgr? How much do you expect the MGR resource usage to be above 3Gi?

Comment 33 Mudit Agarwal 2021-05-03 11:06:17 UTC

Sravika, please let us know if is being reproduced on IBM Z systems or both P and Z systems.

We already have a BZ #1929188 where two cases are mentioned:

1. Pods restarting on P systems --- This is happening because one or more worker nodes are going down (BZ #1945016)
2. Pods restarting on Z systems -- Here we don't know the root cause yet but worker nodes are not going down.

So, if this BZ is reproducible only on IBM Z systems then this is a dup of BZ #1929188 and we should focus on why this is happening only on Z systems.

Comment 34 Michael Schaefer 2021-05-03 14:21:29 UTC

@sbalusu

Cannot find the RHCOS for the PASSing versions, but OCS was 4.7.0-801.ci, and OCP 4.7.4.
The FAILing versions are: RHCOS 4.7.7, OCP 4.7.8, OCS 4.7.0-801.ci.
So, different OCP and very likely RHCOS.

I will repeat the test runs with 6G memory limit, however, we found that the OCS version I should use for the testing is OCS 4.7.0-372.ci.

Comment 35 Sravika 2021-05-04 07:02:00 UTC

@Mudit: Some background info, this issue has been initially observed and reported on OCS 4.6. With the tcmalloc fix delivered in 4.7 this was resolved and the test passed initially when @mschaefe verified on OCS 4.7 and OCP 4.7.4. 
 However, this issue has come back on the last test execution as mentioned in the above comment by @mschaefe. 


If it is reproducible on IBM P am not sure about this, @svenkat could you please comment on this, thankyou.

Comment 36 Michael Schaefer 2021-05-04 10:07:57 UTC

Uploaded results from last test run to Google Drive: https://drive.google.com/file/d/1gjVHqCwgwpNztkr_jMWxiCfQfIYt5hI_/view?usp=sharing

OCP version: 4.7.8
OCS version: 4.7.0-364.ci
RHCOS: 4.7.7-s390x

For convenience, I added output of some oc commands (*.out files) to the uploaded zip file, too.

Observations:
- Testcase keeps failing with ThreadExecutor timeout on the fio command for the test pod which writes to the PVC in Filesystem mode.
- Other Pods can successfully finish their fio jobs on their double-expanded PVCs
- rook-ceph-mgr-a is NOT restarting with memory limit AND request size of 6GB
- must_gather seems to be failing for OCP, not clear why

For further verification, I am planning to up a regression test on RHCOS 4.7.0.

Comment 37 Michael Schaefer 2021-05-04 13:06:47 UTC

Unfortunately, it seems I cannot reproduce the PASSing test.
However, logs for the PASSing tier1 suite can be found here: https://drive.google.com/file/d/1zmV9Bv4J72GR3zDPcSZSINAZPZSaD7dE/view?usp=sharing

Latest run environment:
- OCP 4.7.4
- OCS 4.7.0-377.ci
- RHCOS 4.7.0
- ocs-ci commit 34ce7028ac1736234e011d76e5c6c2b07b0ebeea

Logfiles from that run are uploaded here: https://drive.google.com/file/d/1J633IWjLgQ4uieFO8TGj2U_JsxXit7wg/view?usp=sharing

Observations:
- Test is again FAILing with timeout on the fio command which writes to the Filesystem PVC
- rook-ceph-mgr-a is NOT restarting, although kept memory limit/request at 3G
- again must_gather reported some failures

Comment 38 Mudit Agarwal 2021-05-04 13:45:48 UTC

Elad, 
From Michael's comment (https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c36) it seems like the pods are not restarting but the tests are failing.
Can someone from ci squads check why the tests are failing?


Venkat,
As asked in https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c33 and https://bugzilla.redhat.com/show_bug.cgi?id=1920498#c35, do you know if this is reproducible on P systems or not?


Honestly, this gets confusing with every update.

==> Initially the bug was filed on 4.6 and pods were restarting. [01/26]
==> We applied the tc_malloc fix [02/03]
==> Tests started passing with OCS 4.7 and the bug was marked VERIFIED. [04/28]
==> Again the tests started failing, with pods getting restarted (but nothing changed in OCS4.7/RHCS which could cause this) [04/30]
==> Now, in the recent runs tests are failing but pods are not restarting. [05/04]

Can we also verify this manually instead of relying on the ci runs?
Raz, if you have some suggestions.

Comment 39 Mudit Agarwal 2021-05-04 15:03:21 UTC

Ok, I analysed the failure from the last run. Its failing while deleting a pvc

E               TimeoutError: Timeout when waiting for pvc-66a8c2ac-494d-480c-9789-5001e67befbc to delete. Describe output: Name:            pvc-66a8c2ac-494d-480c-9789-5001e67befbc
E               Labels:          <none>
E               Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
E               Finalizers:      [kubernetes.io/pv-protection]
E               StorageClass:    ocs-storagecluster-ceph-rbd
E               Status:          Released
E               Claim:           namespace-test-c6dd50de4a0545dcb1ece68b0/pvc-test-8d58774ffecc44fda86a34190079db2
E               Reclaim Policy:  Delete
E               Access Modes:    RWO
E               VolumeMode:      Filesystem
E               Capacity:        25Gi
E               Node Affinity:   <none>
E               Message:         
E               Source:
E                   Type:              CSI (a Container Storage Interface (CSI) volume source)
E                   Driver:            openshift-storage.rbd.csi.ceph.com
E                   FSType:            ext4
E                   VolumeHandle:      0001-0011-openshift-storage-0000000000000001-ec0e973f-acb6-11eb-b03f-0a580a81020c
E                   ReadOnly:          false
E                   VolumeAttributes:      clusterID=openshift-storage
E                                          csi.storage.k8s.io/pv/name=pvc-66a8c2ac-494d-480c-9789-5001e67befbc
E                                          csi.storage.k8s.io/pvc/name=pvc-test-8d58774ffecc44fda86a34190079db2
E                                          csi.storage.k8s.io/pvc/namespace=namespace-test-c6dd50de4a0545dcb1ece68b0
E                                          imageFeatures=layering
E                                          imageFormat=2
E                                          imageName=csi-vol-ec0e973f-acb6-11eb-b03f-0a580a81020c
E                                          journalPool=ocs-storagecluster-cephblockpool
E                                          pool=ocs-storagecluster-cephblockpool
E                                          radosNamespace=
E                                          storage.kubernetes.io/csiProvisionerIdentity=1620114679356-8081-openshift-storage.rbd.csi.ceph.com
E               Events:
E                 Type     Reason              Age                 From                                                                                                                Message
E                 ----     ------              ----                ----                                                                                                                -------
E                 Warning  VolumeFailedDelete  55s (x9 over 3m5s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-55b9744ff9-bxx67_057881f5-24f0-4e03-ab27-4f5f489748fb  rpc error: code = Internal desc = rbd csi-vol-ec0e973f-acb6-11eb-b03f-0a580a81020c is still being used

If you check the logs for csi plugin it says that a request for unstaging the volume came but the response never came back from the storage or probably we didn't wait enough for the response to come back.
This might be expected because the test case expands the volume, writes data and deletes it immediately. Need to check the similar bugs we closed in recent past, should be another instance.

Comment 40 Mudit Agarwal 2021-05-04 15:19:46 UTC

Plugin logs as mentioned in the last update, Req ID 1207 never came back within the required time as data was still there to sync.

2021-05-04T09:29:35.022089170Z I0504 09:29:35.022066    4225 utils.go:132] ID: 1207 Req-ID: 0001-0011-openshift-storage-0000000000000001-         ec0e973f-acb6-11eb-b03f-0a580a81020c GRPC call: /csi.v1.Node/NodeUnstageVolume
2021-05-04T09:29:35.022112206Z I0504 09:29:35.022100    4225 utils.go:133] ID: 1207 Req-ID: 0001-0011-openshift-storage-0000000000000001-         ec0e973f-acb6-11eb-b03f-0a580a81020c GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-66a8c2ac-494d-480c-  9789-5001e67befbc/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-ec0e973f-acb6-11eb-b03f-0a580a81020c"}

Comment 42 Josh Durgin 2021-05-04 17:43:33 UTC

(In reply to Blaine Gardner from comment #32)
> @jdurgin: Could there be a mgr plugin that is consuming too many
> resources or a memory leak somewhere in the mgr? How much do you expect the
> MGR resource usage to be above 3Gi?

We haven't seen any cases of the mgr requiring a lot of memory. The initial report showed osd/mon/mgr all crashed - do we have any data to say what process was using more memory/reaching its memory limit?

If it was the mgr, reproducing with different mgr modules disabled to isolate the cause would be the next step.

Comment 43 Sravika 2021-05-05 08:09:20 UTC

@Mudit:  Similar behaviour has been observed in one of the scale tests where the test errored as pvc deletion was not successful during teardown. 
Although the pvc has been deleted, the csi-volume was not unstaged and the data sync to the pool was still in progress. 
Uploaded the logs of the testcase and the must gather logs to the drive.

https://drive.google.com/file/d/1lKBWboEyYedjjMGXUxtPlz5vra8e2ayM/view?usp=sharing


# oc get pvc -A | grep  pvc-test-2266586c3a6e48109eafdf76ab884e
#

# oc get pv -A | grep pvc-8b0ce676-4e18-4c38-a3cb-2178667ecf87
pvc-8b0ce676-4e18-4c38-a3cb-2178667ecf87   1Ti        RWO            Delete           Released    namespace-test-62180a33c675434d88ca31af9/pvc-test-2266586c3a6e48109eafdf76ab884e3   ocs-storagecluster-ceph-rbd            15m


Before test execution:

sh-4.4# rados df
POOL_NAME                                                USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS      RD WR_OPS      WR USED COMPR UNDER COMPR
.rgw.root                                             2.8 MiB      16      0     48                  0       0        0   2034 2.3 MiB     28  24 KiB        0 B         0 B
ocs-storagecluster-cephblockpool                       11 GiB    1498      0   4494                  0       0        0 435904 1.6 GiB 167964  21 GiB        0 B         0 B
ocs-storagecluster-cephfilesystem-data0                23 GiB    1949      0   5847                  0       0        0 857048 3.3 GiB 304270  41 GiB        0 B         0 B
ocs-storagecluster-cephfilesystem-metadata             28 MiB      55      0    165                  0       0        0  14427  17 MiB   3863  11 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data   192 KiB       1      0      3                  0       0        0   2087 2.3 MiB   3623 302 KiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index      0 B      22      0     66                  0       0        0   5600 5.5 MiB   7706 876 KiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec     0 B       0      0      0                  0       0        0      0     0 B      0     0 B        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.control            0 B       8      0     24                  0       0        0      0     0 B      0     0 B        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.log            6.6 MiB     181      0    543                  0       0        0  14114  14 MiB   9545 1.1 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.meta           1.9 MiB      12      0     36                  0       0        0   5540 4.7 MiB   2929 923 KiB        0 B         0 B

total_objects    3742
sh-4.4#




After test execution, the pool size "ocs-storagecluster-cephblockpool" still keeps increasing:


sh-4.4# rados df
POOL_NAME                                                USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS      RD  WR_OPS      WR USED COMPR UNDER COMPR
.rgw.root                                             2.8 MiB      16      0     48                  0       0        0  14298  17 MiB      28  24 KiB        0 B         0 B
ocs-storagecluster-cephblockpool                      269 GiB   49619      0 148857                  0       0        0 440143 1.6 GiB 2051219  29 GiB        0 B         0 B
ocs-storagecluster-cephfilesystem-data0                23 GiB    1949      0   5847                  0       0        0 857048 3.3 GiB  304270  41 GiB        0 B         0 B
ocs-storagecluster-cephfilesystem-metadata             28 MiB      55      0    165                  0       0        0 137924  77 MiB    3871  11 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.data   192 KiB       1      0      3                  0       0        0   9955  11 MiB   17111 1.4 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.index      0 B      22      0     66                  0       0        0  26418  26 MiB   36628 4.0 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec     0 B       0      0      0                  0       0        0      0     0 B       0     0 B        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.control            0 B       8      0     24                  0       0        0      0     0 B       0     0 B        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.log            6.6 MiB     213      0    639                  0       0        0  66179  64 MiB   44837 5.1 MiB        0 B         0 B
ocs-storagecluster-cephobjectstore.rgw.meta           1.9 MiB      12      0     36                  0       0        0  26779  23 MiB   13871 4.2 MiB        0 B         0 B

total_objects    51895
total_used       297 GiB
total_avail      5.7 TiB
total_space      6 TiB

Comment 44 Michael Schaefer 2021-05-05 12:04:55 UTC

Further looking into the test_expand_expanded_pvc, I isolated the actions taken for the pod using the PVC from the ocs-storagecluster-ceph-rbd StorageClass in Filesystem mode and ran the respective fio jobs manually.
While doing so, no error or hangups occured and all the fios terminated in a reasonable time.

Comment 45 Michael Schaefer 2021-05-05 13:17:51 UTC

Following the hints given above, I modified the test_expand_expanded_pvc by adding a two minutes delay between expanding the PVC the second time and then actually starting the fio jobs.

Observations:
- With that little pause, the test is PASSing.
- Anyway, it is now erroring in teardown when trying to remove the PV for the PVC in ocs-storagecluster-ceph-rbd StorageClass - same as noted above.
- In an earlier run, I saw the rook-ceph-mgr pod must have been restarted (but actually didn't notice in time).
- When the fio jobs are started by the ThreadExecutors, the cluster goes into HEALTH_WARN state, showing:
    2 MDSs report slow metadata IOs
    485 slow ops, oldest one blocked for 160 sec, daemons [ods.0,osd.1,osd.2] have slow ops.
  The # of slow ops and blocking time are steadily increasing until they begin going down after some minutes. As soon as this happens, I can see increasing write throughput in the io: client: section of ceph -s output.
- This HEALTH_WARN state is occuring whenever the test is starting the fio jobs on the test pods.


Added this pause to the test:
[schaefm@m1308001 pvc_resize]$ git diff test_pvc_expansion.py
diff --git a/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py b/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py
index 5ccdd8be..22323227 100644
--- a/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py
+++ b/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py
@@ -1,5 +1,7 @@
 import logging
 import pytest
+import time
+
 from concurrent.futures import ThreadPoolExecutor
 
 from ocs_ci.ocs import constants
@@ -244,6 +246,8 @@ class TestPvcExpand(ManageTest):
         log.info("Expanding all PVCs for the second time.")
         self.expand_and_verify(pvc_size_expanded_2)
 
+        log.info("MS/OCPonZ: Sleeping two minutes to let things calm down.")
+        time.sleep(120)
         # Run IO and verify
         log.info("Starting post-second-expand IO on all pods.")
         self.run_io_and_verify(6, "post_expand")

Comment 48 Mudit Agarwal 2021-05-06 15:30:46 UTC

Rejy, I agree to the above comment. Thie BZ can be moved to VERIFIED again.

We don't even need to open a new BZ, the new issue we are hitting here (mon/mgr pods restarting) is already been looked via BZ #1929188 and we can continue the investigation/discussion there.

Comment 52 errata-xmlrpc 2021-05-19 09:18:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Comment 53 Red Hat Bugzilla 2023-09-15 00:59:13 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.