Bug 1987549 - rook-ceph-mgr-a restarting during PVC expansion
Summary: rook-ceph-mgr-a restarting during PVC expansion
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.8
Hardware: s390x
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Travis Nielsen
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-29 15:13 UTC by Michael Schaefer
Modified: 2023-08-09 16:37 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-20 02:50:34 UTC
Embargoed:


Attachments (Terms of Use)

Description Michael Schaefer 2021-07-29 15:13:01 UTC
Description of problem (please be detailed as possible and provide log
snippests):
While running the ocs-ci test 'tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expand_expanded_pvc', I see the rook-ceph-mgr-a Pod restarting. The test itself may pass, or it fails on a timeout of the fio command checking successful second PVC expansion.

I also watched ceph status during test execution and see health warnings, because of slow osds and then mgr daemons not available.

From monitoring I see that OOM kills have occurred, which I assume refer to the rook-ceph-mgr

Version of all relevant components (if applicable):
OCS: 4.8.0-175.ci
OCP: 4.8.2 from candidate-4.8 stream
ocs-ci: stable-ocs-4.8-202107251413

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Possible user impact is that applications using a PVC which has expanded twice may fail. Also, since the whole ceph cluster seems affected and going into health warning state, further storage issues may be possible. 

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes. By running the respective test from ocs-ci.

Can this issue reproduce from the UI?
Not tried.

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:
Test fails or passes, depending on timing, I assume. Ceph cluster shows health_warn during test execution and rook-ceph-mgr-a Pod is restarted.

Expected results:
Test passes, no health warning during execution, all Pods remain running.

Additional info:
* All logfiles, reports and must_gathers from a test run where the test passes, but restarts are detected: https://drive.google.com/file/d/1JSCj9id4hhF3cHmup7Pc2GNnWOX4dGMr/view?usp=sharing

* All logfiles, reports and must_gathers from a test run where the test fails, first on timeout of the fio command, then on deletion of the PVC: https://drive.google.com/file/d/1a0FtRYiK7g7f97XgNTeJrl8daHNgv9jg/view?usp=sharing

Comment 2 Travis Nielsen 2021-07-29 20:18:53 UTC
In the must-gather, the mgr-a pod does show that it was OOMKilled.

    lastState:
      terminated:
        containerID: cri-o://7c2637404d0f700c65f551d219794e83632d2c8458868b56b93c87d4113f9e9f
        exitCode: 137
        finishedAt: "2021-07-29T08:11:35Z"
        reason: OOMKilled
        startedAt: "2021-07-29T08:03:01Z"

Can you repro this outside the ocs-ci? Exactly what operations were being performed? Besides the expansion were other operations completed recently or still in progress? To troubleshoot we really need to know how to repro. All I see in the must-gather is the oom-kill, which is just the side effect of the load.

Comment 3 Michael Schaefer 2021-07-30 08:06:23 UTC
I will try to reproduce the behaviour outside of ocs-ci.

Comment 4 Michael Schaefer 2021-07-30 12:43:02 UTC
I regret, until now I could not reproduce a similar situation, so possibly it is actually a load issue.  ocs-ci is running the following procedure in parallel for different PVCs from different OCS storageclasses.

The general recipe I went through manually so far is:
1. Create a 10G RWO PVC in ocs-storagecluster-ceph-rbd storage class
2. Create a Pod with quay.io/ocsci/nginx:latest as only image running and mounting the PVC at /var/lib/www/html
3. rsh bash into the Pod and
   - apt update
   - apt install fio
4. Run fio:
    fio   --name=fio-rand-write --filename=/var/lib/www/html/pod-test-rbd_pre_expand   --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=8G   --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json
5. Once fio finished, expand PVC top 20G:
    oc patch PersistentVolumeClaim pvc-test -p '{"spec": {"resources": {"requests": {"storage": "20Gi"}}}}'
6. Run an other fio:
    fio   --name=fio-rand-write --filename=/var/lib/www/html/pod-test-rbd_post_expand   --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=8G   --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json
7. Once finished, Repeat 5. to expand to 25G, then Repeat 6.

With this manual walkthrough for one PVC-Pod combination, I could not see mgr-a restarts or health_warn states yet.

Comment 5 Michael Schaefer 2021-08-02 13:28:46 UTC
To further get an understanding of the issue, I am currently running the whole test from the ocs-ci suite on a cluster where I have separated the OCS nodes (infra) from the "normal" workers to not have fios executed on the same workers as the OCS pieces are running. I still see the mgr restarted, and this is about the ceph health warning:

ceph health
  cluster:
    id:     b5034972-6644-4819-bd43-74343a5ba8ad
    health: HEALTH_WARN
            2 MDSs report slow metadata IOs
            20521 slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.1,osd.2,osd.3] have slow ops.

When the mgr is in CrashLoopBackoffstate and restarting:

ceph health
  cluster:
    id:     b5034972-6644-4819-bd43-74343a5ba8ad
    health: HEALTH_WARN
            no active mgr
            4825 slow ops, oldest one blocked for 457 sec, daemons [osd.0,osd.1,osd.3] have slow ops.

  services:
    mon: 3 daemons, quorum a,b,c (age 30m)
    mgr: no daemons active (since 2m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 4 osds: 4 up (since 29m), 4 in (since 29m)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)

  data:
    pools:   10 pools, 176 pgs
    objects: 12.24k objects, 46 GiB
    usage:   143 GiB used, 32 TiB / 32 TiB avail
    pgs:     176 active+clean

  io:
    client:   50 MiB/s wr, 0 op/s rd, 4.50k op/s wr

Once it is up again, health warning is cleared.

Comment 6 Scott Ostapovicz 2021-08-05 13:25:35 UTC
I am assuming that the mgr pod is still getting OOMKilled in the case above (comment 5), correct?  Fairly strong evidence that this is a CI configuration issue.

Comment 7 Mudit Agarwal 2021-08-20 02:50:34 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1987549#c4 and https://bugzilla.redhat.com/show_bug.cgi?id=1987549#c5 strongly suggest that it is a ci issue.


Note You need to log in before you can comment on or make changes to this bug.