Bug 1987549
| Summary: | rook-ceph-mgr-a restarting during PVC expansion | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Michael Schaefer <mschaefe> |
| Component: | ceph | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Raz Tamir <ratamir> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | bniver, madam, mschaefe, muagarwa, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | s390x | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-20 02:50:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Michael Schaefer
2021-07-29 15:13:01 UTC
In the must-gather, the mgr-a pod does show that it was OOMKilled.
lastState:
terminated:
containerID: cri-o://7c2637404d0f700c65f551d219794e83632d2c8458868b56b93c87d4113f9e9f
exitCode: 137
finishedAt: "2021-07-29T08:11:35Z"
reason: OOMKilled
startedAt: "2021-07-29T08:03:01Z"
Can you repro this outside the ocs-ci? Exactly what operations were being performed? Besides the expansion were other operations completed recently or still in progress? To troubleshoot we really need to know how to repro. All I see in the must-gather is the oom-kill, which is just the side effect of the load.
I will try to reproduce the behaviour outside of ocs-ci. I regret, until now I could not reproduce a similar situation, so possibly it is actually a load issue. ocs-ci is running the following procedure in parallel for different PVCs from different OCS storageclasses.
The general recipe I went through manually so far is:
1. Create a 10G RWO PVC in ocs-storagecluster-ceph-rbd storage class
2. Create a Pod with quay.io/ocsci/nginx:latest as only image running and mounting the PVC at /var/lib/www/html
3. rsh bash into the Pod and
- apt update
- apt install fio
4. Run fio:
fio --name=fio-rand-write --filename=/var/lib/www/html/pod-test-rbd_pre_expand --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=8G --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json
5. Once fio finished, expand PVC top 20G:
oc patch PersistentVolumeClaim pvc-test -p '{"spec": {"resources": {"requests": {"storage": "20Gi"}}}}'
6. Run an other fio:
fio --name=fio-rand-write --filename=/var/lib/www/html/pod-test-rbd_post_expand --rw=randwrite --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=60 --size=8G --ioengine=libaio --iodepth=4 --rate=1m --rate_process=poisson --output-format=json
7. Once finished, Repeat 5. to expand to 25G, then Repeat 6.
With this manual walkthrough for one PVC-Pod combination, I could not see mgr-a restarts or health_warn states yet.
To further get an understanding of the issue, I am currently running the whole test from the ocs-ci suite on a cluster where I have separated the OCS nodes (infra) from the "normal" workers to not have fios executed on the same workers as the OCS pieces are running. I still see the mgr restarted, and this is about the ceph health warning:
ceph health
cluster:
id: b5034972-6644-4819-bd43-74343a5ba8ad
health: HEALTH_WARN
2 MDSs report slow metadata IOs
20521 slow ops, oldest one blocked for 322 sec, daemons [osd.0,osd.1,osd.2,osd.3] have slow ops.
When the mgr is in CrashLoopBackoffstate and restarting:
ceph health
cluster:
id: b5034972-6644-4819-bd43-74343a5ba8ad
health: HEALTH_WARN
no active mgr
4825 slow ops, oldest one blocked for 457 sec, daemons [osd.0,osd.1,osd.3] have slow ops.
services:
mon: 3 daemons, quorum a,b,c (age 30m)
mgr: no daemons active (since 2m)
mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
osd: 4 osds: 4 up (since 29m), 4 in (since 29m)
rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
data:
pools: 10 pools, 176 pgs
objects: 12.24k objects, 46 GiB
usage: 143 GiB used, 32 TiB / 32 TiB avail
pgs: 176 active+clean
io:
client: 50 MiB/s wr, 0 op/s rd, 4.50k op/s wr
Once it is up again, health warning is cleared.
I am assuming that the mgr pod is still getting OOMKilled in the case above (comment 5), correct? Fairly strong evidence that this is a CI configuration issue. https://bugzilla.redhat.com/show_bug.cgi?id=1987549#c4 and https://bugzilla.redhat.com/show_bug.cgi?id=1987549#c5 strongly suggest that it is a ci issue. |