Bug 2051199
| Summary: | [ROKS] "ceph -s" reports warning if storage space on OCP nodes where montior pods are scheduled is less than 30% | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Elvir Kuric <ekuric> |
| Component: | rook | Assignee: | Travis Nielsen <tnielsen> |
| Status: | CLOSED NOTABUG | QA Contact: | Elad <ebenahar> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-02-14 18:43:13 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The mons are using the host path in this scenario, and the default warning for ceph mons is the warning at 30%. This is expected. Do you want to change the warning limit? Otherwise there is nothing that can be done, it is a valid warning. (In reply to Travis Nielsen from comment #3) > The mons are using the host path in this scenario, and the default warning > for ceph mons is the warning at 30%. This is expected. Do you want to change > the warning limit? Otherwise there is nothing that can be done, it is a > valid warning. I understand that host path is used and that is OK. I am not clear why is so much data generated on that node, and can this lead in case node file system is full to kick serious issue at ODF side - eg. mon not working? How big is the /var/lib/rook/ directory on that node? If it's bigger than 50GB, perhaps the mons could use some investigation to the cause. The health warning turns into an error when the disk usage gets even higher, looks like when free space drops below 5%: https://docs.ceph.com/en/latest/rados/operations/health-checks/#mon-disk-crit If the system disk fills up, the node in general will have issues, not just the mons. (In reply to Travis Nielsen from comment #5) > How big is the /var/lib/rook/ directory on that node? If it's bigger than > 50GB, perhaps the mons could use some investigation to the cause. it is not critical at all # pwd /var/lib/rook sh-4.2# du -h . 38M ./mon-c/data/store.db 38M ./mon-c/data 38M ./mon-c 28K ./openshift-storage/ocs-deviceset-1-data-1z4ptm/ceph-5 32K ./openshift-storage/ocs-deviceset-1-data-1z4ptm 19M ./openshift-storage/log 4.0K ./openshift-storage/crash/posted 8.0K ./openshift-storage/crash 32K ./openshift-storage/ocs-deviceset-2-data-0cxbnc/ceph-2 36K ./openshift-storage/ocs-deviceset-2-data-0cxbnc 19M ./openshift-storage 57M . > > The health warning turns into an error when the disk usage gets even higher, > looks like when free space drops below 5%: > https://docs.ceph.com/en/latest/rados/operations/health-checks/#mon-disk-crit > > If the system disk fills up, the node in general will have issues, not just > the mons. Yes. Here is confusing that nobody used OCP node for writing anything and this warning is generated, and even /var/lib/rook/ is not big, ODF get error/warning propagate which can be critical over time. On this node seems critical to be # pwd /var/data/crash sh-4.2# du -h . 46G . where crio ( core-crio-6-0-0-100556-1642883029 ) was crashing. Good to see the mon directory is small. But the warning is still a good thing. If the disk fills up for some other reason, it will affect the function of the mons and the cluster. Shall we close this, or what are you suggesting should be done? Suppress the warning? (In reply to Travis Nielsen from comment #7) > Good to see the mon directory is small. But the warning is still a good > thing. If the disk fills up for some other reason, it will affect the > function of the mons and the cluster. Shall we close this, or what are you > suggesting should be done? Suppress the warning? If node is out of storage space it will be in bad state and affect all pods, but warning generated on ODF side originate from not ODF / CEPH component ( nor it is caused by it ), so from ODF side I think we can close BZ. Agreed, while it might not specifically be something ODF caused or can fix, let's keep the warning. |
Description of problem (please be detailed as possible and provide log snippests): "ceph -s" will report that monitor "pod" is low on available space. sh-4.4$ ceph -s cluster: id: 356ae4de-1225-434d-aa5e-fe26a16a4045 health: HEALTH_WARN mon c is low on available space sh-4.4$ ceph health detail HEALTH_WARN mon c is low on available space MON_DISK_LOW mon c is low on available space mon.c has 30% avail in reality this will be space on odf/ocp node where this monitor pod is runing. From ocp/odf node where monitor pod ( mon-c) is scheduled mon-c node sh-4.2# df -h |more Filesystem Size Used Avail Use% Mounted on /dev/vda2 98G 63G 30G 69% / devtmpfs 32G 0 32G 0% /dev tmpfs 32G 0 32G 0% /dev/shm tmpfs 32G 0 32G 0% /sys/fs/cgroup tmpfs 32G 15M 32G 1% /run /dev/vda1 976M 93M 833M 10% /boot Version of all relevant components (if applicable): OCP v4.8, ODF v4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Probably yes - either to add more space to OCP node or delete some data - but not sure what to delete Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? I am not sure, I noticed this issue after cca 20 days of intensive usage of ceph cluster Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: NA Steps to Reproduce: 1. I do not have clear steps, I was creating app pods and writting GB/TB to storage backend. No direct access / write operation was issued on OCP node where mon-c is scheduled. Actual results: "ceph -s" report that mon-c has low available storage Expected results: free space on ocp nodes not to affect storage pods / storage health Additional info: Could be similar / duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1964055 must-gather from cluster where issue was visible : http://perf148b.perf.lab.eng.bos.redhat.com/mon-disk/