Bug 2051199

Summary: [ROKS] "ceph -s" reports warning if storage space on OCP nodes where montior pods are scheduled is less than 30%
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Elvir Kuric <ekuric>
Component: rookAssignee: Travis Nielsen <tnielsen>
Status: CLOSED NOTABUG QA Contact: Elad <ebenahar>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.8CC: jrivera, madam, muagarwa, ocs-bugs, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-14 18:43:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Elvir Kuric 2022-02-06 15:32:50 UTC
Description of problem (please be detailed as possible and provide log
snippests):

"ceph -s"  will report that monitor "pod" is low on available space.

sh-4.4$ ceph -s
  cluster:
    id:     356ae4de-1225-434d-aa5e-fe26a16a4045
    health: HEALTH_WARN
            mon c is low on available space
            
            
 sh-4.4$ ceph health detail
HEALTH_WARN mon c is low on available space
MON_DISK_LOW mon c is low on available space
    mon.c has 30% avail



in reality this will be space on odf/ocp node where this monitor pod is runing. 

From ocp/odf node where monitor pod ( mon-c) is scheduled

mon-c node

sh-4.2# df -h |more
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda2        98G   63G   30G  69% /
devtmpfs         32G     0   32G   0% /dev
tmpfs            32G     0   32G   0% /dev/shm
tmpfs            32G     0   32G   0% /sys/fs/cgroup
tmpfs            32G   15M   32G   1% /run
/dev/vda1       976M   93M  833M  10% /boot


Version of all relevant components (if applicable):

OCP v4.8, ODF v4.8 


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No


Is there any workaround available to the best of your knowledge?
Probably yes - either to add more space to OCP node or delete some data - but not sure what to delete 


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
I am not sure, I noticed this issue after cca 20 days of intensive usage of ceph cluster 

Can this issue reproduce from the UI?
NA 

If this is a regression, please provide more details to justify this:
NA 

Steps to Reproduce:
1. I do not have clear steps, I was creating app pods and writting GB/TB to storage backend. No direct access / write operation was issued 
on OCP node where mon-c is scheduled. 

Actual results:
"ceph -s" report that mon-c has low available storage 

Expected results:
free space on ocp nodes not to affect storage pods / storage health 


Additional info:
Could be similar / duplicate of 
https://bugzilla.redhat.com/show_bug.cgi?id=1964055

must-gather from cluster where issue was visible : http://perf148b.perf.lab.eng.bos.redhat.com/mon-disk/

Comment 3 Travis Nielsen 2022-02-07 16:06:56 UTC
The mons are using the host path in this scenario, and the default warning for ceph mons is the warning at 30%. This is expected. Do you want to change the warning limit? Otherwise there is nothing that can be done, it is a valid warning.

Comment 4 Elvir Kuric 2022-02-07 16:10:23 UTC
(In reply to Travis Nielsen from comment #3)
> The mons are using the host path in this scenario, and the default warning
> for ceph mons is the warning at 30%. This is expected. Do you want to change
> the warning limit? Otherwise there is nothing that can be done, it is a
> valid warning.

I understand that host path is used and that is OK. I am not clear why is so much data generated on that node, and can this lead in case node file system is full to kick serious issue at ODF side - eg. mon not working?

Comment 5 Travis Nielsen 2022-02-07 19:17:48 UTC
How big is the /var/lib/rook/ directory on that node? If it's bigger than 50GB, perhaps the mons could use some investigation to the cause.

The health warning turns into an error when the disk usage gets even higher, looks like when free space drops below 5%:
https://docs.ceph.com/en/latest/rados/operations/health-checks/#mon-disk-crit

If the system disk fills up, the node in general will have issues, not just the mons.

Comment 6 Elvir Kuric 2022-02-08 07:02:14 UTC
(In reply to Travis Nielsen from comment #5)
> How big is the /var/lib/rook/ directory on that node? If it's bigger than
> 50GB, perhaps the mons could use some investigation to the cause.
it is not critical at all 

# pwd
/var/lib/rook
sh-4.2# du -h .
38M	./mon-c/data/store.db
38M	./mon-c/data
38M	./mon-c
28K	./openshift-storage/ocs-deviceset-1-data-1z4ptm/ceph-5
32K	./openshift-storage/ocs-deviceset-1-data-1z4ptm
19M	./openshift-storage/log
4.0K	./openshift-storage/crash/posted
8.0K	./openshift-storage/crash
32K	./openshift-storage/ocs-deviceset-2-data-0cxbnc/ceph-2
36K	./openshift-storage/ocs-deviceset-2-data-0cxbnc
19M	./openshift-storage
57M	.


> 
> The health warning turns into an error when the disk usage gets even higher,
> looks like when free space drops below 5%:
> https://docs.ceph.com/en/latest/rados/operations/health-checks/#mon-disk-crit
> 
> If the system disk fills up, the node in general will have issues, not just
> the mons.

Yes. Here is confusing that nobody used OCP node for writing anything and this warning is generated, and even /var/lib/rook/ is not big, ODF get error/warning propagate which can be 
critical over time. 

On this node seems critical to be 

# pwd
/var/data/crash
sh-4.2# du -h .
46G	.

where crio ( core-crio-6-0-0-100556-1642883029 ) was crashing.

Comment 7 Travis Nielsen 2022-02-08 14:40:12 UTC
Good to see the mon directory is small. But the warning is still a good thing. If the disk fills up for some other reason, it will affect the function of the mons and the cluster. Shall we close this, or what are you suggesting should be done? Suppress the warning?

Comment 8 Elvir Kuric 2022-02-14 16:33:43 UTC
(In reply to Travis Nielsen from comment #7)
> Good to see the mon directory is small. But the warning is still a good
> thing. If the disk fills up for some other reason, it will affect the
> function of the mons and the cluster. Shall we close this, or what are you
> suggesting should be done? Suppress the warning?
If node is out of storage space it will be in bad state and affect all pods, but warning generated on ODF side originate from not ODF / CEPH component ( nor it is caused by it ), so from ODF side I think we can close BZ.

Comment 9 Travis Nielsen 2022-02-14 18:43:13 UTC
Agreed, while it might not specifically be something ODF caused or can fix, let's keep the warning.