Bug 2117398

Summary:	The ceph is in warning state right after deployment though it has enough space.
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Alexander Chuzhoy <sasha>
Component:	ceph	Assignee:	Neha Ojha <nojha>
ceph sub component:	RADOS	QA Contact:	Elad <ebenahar>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bniver, madam, muagarwa, ocs-bugs, odf-bz-bot, pdhange, pdhiran
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-09-26 13:55:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2022-08-10 22:18:02 UTC

Versions:
mcg-operator.v4.10.5
ocs-operator.v4.10.5
odf-csi-addons-operator.v4.10.5
odf-operator.v4.10.5

OCP: 4.10.24


Deployed ODF in a cluster deployed on KVM virtual machines, where each vm has 3 disks of 120G for ODF.
This in total we have 120 * 9 GB of space.


With approximately 60% usage, we see 1 nearfull osd:


 cluster:",
   id:     5acf4f06-4381-446c-b64f-4acfb26006ea",
   health: HEALTH_WARN",
           1 nearfull osd(s)", 
           Degraded data redundancy: 811/295932 objects degraded (0.274%), 2 pgs degraded, 3 pgs undersized",
           11 pool(s) nearfull",
",
 services:",
   mon: 3 daemons, quorum a,b,c (age 2h)",
   mgr: a(active, since 2h)",
   mds: 1/1 daemons up, 1 hot standby", 
   osd: 9 osds: 9 up (since 2h), 9 in (since 2h); 3 remapped pgs",
   rgw: 1 daemon active (1 hosts, 1 zones)",
",
 data:",
   volumes: 1/1 healthy",
   pools:   11 pools, 369 pgs", 
   objects: 98.64k objects, 211 GiB",
   usage:   646 GiB used, 434 GiB / 1.1 TiB avail",
   pgs:     811/295932 objects degraded (0.274%)", 
            1971/295932 objects misplaced (0.666%)",
            366 active+clean",  
            2   active+recovery_wait+undersized+degraded+remapped",
            1   active+recovering+undersized+remapped",
",
 io:",
   client:   19 MiB/s rd, 1.2 MiB/s wr, 30 op/s rd, 119 op/s wr",
   recovery: 19 MiB/s, 9 objects/s",
",
 progress:",
   Global Recovery Event (36m)",
     [===========================.] (remaining: 18s)",
"
~

Comment 2 Alexander Chuzhoy 2022-08-18 15:15:16 UTC

The issue is reproducing:
 cluster:
   id:     599932f3-90b2-49c4-a6c0-a2531dd2e694
   health: HEALTH_WARN
           Degraded data redundancy: 8899/299829 objects degraded (2.968%), 8 pgs degraded, 10 pgs undersized

 services:
   mon: 3 daemons, quorum a,b,c (age 102m)
   mgr: a(active, since 101m)
   mds: 1/1 daemons up, 1 hot standby
   osd: 9 osds: 9 up (since 101m), 9 in (since 101m); 10 remapped pgs
   rgw: 1 daemon active (1 hosts, 1 zones)

 data:
   volumes: 1/1 healthy
   pools:   11 pools, 369 pgs
   objects: 99.94k objects, 213 GiB
   usage:   655 GiB used, 425 GiB / 1.1 TiB avail
   pgs:     8899/299829 objects degraded (2.968%)
            5961/299829 objects misplaced (1.988%)
            359 active+clean
            8   active+recovery_wait+undersized+degraded+remapped
            2   active+recovering+undersized+remapped
 io:
   client:   30 MiB/s rd, 1.1 MiB/s wr, 36 op/s rd, 66 op/s wr
   recovery: 28 MiB/s, 12 objects/s

Comment 3 Alexander Chuzhoy 2022-08-18 15:15:45 UTC

Came to check the cluster after a few hours and the status got fixed:


  cluster:
    id:     599932f3-90b2-49c4-a6c0-a2531dd2e694
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 13h)
    mgr: a(active, since 13h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 9 osds: 9 up (since 13h), 9 in (since 13h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 369 pgs
    objects: 99.98k objects, 214 GiB
    usage:   649 GiB used, 431 GiB / 1.1 TiB avail
    pgs:     369 active+clean
 
  io:
    client:   852 B/s rd, 19 KiB/s wr, 1 op/s rd, 1 op/s wr

Comment 4 Prashant Dhange 2022-08-23 04:02:54 UTC

@sasha The cluster was in HEALTH_WARN because one of the OSD reached 75% (nearfull ratio is set to 0.75 by default in ODF) of it's capacity. This is expected behavior. Check the "ceph osd df"/"ceph osd df tree" output to track individual OSD %USE (percentage usage).

Check ceph documentation [1] on Nearfull OSDs :

[1] https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html-single/troubleshooting_guide/index#near-full-osds_diag

Can you provide "ceph osd df tree" output if the cluster is still available ?

Let me know if you have any further queries. Feel free to close this BZ as not a bug.

Comment 5 Alexander Chuzhoy 2022-08-23 13:24:39 UTC

Hi Prashant.

Is the data redundancy message related to the same?
"
Degraded data redundancy: 811/295932 objects degraded (0.274%), 2 pgs degraded, 3 pgs undersized
"

Seems like it also raises warning.

Comment 6 Prashant Dhange 2022-09-01 02:42:48 UTC

(In reply to Alexander Chuzhoy from comment #5)
> Hi Prashant.
> 
> Is the data redundancy message related to the same?
> "
> Degraded data redundancy: 811/295932 objects degraded (0.274%), 2 pgs
> degraded, 3 pgs undersized
> "
> 
> Seems like it also raises warning.

The data redundancy message is related to recovery and warning will get cleared once recovery completes. The HEALTH_WARN because of nearfull OSD notifies end user to add more OSDs to the cluster as cluster is getting full.

Comment 7 Prashant Dhange 2022-09-21 15:48:54 UTC

Hi Alexander,

Are we good to close this BZ ? Let me know if you have any further questions.

Comment 8 Alexander Chuzhoy 2022-09-26 13:54:42 UTC

Hi Prashant.

Let's close it. 
Thank you :)