Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2231346

Summary: [GSS][ODF 4.12] When OSD goes down PG's are not redistributed to other OSDs
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Rafrojas <rafrojas>
Component: RookAssignee: Santosh Pillai <sapillai>
Status: ASSIGNED --- QA Contact: Vivek Das <vdas>
Severity: low Docs Contact:
Priority: unspecified    
Version: 6.1CC: assingh, bhubbard, ceph-eng-bugs, cephqe-warriors, kelwhite, mduasope, nojha, pjagtap, rsachere, rzarzyns, sapillai, sostapov, tnielsen, trchakra, vereddy, vumrao
Target Milestone: ---Flags: mduasope: needinfo-
trchakra: needinfo-
tnielsen: needinfo? (rzarzyns)
Target Release: Backlog   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rafrojas 2023-08-11 10:25:02 UTC
Description of problem:
 When taking down one storage node hosting an OSD, PG.s are not redistributed to other OSDs

Version-Release number of selected component (if applicable):
ODF 4.12

How reproducible:
Manually taking one OSD down

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Meanwhile in the rook-ceph-operator logs I just see the flag set

2023-06-28T13:23:33.675488974Z 2023-06-28 13:23:33.675473 D | ceph-cluster-controller: updating ceph cluster "openshift-storage" status and condition to &{Health:{Status:HEALTH_WARN Checks:map[MDS_SLOW_METADATA_IO:{Severity:HEALTH_WARN Summary:{Message:1 MDSs report slow metadata IOs}} OSDMAP_FLAGS:{Severity:HEALTH_WARN Summary:{Message:norecover flag(s) set}} OSD_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 osds down}} OSD_HOST_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 hosts (2 osds) down}} OSD_RACK_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 racks (2 osds) down}} PG_AVAILABILITY:{Severity:HEALTH_WARN Summary:{Message:Reduced data availability: 98 pgs inactive}} PG_DEGRADED:{Severity:HEALTH_WARN Summary:{Message:Degraded data redundancy: 602/1569 objects degraded (38.368%), 118 pgs degraded, 311 pgs undersized}}]} FSID:c4699502-9c8b-4fcf-8040-38115e2cc0a1 ElectionEpoch:110 Quorum:[0 1 2] QuorumNames:[c d e] MonMap:{Epoch:7 FSID: CreatedTime: ModifiedTime: Mons:[]} OsdMap:{OsdMap:{Epoch:0 NumOsd:0 NumUpOsd:0 NumInOsd:0 Full:false NearFull:false NumRemappedPgs:0}} PgMap:{PgsByState:[{StateName:active+undersized Count:137} {StateName:active+undersized+degraded Count:76} {StateName:undersized+peered Count:56} {StateName:active+clean Count:42} {StateName:undersized+degraded+peered Count:42}] Version:0 NumPgs:353 DataBytes:728470648 UsedBytes:13220044800 AvailableBytes:2671134515200 TotalBytes:2684354560000 ReadBps:0 WriteBps:0 ReadOps:0 WriteOps:0 RecoveryBps:0 RecoveryObjectsPerSec:0 RecoveryKeysPerSec:0 CacheFlushBps:0 CacheEvictBps:0 CachePromoteBps:0} MgrMap:{Epoch:0 ActiveGID:0 ActiveName: ActiveAddr: Available:true Standbys:[]} Fsmap:{Epoch:6173 ID:1 Up:1 In:1 Max:1 ByRank:[{FilesystemID:1 Rank:0 Name:ocs-storagecluster-cephfilesystem-b Status:up:active Gid:144641} {FilesystemID:1 Rank:0 Name:ocs-storagecluster-cephfilesystem-a Status:up:standby-replay Gid:906740}] UpStandby:0}}, True, ClusterCreated, Cluster created successfully
2023-06-28T13:23:33.794875301Z 2023-06-28 13:23:33.794863 D | ceph-cluster-controller: Health: "HEALTH_WARN", code: "OSDMAP_FLAGS", message: "norecover flag(s) set"

in the logs of the osd pod how the flag is set and unset

[amanzane@supportshell-1 pods]$ grep -ir NORECOVER rook-ceph-osd-1-5469f6d9c9-b6rfx/osd/osd/logs/current.log
2023-06-28T13:23:12.575619884Z debug 2023-06-28T13:23:12.574+0000 7f9074980700  1 osd.1 704 pausing recovery (NORECOVER flag set)
2023-06-28T13:24:05.002745490Z debug 2023-06-28T13:24:05.002+0000 7f9074980700  1 osd.1 705 unpausing recovery (NORECOVER flag unset)

In the mon logs

2023-07-13T10:50:00.000317521Z debug 2023-07-13T10:49:59.998+0000 7f611f2cd700  0 log_channel(cluster) log [WRN] :     rack rack0 has flags noout
2023-07-13T10:50:00.842077591Z cluster 2023-07-13T10:50:00.000165+0000 mon.a (mon.0) 2275 : cluster [WRN] Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 279/1464 objects degraded (19.057%), 86 pgs degraded, 210 pgs undersized
2023-07-13T10:50:00.842141364Z cluster 2023-07-13T10:50:00.000251+0000 mon.a (mon.0) 2278 : cluster [WRN] [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
2023-07-13T10:50:00.842153981Z cluster 2023-07-13T10:50:00.000295+0000 mon.a (mon.0) 2279 : cluster [WRN]     rack rack0 has flags noout
2023-07-13T11:00:00.000327680Z debug 2023-07-13T10:59:59.998+0000 7f611f2cd700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 279/1467 objects degraded (19.018%), 86 pgs degraded, 210 pgs undersized
2023-07-13T11:00:00.000327680Z debug 2023-07-13T10:59:59.998+0000 7f611f2cd700  0 log_channel(cluster) log [WRN] : [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set

I’ve prepared a lab and do the same test ( same version 4.12 ) but in my case the flag noout is not set

Comment 4 Raimund Sacherer 2023-08-14 11:36:12 UTC
Do we have any updates for this BZ?

Thank you, 

Raimund

Comment 5 Travis Nielsen 2023-08-14 22:30:10 UTC
The noout flag is expected to clear out after some timeout such as 30m. 
Santosh PTAL

Comment 11 Santosh Pillai 2023-08-21 02:51:32 UTC
Thanks for the details. Here are my thoughts on this.

Rook workflow for node maintenance:
1. Rook uses PodDisruptionBudget (PDB) to ensure than only one OSD is down in a failure domain (say rack0) at a time.
2. If customer brings down a node in, say, `rack0` and `osd.0` goes down, rook adds blocking PDBs in other zones (say, `rack1` and `rack2`) and prevent customers from deleting any OSDs in rack1 and rack2. 
3. Ceph marks `osd.0` as `DOWN`. Rook delays the `DOWN/OUT` process of the `osd.0` by placing a `NOOUT` on the `rack0` failure domain. This is to prevent the CRUSH to automatically re-balance the cluster as customers stop OSDs for maintenance. Rook is now expecting the customer to upgrade/do any maintenance on the node and add it back which starts osd.0 again. 
4. When osd.0 comes back up and all the pgs are active+clean then rook removes the blocking PDBs in rack1 and rack2, so that customers can drain nodes over there as well.  


In Step 3 above, the `NOOUT` flag is removed after a certain maintenance timeout has elapsed. This timeout is equal to `cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout`, if set. Else, 30 minutes.

From the BZ description so far, I believe that this `OSDMaintenanceTimeout` timeout is not working for the customer and they have to remove it manually. I would like to see the mustgather logs for both the scenarios that the customer has tried:
1. `Customer tried to suspend one of the VM running the node having the OSD`
2. `also they taken down OSDs by scaling down the deployment for one of them.`

Comment 13 Santosh Pillai 2023-08-24 04:59:14 UTC
Alicia, Thanks for the must-gather. 

I started with `must-gather-scale-down-osd-20230822.tar.xz`. 

Rook set the `NOOUT` on `rack-3` which is clear from the `rook-ceph-pdbstatemap` configmap.

`rack3-noout-last-set-at: "2023-08-22T12:20:05Z"`

Now 30 minutes have passed and I need to verify if `noout` was removed by rook.


But with current logs, I can't verify if the noout was removed or not.  I need `ceph` logs to verify that. Unfortunately the must-gather does not have `ceph` related logs. The `ceph` directory is missing from the must-gather.

So I would request you to see why the `ceph` logs are missing. I'm assuming because this must-gather is for entire OCP. If we only fetch ODF related must-gather, then we might get the `ceph` directory as well.

Comment 31 Scott Ostapovicz 2024-03-06 15:53:03 UTC
While this is an interesting topic, we have yet to even propose what I consider to be a credible solution to what is at best an off-to-the-side configuration problem.  Given that we are only working on bugs at this point in the release, I am retargeting this at the next release.