Bug 2231346
| Summary: | [GSS][ODF 4.12] When OSD goes down PG's are not redistributed to other OSDs | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Rafrojas <rafrojas> |
| Component: | Rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | ASSIGNED --- | QA Contact: | Vivek Das <vdas> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.1 | CC: | assingh, bhubbard, ceph-eng-bugs, cephqe-warriors, kelwhite, mduasope, nojha, pjagtap, rsachere, rzarzyns, sapillai, sostapov, tnielsen, trchakra, vereddy, vumrao |
| Target Milestone: | --- | Flags: | mduasope:
needinfo-
trchakra: needinfo- tnielsen: needinfo? (rzarzyns) |
| Target Release: | Backlog | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Do we have any updates for this BZ? Thank you, Raimund The noout flag is expected to clear out after some timeout such as 30m. Santosh PTAL Thanks for the details. Here are my thoughts on this. Rook workflow for node maintenance: 1. Rook uses PodDisruptionBudget (PDB) to ensure than only one OSD is down in a failure domain (say rack0) at a time. 2. If customer brings down a node in, say, `rack0` and `osd.0` goes down, rook adds blocking PDBs in other zones (say, `rack1` and `rack2`) and prevent customers from deleting any OSDs in rack1 and rack2. 3. Ceph marks `osd.0` as `DOWN`. Rook delays the `DOWN/OUT` process of the `osd.0` by placing a `NOOUT` on the `rack0` failure domain. This is to prevent the CRUSH to automatically re-balance the cluster as customers stop OSDs for maintenance. Rook is now expecting the customer to upgrade/do any maintenance on the node and add it back which starts osd.0 again. 4. When osd.0 comes back up and all the pgs are active+clean then rook removes the blocking PDBs in rack1 and rack2, so that customers can drain nodes over there as well. In Step 3 above, the `NOOUT` flag is removed after a certain maintenance timeout has elapsed. This timeout is equal to `cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout`, if set. Else, 30 minutes. From the BZ description so far, I believe that this `OSDMaintenanceTimeout` timeout is not working for the customer and they have to remove it manually. I would like to see the mustgather logs for both the scenarios that the customer has tried: 1. `Customer tried to suspend one of the VM running the node having the OSD` 2. `also they taken down OSDs by scaling down the deployment for one of them.` Alicia, Thanks for the must-gather. I started with `must-gather-scale-down-osd-20230822.tar.xz`. Rook set the `NOOUT` on `rack-3` which is clear from the `rook-ceph-pdbstatemap` configmap. `rack3-noout-last-set-at: "2023-08-22T12:20:05Z"` Now 30 minutes have passed and I need to verify if `noout` was removed by rook. But with current logs, I can't verify if the noout was removed or not. I need `ceph` logs to verify that. Unfortunately the must-gather does not have `ceph` related logs. The `ceph` directory is missing from the must-gather. So I would request you to see why the `ceph` logs are missing. I'm assuming because this must-gather is for entire OCP. If we only fetch ODF related must-gather, then we might get the `ceph` directory as well. While this is an interesting topic, we have yet to even propose what I consider to be a credible solution to what is at best an off-to-the-side configuration problem. Given that we are only working on bugs at this point in the release, I am retargeting this at the next release. |
Description of problem: When taking down one storage node hosting an OSD, PG.s are not redistributed to other OSDs Version-Release number of selected component (if applicable): ODF 4.12 How reproducible: Manually taking one OSD down Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Meanwhile in the rook-ceph-operator logs I just see the flag set 2023-06-28T13:23:33.675488974Z 2023-06-28 13:23:33.675473 D | ceph-cluster-controller: updating ceph cluster "openshift-storage" status and condition to &{Health:{Status:HEALTH_WARN Checks:map[MDS_SLOW_METADATA_IO:{Severity:HEALTH_WARN Summary:{Message:1 MDSs report slow metadata IOs}} OSDMAP_FLAGS:{Severity:HEALTH_WARN Summary:{Message:norecover flag(s) set}} OSD_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 osds down}} OSD_HOST_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 hosts (2 osds) down}} OSD_RACK_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 racks (2 osds) down}} PG_AVAILABILITY:{Severity:HEALTH_WARN Summary:{Message:Reduced data availability: 98 pgs inactive}} PG_DEGRADED:{Severity:HEALTH_WARN Summary:{Message:Degraded data redundancy: 602/1569 objects degraded (38.368%), 118 pgs degraded, 311 pgs undersized}}]} FSID:c4699502-9c8b-4fcf-8040-38115e2cc0a1 ElectionEpoch:110 Quorum:[0 1 2] QuorumNames:[c d e] MonMap:{Epoch:7 FSID: CreatedTime: ModifiedTime: Mons:[]} OsdMap:{OsdMap:{Epoch:0 NumOsd:0 NumUpOsd:0 NumInOsd:0 Full:false NearFull:false NumRemappedPgs:0}} PgMap:{PgsByState:[{StateName:active+undersized Count:137} {StateName:active+undersized+degraded Count:76} {StateName:undersized+peered Count:56} {StateName:active+clean Count:42} {StateName:undersized+degraded+peered Count:42}] Version:0 NumPgs:353 DataBytes:728470648 UsedBytes:13220044800 AvailableBytes:2671134515200 TotalBytes:2684354560000 ReadBps:0 WriteBps:0 ReadOps:0 WriteOps:0 RecoveryBps:0 RecoveryObjectsPerSec:0 RecoveryKeysPerSec:0 CacheFlushBps:0 CacheEvictBps:0 CachePromoteBps:0} MgrMap:{Epoch:0 ActiveGID:0 ActiveName: ActiveAddr: Available:true Standbys:[]} Fsmap:{Epoch:6173 ID:1 Up:1 In:1 Max:1 ByRank:[{FilesystemID:1 Rank:0 Name:ocs-storagecluster-cephfilesystem-b Status:up:active Gid:144641} {FilesystemID:1 Rank:0 Name:ocs-storagecluster-cephfilesystem-a Status:up:standby-replay Gid:906740}] UpStandby:0}}, True, ClusterCreated, Cluster created successfully 2023-06-28T13:23:33.794875301Z 2023-06-28 13:23:33.794863 D | ceph-cluster-controller: Health: "HEALTH_WARN", code: "OSDMAP_FLAGS", message: "norecover flag(s) set" in the logs of the osd pod how the flag is set and unset [amanzane@supportshell-1 pods]$ grep -ir NORECOVER rook-ceph-osd-1-5469f6d9c9-b6rfx/osd/osd/logs/current.log 2023-06-28T13:23:12.575619884Z debug 2023-06-28T13:23:12.574+0000 7f9074980700 1 osd.1 704 pausing recovery (NORECOVER flag set) 2023-06-28T13:24:05.002745490Z debug 2023-06-28T13:24:05.002+0000 7f9074980700 1 osd.1 705 unpausing recovery (NORECOVER flag unset) In the mon logs 2023-07-13T10:50:00.000317521Z debug 2023-07-13T10:49:59.998+0000 7f611f2cd700 0 log_channel(cluster) log [WRN] : rack rack0 has flags noout 2023-07-13T10:50:00.842077591Z cluster 2023-07-13T10:50:00.000165+0000 mon.a (mon.0) 2275 : cluster [WRN] Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 279/1464 objects degraded (19.057%), 86 pgs degraded, 210 pgs undersized 2023-07-13T10:50:00.842141364Z cluster 2023-07-13T10:50:00.000251+0000 mon.a (mon.0) 2278 : cluster [WRN] [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set 2023-07-13T10:50:00.842153981Z cluster 2023-07-13T10:50:00.000295+0000 mon.a (mon.0) 2279 : cluster [WRN] rack rack0 has flags noout 2023-07-13T11:00:00.000327680Z debug 2023-07-13T10:59:59.998+0000 7f611f2cd700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 279/1467 objects degraded (19.018%), 86 pgs degraded, 210 pgs undersized 2023-07-13T11:00:00.000327680Z debug 2023-07-13T10:59:59.998+0000 7f611f2cd700 0 log_channel(cluster) log [WRN] : [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set I’ve prepared a lab and do the same test ( same version 4.12 ) but in my case the flag noout is not set