Bug 2231346
| Summary: | [GSS][ODF 4.12] When OSD goes down PG's are not redistributed to other OSDs | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Rafrojas <rafrojas> |
| Component: | Rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | ASSIGNED --- | QA Contact: | Tejas <tchandra> |
| Severity: | low | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.2 | CC: | bhubbard, ceph-eng-bugs, cephqe-warriors, nojha, rsachere, rzarzyns, sapillai, tnielsen, vumrao |
| Target Milestone: | --- | Flags: | sapillai:
needinfo?
(rafrojas) |
| Target Release: | 7.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Do we have any updates for this BZ? Thank you, Raimund The noout flag is expected to clear out after some timeout such as 30m. Santosh PTAL |
Description of problem: When taking down one storage node hosting an OSD, PG.s are not redistributed to other OSDs Version-Release number of selected component (if applicable): ODF 4.12 How reproducible: Manually taking one OSD down Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Meanwhile in the rook-ceph-operator logs I just see the flag set 2023-06-28T13:23:33.675488974Z 2023-06-28 13:23:33.675473 D | ceph-cluster-controller: updating ceph cluster "openshift-storage" status and condition to &{Health:{Status:HEALTH_WARN Checks:map[MDS_SLOW_METADATA_IO:{Severity:HEALTH_WARN Summary:{Message:1 MDSs report slow metadata IOs}} OSDMAP_FLAGS:{Severity:HEALTH_WARN Summary:{Message:norecover flag(s) set}} OSD_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 osds down}} OSD_HOST_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 hosts (2 osds) down}} OSD_RACK_DOWN:{Severity:HEALTH_WARN Summary:{Message:2 racks (2 osds) down}} PG_AVAILABILITY:{Severity:HEALTH_WARN Summary:{Message:Reduced data availability: 98 pgs inactive}} PG_DEGRADED:{Severity:HEALTH_WARN Summary:{Message:Degraded data redundancy: 602/1569 objects degraded (38.368%), 118 pgs degraded, 311 pgs undersized}}]} FSID:c4699502-9c8b-4fcf-8040-38115e2cc0a1 ElectionEpoch:110 Quorum:[0 1 2] QuorumNames:[c d e] MonMap:{Epoch:7 FSID: CreatedTime: ModifiedTime: Mons:[]} OsdMap:{OsdMap:{Epoch:0 NumOsd:0 NumUpOsd:0 NumInOsd:0 Full:false NearFull:false NumRemappedPgs:0}} PgMap:{PgsByState:[{StateName:active+undersized Count:137} {StateName:active+undersized+degraded Count:76} {StateName:undersized+peered Count:56} {StateName:active+clean Count:42} {StateName:undersized+degraded+peered Count:42}] Version:0 NumPgs:353 DataBytes:728470648 UsedBytes:13220044800 AvailableBytes:2671134515200 TotalBytes:2684354560000 ReadBps:0 WriteBps:0 ReadOps:0 WriteOps:0 RecoveryBps:0 RecoveryObjectsPerSec:0 RecoveryKeysPerSec:0 CacheFlushBps:0 CacheEvictBps:0 CachePromoteBps:0} MgrMap:{Epoch:0 ActiveGID:0 ActiveName: ActiveAddr: Available:true Standbys:[]} Fsmap:{Epoch:6173 ID:1 Up:1 In:1 Max:1 ByRank:[{FilesystemID:1 Rank:0 Name:ocs-storagecluster-cephfilesystem-b Status:up:active Gid:144641} {FilesystemID:1 Rank:0 Name:ocs-storagecluster-cephfilesystem-a Status:up:standby-replay Gid:906740}] UpStandby:0}}, True, ClusterCreated, Cluster created successfully 2023-06-28T13:23:33.794875301Z 2023-06-28 13:23:33.794863 D | ceph-cluster-controller: Health: "HEALTH_WARN", code: "OSDMAP_FLAGS", message: "norecover flag(s) set" in the logs of the osd pod how the flag is set and unset [amanzane@supportshell-1 pods]$ grep -ir NORECOVER rook-ceph-osd-1-5469f6d9c9-b6rfx/osd/osd/logs/current.log 2023-06-28T13:23:12.575619884Z debug 2023-06-28T13:23:12.574+0000 7f9074980700 1 osd.1 704 pausing recovery (NORECOVER flag set) 2023-06-28T13:24:05.002745490Z debug 2023-06-28T13:24:05.002+0000 7f9074980700 1 osd.1 705 unpausing recovery (NORECOVER flag unset) In the mon logs 2023-07-13T10:50:00.000317521Z debug 2023-07-13T10:49:59.998+0000 7f611f2cd700 0 log_channel(cluster) log [WRN] : rack rack0 has flags noout 2023-07-13T10:50:00.842077591Z cluster 2023-07-13T10:50:00.000165+0000 mon.a (mon.0) 2275 : cluster [WRN] Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 279/1464 objects degraded (19.057%), 86 pgs degraded, 210 pgs undersized 2023-07-13T10:50:00.842141364Z cluster 2023-07-13T10:50:00.000251+0000 mon.a (mon.0) 2278 : cluster [WRN] [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set 2023-07-13T10:50:00.842153981Z cluster 2023-07-13T10:50:00.000295+0000 mon.a (mon.0) 2279 : cluster [WRN] rack rack0 has flags noout 2023-07-13T11:00:00.000327680Z debug 2023-07-13T10:59:59.998+0000 7f611f2cd700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN 1 osds down; 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 279/1467 objects degraded (19.018%), 86 pgs degraded, 210 pgs undersized 2023-07-13T11:00:00.000327680Z debug 2023-07-13T10:59:59.998+0000 7f611f2cd700 0 log_channel(cluster) log [WRN] : [WRN] OSD_FLAGS: 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set I’ve prepared a lab and do the same test ( same version 4.12 ) but in my case the flag noout is not set