Bug 2100747

Summary: [Tracker BZ #2110008][GSS] Ceph placement groups in unknown status for a very long time, OSDs show "fault on lossy channel, failing" in debug log
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Emmanuel Kasper <ekasprzy>
Component: cephAssignee: Brad Hubbard <bhubbard>
ceph sub component: RADOS QA Contact: Prasad Desala <tdesala>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: high CC: assingh, bhubbard, bniver, hklein, hnallurv, jdurgin, kelwhite, kramdoss, lsantann, mgokhool, mmanjuna, muagarwa, nojha, ocs-bugs, odf-bz-bot, owasserm, pdhiran, sarora, sostapov, vumrao
Version: 4.8Keywords: Reopened
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: 4.12.0-65 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-08 14:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1647624, 2110008    
Bug Blocks:    

Description Emmanuel Kasper 2022-06-24 07:45:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Ceph placement groups in unkown status.

sh-4.4$ ceph -s
  cluster:
    id:    
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            Reduced data availability: 260 pgs inactive, 78 pgs stale
            19 daemons have recently crashed
  services:
    mon: 3 daemons, quorum bs,bw,bz (age 20h)
    mgr: a(active, since 14h)
    mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged
    osd: 27 osds: 27 up (since 12h), 27 in (since 21h)
  data:
    pools:   10 pools, 1136 pgs
    objects: 1.93M objects, 1.4 TiB
    usage:   4.5 TiB used, 69 TiB / 73 TiB avail
    pgs:     22.887% pgs unknown
             798 active+clean
             260 unknown
             78  stale+active+clean


Version of all relevant components (if applicable): ODF 4.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to create any CephFS PVCs on that cluster.


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Check  your grammar :)
After restarting ceph-mgr multiple times and testing pod to pod connectivity between ceph-mgr and osds, the issue is still present.


Steps to Reproduce: Unclear.


Actual results: Ceph HEALTH ERROR


Expected results: Ceph HEATH OK

Comment 4 Scott Ostapovicz 2022-06-24 15:06:15 UTC
Without a reproduction of the cause. I am not sure what we can do with this.  @jdurgin perhaps you can figure something out by looking at the system itself.

Comment 18 Brad Hubbard 2022-06-30 06:27:47 UTC
Looking at recent output in the support case.

    pg 1.102 is stuck inactive for 16659.549710, current state unknown, last acting []
    pg 1.104 is stuck inactive for 16659.549710, current state unknown, last acting []
    pg 1.108 is stuck inactive for 16659.549710, current state unknown, last acting []

That's over four and a half hours ago they got stuck and much longer now.

Repeat that command to see how long it's been now.

For each of those pgs if they are still stuck issue a pg query and find the primary and then gather as much log history as we can get before doing anything else (we definitely need the time they got stuck).

Then actually follow the procedure from comment #256 for all stuck pgs (not sure why they haven't tried this since it got the cluster back to all active+clean approx. 24-36 hours ago, but they didn't gather/upload the data Michael requested) and upload that data as well.