Bug 2100747 - [Tracker BZ #2110008][GSS] Ceph placement groups in unknown status for a very long time, OSDs show "fault on lossy channel, failing" in debug log
Summary: [Tracker BZ #2110008][GSS] Ceph placement groups in unknown status for a very...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.8
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ODF 4.12.0
Assignee: Brad Hubbard
QA Contact: Prasad Desala
URL:
Whiteboard:
Depends On: 1647624 2110008
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-24 07:45 UTC by Emmanuel Kasper
Modified: 2023-08-09 16:37 UTC (History)
20 users (show)

Fixed In Version: 4.12.0-65
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:


Attachments (Terms of Use)

Description Emmanuel Kasper 2022-06-24 07:45:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Ceph placement groups in unkown status.

sh-4.4$ ceph -s
  cluster:
    id:    
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            Reduced data availability: 260 pgs inactive, 78 pgs stale
            19 daemons have recently crashed
  services:
    mon: 3 daemons, quorum bs,bw,bz (age 20h)
    mgr: a(active, since 14h)
    mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged
    osd: 27 osds: 27 up (since 12h), 27 in (since 21h)
  data:
    pools:   10 pools, 1136 pgs
    objects: 1.93M objects, 1.4 TiB
    usage:   4.5 TiB used, 69 TiB / 73 TiB avail
    pgs:     22.887% pgs unknown
             798 active+clean
             260 unknown
             78  stale+active+clean


Version of all relevant components (if applicable): ODF 4.8


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to create any CephFS PVCs on that cluster.


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Check  your grammar :)
After restarting ceph-mgr multiple times and testing pod to pod connectivity between ceph-mgr and osds, the issue is still present.


Steps to Reproduce: Unclear.


Actual results: Ceph HEALTH ERROR


Expected results: Ceph HEATH OK

Comment 4 Scott Ostapovicz 2022-06-24 15:06:15 UTC
Without a reproduction of the cause. I am not sure what we can do with this.  @jdurgin perhaps you can figure something out by looking at the system itself.

Comment 18 Brad Hubbard 2022-06-30 06:27:47 UTC
Looking at recent output in the support case.

    pg 1.102 is stuck inactive for 16659.549710, current state unknown, last acting []
    pg 1.104 is stuck inactive for 16659.549710, current state unknown, last acting []
    pg 1.108 is stuck inactive for 16659.549710, current state unknown, last acting []

That's over four and a half hours ago they got stuck and much longer now.

Repeat that command to see how long it's been now.

For each of those pgs if they are still stuck issue a pg query and find the primary and then gather as much log history as we can get before doing anything else (we definitely need the time they got stuck).

Then actually follow the procedure from comment #256 for all stuck pgs (not sure why they haven't tried this since it got the cluster back to all active+clean approx. 24-36 hours ago, but they didn't gather/upload the data Michael requested) and upload that data as well.


Note You need to log in before you can comment on or make changes to this bug.