DescriptionEmmanuel Kasper
2022-06-24 07:45:12 UTC
Description of problem (please be detailed as possible and provide log
snippests):
Ceph placement groups in unkown status.
sh-4.4$ ceph -s
cluster:
id:
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
Reduced data availability: 260 pgs inactive, 78 pgs stale
19 daemons have recently crashed
services:
mon: 3 daemons, quorum bs,bw,bz (age 20h)
mgr: a(active, since 14h)
mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged
osd: 27 osds: 27 up (since 12h), 27 in (since 21h)
data:
pools: 10 pools, 1136 pgs
objects: 1.93M objects, 1.4 TiB
usage: 4.5 TiB used, 69 TiB / 73 TiB avail
pgs: 22.887% pgs unknown
798 active+clean
260 unknown
78 stale+active+clean
Version of all relevant components (if applicable): ODF 4.8
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Unable to create any CephFS PVCs on that cluster.
Is there any workaround available to the best of your knowledge?
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Check your grammar :)
After restarting ceph-mgr multiple times and testing pod to pod connectivity between ceph-mgr and osds, the issue is still present.
Steps to Reproduce: Unclear.
Actual results: Ceph HEALTH ERROR
Expected results: Ceph HEATH OK
Without a reproduction of the cause. I am not sure what we can do with this. @jdurgin perhaps you can figure something out by looking at the system itself.
Looking at recent output in the support case.
pg 1.102 is stuck inactive for 16659.549710, current state unknown, last acting []
pg 1.104 is stuck inactive for 16659.549710, current state unknown, last acting []
pg 1.108 is stuck inactive for 16659.549710, current state unknown, last acting []
That's over four and a half hours ago they got stuck and much longer now.
Repeat that command to see how long it's been now.
For each of those pgs if they are still stuck issue a pg query and find the primary and then gather as much log history as we can get before doing anything else (we definitely need the time they got stuck).
Then actually follow the procedure from comment #256 for all stuck pgs (not sure why they haven't tried this since it got the cluster back to all active+clean approx. 24-36 hours ago, but they didn't gather/upload the data Michael requested) and upload that data as well.
Description of problem (please be detailed as possible and provide log snippests): Ceph placement groups in unkown status. sh-4.4$ ceph -s cluster: id: health: HEALTH_ERR 1 filesystem is degraded 1 filesystem is offline 1 mds daemon damaged Reduced data availability: 260 pgs inactive, 78 pgs stale 19 daemons have recently crashed services: mon: 3 daemons, quorum bs,bw,bz (age 20h) mgr: a(active, since 14h) mds: ocs-storagecluster-cephfilesystem:0/1 2 up:standby, 1 damaged osd: 27 osds: 27 up (since 12h), 27 in (since 21h) data: pools: 10 pools, 1136 pgs objects: 1.93M objects, 1.4 TiB usage: 4.5 TiB used, 69 TiB / 73 TiB avail pgs: 22.887% pgs unknown 798 active+clean 260 unknown 78 stale+active+clean Version of all relevant components (if applicable): ODF 4.8 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Unable to create any CephFS PVCs on that cluster. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Check your grammar :) After restarting ceph-mgr multiple times and testing pod to pod connectivity between ceph-mgr and osds, the issue is still present. Steps to Reproduce: Unclear. Actual results: Ceph HEALTH ERROR Expected results: Ceph HEATH OK