So this is a cleanup issue that eventually fixes itself as the PG count eventually reaches zero, and that does not degrade the system? In any case have you tracked how long it takes for this state to clear up?
(In reply to Scott Ostapovicz from comment #5) > So this is a cleanup issue that eventually fixes itself as the PG count > eventually reaches zero, and that does not degrade the system? In any case > have you tracked how long it takes for this state to clear up? Once the cluster ends in this state and if I actively use it ( create Pods/PVC , do writes/reads ) it will take very long to move to healthy state ( hours / days ). Leaving cluster in "idle" state without issuing read / write against it will eventually move it to HEALTHY state. Other than this, cluster is operational, possible to create images and issue r/w operations against it. I am doing lot of tests and using same tests scenario ( create pod(s), attach PVC(s) execute load against pods, once test is done , delete pods / PVCs ... and I see this problem only in case when rbd replication / mirroring is involved )
Not a 4.9 blocker, moving it out while we continue the discussion.
This is a symptom of an overloaded cluster - not a bug. We need to test to determine what configuration / workload we can support on given hardware, as described here: https://docs.google.com/document/d/1lLSf2GzdBIt9EATcqMx9jYcX5ylgIu6rzPISVmOYtIA/edit?usp=sharing
This turned out to be bug in scrub/snap trim interaction - marking as a duplicate instead *** This bug has been marked as a duplicate of bug 2067056 ***