Description of problem: OSD node is taken offline for 15 mins (with noout flag set) and is brought back online, when recovery IO starts RadosGW client IO halts on the cluster due to large amounts of slow requests currently waiting for degraded object. When the OSD node was brought offline client IO was still active to the cluster for those 15 minutes. This OSD node houses 2 SSD and 14 SATA HDD's. SSD OSD's back the radosgw.index pool. The degraded objects were present in the radosgw index pool which is required to be accessed for each RGW op, so a large range of RGW operations would be affected. Rack replication is being used with 3 racks, 6 SSD's per rack, 2 per OSD node. Per upstream Tracker: http://tracker.ceph.com/issues/13104 this is expected behavior as the object is degraded and the OSD is waiting for it to get repaired. Writes to degraded objects (present on the primary) are not allowed in Hammer and below but this has changed in Infernalis. Initially the recovery threads were throttled on the entire cluster to 1 to prevent client IO impact during cluster recovery but a recommendation was made to increase this value back to the default of 15 on the SSD OSD's, this did not alleviate the issue and issue was seen again on the next node move. Is recovery operating properly here as expected in Hammer? Do we have any method to prevent this impact from occurring during an OSD node move? Version-Release number of selected component (if applicable): ceph-0.94.5-14.el7cp.x86_64 How reproducible: Consistent Steps to Reproduce: 1. Set noout on cluster. 2. Write several GB to cluster. 2. Down one of the OSD nodes. 4. Write several GB to the cluster. 5. Bring OSD node back into cluster, to generate recovery. 6. While recovery is ongoing generate further IO to cluster and validate IO has halted. Logs from issues are here: https://api.access.redhat.com/rs/cases/01703018/attachments/ceb38cac-0a54-4781-9d9c-4498f37abddb https://api.access.redhat.com/rs/cases/01703018/attachments/db1e4598-6e5e-4808-9592-a904738f408f https://api.access.redhat.com/rs/cases/01703018/attachments/38fd0ded-07d5-4f4e-b71b-10601f9ee58b
adding needinfo back as last update cleared it.