Description of problem: Causing a client->osd connection reset (by killing the client for example) while a message is blocked waiting on a map on that connection can cause a connection->session->message->connection cycle to be leaked. The osd will then constantly complain about a request being blocked for an ever increasing amount of time (slow request warning with increasing time value). Version-Release number of selected component (if applicable): 1.3.0 (upstream, giant and newer) How reproducible: Not very. It tends to leak a little very infrequently. The osd complains really loudly to the central log whenever it happens, so it's unlikely to be happening without someone noticing. Steps to Reproduce: 1. Create new cluster and start up 2. While that is happening, asap: while ( true ); do <start radosgw>; sleep 10; <kill radosgw> 3. Hopefully that will cause a message to get stuck in that state. Actual results: stuck slow request Expected results: request correctly cleaned up Additional info:
Shipped in v0.94.4 - will be in RHCS 1.3.2
This is a race condition which I was not able to reproduce on the older builds. After talking to sam we concluded that we have done enough automated+manual regression testing in the surrounding areas of the fix , hence marking this bug as verified. Following tests are also run specific to this bug. As per sam's instruction I started doing map changing commands from different terminals like following .. T1: === set noout unset noout T2: === set noin unset noin T3: === ceph osd scrub 1 ceph osd deep-scrub 1 T4: === for i in {1..1000}; do sudo ceph osd pool create pool$i 1 1 replicated replicated_ruleset; sudo ceph osd pool mksnap pool$i snappy$i; sudo ceph osd pool rmsnap pool$i snappy$i; done T5: === for i in {101..110}; do for j in {1..100}; do sudo ceph osd pool mksnap p$i s$j; sudo ceph osd pool rmsnap p$i s$j; done; done [ubuntu@magna028 ~]$ cat snap.sh #!/bin/bash val=$RANDOM for i in {1..100} do for j in {1..100} do sudo ceph osd pool mksnap p$i sna$i$val sudo ceph osd pool rmsnap p$i sna$i$val done done above script run from 4 different terminals(= 4 different clients , so 400 ops ) like below for i in {1..100} do ./snap.sh & done simultaneously ceph-radosgw process has been restarted continuously. But still I am not able to see the blocked messages in ceph -w. verified on ceph-0.94.5-8.el7cp.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0313