Bug 1245333 - rare leaked connection->session->message->connection cycle on client restart with blocked requests
Summary: rare leaked connection->session->message->connection cycle on client restart ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 1.3.0
Hardware: All
OS: All
unspecified
low
Target Milestone: rc
: 1.3.2
Assignee: Samuel Just
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-07-21 19:10 UTC by Samuel Just
Modified: 2017-07-30 15:14 UTC (History)
7 users (show)

Fixed In Version: RHEL: ceph-0.94.5-2.el7cp Ubuntu: ceph_0.94.5-2redhat1
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-29 14:42:30 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 12338 0 None None None Never
Red Hat Product Errata RHBA-2016:0313 0 normal SHIPPED_LIVE Red Hat Ceph Storage 1.3.2 bug fix and enhancement update 2016-02-29 19:37:43 UTC

Description Samuel Just 2015-07-21 19:10:07 UTC
Description of problem:

Causing a client->osd connection reset (by killing the client for example) while a message is blocked waiting on a map on that connection can cause a connection->session->message->connection cycle to be leaked.  The osd will then constantly complain about a request being blocked for an ever increasing amount of time (slow request warning with increasing time value).

Version-Release number of selected component (if applicable):

1.3.0 (upstream, giant and newer)

How reproducible:

Not very.  It tends to leak a little very infrequently.  The osd complains really loudly to the central log whenever it happens, so it's unlikely to be happening without someone noticing.

Steps to Reproduce:
1. Create new cluster and start up
2. While that is happening, asap: while ( true ); do <start radosgw>; sleep 10; <kill radosgw>
3. Hopefully that will cause a message to get stuck in that state.

Actual results:

stuck slow request

Expected results:

request correctly cleaned up

Additional info:

Comment 2 Ken Dreyer (Red Hat) 2015-12-11 21:41:43 UTC
Shipped in v0.94.4 - will be in RHCS 1.3.2

Comment 4 shylesh 2016-02-10 07:28:56 UTC
This is a race condition which I was not able to reproduce on the older builds. After talking to sam we concluded that we have done enough automated+manual regression testing in the surrounding areas of the fix , hence marking this bug as verified.

Following tests are also run specific to this bug.

As per sam's  instruction I started doing map changing commands from different terminals like following ..

T1:
===
set noout
unset noout

T2:
===
set noin
unset noin

T3:
===
ceph osd scrub 1
ceph osd deep-scrub 1

T4:
===
for i in {1..1000}; do sudo ceph osd pool create pool$i 1 1 replicated replicated_ruleset; sudo ceph osd pool mksnap pool$i snappy$i; sudo ceph osd pool rmsnap pool$i snappy$i; done

T5:
===
for i in {101..110}; do for j in {1..100}; do sudo ceph osd pool  mksnap p$i s$j; sudo ceph osd pool  rmsnap p$i s$j; done; done



[ubuntu@magna028 ~]$ cat snap.sh
#!/bin/bash

val=$RANDOM

for i in {1..100}
do
for j in {1..100}
do
sudo ceph osd pool mksnap p$i sna$i$val
sudo ceph osd pool rmsnap p$i sna$i$val
done
done


above script run from 4 different terminals(= 4 different clients , so 400 ops ) like below

for i in {1..100}
do
./snap.sh &
done

simultaneously ceph-radosgw process has been restarted continuously.

But still I am not able to see the blocked messages in ceph -w.


verified on ceph-0.94.5-8.el7cp.x86_64

Comment 6 errata-xmlrpc 2016-02-29 14:42:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313


Note You need to log in before you can comment on or make changes to this bug.