1245333 – rare leaked connection->session->message->connection cycle on client restart with blocked requests

Bug 1245333 - rare leaked connection->session->message->connection cycle on client restart with blocked requests

Summary: rare leaked connection->session->message->connection cycle on client restart ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.0
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	low
Target Milestone:	rc
Target Release:	1.3.2
Assignee:	Samuel Just
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-07-21 19:10 UTC by Samuel Just
Modified:	2017-07-30 15:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:	RHEL: ceph-0.94.5-2.el7cp Ubuntu: ceph_0.94.5-2redhat1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-02-29 14:42:30 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	12338	0	None	None	None	Never
Red Hat Product Errata	RHBA-2016:0313	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 1.3.2 bug fix and enhancement update	2016-02-29 19:37:43 UTC

Description Samuel Just 2015-07-21 19:10:07 UTC

Description of problem:

Causing a client->osd connection reset (by killing the client for example) while a message is blocked waiting on a map on that connection can cause a connection->session->message->connection cycle to be leaked.  The osd will then constantly complain about a request being blocked for an ever increasing amount of time (slow request warning with increasing time value).

Version-Release number of selected component (if applicable):

1.3.0 (upstream, giant and newer)

How reproducible:

Not very.  It tends to leak a little very infrequently.  The osd complains really loudly to the central log whenever it happens, so it's unlikely to be happening without someone noticing.

Steps to Reproduce:
1. Create new cluster and start up
2. While that is happening, asap: while ( true ); do <start radosgw>; sleep 10; <kill radosgw>
3. Hopefully that will cause a message to get stuck in that state.

Actual results:

stuck slow request

Expected results:

request correctly cleaned up

Additional info:

Comment 2 Ken Dreyer (Red Hat) 2015-12-11 21:41:43 UTC

Shipped in v0.94.4 - will be in RHCS 1.3.2

Comment 4 shylesh 2016-02-10 07:28:56 UTC

This is a race condition which I was not able to reproduce on the older builds. After talking to sam we concluded that we have done enough automated+manual regression testing in the surrounding areas of the fix , hence marking this bug as verified.

Following tests are also run specific to this bug.

As per sam's  instruction I started doing map changing commands from different terminals like following ..

T1:
===
set noout
unset noout

T2:
===
set noin
unset noin

T3:
===
ceph osd scrub 1
ceph osd deep-scrub 1

T4:
===
for i in {1..1000}; do sudo ceph osd pool create pool$i 1 1 replicated replicated_ruleset; sudo ceph osd pool mksnap pool$i snappy$i; sudo ceph osd pool rmsnap pool$i snappy$i; done

T5:
===
for i in {101..110}; do for j in {1..100}; do sudo ceph osd pool  mksnap p$i s$j; sudo ceph osd pool  rmsnap p$i s$j; done; done



[ubuntu@magna028 ~]$ cat snap.sh
#!/bin/bash

val=$RANDOM

for i in {1..100}
do
for j in {1..100}
do
sudo ceph osd pool mksnap p$i sna$i$val
sudo ceph osd pool rmsnap p$i sna$i$val
done
done


above script run from 4 different terminals(= 4 different clients , so 400 ops ) like below

for i in {1..100}
do
./snap.sh &
done

simultaneously ceph-radosgw process has been restarted continuously.

But still I am not able to see the blocked messages in ceph -w.


verified on ceph-0.94.5-8.el7cp.x86_64

Comment 6 errata-xmlrpc 2016-02-29 14:42:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313

Note You need to log in before you can comment on or make changes to this bug.