962621 – glusterd crashes on volume-stop

Bug 962621 - glusterd crashes on volume-stop

Summary: glusterd crashes on volume-stop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	krishnan parthasarathi
QA Contact:	Ben Turner
Docs Contact:
URL:
Whiteboard:
Depends On:	962619
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-14 04:57 UTC by krishnan parthasarathi
Modified:	2013-09-23 22:43 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.4.0.12rhs.beta4_1
Doc Type:	Bug Fix
Doc Text:
Clone Of:	962619
Environment:
Last Closed:	2013-09-23 22:39:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description krishnan parthasarathi 2013-05-14 04:57:07 UTC

+++ This bug was initially created as a clone of Bug #962619 +++

Description of problem:
glusterd crashes on running volume-stop command. This crash was observed once while running regression tests, which is part of the codebase.

Version-Release number of selected component (if applicable):


How reproducible:
Inconsistent

Steps to Reproduce:
1. Run regression tests [1]
2.
3.
  
Actual results:
Glusterd crashes.

Expected results:
Glusterd shouldn't crash.


Additional info:
[1] - For further information on running regression tests, see https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/tests/README

--- Additional comment from krishnan parthasarathi on 2013-05-14 00:51:24 EDT ---

Created attachment 747495 [details]
Back trace of the crash

--- Additional comment from Anand Avati on 2013-05-14 00:53:22 EDT ---

REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)

Comment 1 Nagaprasad Sathyanarayana 2013-05-17 04:24:03 UTC

Not good to have crashes. Hence marking this high priority.

Comment 3 SATHEESARAN 2013-08-12 11:54:33 UTC

Krishnan,

Please provides steps to verify this bug

Comment 4 krishnan parthasarathi 2013-08-12 12:21:44 UTC

Satheesaran,

This crash happens due to a race in the way in we free up resources associated with a brick, when its being stopped. So, there is no deterministic way of recreating the issue. Having said that, running volume-stop and volume-start in quick succession might increase the chance of the race to surface. But with the fix, you shouldn't see the crash. Unfortunately, you can only increase the confidence that the current implementation is race free, by repeated execution.

Comment 5 Ben Turner 2013-08-12 20:17:45 UTC

Here is what I am running for the reproducer:

#!/bin/bash
$VOLUME_NAME=testvol

gluster volume start $VOLUME_NAME

for number in `seq 1 10000`
do
    gluster volume stop $VOLUME_NAME
    if [ $? -ne 0 ]; then
        echo "There was a problem stopping the volume"
        break
    else
        gluster volume start $VOLUME_NAME
    fi
done

Comment 6 Ben Turner 2013-08-12 20:26:26 UTC

/me forgot the --mode=script above.  I will do 10000 iterations of stop/start and see where we stand.  I the systems are setup to email the storage QEs if a crash occurs.  Is 10,000 iterations enough or is there anything else you wanted for verification?

Comment 7 SATHEESARAN 2013-08-13 05:52:48 UTC

Krishnan,

I have removed NEEDINFO on you, as you have provided the way to verify this bug,
but raising it again for the question raised by Ben Turner in comment 6

Comment 8 krishnan parthasarathi 2013-08-13 06:18:46 UTC

Ben,

I think 10,000 iterations would be a good test. But there isn't a deterministic way to verify/confirm if 10,000 iterations would be good enough. Race detection, during runtime, is only a best effort, with the tools we have today.

How we tested it, during development, was by running the volume-start, volume-stop commands in a loop for a couple of hours. We didn't observe any crashes. This increased our confidence on the fix.

Comment 9 Ben Turner 2013-08-13 15:48:59 UTC

Been running for one day with no problems, /me should have printed the iteration number...

Comment 10 Ben Turner 2013-08-13 18:53:14 UTC

MAde it through all 10,000 iterations without a crash.  Marking as verified.

Comment 11 Scott Haines 2013-09-23 22:39:46 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Comment 12 Scott Haines 2013-09-23 22:43:48 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.