Bug 962621 - glusterd crashes on volume-stop
glusterd crashes on volume-stop
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
2.1
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: krishnan parthasarathi
Ben Turner
:
Depends On: 962619
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-14 00:57 EDT by krishnan parthasarathi
Modified: 2013-09-23 18:43 EDT (History)
8 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0.12rhs.beta4_1
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 962619
Environment:
Last Closed: 2013-09-23 18:39:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description krishnan parthasarathi 2013-05-14 00:57:07 EDT
+++ This bug was initially created as a clone of Bug #962619 +++

Description of problem:
glusterd crashes on running volume-stop command. This crash was observed once while running regression tests, which is part of the codebase.

Version-Release number of selected component (if applicable):


How reproducible:
Inconsistent

Steps to Reproduce:
1. Run regression tests [1]
2.
3.
  
Actual results:
Glusterd crashes.

Expected results:
Glusterd shouldn't crash.


Additional info:
[1] - For further information on running regression tests, see https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/tests/README

--- Additional comment from krishnan parthasarathi on 2013-05-14 00:51:24 EDT ---

Created attachment 747495 [details]
Back trace of the crash

--- Additional comment from Anand Avati on 2013-05-14 00:53:22 EDT ---

REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on master by Krishnan Parthasarathi (kparthas@redhat.com)
Comment 1 Nagaprasad Sathyanarayana 2013-05-17 00:24:03 EDT
Not good to have crashes. Hence marking this high priority.
Comment 3 SATHEESARAN 2013-08-12 07:54:33 EDT
Krishnan,

Please provides steps to verify this bug
Comment 4 krishnan parthasarathi 2013-08-12 08:21:44 EDT
Satheesaran,

This crash happens due to a race in the way in we free up resources associated with a brick, when its being stopped. So, there is no deterministic way of recreating the issue. Having said that, running volume-stop and volume-start in quick succession might increase the chance of the race to surface. But with the fix, you shouldn't see the crash. Unfortunately, you can only increase the confidence that the current implementation is race free, by repeated execution.
Comment 5 Ben Turner 2013-08-12 16:17:45 EDT
Here is what I am running for the reproducer:

#!/bin/bash
$VOLUME_NAME=testvol

gluster volume start $VOLUME_NAME

for number in `seq 1 10000`
do
    gluster volume stop $VOLUME_NAME
    if [ $? -ne 0 ]; then
        echo "There was a problem stopping the volume"
        break
    else
        gluster volume start $VOLUME_NAME
    fi
done
Comment 6 Ben Turner 2013-08-12 16:26:26 EDT
/me forgot the --mode=script above.  I will do 10000 iterations of stop/start and see where we stand.  I the systems are setup to email the storage QEs if a crash occurs.  Is 10,000 iterations enough or is there anything else you wanted for verification?
Comment 7 SATHEESARAN 2013-08-13 01:52:48 EDT
Krishnan,

I have removed NEEDINFO on you, as you have provided the way to verify this bug,
but raising it again for the question raised by Ben Turner in comment 6
Comment 8 krishnan parthasarathi 2013-08-13 02:18:46 EDT
Ben,

I think 10,000 iterations would be a good test. But there isn't a deterministic way to verify/confirm if 10,000 iterations would be good enough. Race detection, during runtime, is only a best effort, with the tools we have today.

How we tested it, during development, was by running the volume-start, volume-stop commands in a loop for a couple of hours. We didn't observe any crashes. This increased our confidence on the fix.
Comment 9 Ben Turner 2013-08-13 11:48:59 EDT
Been running for one day with no problems, /me should have printed the iteration number...
Comment 10 Ben Turner 2013-08-13 14:53:14 EDT
MAde it through all 10,000 iterations without a crash.  Marking as verified.
Comment 11 Scott Haines 2013-09-23 18:39:46 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html
Comment 12 Scott Haines 2013-09-23 18:43:48 EDT
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.