+++ This bug was initially created as a clone of Bug #962619 +++ Description of problem: glusterd crashes on running volume-stop command. This crash was observed once while running regression tests, which is part of the codebase. Version-Release number of selected component (if applicable): How reproducible: Inconsistent Steps to Reproduce: 1. Run regression tests [1] 2. 3. Actual results: Glusterd crashes. Expected results: Glusterd shouldn't crash. Additional info: [1] - For further information on running regression tests, see https://forge.gluster.org/glusterfs-core/glusterfs/blobs/master/tests/README --- Additional comment from krishnan parthasarathi on 2013-05-14 00:51:24 EDT --- Created attachment 747495 [details] Back trace of the crash --- Additional comment from Anand Avati on 2013-05-14 00:53:22 EDT --- REVIEW: http://review.gluster.org/5000 (glusterd: Disable transport before cleaning up rpc object) posted (#1) for review on master by Krishnan Parthasarathi (kparthas)
Not good to have crashes. Hence marking this high priority.
Krishnan, Please provides steps to verify this bug
Satheesaran, This crash happens due to a race in the way in we free up resources associated with a brick, when its being stopped. So, there is no deterministic way of recreating the issue. Having said that, running volume-stop and volume-start in quick succession might increase the chance of the race to surface. But with the fix, you shouldn't see the crash. Unfortunately, you can only increase the confidence that the current implementation is race free, by repeated execution.
Here is what I am running for the reproducer: #!/bin/bash $VOLUME_NAME=testvol gluster volume start $VOLUME_NAME for number in `seq 1 10000` do gluster volume stop $VOLUME_NAME if [ $? -ne 0 ]; then echo "There was a problem stopping the volume" break else gluster volume start $VOLUME_NAME fi done
/me forgot the --mode=script above. I will do 10000 iterations of stop/start and see where we stand. I the systems are setup to email the storage QEs if a crash occurs. Is 10,000 iterations enough or is there anything else you wanted for verification?
Krishnan, I have removed NEEDINFO on you, as you have provided the way to verify this bug, but raising it again for the question raised by Ben Turner in comment 6
Ben, I think 10,000 iterations would be a good test. But there isn't a deterministic way to verify/confirm if 10,000 iterations would be good enough. Race detection, during runtime, is only a best effort, with the tools we have today. How we tested it, during development, was by running the volume-start, volume-stop commands in a loop for a couple of hours. We didn't observe any crashes. This increased our confidence on the fix.
Been running for one day with no problems, /me should have printed the iteration number...
MAde it through all 10,000 iterations without a crash. Marking as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1262.html