Bug 1134690
Summary: | [SNAPSHOT]: glusterd crash while snaphshot creation was in progress | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | senaik | |
Component: | snapshot | Assignee: | Avra Sengupta <asengupt> | |
Status: | CLOSED ERRATA | QA Contact: | senaik | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.0 | CC: | asrivast, rhs-bugs, rjoseph, smohan, storage-qa-internal, vagarwal | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.1.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | SNAPSHOT | |||
Fixed In Version: | glusterfs-3.7.0-3.el6rhs | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1138577 (view as bug list) | Environment: | ||
Last Closed: | 2015-07-29 04:35:35 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1138577, 1202842, 1223636 |
Description
senaik
2014-08-28 06:21:37 UTC
I was able to hit the crash, but the bt looks different. Followed the similar steps as mentioned in "Steps to Reproduce" along with the step of trying to create another volume (which was immediately interrupted) in between while snapshot creation was in progress and glusterd crashed immediately. Connection failed. Please check if gluster daemon is operational. real 0m0.098s user 0m0.077s sys 0m0.019s Connection failed. Please check if gluster daemon is operational. I tried the similar steps on 4 node cluster, but did not face the issue. Retried the issue on 12 node cluster and hit the crash bt : === #0 __gf_free (free_ptr=0x504e280) at mem-pool.c:252 #1 0x000000380244d694 in mem_put (ptr=0x504e29c) at mem-pool.c:526 #2 0x0000003802808e56 in rpcsvc_submit_generic (req=0x504e29c, proghdr=0xfd25770, hdrcount=<value optimized out>, payload=0x0, payloadcount=0, iobref=0x7ff8d0663820) at rpcsvc.c:1266 #3 0x00000038028092f6 in rpcsvc_error_reply (req=0x504e29c) at rpcsvc.c:1285 #4 0x000000380280936b in rpcsvc_check_and_reply_error (ret=-1, frame=<value optimized out>, opaque=0x504e29c) at rpcsvc.c:547 #5 0x000000380245b9ea in synctask_wrap (old_task=<value optimized out>) at syncop.c:335 #6 0x0000003a26e43bf0 in ?? () from /lib64/libc-2.12.so #7 0x0000000000000000 in ?? () (gdb) The steps are already updated in Description , nevertheless reposting the steps: 12 node cluster 6x2 dist rep volume 1.Fuse and NFS mount the volume and create IO for i in {101..140}; do dd if=/dev/urandom of=file_"$i" bs=1024M count=1; done for i in {101..140}; do dd if=/dev/urandom of=file_nfs_"$i" bs=1024M count=1; done Tried a few snapshot operations like snapshot create ,list , delete and restore 2.Now created some more snapshots in a loop While snapshot creation is in progress, tried to create new volume with existing bricks (which was immediately interrupted) in between while snapshot creation was in progress and glusterd crashed immediately (Snapshot command failed as it took more than 2 min - crossed the cli time out and the remaning snapshots failed with error "Connection failed. Please check if gluster daemon is operational") 5. [root@dhcp-8-29-179 ~]# service glusterd status glusterd dead but pid file exists Following is the initial analysis of the bug: The crash is happening during a network disconnect when brick-op handler is sending a response back to the originator node. Response is sent by calling glusterd_submit_reply. This function deletes the req object upon completion. The function returns an error code (-1) if it fails to send the reply, but it anyway deletes the req object. brick-op handler is called from the sync-op framework. sync-op framework calls rpcsvc_check_and_reply_error on error which in turns calls glusterd_submit_reply to send the error back to the originator. Which will delete the req object again. Therefore if during a brick-op handler network disconnect happens then it might lead to double deletion of req object and hence leading to crash. Version : glusterfs-3.7.1-11.el6rhs.x86_64 Retried the steps as mentioned in Description in a 8 node cluster did not observer any crash. Marking the bug 'verified' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html |