Bug 1231635
Summary: | glusterd crashed when testing heal full on replaced disks | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | spandura | |
Component: | glusterd | Assignee: | Anand Nekkunti <anekkunt> | |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.1 | CC: | amukherj, anekkunt, asrivast, atalur, kparthas, nlevinki, nsathyan, rcyriac, sasundar, senaik, spandura, ssampat, vagarwal, vbellur | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.1.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | GlusterD | |||
Fixed In Version: | glusterfs-3.7.1-8 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1232693 1233041 (view as bug list) | Environment: | ||
Last Closed: | 2015-07-29 05:02:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1202842, 1232693, 1233041 |
Description
spandura
2015-06-15 06:46:55 UTC
Shwetha spoke to me about this before raising this bug. The sos-reports for this beaker run are missing for some unknown reason. Hence, we didn't have enough information for confirm the cause for this bug. Shwetha volunteered to run the test again and check if it fails again. Hopefully, we can get the sos-reports in case it happens again. Till then we cannot proceed further with the bug. Shwetha, Could you provide the steps performed to hit this issue? RCA : crash mainly due to rcu_unlock() is called in different thread. i.e rcu lock is taken in one thread and rcu_unlock is called in different thread. To avoid this I sent patch http://review.gluster.org/#/c/10285/14 , but still some places we forgot to do. Upstream mainline patch http://review.gluster.org/#/c/11276/ posted for review I have observed this crash in my setup too. I cannot say what steps exactly led to the crash, but I was doing the following operations - 1. On a volume that was mounted on a FUSE client, I tried to disable quota and unmount the volume. The `umount' command was stuck and did not complete for a long time. 2. I tried to stop the volume but the stop request timed out and failed. 3. After volume stop, I was unable to see volume status as it failed with "Another transaction in progress". 4. The volume was now in an inconsistent state as it was shown as started in volume info output but attempts to stop it would fail with "volume not in started state" message. 5. I removed the volume directory from /var/lib/glusterd/vols and tried to restart glusterd. Restart failed and I checked the logs. glusterd had crashed. From glusterd logs - ------------------------------------------------------------------------------- [2015-07-03 07:43:07.474874] E [MSGID: 106069] [glusterd-volgen.c:1021:volgen_write_volfile] 0-management: failed to create volfile /var/lib/glusterd/vols/rep/rep-snapd.vol [2015-07-03 07:43:07.474925] E [MSGID: 106069] [glusterd-snapd-svc.c:254:glusterd_snapdsvc_start] 0-management: Couldn't create snapd volfile for volume: rep [2015-07-03 07:43:07.475000] E [MSGID: 106113] [glusterd-snapd-svc.c:341:glusterd_snapdsvc_restart] 0-management: Couldn't start snapd for vol: rep [2015-07-03 07:43:07.562023] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a1b83030-3890-45fc-9489-0815341722a3 [2015-07-03 07:43:08.153177] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2015-07-03 07:43:08.304427] W [glusterfsd.c:1219:cleanup_and_exit] (--> 0-: received signum (15), shutting down pending frames: frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-07-03 07:43:08 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.1 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fe6234c8826] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fe6234e83ef] /lib64/libc.so.6(+0x38072326a0)[0x7fe621e676a0] /usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7fe617925de6] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_friend_update+0x701)[0x7fe617ed0bd1] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fe617eb5a0f] /usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x295)[0x7fe623291ee5] /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7fe623292123] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fe623293ad8] /usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xa255)[0x7fe616557255] /usr/lib64/glusterfs/3.7.1/rpc-transport/socket.so(+0xbe4d)[0x7fe616558e4d] /usr/lib64/libglusterfs.so.0(+0x89970)[0x7fe62352c970] /lib64/libpthread.so.0(+0x3807607a51)[0x7fe6225b3a51] /lib64/libc.so.6(clone+0x6d)[0x7fe621f1d96d] --------- patch is available in upstream, so moving to post Version :glusterfs-3.7.1-7.el6rhs.x86_64 Faced a similar glusterd crash while probing another node . [2015-07-06 20:53:06.046749] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 3ed85019-8616-458 1-8115-22a2554dea26 [2015-07-06 20:53:06.046775] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend [2015-07-06 20:53:06.048525] I [MSGID: 106492] [glusterd-handler.c:2706:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7c1cd644-75ea-4d2 d-b228-09e25827cd45 [2015-07-06 20:53:06.048562] I [MSGID: 106502] [glusterd-handler.c:2751:__glusterd_handle_friend_update] 0-management: Received my uuid as Friend pending frames: frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-07-06 20:53:06 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.7.1 /usr/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb6)[0x7fd560ce7826] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x33f)[0x7fd560d073ef] /lib64/libc.so.6(+0x3da84326a0)[0x7fd55f6866a0] /usr/lib64/liburcu-bp.so.1(rcu_read_unlock_bp+0x16)[0x7fd555144de6] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(+0x58a57)[0x7fd5556f4a57] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x189)[0x7fd5556f4479] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(__glusterd_handle_cli_probe+0x14f)[0x7fd5556f0a6f] /usr/lib64/glusterfs/3.7.1/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd5556d4a6f] /usr/lib64/libglusterfs.so.0(synctask_wrap+0x12)[0x7fd560d2f1f2] /lib64/libc.so.6(+0x3da84438f0)[0x7fd55f6978f0] bt: ==(gdb) bt #0 0x00007fd555144de6 in rcu_read_unlock_bp () from /usr/lib64/liburcu-bp.so.1 #1 0x00007fd5556f4a57 in glusterd_ac_send_friend_update (event=<value optimized out>, ctx=<value optimized out>) at glusterd-sm.c:592 #2 0x00007fd5556f4479 in glusterd_friend_sm () at glusterd-sm.c:1257 #3 0x00007fd5556f0a6f in __glusterd_handle_cli_probe (req=0x7fd561eb45cc) at glusterd-handler.c:1220 #4 0x00007fd5556d4a6f in glusterd_big_locked_handler (req=0x7fd561eb45cc, actor_fn=0x7fd5556f0920 <__glusterd_handle_cli_probe>) at glusterd-handler.c:83 #5 0x00007fd560d2f1f2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:381 #6 0x00007fd55f6978f0 in ?? () from /lib64/libc.so.6 #7 0x0000000000000000 in ?? () Since this issue was hit atleast 3 times as of now and also with the fact that the fix is available upstream, proposing this issue as blocker for RHGS 3.1 As suggested by Dev ( Anand Nekkunti ), this case is the rarely hit race. The probability of hitting this race is expected to increase, when there are more activated snapshots and peer probe was done. I tried the case where snapshot creation was happening in one script and concurrently peer probe/deprobe was happening. I have created 100+ snaps along with peer probing/deprobing, there were no glusterd crash found. Tried disk replacement procedure as well and triggered 'heal full' on the replaced disks. That works well and there are no issues. Marking this bug as VERIFIED Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |