Bug 858487

Summary: [8b6534031ab9b60da293e9c2ffb95141d714f973]: glusterfs server crashed due to memory corruption
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Scott Haines <shaines>
Component: glusterfsAssignee: Raghavendra Bhat <rabhat>
Status: CLOSED CURRENTRELEASE QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: gluster-bugs, pkarampu, rabhat, rhs-bugs, rmainz, rwheeler, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0qa8 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 852820 Environment:
Last Closed: 2015-08-10 07:43:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 817468, 852820    
Bug Blocks:    

Comment 2 Pranith Kumar K 2012-11-21 10:09:40 UTC
Got the same crash while running https://github.com/pranithk/gluster-tests/blob/master/afr/self-heal.sh

Core was generated by `/usr/local/sbin/glusterfsd -s localhost --volfile-id vol.pranithk-laptop.tmp-0'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f4ad769841e in rpcsvc_get_listener (svc=0x1ddd9e0, port=65535, trans=0x1dee6e0) at rpcsvc.c:1238
1238	                                if (listener->trans == trans) {
Missing separate debuginfos, use: debuginfo-install glibc-2.15-37.fc17.x86_64 keyutils-libs-1.5.5-2.fc17.x86_64 krb5-libs-1.10-5.fc17.x86_64 libaio-0.3.109-5.fc17.x86_64 libcom_err-1.42-4.fc17.x86_64 libgcc-4.7.0-5.fc17.x86_64 libselinux-2.1.10-3.fc17.x86_64 openssl-1.0.0j-1.fc17.x86_64 zlib-1.2.5-6.fc17.x86_64
(gdb) bt
#0  0x00007f4ad769841e in rpcsvc_get_listener (svc=0x1ddd9e0, port=65535, trans=0x1dee6e0) at rpcsvc.c:1238
#1  0x00007f4ad769715e in rpcsvc_notify (trans=0x1dfb070, mydata=0x1ddd9e0, event=RPC_TRANSPORT_CLEANUP, data=0x0) at rpcsvc.c:651
#2  0x00007f4ad769be1c in rpc_transport_unref (this=0x1dfb070) at rpc-transport.c:476
#3  0x00007f4ad4729f90 in socket_event_handler (fd=7, idx=3, data=0x1dfb070, poll_in=1, poll_out=0, poll_err=16) at socket.c:2106
#4  0x00007f4ad7915806 in event_dispatch_epoll_handler (event_pool=0x1ddcea0, events=0x1df9a10, i=1) at event-epoll.c:384
#5  0x00007f4ad79159eb in event_dispatch_epoll (event_pool=0x1ddcea0) at event-epoll.c:445
#6  0x00007f4ad78ed40d in event_dispatch (event_pool=0x1ddcea0) at event.c:113
#7  0x000000000040857f in main (argc=19, argv=0x7fff46640108) at glusterfsd.c:1893

Comment 3 Raghavendra Bhat 2012-12-04 09:43:07 UTC
http://review.gluster.org/4230 : root caused the issue as a race between freeing the RPC object and handling socket's POLL_ERR

May be couple of iterations of review before getting the patch accepted, after its accepted in upstream, will backport to 2.0.z

Comment 4 Vijay Bellur 2012-12-05 00:28:15 UTC
CHANGE: http://review.gluster.org/4230 (rpc: check the ctx->listener before accessing rpcsvc object) merged in master by Anand Avati (avati)

Comment 6 Raghavendra Bhat 2013-01-31 10:30:17 UTC
https://code.engineering.redhat.com/gerrit/#/c/1763/ has been submitted for review.

Comment 7 SATHEESARAN 2013-08-13 09:50:18 UTC
Verified with glusterfs-3.4.0.18rhs-1

Test steps
==========

1. Ran the test case - https://tcms.engineering.redhat.com/case/243098/?from_plan=7656

2. Created distributed volume with 3 bricks and fuse mounted it in 2 clients

From clients ran intense IO, ( creating multiple files using dd command ),
and on RHS server side, kickstarted a continuous graph change ( by setting/unsetting a volume option, stat-prefetch on/off for instance ), and
getting sosreports from RHS machines

There seemed to be no crash or errors.