901457 – Brick process crashed

Bug 901457 - Brick process crashed

Summary: Brick process crashed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra Bhat
QA Contact:	shylesh
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	901472 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-01-18 09:07 UTC by shylesh
Modified:	2015-08-10 19:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.4.0qa8, glusterfs-3.3.0.5rhs-42
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-08-10 07:42:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
mnt and brick logs (50.00 KB, application/x-tar) 2013-01-18 09:07 UTC, shylesh	no flags	Details
View All

Description shylesh 2013-01-18 09:07:57 UTC

Created attachment 682207 [details]
mnt and brick logs

Description of problem:
Brick process of a distributed-replicate volume crashed 

Version-Release number of selected component (if applicable):
[root@rhs-gp-srv9 core]# rpm -qa | grep gluster
glusterfs-fuse-3.3.0.5rhs-40.el6rhs.x86_64
vdsm-gluster-4.9.6-17.el6rhs.noarch
gluster-swift-plugin-1.0-5.noarch
gluster-swift-account-1.4.8-4.el6.noarch
glusterfs-3.3.0.5rhs-40.el6rhs.x86_64
gluster-swift-doc-1.4.8-4.el6.noarch
glusterfs-debuginfo-3.3.0.5rhs-40.el6rhs.x86_64
glusterfs-server-3.3.0.5rhs-40.el6rhs.x86_64
glusterfs-rdma-3.3.0.5rhs-40.el6rhs.x86_64
gluster-swift-object-1.4.8-4.el6.noarch
gluster-swift-container-1.4.8-4.el6.noarch
org.apache.hadoop.fs.glusterfs-glusterfs-0.20.2_0.2-1.noarch
gluster-swift-1.4.8-4.el6.noarch
gluster-swift-proxy-1.4.8-4.el6.noarch
glusterfs-geo-replication-3.3.0.5rhs-40.el6rhs.x86_64


How reproducible:


Steps to Reproduce:
1. created a distributed-replicate volume of 2x2 configuratioin
2. This volume was used as vm store, around 5 vms created on this

  
Actual results:
After sometime one of the brick from replica pair got crashed

Expected results:


Additional info:
Core was generated by `/usr/sbin/glusterfsd -s localhost --volfile-id dis-rep.rhs-gp-srv9.lab.eng.blr.'.
Program terminated with signal 11, Segmentation fault.
#0  server_alloc_frame (req=0x7ff23b06402c) at server-helpers.c:774
774                     state->itable = conn->bound_xl->itable;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.12.x86_64 libaio-0.3.107-10.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0  server_alloc_frame (req=0x7ff23b06402c) at server-helpers.c:774
#1  get_frame_from_request (req=0x7ff23b06402c) at server-helpers.c:799
#2  0x00007ff23b7acab8 in server_lookup (req=0x7ff23b06402c) at server3_1-fops.c:5540
#3  0x00007ff24582b443 in rpcsvc_handle_rpc_call (svc=0x1bf0030, trans=<value optimized out>, msg=<value optimized out>)
    at rpcsvc.c:513
#4  0x00007ff24582b5b3 in rpcsvc_notify (trans=0x1c05900, mydata=<value optimized out>, event=<value optimized out>, 
    data=<value optimized out>) at rpcsvc.c:612
#5  0x00007ff24582c018 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:489
#6  0x00007ff24177e954 in socket_event_poll_in (this=0x1c05900) at socket.c:1677
#7  0x00007ff24177ea37 in socket_event_handler (fd=<value optimized out>, idx=4, data=0x1c05900, poll_in=1, poll_out=0, 
    poll_err=<value optimized out>) at socket.c:1792
#8  0x00007ff245a76d34 in event_dispatch_epoll_handler (event_pool=0x1bc7e10) at event.c:785
#9  event_dispatch_epoll (event_pool=0x1bc7e10) at event.c:847
#10 0x00000000004076b0 in main (argc=<value optimized out>, argv=0x7ffffc7b7858) at glusterfsd.c:1782
(gdb) l
769             state = GF_CALLOC (1, sizeof (*state), gf_server_mt_state_t);
770             if (!state)
771                     goto out;
772
773             if (conn->bound_xl)
774                     state->itable = conn->bound_xl->itable;
775
776             state->xprt  = rpc_transport_ref (req->trans);
777             state->conn  = conn;
778
(gdb) p state->itable
$1 = (inode_table_t *) 0x0


==========================================

[root@rhs-gp-srv6 logs]# gluster v info
 
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 0754067e-bc25-41c7-aa3c-2d3ef2a0f94c
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: rhs-gp-srv6.lab.eng.blr.redhat.com:/brick1
Brick2: rhs-gp-srv9.lab.eng.blr.redhat.com:/brick1===> this is the crashed brick
Brick3: rhs-gp-srv6.lab.eng.blr.redhat.com:/brick2
Brick4: rhs-gp-srv9.lab.eng.blr.redhat.com:/brick2
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
cluster.subvols-per-directory: 1
cluster.eager-lock: enable
storage.linux-aio: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off

==================================================

Comment 3 Raghavendra Bhat 2013-01-22 13:46:27 UTC

Method to reproduce:

1) Have 2 machines peer probed (say A and B). Create a simple single brick volume     with brick in one of the machines (say A ; imagine that the brick process would be running in the port 24009). Machine B would be running nfs server process for the volume. 

2) Now kill all the gluster processes on machine A. Wipe out /var/lib/glusterd. Start glusterd on machine A and create a new volume in machine A (single brick volume is better as its easy to verify). Statr the volume (Make sure that the brick process is using the same port as before i.e 24009 in this example. If its using different port, then repeat step 2 till the brick uses the same port as in step 1).

3) Now kill the nfs server process running in machine B. 

This should crash the brick process running on mahine A.


Reason:
In server_setvolume req->trans->xl_private is set to conn at the beginning. Later if getting bound_xl via get_xlator_by_name fails, then the connection is shutdown by calling server_connection_put which frees the conn object. But transport->xl_private is still pointing to the freed up conn. Now when DISCONNECT is received in server_rpc_notify on the corresponding transport which still has xl_private to freed up conn, conn i.e xl_private is accessed and the process segfaults.

In the above case the transport and the conn object corresponds to the machine B's nfs server process.

Comment 4 Raghavendra Bhat 2013-01-23 11:57:51 UTC

http://review.gluster.org/#change,4411 has been submitted for review.

Comment 5 Raghavendra Bhat 2013-01-23 11:58:23 UTC

*** Bug 901472 has been marked as a duplicate of this bug. ***

Comment 6 Vijay Bellur 2013-02-03 22:10:18 UTC

CHANGE: http://review.gluster.org/4411 ( protocol/server: upon server_connection_put, set xl_private of the transport to  NULL) merged in master by Anand Avati (avati)

Comment 7 shylesh 2013-03-11 05:01:38 UTC

verified on 3.3.0.5rhs-43

Note You need to log in before you can comment on or make changes to this bug.