765373 – (GLUSTER-3641) [glusterfs-3.3.0qa11]: glustershd crashed

Bug 765373 (GLUSTER-3641) - [glusterfs-3.3.0qa11]: glustershd crashed

Summary: [glusterfs-3.3.0qa11]: glustershd crashed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-3641
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	pre-release
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Pranith Kumar K
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	GLUSTER-3083 (view as bug list)
Depends On:
Blocks:	817967
TreeView+	depends on / blocked

Reported:	2011-09-27 06:58 UTC by Raghavendra Bhat
Modified:	2015-12-01 16:45 UTC (History)
CC List:	2 users (show)
Fixed In Version:	glusterfs-3.4.0
Clone Of:
Environment:
Last Closed:	2013-07-24 17:16:52 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:	glusterfs-3.3.0qa40
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Raghavendra Bhat 2011-09-27 06:58:26 UTC

glustershd crashed since the fresh_children were -1. 

Setup:

On the setup of bugs 3637, 3639 (2 replica volume with 1 fuse and 1 nfs client).
ran the tests.

On fuse client kernel untar in a while loop
On nfs client rm -rf of the untarred kernel

Killed one of the bricks, slept and brought the brick up. On other machine volume set was running in a loop.

This is the backtrace of the core.

Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /etc/'.
Program terminated with signal 6, Aborted.
#0  0x00000030b8e30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00000030b8e30265 in raise () from /lib64/libc.so.6
#1  0x00000030b8e31d10 in abort () from /lib64/libc.so.6
#2  0x00000030b8e296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002aaaacf2ba12 in afr_inode_set_read_ctx (this=0x15659800, inode=0x2aaab29e3b38, read_child=1, fresh_children=0x1570a380)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:419
#4  0x00002aaaacf06dbd in afr_sh_inode_set_read_ctx (sh=0x156fb7e8, this=0x15659800)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-data.c:642
#5  0x00002aaaacf070e4 in afr_sh_data_fix (frame=0x2b771b93cad8, this=0x15659800)
    at ../../../../../xlators/cluster/afr/src/afr-self-heal-data.c:705
#6  0x00002aaaacf079b0 in afr_sh_data_fstat_cbk (frame=0x2b771b93cad8, cookie=0x1, this=0x15659800, op_ret=-1, op_errno=107, 
    buf=0x7fff6d689d70) at ../../../../../xlators/cluster/afr/src/afr-self-heal-data.c:885
#7  0x00002aaaaccb5248 in client3_1_fstat_cbk (req=0x2aaaad36c534, iov=0x7fff6d689ee0, count=1, myframe=0x2b771b6b2790)
    at ../../../../../xlators/protocol/client/src/client3_1-fops.c:1198
#8  0x00002b771a9f6410 in saved_frames_unwind (saved_frames=0x15666290) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:385
#9  0x00002b771a9f652f in saved_frames_destroy (frames=0x15666290) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:403
#10 0x00002b771a9f6a03 in rpc_clnt_connection_cleanup (conn=0x1565fa60) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:559
#11 0x00002b771a9f7455 in rpc_clnt_notify (trans=0x1565fd60, mydata=0x1565fa60, event=RPC_TRANSPORT_DISCONNECT, data=0x1565fd60)
    at ../../../../rpc/rpc-lib/src/rpc-clnt.c:863
#12 0x00002b771a9f39f3 in rpc_transport_notify (this=0x1565fd60, event=RPC_TRANSPORT_DISCONNECT, data=0x1565fd60)
    at ../../../../rpc/rpc-lib/src/rpc-transport.c:498
#13 0x00002aaaaab59006 in socket_event_poll_err (this=0x1565fd60) at ../../../../../rpc/rpc-transport/socket/src/socket.c:694
#14 0x00002aaaaab5d47c in socket_event_handler (fd=27, idx=20, data=0x1565fd60, poll_in=1, poll_out=0, poll_err=24)
    at ../../../../../rpc/rpc-transport/socket/src/socket.c:1797
#15 0x00002b771a79f84c in event_dispatch_epoll_handler (event_pool=0x1564c960, events=0x156513f0, i=0)
    at ../../../libglusterfs/src/event.c:794
#16 0x00002b771a79fa51 in event_dispatch_epoll (event_pool=0x1564c960) at ../../../libglusterfs/src/event.c:856
#17 0x00002b771a79fdab in event_dispatch (event_pool=0x1564c960) at ../../../libglusterfs/src/event.c:956
#18 0x000000000040784d in main (argc=11, argv=0x7fff6d68a518) at ../../../glusterfsd/src/glusterfsd.c:1592
(gdb) f 3
#3  0x00002aaaacf2ba12 in afr_inode_set_read_ctx (this=0x15659800, inode=0x2aaab29e3b38, read_child=1, fresh_children=0x1570a380)
    at ../../../../../xlators/cluster/afr/src/afr-common.c:419
419             GF_ASSERT (afr_is_child_present (fresh_children, priv->child_count,
(gdb) l
414             afr_private_t      *priv  = NULL;
415
416             priv = this->private;
417             GF_ASSERT (read_child >= 0);
418             GF_ASSERT (fresh_children);
419             GF_ASSERT (afr_is_child_present (fresh_children, priv->child_count,
420                                              read_child));
421
422             params.op = AFR_INODE_SET_READ_CTX;
423             params.u.read_ctx.read_child     = read_child;
(gdb) l afr_is_child_present
455     }
456
457     gf_boolean_t
458     afr_is_child_present (int32_t *success_children, int32_t child_count,
459                           int32_t child)
460     {
461             gf_boolean_t             success_child = _gf_false;
462             int                      i = 0;
463
464             GF_ASSERT (child < child_count);
(gdb) 
465
466             for (i = 0; i < child_count; i++) {
467                     if (success_children[i] == -1)
468                             break;
469                     if (child == success_children[i]) {
470                             success_child = _gf_true;
471                             break;
472                     }
473             }
474             return success_child;
(gdb) p fresh_children
$1 = (int32_t *) 0x1570a380
(gdb) p fresh_children[0]
$2 = -1
(gdb) p fresh_children[1]
$3 = -1
(gdb)  p read_child
$4 = 1
(gdb)

Comment 1 Pranith Kumar K 2011-09-28 07:34:41 UTC

This bug is observed because afr_build_sources does not take into account the valid_children in computing the sources
afr_sh_data_fix should check for errors in fxattrop and fstat, only if there exist atleast one source and one sink it should proceed with the data fixing.

Comment 2 Pranith Kumar K 2011-10-19 10:22:38 UTC

*** Bug 3083 has been marked as a duplicate of this bug. ***

Comment 3 Anand Avati 2012-03-28 18:40:54 UTC

CHANGE: http://review.gluster.com/2662 (cluster/afr: Handle afr data self-heal failures gracefully) merged in master by Vijay Bellur (vijay)

Comment 4 Raghavendra Bhat 2012-05-08 12:59:51 UTC

Tested with glusterfs-3.3.0qa40. Repeated the same test of untar and rm -rf of linux kernel parallely, and bringing brick down and up. Gave volume heal command, did volume set operations.

self-heal daemon did not crash.

Note You need to log in before you can comment on or make changes to this bug.