Bug 763798 (GLUSTER-2066)

Summary: glusterd crashed while trying to restore volumes
Product: [Community] GlusterFS Reporter: Raghavendra G <raghavendra>
Component: glusterdAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: mainlineCC: gluster-bugs, rabhat, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
cmd_log_history none

Description Raghavendra G 2010-11-08 22:53:44 EST
Created attachment 373
Comment 1 Raghavendra G 2010-11-09 01:52:42 EST
It crashed while trying to restore a brick named ":". Below are the contents of the file:

raghu@booradley:/etc/glusterd/vols/local/bricks$ cat /etc/glusterd/vols/local/bricks/:
hostname=
path=
listen-port=0
hostname=
path=
listen-port=0

I've attached .cmd_log_history.

Below is the backtrace:
(gdb) bt
#0  0xb7d65490 in strncpy () from /lib/libc.so.6
#1  0xb6945fc6 in glusterd_store_retrieve_bricks (volinfo=0x8084c68)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-store.c:961
#2  0xb6946760 in glusterd_store_retrieve_volume (volname=0x807aebb "local")
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-store.c:1108
#3  0xb6946a13 in glusterd_store_retrieve_volumes (this=0x8076808)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-store.c:1153
#4  0xb6947dfd in glusterd_restore () at ../../../../../xlators/mgmt/glusterd/src/glusterd-store.c:1536
#5  0xb690f705 in init (this=0x8076808) at ../../../../../xlators/mgmt/glusterd/src/glusterd.c:404
#6  0xb7e994fa in __xlator_init (xl=0x8076808) at ../../../libglusterfs/src/xlator.c:875
#7  0xb7e9960a in xlator_init (xl=0x8076808) at ../../../libglusterfs/src/xlator.c:903
#8  0xb7ec67b9 in glusterfs_graph_init (graph=0x80725e0) at ../../../libglusterfs/src/graph.c:328
#9  0xb7ec6cb3 in glusterfs_graph_activate (graph=0x80725e0, ctx=0x8071008)
    at ../../../libglusterfs/src/graph.c:491
#10 0x0804d07f in glusterfs_process_volfp (ctx=0x8071008, fp=0x80723c8)
    at ../../../glusterfsd/src/glusterfsd.c:1316
#11 0x0804d1ab in glusterfs_volumes_init (ctx=0x8071008) at ../../../glusterfsd/src/glusterfsd.c:1362
#12 0x0804d2ad in main (argc=2, argv=0xbfab3464) at ../../../glusterfsd/src/glusterfsd.c:1407
(gdb) f 1
#1  0xb6945fc6 in glusterd_store_retrieve_bricks (volinfo=0x8084c68)
    at ../../../../../xlators/mgmt/glusterd/src/glusterd-store.c:961
961                                     strncpy (brickinfo->hostname, value, 1024);
(gdb) p value
$16 = 0x0
(gdb) p key
$17 = 0x8086718 "hostname"

raghu@booradley:~/work/gluster@sv.gnu.org/git/current/glusterfs.git/build$ cat /etc/hosts
#
# hosts         This file describes a number of hostname-to-address
#               mappings for the TCP/IP subsystem.  It is mostly
#               used at boot time, when no name servers are running.
#               On small systems, this file can be used instead of a
#               "named" name server.  Just add the names, addresses
#               and any aliases to this file...
#
# By the way, Arnt Gulbrandsen <agulbra@nvg.unit.no> says that 127.0.0.1
# should NEVER be named with the name of the machine.  It causes problems
# for some (stupid) programs, irc and reputedly talk. :^)
#

# For loopbacking.
127.0.0.1               localhost
127.0.0.1               booradley
#192.168.1.13           #booradley.zillionresearch.com booradley
192.168.1.201           n1
192.168.1.202           n2
192.168.1.203           n3
192.168.1.204           n4
Comment 2 Anand Avati 2011-02-22 02:11:51 EST
PATCH: http://patches.gluster.com/patch/6224 in master (mgmt/glusterd: In store-retrieve exit with error message instead of crashing.)
Comment 3 Pranith Kumar K 2011-03-11 02:48:16 EST
(In reply to comment #2)
> PATCH: http://patches.gluster.com/patch/6224 in master (mgmt/glusterd: In
> store-retrieve exit with error message instead of crashing.)

An intermediate fix that handled this crash already went in the fix for 2271. I made it a little more robust. If any of the entries in any stores/ or the files it self are missing glusterd should print the error and exit out.
Comment 4 Raghavendra Bhat 2011-03-11 03:07:43 EST
Probed a peer, stopeed glusterd and then removed the entry of the other peer from /etc/glusterd/peers directory. Now started glusterd. It logs the error message.




[2011-03-11 16:35:13.165866] D [glusterd-store.c:1610:glusterd_store_retrieve_peers] 0-: Returning with 0
[2011-03-11 16:35:13.165882] D [glusterd-store.c:1640:glusterd_resolve_all_bricks] 0-: Returning with 0
[2011-03-11 16:35:13.165896] D [glusterd-store.c:1667:glusterd_restore] 0-: Returning 0
Given volfile:
+------------------------------------------------------------------------------+
  1: volume management
  2:     type mgmt/glusterd
  3:     option working-directory /etc/glusterd
  4:     option transport-type socket,rdma
  5:     option transport.socket.keepalive-time 10
  6:     option transport.socket.keepalive-interval 2
  7: end-volume
  8: 

+------------------------------------------------------------------------------+
[2011-03-11 16:35:13.214601] I [glusterd-handler.c:2611:glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: eaae880d-fa3d-4ba9-a53d-417323598df0
[2011-03-11 16:35:13.214682] I [glusterd-handler.c:379:glusterd_friend_find] 0-glusterd: Unable to find peer by uuid
[2011-03-11 16:35:13.248038] I [glusterd-handler.c:391:glusterd_friend_find] 0-glusterd: Unable to find hostname: 192.168.1.104
[2011-03-11 16:35:13.248298] I [glusterd-handler.c:3267:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 192.168.1.104 (24007), ret: 0