glusterd crashed when started with the following backtrace. Operations performed: 1) With some previous git head created a stripe replicate volume (gluster volume create dsr replica 3 stripe 2 10.1.12.172:/export/dsr1 10.1.12.172:/export/dsr2 10.1.12.170:/export/dsr1 10.1.12.170:/export/dsr2 10.1.12.173:/export/dsr1 10.1.12.173:/export/dsr2) 2) After some operations and testing did a git pull and then when started glusterd glusterd crashed on one of the machines. 0 gf_print_trace (signum=58) at common-utils.c:347 #1 <signal handler called> #2 0x00002aaaabfe49ce in client_graph_builder (graph=0x7fffffff9e70, volinfo=0x641330, set_dict=0x63bf40, param=0x0) at glusterd-volgen.c:1610 #3 0x00002aaaabfe33f3 in build_graph_generic (graph=0x7fffffff9e70, volinfo=0x641330, mod_dict=0x63aaf0, param=0x0, builder=0x2aaaabfe444a <client_graph_builder>) at glusterd-volgen.c:1113 #4 0x00002aaaabfe4d8f in build_client_graph (graph=0x7fffffff9e70, volinfo=0x641330, mod_dict=0x63aaf0) at glusterd-volgen.c:1711 #5 0x00002aaaabfe56fe in build_nfs_graph (graph=0x7fffffffaf80, mod_dict=0x0) at glusterd-volgen.c:1973 #6 0x00002aaaabfe65d2 in glusterd_create_nfs_volfile () at glusterd-volgen.c:2288 #7 0x00002aaaabfccc18 in glusterd_check_generate_start_nfs () at glusterd-utils.c:2284 #8 0x00002aaaabfcd2ba in glusterd_restart_bricks (conf=0x6379e0) at glusterd-utils.c:2413 #9 0x00002aaaabf8bd2b in init (this=0x6341f0) at glusterd.c:731 #10 0x00002aaaaaac8fe2 in __xlator_init (xl=0x6341f0) at xlator.c:1369 #11 0x00002aaaaaac9103 in xlator_init (xl=0x6341f0) at xlator.c:1392 #12 0x00002aaaaab01e0c in glusterfs_graph_init (graph=0x62fed0) at graph.c:328 #13 0x00002aaaaab02476 in glusterfs_graph_activate (graph=0x62fed0, ctx=0x62e010) at graph.c:501 #14 0x0000000000406e9a in glusterfs_process_volfp (ctx=0x62e010, fp=0x62fbe0) at glusterfsd.c:1423 #15 0x0000000000406fd2 in glusterfs_volumes_init (ctx=0x62e010) at glusterfsd.c:1475 #16 0x00000000004070f7 in main (argc=2, argv=0x7fffffffe438) at glusterfsd.c:1523 (gdb) f 2 #2 0x00002aaaabfe49ce in client_graph_builder (graph=0x7fffffff9e70, volinfo=0x641330, set_dict=0x63bf40, param=0x0) at glusterd-volgen.c:1610 1610 if (i % sub_count == 0) { (gdb) p i $17 = 0 (gdb) p sub_count $18 = 0 (gdb) This is the /etc/glusterd/vols/dsr/info file which says the stripe count is 0. type=3 count=6 status=1 sub_count=6 stripe_count=0 version=8 transport-type=0 volume-id=a7cfdba5-292d-46e0-ad2f-458798b77253 brick-0=10.1.12.172:-export-dsr1 brick-1=10.1.12.172:-export-dsr2 brick-2=10.1.12.170:-export-dsr1 brick-3=10.1.12.170:-export-dsr2 brick-4=10.1.12.173:-export-dsr1 brick-5=10.1.12.173:-export-dsr2 This is the /etc/glusterd/vols/dsr/info from the other machine of the cluster. type=3 count=6 status=1 sub_count=6 stripe_count=2 version=15 transport-type=0 volume-id=a7cfdba5-292d-46e0-ad2f-458798b77253 brick-0=10.1.12.172:-export-dsr1 brick-1=10.1.12.172:-export-dsr2 brick-2=10.1.12.170:-export-dsr1 brick-3=10.1.12.170:-export-dsr2 brick-4=10.1.12.173:-export-dsr1 brick-5=10.1.12.173:-export-dsr2 Here stripe count is 2. In glusterd_store_retrieve_volume function we do this. if (volinfo->stripe_count) volinfo->replica_count = (volinfo->sub_count / volinfo->stripe_count); But since we check for volinfo->stripe_count (which is zero here) we do not set the volinfo->replica_count value which is 0. In client_graph_builder from volgen we do this. case GF_CLUSTER_TYPE_STRIPE_REPLICATE: /* Replicate after the clients, then stripe */ sub_count = volinfo->replica_count; cluster_args = replicate_args; break; It makes sub_count zero and when we do i % sub_count we get the crash. We will have to check volinfo->replica_count for 0 and return if it is before assigning it to sub_count. And have to investigate why stripe count is zero.
Raghavendra Bhat has sent a patch for this. That should fix the crash. I am keeping this bug so I can figure out the root cause why it was written as '0' in first place.
PATCH: http://patches.gluster.com/patch/7743 in master (glusterd: check replica_count for 0 before using it for volume creation in stripe replicate volume)
Now is fixed since we are checking volinfo->replica_count for zero before assigning it to sub_count and performing the division operation.