+++ This bug was initially created as a clone of Bug #1097217 +++ Description of problem: I was taking snapshot in a loop while the dbench is running on the mountpoint. And the glusterd crashed in one of the nodes with the segmentation fault. And the whole cluster went into a inconsistent state where the one peer rejects other but this peer is connected from that peer etc. Version-Release number of selected component (if applicable): glusterfs-3.6.0-1.0.el6rhs.x86_64 How reproducible: Hit once in as many tries. Haven't run once more. Steps to Reproduce: 1. Start running dbench on the mountpoint. 2.Start creating 256 snaps in a loop. Actual results: glusterd crashed in one of the node with following back trace. Loaded symbols for /lib64/libgcc_s.so.1 Core was generated by `/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid'. Program terminated with signal 11, Segmentation fault. #0 __gf_free (free_ptr=0x7f6c046010c0) at mem-pool.c:252 252 if (!xl->mem_acct.rec) { Missing separate debuginfos, use: debuginfo-install device-mapper-event-libs-1.02.79-8.el6.x86_64 device-mapper-libs-1.02.79-8.el6.x86_64 glibc-2.12-1.132.el6.x86_64 keyutils-libs-1.4-4 .el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libsepol-2.0.41-4.el6.x86_64 libudev-147-2 .51.el6.x86_64 libxml2-2.7.6-14.el6.x86_64 lvm2-libs-2.02.100-8.el6.x86_64 openssl-1.0.1e-15.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 __gf_free (free_ptr=0x7f6c046010c0) at mem-pool.c:252 #1 0x00007f6c11380825 in glusterd_auth_cleanup (volinfo=0x7f6c04607080) at glusterd-utils.c:779 #2 0x00007f6c11389235 in glusterd_volinfo_delete (volinfo=0x7f6c04607080) at glusterd-utils.c:889 #3 0x00007f6c113f6e09 in glusterd_snap_volume_remove (rsp_dict=0x7f6c19ecb420, snap_vol=0x7f6c04607080, remove_lvm=_gf_true, force=_gf_true) at glusterd-snapshot.c:1955 #4 0x00007f6c113f97fc in glusterd_do_snap_vol (origin_vol=0x7f6c04001ce0, snap=0x7f6c0460c2c0, dict=<value optimized out>, rsp_dict=<value optimized out>, volcount=<value optimized out>) at glusterd-snapshot.c:3983 #5 0x00007f6c113fde35 in glusterd_snapshot_create_commit (dict=0x7f6c19ecbd6c, op_errstr=<value optimized out>, rsp_dict=0x7f6c19ecb420) at glusterd-snapshot.c:4704 #6 0x00007f6c113fe2de in glusterd_snapshot (dict=0x7f6c19ecbd6c, op_errstr=0x28bb020, rsp_dict=0x7f6c19ecb420) at glusterd-snapshot.c:5631 #7 0x00007f6c11403d8e in gd_mgmt_v3_commit_fn (op=GD_OP_SNAP, dict=0x7f6c19ecbd6c, op_errstr=0x28bb020, rsp_dict=0x7f6c19ecb420) at glusterd-mgmt.c:209 #8 0x00007f6c11400843 in glusterd_handle_commit_fn (req=0x7f6c10e08b58) at glusterd-mgmt-handler.c:548 #9 0x00007f6c113534ff in glusterd_big_locked_handler (req=0x7f6c10e08b58, actor_fn=0x7f6c114005f0 <glusterd_handle_commit_fn>) at glusterd-handler.c:80 #10 0x000000359285b5d2 in synctask_wrap (old_task=<value optimized out>) at syncop.c:333 #11 0x00000035c1443bf0 in ?? () from /lib64/libc.so.6 #12 0x0000000000000000 in ?? () (gdb) And the whole state was in inconsistent state. [root@rhsauto029 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.36.253 Uuid: 3b6a9c1a-f0d1-42cc-9948-e423c7be7922 State: Accepted peer request (Connected) Hostname: rhsauto030.lab.eng.blr.redhat.com Uuid: 6453c7d2-6575-4c21-a7c5-d102ae27f643 State: Peer Rejected (Disconnected) Hostname: rhsauto038.lab.eng.blr.redhat.com Uuid: 1eef5fa5-1f93-405b-9915-218200d4ef13 State: Peer in Cluster (Disconnected) And in other machines, I am not able to execute and gluster command [root@rhsauto023 ~]# /etc/init.d/glusterd status glusterd (pid 10747) is running... [root@rhsauto023 ~]# gluster peer status Connection failed. Please check if gluster daemon is operational. I get glusterd not running message even though glusterd is running. The machines have entered weird state. Expected results: There should be no crashes and the machines should be very consistent. Additional info: I have lost the access to machines with the following message. [root@rhsauto029 ~]# Message from syslogd@rhsauto029 at May 13 05:04:17 ... kernel:BUG: soft lockup - CPU#1 stuck for 100s! [glusterfsd:20508] [root@rhsauto023 ~]# Message from syslogd@rhsauto023 at May 13 05:20:04 ... kernel:BUG: soft lockup - CPU#0 stuck for 62s! [glusterfsd:13111] I have lost access to all the nodes. The logs and the core in those machines. --- Additional comment from M S Vishwanath Bhat on 2014-05-13 08:03:51 EDT --- I am moving the priority to urgent because this makes the while cluster go down once with message from kernel. --- Additional comment from M S Vishwanath Bhat on 2014-05-13 08:40:08 EDT --- I did force off the node and got the access back. I have uploaded the glusterd log. --- Additional comment from M S Vishwanath Bhat on 2014-05-13 08:46:42 EDT --- Not sure if this would help, But I see this call trace in /var/log/messages May 13 03:36:28 rhsauto038 kernel: Hardware name: KVM May 13 03:36:28 rhsauto038 kernel: NETDEV WATCHDOG: eth0 (8139cp): transmit queue 0 timed out May 13 03:36:28 rhsauto038 kernel: Modules linked in: dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c xfs exportfs ipv6 microcode virtio_balloon 8139too 8139cp mii i2c_ piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ip_tables] May 13 03:36:28 rhsauto038 kernel: Pid: 26435, comm: glusterfsd Not tainted 2.6.32-431.el6.x86_64 #1 May 13 03:36:28 rhsauto038 kernel: Call Trace: May 13 03:36:28 rhsauto038 kernel: <IRQ> [<ffffffff81071e27>] ? warn_slowpath_common+0x87/0xc0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff81071f16>] ? warn_slowpath_fmt+0x46/0x50 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8147b74b>] ? dev_watchdog+0x26b/0x280 May 13 03:36:28 rhsauto038 kernel: [<ffffffff814b2515>] ? tcp_write_err+0x35/0x70 May 13 03:36:28 rhsauto038 kernel: [<ffffffff81083e75>] ? internal_add_timer+0xb5/0x110 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8147b4e0>] ? dev_watchdog+0x0/0x280 May 13 03:36:28 rhsauto038 kernel: [<ffffffff81084b07>] ? run_timer_softirq+0x197/0x340 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8103ea6c>] ? kvm_clock_read+0x1c/0x20 May 13 03:36:28 rhsauto038 kernel: [<ffffffff810ac8f5>] ? tick_dev_program_event+0x65/0xc0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8107a8e1>] ? __do_softirq+0xc1/0x1e0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff810ac9ca>] ? tick_program_event+0x2a/0x30 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90 May 13 03:36:28 rhsauto038 kernel: [<ffffffff815310aa>] ? smp_apic_timer_interrupt+0x4a/0x60 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20 May 13 03:36:28 rhsauto038 kernel: <EOI> [<ffffffff8128e6c2>] ? memset+0x32/0xc0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8144c2d1>] ? sk_prot_alloc+0x161/0x1c0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8144c36c>] ? sk_alloc+0x3c/0x140 May 13 03:36:28 rhsauto038 kernel: [<ffffffff814c709e>] ? inet_create+0x19e/0x390 May 13 03:36:28 rhsauto038 kernel: [<ffffffff811a6bf2>] ? new_inode+0x72/0xb0 May 13 03:36:28 rhsauto038 kernel: [<ffffffff81447347>] ? __sock_create+0x117/0x240 May 13 03:36:28 rhsauto038 kernel: [<ffffffff814474d0>] ? sock_create+0x30/0x40 May 13 03:36:28 rhsauto038 kernel: [<ffffffff814476ec>] ? sys_socket+0x3c/0x80 May 13 03:36:28 rhsauto038 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b May 13 03:36:28 rhsauto038 kernel: ---[ end trace d5cf8559acde74ca ]--- May 13 03:36:28 rhsauto038 kernel: eth0: Transmit timeout, status d 3b 0 80ff --- Additional comment from Joseph Elwin Fernandes on 2014-06-02 06:33:25 EDT --- 1) Was not able to reproduce with glusterfs-3.6.0-1.0.el6rhs.x86_64 2) Was not able to reproduce with glusterfs-3.6.0-4.0.el6rhs.x86_64 3) By looking at the logs and the BT, 1) There is a failure while starting the snap vol brick. Couldnt pin-point the cause of the failure. 2) As the starting of the snap vol brick failed, snap creation (glusterd_do_snap_vol() to be precise) tries to clear up the snap vol as part of failure exit. 3) While clearing the snap_vol it tries to clean up the auth attribute of snap_vol. The auth attribute is pointing to garbage at this point of time, as glusterd_volinfo_dup (called earlier in glusterd_do_snap_vol()) has not copied the initialized the auth varibles. And hence when glusterd_auth_cleanup is called glusterd crashes with the above BT 4) The argument here is even though the snap vol brick failed to start or any other failure in the creation path, the clean up should be graceful, without glusterd crashing. 5) Will provide a fix for the proper init of auth attribute and proper cleanup
REVIEW: http://review.gluster.org/7954 ([glusterd/snaphot]: Fix for setting auth attribute NULL if not opted in copy) posted (#1) for review on master by Joseph Fernandes (josferna)
*** This bug has been marked as a duplicate of bug 1091926 ***