Description of problem: ********************************* When the network interface is brought down from one of the nodes in CTDB cluster and brought back up, the ctdb node goes to banning state and there are multiple smbd crashes. With the similar steps there was anothe rcrash seen but that was with multi-channel enabled.(https://bugzilla.redhat.com/show_bug.cgi?id=1322681) ********************************************* May 5 11:15:50 dhcp47-10 smbd[27601]: BACKTRACE: 11 stack frames: May 5 11:15:50 dhcp47-10 smbd[27601]: #0 /lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7fe18d923eaa] May 5 11:15:50 dhcp47-10 smbd[27601]: #1 /lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7fe18d923f80] May 5 11:15:50 dhcp47-10 smbd[27601]: #2 /lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7fe18fe1557f] May 5 11:15:50 dhcp47-10 smbd[27601]: #3 /usr/sbin/smbd(+0xc192) [0x7fe1904b3192] May 5 11:15:50 dhcp47-10 smbd[27601]: #4 /lib64/libsmbconf.so.0(run_events_poll+0x16c) [0x7fe18d93941c] May 5 11:15:50 dhcp47-10 smbd[27601]: #5 /lib64/libsmbconf.so.0(+0x35670) [0x7fe18d939670] May 5 11:15:50 dhcp47-10 smbd[27601]: #6 /lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7fe18c36040d] May 5 11:15:50 dhcp47-10 smbd[27601]: #7 /lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x7fe18c3605ab] May 5 11:15:50 dhcp47-10 smbd[27601]: #8 /usr/sbin/smbd(main+0x1783) [0x7fe1904aec33] May 5 11:15:50 dhcp47-10 smbd[27601]: #9 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fe18bfbcb15] May 5 11:15:50 dhcp47-10 smbd[27601]: #10 /usr/sbin/smbd(+0x84a1) [0x7fe1904af4a1] May 5 11:15:50 dhcp47-10 smbd[27601]: [2016/05/05 11:15:50.018799, 0, pid=27601, effective(0, 0), real(0, 0)] ../source3/lib/dumpcore.c:313(dump_core) May 5 11:15:50 dhcp47-10 smbd[27601]: unable to change to /var/log/core May 5 11:15:50 dhcp47-10 smbd[27601]: refusing to dump core May 5 11:15:50 dhcp47-10 smbd[27602]: [2016/05/05 11:15:50.035474, 0, pid=27602, effective(0, 0), real(0, 0)] ../source3/lib/util.c:478(reinit_after_fork) May 5 11:15:50 dhcp47-10 smbd[27602]: messaging_reinit() failed: NT_STATUS_IO_DEVICE_ERROR May 5 11:15:50 dhcp47-10 smbd[27602]: [2016/05/05 11:15:50.035627, 0, pid=27602, effective(0, 0), real(0, 0)] ../source3/smbd/server.c:743(smbd_accept_connection) May 5 11:15:50 dhcp47-10 smbd[27602]: reinit_after_fork() failed May 5 11:15:50 dhcp47-10 smbd[27602]: [2016/05/05 11:15:50.035690, 0, pid=27602, effective(0, 0), real(0, 0)] ../source3/lib/util.c:791(smb_panic_s3) May 5 11:15:50 dhcp47-10 smbd[27602]: PANIC (pid 27602): reinit_after_fork() failed Version-Release number of selected component (if applicable): samba-4.4.2-1.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1.Setup ctdb 2.Mount the volume using VIP on one client 3.Start I/O's 4. Bring down the network interface using ifdown eth0 on the node whose VIP is used to mount the volume 5. Observe the IP failover 5. Bring up the network interface Actual results: ***************** Multiple smbd crashes and ctdb node goes to banned state and doesn't comes back to OK state until ctdb is restarted. Expected results: ****************** Smbd should not crash and ctdb node should not go to banned state and even if it has gone to banned state it may come back to ok after banning period is over. Additional info: Also seeing the issue of core dump not getting into correct location. Will update the setup with all the core file settings and then will update the sosreports and core file
Core file is copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1333360/ Also in the first case the samba and glusterfs log level was set to 10 , just changed that to 3 (may not be useful for this issue but did that as dev recommended) and after following the steps mentioned in this bug, hit smbd crash again but this time only one crash and not multiple
May 6 10:17:58 dhcp47-10 smbd[23301]: BACKTRACE: 17 stack frames: May 6 10:17:58 dhcp47-10 smbd[23301]: #0 /lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7f8be5c4deaa] May 6 10:17:58 dhcp47-10 smbd[23301]: #1 /lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7f8be5c4df80] May 6 10:17:58 dhcp47-10 smbd[23301]: #2 /lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7f8be813f57f] May 6 10:17:58 dhcp47-10 smbd[23301]: #3 /lib64/libsamba-util.so.0(+0x24796) [0x7f8be813f796] May 6 10:17:58 dhcp47-10 smbd[23301]: #4 /lib64/libpthread.so.0(+0xf100) [0x7f8be83a0100] May 6 10:17:58 dhcp47-10 smbd[23301]: #5 /usr/lib64/samba/libdbwrap-samba4.so(dbwrap_traverse_read+0x7) [0x7f8be236d237] May 6 10:17:58 dhcp47-10 smbd[23301]: #6 /usr/lib64/samba/libsmbd-base-samba4.so(+0x83bf0) [0x7f8be7c3fbf0] May 6 10:17:58 dhcp47-10 smbd[23301]: #7 /lib64/libtalloc.so.2(_talloc_free+0x440) [0x7f8be4898e80] May 6 10:17:58 dhcp47-10 smbd[23301]: #8 /usr/lib64/samba/libsmbd-base-samba4.so(+0x84cb8) [0x7f8be7c40cb8] May 6 10:17:58 dhcp47-10 smbd[23301]: #9 /lib64/libtevent.so.0(tevent_common_loop_timer_delay+0xcf) [0x7f8be468eb4f] May 6 10:17:58 dhcp47-10 smbd[23301]: #10 /lib64/libsmbconf.so.0(run_events_poll+0x1c9) [0x7f8be5c63479] May 6 10:17:58 dhcp47-10 smbd[23301]: #11 /lib64/libsmbconf.so.0(+0x35670) [0x7f8be5c63670] May 6 10:17:58 dhcp47-10 smbd[23301]: #12 /lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7f8be468a40d] May 6 10:17:58 dhcp47-10 smbd[23301]: #13 /lib64/libtevent.so.0(tevent_req_poll+0x1f) [0x7f8be468b6df] May 6 10:17:58 dhcp47-10 smbd[23301]: #14 /usr/sbin/smbd(main+0xa53) [0x7f8be87d7f03] May 6 10:17:58 dhcp47-10 smbd[23301]: #15 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f8be42e6b15] May 6 10:17:58 dhcp47-10 smbd[23301]: #16 /usr/sbin/smbd(+0x84a1) [0x7f8be87d94a1] May 6 10:17:58 dhcp47-10 smbd[23301]: [2016/05/06 10:17:58.961271, 0, pid=23301, effective(0, 0), real(0, 0)] ../source3/lib/dumpcore.c:318(dump_core) May 6 10:17:58 dhcp47-10 smbd[23301]: dumping core in /var/log/samba/cores/smbd
The configuration is as follows: 1. There are two CTDB nodes with following nw config: Node1 : eth0, eth1, eth2, eth3 Node2 : eth0, eth1 eth0 on both the nodes : On Public network eth1, eth2, eth3 NIC's are in private network(with static IP's configured) used for internal communication between nodes. The steps to reproduce: 1. Create a dis-rep volume and mount it on windows client(which also has NIC configured in private network) using VIP(corresponding to eth1) of node1. 2. Start copying a large file from windows local share to samba share. 3. Bring down the interface eth1 with the command "ifdown eth1" 4. Observe the IP failover. 5. Once the failover has happened,Bring up eth1 with command "ifup eth1" 6. Observe ctdb status and check /var/log/messages and log.smbd for any cores. Result: The ctdb node goes to banned state and there is a smbd crash with following BT : (gdb) bt #0 0x00007f2c471765f7 in raise () from /lib64/libc.so.6 #1 0x00007f2c47177ce8 in abort () from /lib64/libc.so.6 #2 0x00007f2c48ad6beb in dump_core () at ../source3/lib/dumpcore.c:322 #3 0x00007f2c48ac9fe7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:814 #4 0x00007f2c4afbb57f in smb_panic (why=why@entry=0x7f2c4b00254a "internal error") at ../lib/util/fault.c:166 #5 0x00007f2c4afbb796 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:83 #6 sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94 #7 <signal handler called> #8 dbwrap_traverse_read (db=0x0, f=f@entry=0x7f2c4aabe210 <notifyd_db_del_syswatches>, private_data=private_data@entry=0x0, count=count@entry=0x0) at ../lib/dbwrap/dbwrap.c:361 #9 0x00007f2c4aabbc40 in notifyd_peer_destructor (p=p@entry=0x7f2c4c9a8e60) at ../source3/smbd/notifyd/notifyd.c:1249 #10 0x00007f2c47714e80 in _talloc_free_internal (location=<optimized out>, ptr=<optimized out>) at ../talloc.c:1046 #11 _talloc_free (ptr=0x7f2c4c9a8e60, location=0x7f2c4ac73ac0 "../source3/smbd/notifyd/notifyd.c:1154") at ../talloc.c:1647 #12 0x00007f2c4aabcd08 in notifyd_clean_peers_next (subreq=<optimized out>) at ../source3/smbd/notifyd/notifyd.c:1154 #13 0x00007f2c4750ab4f in tevent_common_loop_timer_delay (ev=ev@entry=0x7f2c4c998df0) at ../tevent_timed.c:341 #14 0x00007f2c48adf3f9 in run_events_poll (ev=0x7f2c4c998df0, pollrtn=0, pfds=0x7f2c4c9a7f50, num_pfds=4) at ../source3/lib/events.c:199 #15 0x00007f2c48adf5f0 in s3_event_loop_once (ev=0x7f2c4c998df0, location=<optimized out>) at ../source3/lib/events.c:326 #16 0x00007f2c4750640d in _tevent_loop_once (ev=ev@entry=0x7f2c4c998df0, location=location@entry=0x7f2c4750c5c5 "../tevent_req.c:256") at ../tevent.c:533 #17 0x00007f2c475076df in tevent_req_poll (req=req@entry=0x7f2c4c9a5440, ev=ev@entry=0x7f2c4c998df0) at ../tevent_req.c:256 #18 0x00007f2c4b653f03 in smbd_notifyd_init (interactive=false, msg=0x7f2c4c998ee0) at ../source3/smbd/server.c:411 #19 main (argc=<optimized out>, argv=<optimized out>) at ../source3/smbd/server.c:1597 Uploaded the sosreports @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1333360/ The node file has static IP entries for eth1 on both the nodes. cat /etc/ctdb/nodes 192.168.XXX.X 192.168.XXX.X The public_addresses file has VIP's entries for eth1. 192.168.XXX.XX/24 eth1 192.168.XXX.XX/24 eth1 Let me know if any other information is needed.
Verified the BZ with following steps: 1. There are two CTDB nodes with following nw config: Node1 : eth0, eth1, eth2, eth3 Node2 : eth0, eth1 eth0 on both the nodes : On Public network eth1, eth2, eth3 NIC's are in private network(with static IP's configured) used for internal communication between nodes. The steps to reproduce: 1. Create a dis-rep volume and mount it on windows client(which also has NIC configured in private network) using VIP(corresponding to eth1) of node1. 2. Start copying a large file from windows local share to samba share. 3. Bring down the interface eth1 with the command "ifdown eth1" 4. Observe the IP failover. 5. Once the failover has happened,Bring up eth1 with command "ifup eth1" 6. Observe ctdb status and check /var/log/messages and log.smbd for any cores. There are no crashes seen. The ctdb node remains in banned state which is being discussed in another BZ. Marking this BZ as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1245