Bug 1333360

Summary: Samba: Multiple smbd crashes (notifyd) after a ctdb-internal network interface is brought down in a ctdb cluster.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: surabhi <sbhaloth>
Component: sambaAssignee: Michael Adam <madam>
Status: CLOSED ERRATA QA Contact: surabhi <sbhaloth>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: gdeschner, madam, nlevinki, rcyriac, rhinduja
Target Milestone: ---Keywords: Regression, ZStream
Target Release: RHGS 3.1.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: samba-4.4.3-5.el7rhgs Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-23 05:37:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1311817    

Description surabhi 2016-05-05 11:34:03 UTC
Description of problem:
*********************************
When the network interface is brought down from one of the nodes in CTDB cluster and brought back up, the ctdb node goes to banning state and there are multiple smbd crashes.

With the similar steps there was anothe rcrash seen but that was with multi-channel enabled.(https://bugzilla.redhat.com/show_bug.cgi?id=1322681)  

*********************************************


May  5 11:15:50 dhcp47-10 smbd[27601]:  BACKTRACE: 11 stack frames:
May  5 11:15:50 dhcp47-10 smbd[27601]:   #0 /lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7fe18d923eaa]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #1 /lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7fe18d923f80]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #2 /lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7fe18fe1557f]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #3 /usr/sbin/smbd(+0xc192) [0x7fe1904b3192]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #4 /lib64/libsmbconf.so.0(run_events_poll+0x16c) [0x7fe18d93941c]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #5 /lib64/libsmbconf.so.0(+0x35670) [0x7fe18d939670]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #6 /lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7fe18c36040d]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #7 /lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x7fe18c3605ab]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #8 /usr/sbin/smbd(main+0x1783) [0x7fe1904aec33]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #9 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fe18bfbcb15]
May  5 11:15:50 dhcp47-10 smbd[27601]:   #10 /usr/sbin/smbd(+0x84a1) [0x7fe1904af4a1]
May  5 11:15:50 dhcp47-10 smbd[27601]: [2016/05/05 11:15:50.018799,  0, pid=27601, effective(0, 0), real(0, 0)] ../source3/lib/dumpcore.c:313(dump_core)
May  5 11:15:50 dhcp47-10 smbd[27601]:  unable to change to /var/log/core
May  5 11:15:50 dhcp47-10 smbd[27601]:  refusing to dump core
May  5 11:15:50 dhcp47-10 smbd[27602]: [2016/05/05 11:15:50.035474,  0, pid=27602, effective(0, 0), real(0, 0)] ../source3/lib/util.c:478(reinit_after_fork)
May  5 11:15:50 dhcp47-10 smbd[27602]:  messaging_reinit() failed: NT_STATUS_IO_DEVICE_ERROR
May  5 11:15:50 dhcp47-10 smbd[27602]: [2016/05/05 11:15:50.035627,  0, pid=27602, effective(0, 0), real(0, 0)] ../source3/smbd/server.c:743(smbd_accept_connection)
May  5 11:15:50 dhcp47-10 smbd[27602]:  reinit_after_fork() failed
May  5 11:15:50 dhcp47-10 smbd[27602]: [2016/05/05 11:15:50.035690,  0, pid=27602, effective(0, 0), real(0, 0)] ../source3/lib/util.c:791(smb_panic_s3)
May  5 11:15:50 dhcp47-10 smbd[27602]:  PANIC (pid 27602): reinit_after_fork() failed


Version-Release number of selected component (if applicable):
samba-4.4.2-1.el7rhgs.x86_64

How reproducible:
Always 

Steps to Reproduce:
1.Setup ctdb 
2.Mount the volume using VIP on one client
3.Start I/O's
4. Bring down the network interface using ifdown eth0 on the node whose VIP is used to mount the volume
5. Observe the IP failover
5. Bring up the network interface


Actual results:
*****************
Multiple smbd crashes and ctdb node goes to banned state and doesn't comes back to OK state until ctdb is restarted.

Expected results:
******************
Smbd should not crash and ctdb node should not go to banned state and even if it has gone to banned state it may come back to ok after banning period is over.


Additional info:
Also seeing the issue of core dump not getting into correct location.
Will update the setup with all the core file settings and then will update the sosreports and core file

Comment 2 surabhi 2016-05-06 11:16:17 UTC
Core file is copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1333360/ 

Also in the first case the samba and glusterfs log level was set to 10 , just changed that to 3 (may not be useful for this issue but did that as dev recommended) and after following the steps mentioned in this bug, hit smbd crash again but this time only one crash and not multiple

Comment 3 surabhi 2016-05-06 11:17:05 UTC
May  6 10:17:58 dhcp47-10 smbd[23301]:  BACKTRACE: 17 stack frames:
May  6 10:17:58 dhcp47-10 smbd[23301]:   #0 /lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7f8be5c4deaa]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #1 /lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7f8be5c4df80]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #2 /lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7f8be813f57f]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #3 /lib64/libsamba-util.so.0(+0x24796) [0x7f8be813f796]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #4 /lib64/libpthread.so.0(+0xf100) [0x7f8be83a0100]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #5 /usr/lib64/samba/libdbwrap-samba4.so(dbwrap_traverse_read+0x7) [0x7f8be236d237]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #6 /usr/lib64/samba/libsmbd-base-samba4.so(+0x83bf0) [0x7f8be7c3fbf0]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #7 /lib64/libtalloc.so.2(_talloc_free+0x440) [0x7f8be4898e80]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #8 /usr/lib64/samba/libsmbd-base-samba4.so(+0x84cb8) [0x7f8be7c40cb8]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #9 /lib64/libtevent.so.0(tevent_common_loop_timer_delay+0xcf) [0x7f8be468eb4f]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #10 /lib64/libsmbconf.so.0(run_events_poll+0x1c9) [0x7f8be5c63479]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #11 /lib64/libsmbconf.so.0(+0x35670) [0x7f8be5c63670]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #12 /lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7f8be468a40d]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #13 /lib64/libtevent.so.0(tevent_req_poll+0x1f) [0x7f8be468b6df]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #14 /usr/sbin/smbd(main+0xa53) [0x7f8be87d7f03]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #15 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f8be42e6b15]
May  6 10:17:58 dhcp47-10 smbd[23301]:   #16 /usr/sbin/smbd(+0x84a1) [0x7f8be87d94a1]
May  6 10:17:58 dhcp47-10 smbd[23301]: [2016/05/06 10:17:58.961271,  0, pid=23301, effective(0, 0), real(0, 0)] ../source3/lib/dumpcore.c:318(dump_core)
May  6 10:17:58 dhcp47-10 smbd[23301]:  dumping core in /var/log/samba/cores/smbd

Comment 5 surabhi 2016-05-19 13:37:48 UTC
The configuration is as follows:

1. There are two CTDB nodes with following nw config:
Node1 : eth0, eth1, eth2, eth3
Node2 : eth0, eth1

eth0 on both the nodes : On Public network
eth1, eth2, eth3 NIC's are in private network(with static IP's configured) used for internal communication between nodes.

The steps to reproduce:

1. Create a dis-rep volume and mount it on windows client(which also has NIC configured in private network) using VIP(corresponding to eth1) of node1.
2. Start copying a large file from windows local share to samba share.
3. Bring down the interface eth1 with the command "ifdown eth1"
4. Observe the IP failover.
5. Once the failover has happened,Bring up eth1 with command "ifup eth1"
6. Observe ctdb status and check /var/log/messages and log.smbd for any cores.

Result: 
The ctdb node goes to banned state and there is a smbd crash with following BT :

(gdb) bt
#0  0x00007f2c471765f7 in raise () from /lib64/libc.so.6
#1  0x00007f2c47177ce8 in abort () from /lib64/libc.so.6
#2  0x00007f2c48ad6beb in dump_core () at ../source3/lib/dumpcore.c:322
#3  0x00007f2c48ac9fe7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:814
#4  0x00007f2c4afbb57f in smb_panic (why=why@entry=0x7f2c4b00254a "internal error") at ../lib/util/fault.c:166
#5  0x00007f2c4afbb796 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:83
#6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
#7  <signal handler called>
#8  dbwrap_traverse_read (db=0x0, f=f@entry=0x7f2c4aabe210 <notifyd_db_del_syswatches>, private_data=private_data@entry=0x0, count=count@entry=0x0)
    at ../lib/dbwrap/dbwrap.c:361
#9  0x00007f2c4aabbc40 in notifyd_peer_destructor (p=p@entry=0x7f2c4c9a8e60) at ../source3/smbd/notifyd/notifyd.c:1249
#10 0x00007f2c47714e80 in _talloc_free_internal (location=<optimized out>, ptr=<optimized out>) at ../talloc.c:1046
#11 _talloc_free (ptr=0x7f2c4c9a8e60, location=0x7f2c4ac73ac0 "../source3/smbd/notifyd/notifyd.c:1154") at ../talloc.c:1647
#12 0x00007f2c4aabcd08 in notifyd_clean_peers_next (subreq=<optimized out>) at ../source3/smbd/notifyd/notifyd.c:1154
#13 0x00007f2c4750ab4f in tevent_common_loop_timer_delay (ev=ev@entry=0x7f2c4c998df0) at ../tevent_timed.c:341
#14 0x00007f2c48adf3f9 in run_events_poll (ev=0x7f2c4c998df0, pollrtn=0, pfds=0x7f2c4c9a7f50, num_pfds=4) at ../source3/lib/events.c:199
#15 0x00007f2c48adf5f0 in s3_event_loop_once (ev=0x7f2c4c998df0, location=<optimized out>) at ../source3/lib/events.c:326
#16 0x00007f2c4750640d in _tevent_loop_once (ev=ev@entry=0x7f2c4c998df0, location=location@entry=0x7f2c4750c5c5 "../tevent_req.c:256") at ../tevent.c:533
#17 0x00007f2c475076df in tevent_req_poll (req=req@entry=0x7f2c4c9a5440, ev=ev@entry=0x7f2c4c998df0) at ../tevent_req.c:256
#18 0x00007f2c4b653f03 in smbd_notifyd_init (interactive=false, msg=0x7f2c4c998ee0) at ../source3/smbd/server.c:411
#19 main (argc=<optimized out>, argv=<optimized out>) at ../source3/smbd/server.c:1597

Uploaded the sosreports @http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1333360/ 

The node file has static IP entries for eth1 on both the nodes.
cat /etc/ctdb/nodes
192.168.XXX.X
192.168.XXX.X

The public_addresses file has VIP's entries for eth1.
192.168.XXX.XX/24 eth1
192.168.XXX.XX/24 eth1


Let me know if any other information is needed.

Comment 12 surabhi 2016-05-27 10:05:03 UTC
Verified the BZ with following steps:

1. There are two CTDB nodes with following nw config:
Node1 : eth0, eth1, eth2, eth3
Node2 : eth0, eth1

eth0 on both the nodes : On Public network
eth1, eth2, eth3 NIC's are in private network(with static IP's configured) used for internal communication between nodes.

The steps to reproduce:

1. Create a dis-rep volume and mount it on windows client(which also has NIC configured in private network) using VIP(corresponding to eth1) of node1.
2. Start copying a large file from windows local share to samba share.
3. Bring down the interface eth1 with the command "ifdown eth1"
4. Observe the IP failover.
5. Once the failover has happened,Bring up eth1 with command "ifup eth1"
6. Observe ctdb status and check /var/log/messages and log.smbd for any cores.

There are no crashes seen.

The ctdb node remains in banned state which is being discussed in another BZ. Marking this BZ as verified.

Comment 14 errata-xmlrpc 2016-06-23 05:37:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1245