Description of problem: On a already established 4 node gluster cluster with samba ctdb setup. Mount the samba share using the public ip in windows client and start creating huge number of zero kb files (say 10000). On the server side stop ctdb one buy one in 3 out of 4 nodes (command : ctdb stop) and after the ip failover when we do restart of ctdb one by one on those 3 nodes we are getting huge number of smbd cores (around 600) in /var/log/samba/core/smbd Version-Release number of selected component (if applicable): samba-client-4.4.6-2.el7rhgs.x86_64 glusterfs-3.8.4-6.el7rhgs.x86_64 Windows10 How reproducible: Always Steps to Reproduce: 1.On an available Four node ctdb samba setup 2.Mount the samba share using VIP in windows client 3.Start a script that creates around 10000 zero kb files 4.While the script in progress run command "ctdb stop" in 3 out of 4 nodes one by one. Stop in one wait for the ip failover, then go for the other. 5.Check the IO process which should be still running 6.Do a ctdb restart (service ctdb restart) in those 3 nodes one by one waiting for each individual state to be OK. 7.Check for the cores in /var/log/samba/core/smbd Actual results: Huge number of Cores generated Expected results: No core should come Additional info: #0 0x00007fd1274951d7 in raise () from /lib64/libc.so.6 #1 0x00007fd1274968c8 in abort () from /lib64/libc.so.6 #2 0x00007fd128df5b9b in dump_core () at ../source3/lib/dumpcore.c:322 #3 0x00007fd128de8f97 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:814 #4 0x00007fd12b2db57f in smb_panic (why=why@entry=0x7fd12b977fa0 "reinit_after_fork() failed") at ../lib/util/fault.c:166 #5 0x00007fd12b97737c in smbd_accept_connection (ev=0x7fd12d19ad10, fde=<optimized out>, flags=<optimized out>, private_data=<optimized out>) at ../source3/smbd/server.c:759 #6 0x00007fd128dfe34c in run_events_poll (ev=0x7fd12d19ad10, pollrtn=<optimized out>, pfds=0x7fd12d1c7a90, num_pfds=7) at ../source3/lib/events.c:257 #7 0x00007fd128dfe5a0 in s3_event_loop_once (ev=0x7fd12d19ad10, location=<optimized out>) at ../source3/lib/events.c:326 #8 0x00007fd12782540d in _tevent_loop_once () from /lib64/libtevent.so.0 #9 0x00007fd1278255ab in tevent_common_loop_wait () from /lib64/libtevent.so.0 #10 0x00007fd12b972ad4 in smbd_parent_loop (parent=<optimized out>, ev_ctx=0x7fd12d19ad10) at ../source3/smbd/server.c:1127 #11 main (argc=<optimized out>, argv=<optimized out>) at ../source3/smbd/server.c:1780
Some findings: /var/log/messages repeatedly displayed the following logs+back trace: Dec 2 10:02:18 dhcp47-12 smbd[31742]: [2016/12/02 10:02:18.084385, 0] ../source3/lib/util.c:478(reinit_after_fork) Dec 2 10:02:18 dhcp47-12 smbd[31742]: messaging_reinit() failed: NT_STATUS_IO_DEVICE_ERROR Dec 2 10:02:18 dhcp47-12 smbd[31742]: [2016/12/02 10:02:18.084532, 0] ../source3/smbd/server.c:758(smbd_accept_connection) Dec 2 10:02:18 dhcp47-12 smbd[31742]: reinit_after_fork() failed Dec 2 10:02:18 dhcp47-12 smbd[31742]: [2016/12/02 10:02:18.084680, 0] ../source3/lib/util.c:791(smb_panic_s3) Dec 2 10:02:18 dhcp47-12 smbd[31742]: PANIC (pid 31742): reinit_after_fork() failed Dec 2 10:02:18 dhcp47-12 smbd[31742]: [2016/12/02 10:02:18.085999, 0] ../source3/lib/util.c:902(log_stack_trace) Dec 2 10:02:18 dhcp47-12 smbd[31742]: BACKTRACE: 11 stack frames: Dec 2 10:02:18 dhcp47-12 smbd[31742]: #0 /lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7fd128de8e5a] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #1 /lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7fd128de8f30] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #2 /lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7fd12b2db57f] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #3 /usr/sbin/smbd(+0xc37c) [0x7fd12b97737c] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #4 /lib64/libsmbconf.so.0(run_events_poll+0x16c) [0x7fd128dfe34c] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #5 /lib64/libsmbconf.so.0(+0x355a0) [0x7fd128dfe5a0] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #6 /lib64/libtevent.so.0(_tevent_loop_once+0x8d) [0x7fd12782540d] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #7 /lib64/libtevent.so.0(tevent_common_loop_wait+0x1b) [0x7fd1278255ab] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #8 /usr/sbin/smbd(main+0x15d4) [0x7fd12b972ad4] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #9 /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fd127481b35] Dec 2 10:02:18 dhcp47-12 smbd[31742]: #10 /usr/sbin/smbd(+0x7ea9) [0x7fd12b972ea9] Dec 2 10:02:18 dhcp47-12 smbd[31742]: [2016/12/02 10:02:18.088052, 0] ../source3/lib/dumpcore.c:303(dump_core) Dec 2 10:02:18 dhcp47-12 smbd[31742]: dumping core in /var/log/samba/cores/smbd Dec 2 10:02:18 dhcp47-12 smbd[31742]: Dec 2 10:02:18 dhcp47-12 abrt-hook-ccpp: Process 31742 (smbd) of user 0 killed by SIGABRT - dumping core Even then Samba logs(at least from logs provided @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1400957) does not have any information regarding these crashes. NT_STATUS_IO_DEVICE_ERROR is being mapped from unix errno EIO using map_nt_error_from_unix() within messaging_reinit() [1]. If so, with log level set to 5 in smb.conf we should have seen one among the following log entries: messaging_dgm_ref failed: Input/output error <-- debug level 2 OR messaging_ctdbd_init failed: Input/output error <-- debug level 1 But the above messages are also missing from Samba logs. Needs more investigation... [1] https://github.com/samba-team/samba/blob/samba-4.4.6/source3/lib/messages.c#L393
Hi Vivek, Myself and Gunther looked over the issue today. As mentioned in my previous comment, we couldn't find enough log entries from whatever you have provided now in order to pin point the exact reason for crash. It is because of the max log size set to 50 which will rename the log file to log.smbd.old if it exceeds 50KB and this repeats. Even though we hope/suspect some fixes that are already present in Samba upstream master to resolve this issue(found by going through the code path) we can only confirm the same based on better logs which will lead us to put a RCA for the crash. So, can you please try reproducing the crash after making the following changes to smb.conf? log level = 10 max log size = 0
The misbehavior was triggered by a connection to an internal (non-public) interface. This was in fact a crashed/ half-unmounted cifs-mount from node#1 to node#0. This was not visible any more, but the cifs.ko still tried to connect to the node-internal address of node #1 periodically. When ctdb is stopped (ctdb stop) -- each such SMB connection to an internal address will trigger a fork and the reinit_after_fork() will just panic trying to connect to ctdb (because ctdb will reject). This is by current design. Getting rid of the cifs mount (by rebooting) solved the problem for us.