Description of problem: ======================= While testing geo-replication with samba over cifs mount, with crefi tool (for IO) checksum between master and slave did not match and samba crashed producing cores. Version-Release number of selected component (if applicable): ============================================================== samba-4.6.3-5.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-38.el7rhgs.x86_64 How reproducible: ================= 1/1 Steps to Reproduce: ==================== 1.Over samba-ctdb cluster 2.Created master volume (6x2) and slave volume (2x2) 3.Mounted master volume via cifs and ran IO using crefi with fop-create (1500 files) 4. After IO, arequal-checksum has to be checked 5. Check for any cores. Actual results: =============== Mismatch of checksum, samba crashed creating cores. Expected results: ================= Checksum of master and slave should match There should be no crash Additional info: ================ (gdb) bt #0 0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6 #1 0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6 #2 0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338 #3 0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>) at ../source3/lib/util.c:814 #4 0x00007f1d2a79c95f in smb_panic ( why=why@entry=0x7f1d2a7e482a "internal error") at ../lib/util/fault.c:166 #5 0x00007f1d2a79cb76 in fault_report (sig=<optimized out>) at ../lib/util/fault.c:83 #6 sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94 #7 <signal handler called> #8 messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800, mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0) at ../source3/lib/messages_ctdbd.c:278 #9 0x00007f1d286ccd40 in messaging_reinit ( ---Type <return> to continue, or q <return> to quit--- msg_ctx=msg_ctx@entry=0x56508d0e3800) at ../source3/lib/messages.c:415 #10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800, ev_ctx=<optimized out>, parent_longlived=parent_longlived@entry=true, comment=comment@entry=0x0) at ../source3/lib/util.c:475 #11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0f9900) at ../source3/lib/background.c:179 #12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay () from /lib64/libtevent.so.0 #13 0x00007f1d270e2f49 in epoll_event_loop_once () from /lib64/libtevent.so.0 #14 0x00007f1d270e12a7 in std_event_loop_once () from /lib64/libtevent.so.0 #15 0x00007f1d270dd0cd in _tevent_loop_once () from /lib64/libtevent.so.0 #16 0x00007f1d270dd2fb in tevent_common_loop_wait () from /lib64/libtevent.so.0 #17 0x00007f1d270e1247 in std_event_loop_wait () from /lib64/libtevent.so.0 #18 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>, ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384 #19 main (argc=<optimized out>, argv=<optimized out>) at ../source3/smbd/server.c:2038 (gdb)
Adding some initial thoughts: When smbd is brought up, it registers a background task to be run every 15 minutes to cleanup orphan sockets. It is done using a tevent timer. Epoll event loop will wait till the next timed event i.e, this background timer to expire by calculating the delay and is run as soon as the event loop is timed out followed by the registration of the same background task for next run. The crash is seen when the background task is set to run after the timer expiration by forking from main smbd. As long as the cluster in HEALTHY state I don't expect the remote end point i.e, ctdb messaging end to be NULL which leads to this crash. We need to have ctdb logs(which we do not have right now) exactly at the time of crash to find out whether nodes were in HEALTHY state or not. Nonetheless I will continue to find a root cause after having more discussions upstream. As of now, changing the summary as this issue has nothing to do specifically with a Geo-Replication setup.