Bug 1480199

Summary:	Samba crashed in internal background task execution while running IO
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rochelle <rallan>
Component:	samba	Assignee:	Anoop C S <anoopcs>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Vivek Das <vdas>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	amukherj, ksandha, rhinduja, rhs-smb
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-09 17:02:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Rochelle 2017-08-10 11:49:41 UTC

Description of problem:
=======================
While testing geo-replication with samba over cifs mount, with crefi tool (for IO) checksum between master and slave did not match and samba crashed producing cores. 


Version-Release number of selected component (if applicable):
==============================================================

samba-4.6.3-5.el7rhgs.x86_64

glusterfs-geo-replication-3.8.4-38.el7rhgs.x86_64




How reproducible:
=================

1/1


Steps to Reproduce:
====================
1.Over samba-ctdb cluster 
2.Created master volume (6x2) and slave volume (2x2) 
3.Mounted master volume via cifs and ran IO using crefi with fop-create (1500 files)
4. After IO, arequal-checksum has to be checked
5. Check for any cores.

Actual results:
===============

Mismatch of checksum, samba crashed creating cores.


Expected results:
=================
Checksum of master and slave should match

There should be no crash

Additional info:
================

(gdb) bt
#0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
#1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
#2  0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338
#3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>)
    at ../source3/lib/util.c:814
#4  0x00007f1d2a79c95f in smb_panic (
    why=why@entry=0x7f1d2a7e482a "internal error")
    at ../lib/util/fault.c:166
#5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>)
    at ../lib/util/fault.c:83
#6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
#7  <signal handler called>
#8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800, 
    mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
    at ../source3/lib/messages_ctdbd.c:278
#9  0x00007f1d286ccd40 in messaging_reinit (
---Type <return> to continue, or q <return> to quit---
    msg_ctx=msg_ctx@entry=0x56508d0e3800)
    at ../source3/lib/messages.c:415
#10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800, 
    ev_ctx=<optimized out>, 
    parent_longlived=parent_longlived@entry=true, 
    comment=comment@entry=0x0) at ../source3/lib/util.c:475
#11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0f9900)
    at ../source3/lib/background.c:179
#12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay ()
   from /lib64/libtevent.so.0
#13 0x00007f1d270e2f49 in epoll_event_loop_once ()
   from /lib64/libtevent.so.0
#14 0x00007f1d270e12a7 in std_event_loop_once ()
   from /lib64/libtevent.so.0
#15 0x00007f1d270dd0cd in _tevent_loop_once () from /lib64/libtevent.so.0
#16 0x00007f1d270dd2fb in tevent_common_loop_wait ()
   from /lib64/libtevent.so.0
#17 0x00007f1d270e1247 in std_event_loop_wait ()
   from /lib64/libtevent.so.0
#18 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>, 
    ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
#19 main (argc=<optimized out>, argv=<optimized out>)
    at ../source3/smbd/server.c:2038
(gdb)

Comment 7 Anoop C S 2017-08-16 12:40:28 UTC

Adding some initial thoughts:

When smbd is brought up, it registers a background task to be run every 15 minutes to cleanup orphan sockets. It is done using a tevent timer. Epoll event loop will wait till the next timed event i.e, this background timer to expire by calculating the delay and is run as soon as the event loop is timed out followed by the registration of the same background task for next run. The crash is seen when the background task is set to run after the timer expiration by forking from main smbd. As long as the cluster in HEALTHY state I don't expect the remote end point i.e, ctdb messaging end to be NULL which leads to this crash. We need to have ctdb logs(which we do not have right now) exactly at the time of crash to find out whether nodes were in HEALTHY state or not.

Nonetheless I will continue to find a root cause after having more discussions upstream. As of now, changing the summary as this issue has nothing to do specifically with a Geo-Replication setup.