Bug 1480199 - Samba crashed in internal background task execution while running IO
Samba crashed in internal background task execution while running IO
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: samba (Show other bugs)
3.3
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: Anoop C S
Vivek Das
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-08-10 07:49 EDT by Rochelle
Modified: 2017-08-16 08:40 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Rochelle 2017-08-10 07:49:41 EDT
Description of problem:
=======================
While testing geo-replication with samba over cifs mount, with crefi tool (for IO) checksum between master and slave did not match and samba crashed producing cores. 


Version-Release number of selected component (if applicable):
==============================================================

samba-4.6.3-5.el7rhgs.x86_64

glusterfs-geo-replication-3.8.4-38.el7rhgs.x86_64




How reproducible:
=================

1/1


Steps to Reproduce:
====================
1.Over samba-ctdb cluster 
2.Created master volume (6x2) and slave volume (2x2) 
3.Mounted master volume via cifs and ran IO using crefi with fop-create (1500 files)
4. After IO, arequal-checksum has to be checked
5. Check for any cores.

Actual results:
===============

Mismatch of checksum, samba crashed creating cores.


Expected results:
=================
Checksum of master and slave should match

There should be no crash

Additional info:
================

(gdb) bt
#0  0x00007f1d26d4a1f7 in raise () from /lib64/libc.so.6
#1  0x00007f1d26d4b8e8 in abort () from /lib64/libc.so.6
#2  0x00007f1d286d04de in dump_core () at ../source3/lib/dumpcore.c:338
#3  0x00007f1d286c16e7 in smb_panic_s3 (why=<optimized out>)
    at ../source3/lib/util.c:814
#4  0x00007f1d2a79c95f in smb_panic (
    why=why@entry=0x7f1d2a7e482a "internal error")
    at ../lib/util/fault.c:166
#5  0x00007f1d2a79cb76 in fault_report (sig=<optimized out>)
    at ../lib/util/fault.c:83
#6  sig_fault (sig=<optimized out>) at ../lib/util/fault.c:94
#7  <signal handler called>
#8  messaging_ctdbd_reinit (msg_ctx=msg_ctx@entry=0x56508d0e3800, 
    mem_ctx=mem_ctx@entry=0x56508d0e3800, backend=0x0)
    at ../source3/lib/messages_ctdbd.c:278
#9  0x00007f1d286ccd40 in messaging_reinit (
---Type <return> to continue, or q <return> to quit---
    msg_ctx=msg_ctx@entry=0x56508d0e3800)
    at ../source3/lib/messages.c:415
#10 0x00007f1d286c0ec9 in reinit_after_fork (msg_ctx=0x56508d0e3800, 
    ev_ctx=<optimized out>, 
    parent_longlived=parent_longlived@entry=true, 
    comment=comment@entry=0x0) at ../source3/lib/util.c:475
#11 0x00007f1d286dbafa in background_job_waited (subreq=0x56508d0f9900)
    at ../source3/lib/background.c:179
#12 0x00007f1d270e1c97 in tevent_common_loop_timer_delay ()
   from /lib64/libtevent.so.0
#13 0x00007f1d270e2f49 in epoll_event_loop_once ()
   from /lib64/libtevent.so.0
#14 0x00007f1d270e12a7 in std_event_loop_once ()
   from /lib64/libtevent.so.0
#15 0x00007f1d270dd0cd in _tevent_loop_once () from /lib64/libtevent.so.0
#16 0x00007f1d270dd2fb in tevent_common_loop_wait ()
   from /lib64/libtevent.so.0
#17 0x00007f1d270e1247 in std_event_loop_wait ()
   from /lib64/libtevent.so.0
#18 0x000056508bddfa95 in smbd_parent_loop (parent=<optimized out>, 
    ev_ctx=0x56508d0e2d10) at ../source3/smbd/server.c:1384
#19 main (argc=<optimized out>, argv=<optimized out>)
    at ../source3/smbd/server.c:2038
(gdb)
Comment 7 Anoop C S 2017-08-16 08:40:28 EDT
Adding some initial thoughts:

When smbd is brought up, it registers a background task to be run every 15 minutes to cleanup orphan sockets. It is done using a tevent timer. Epoll event loop will wait till the next timed event i.e, this background timer to expire by calculating the delay and is run as soon as the event loop is timed out followed by the registration of the same background task for next run. The crash is seen when the background task is set to run after the timer expiration by forking from main smbd. As long as the cluster in HEALTHY state I don't expect the remote end point i.e, ctdb messaging end to be NULL which leads to this crash. We need to have ctdb logs(which we do not have right now) exactly at the time of crash to find out whether nodes were in HEALTHY state or not.

Nonetheless I will continue to find a root cause after having more discussions upstream. As of now, changing the summary as this issue has nothing to do specifically with a Geo-Replication setup.

Note You need to log in before you can comment on or make changes to this bug.