Bug 1299759 - SMB:OOM kill invoked by smbd while running the I/O's on cifs mount and repeating the test cases for multiple times.
SMB:OOM kill invoked by smbd while running the I/O's on cifs mount and repeat...
Status: NEW
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: samba (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: rhs-smb@redhat.com
storage-qa-internal@redhat.com
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-19 03:59 EST by surabhi
Modified: 2017-03-25 12:26 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description surabhi 2016-01-19 03:59:46 EST
Description of problem:
*************************************

While running the automated test suite for samba which covers cases like : creating dirs and files on mount point ,nested dirs/files , renames, changing graph from the server etc and running the whole suite multiple times , there is an instance of smbd OOM kill and all tests start failing :

******************************************************

Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.526463,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  testvol: Initialized volume from server localhost
Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.532072,  0] ../source3/smbd/service.c:798(make_connection_snum)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  canonicalize_connect_path failed for service gluster-testvol, path /
Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.543085,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  testvol: Initialized volume from server localhost
Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.543541,  0] ../source3/smbd/service.c:798(make_connection_snum)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  canonicalize_connect_path failed for service gluster-testvol, path /
Jan 18 17:14:54 rhsauto003 rpc.statd[15253]: Version 1.3.0 starting
Jan 18 17:14:54 rhsauto003 sm-notify[15254]: Version 1.3.0 starting

********************************************************************
Jan 18 17:15:22 rhsauto003 kernel: smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
Jan 18 17:15:22 rhsauto003 kernel: smbd cpuset=/ mems_allowed=0
Jan 18 17:15:22 rhsauto003 kernel: CPU: 2 PID: 15309 Comm: smbd Not tainted 3.10.0-327.el7.x86_64 #1
Jan 18 17:15:22 rhsauto003 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
Jan 18 17:15:22 rhsauto003 kernel: ffff880210a9dc00 00000000fc407cb3 ffff88021386ba70 ffffffff816351f1
Jan 18 17:15:22 rhsauto003 kernel: ffff88021386bb00 ffffffff81630191 ffff8801dd773440 ffff8801dd773458
Jan 18 17:15:22 rhsauto003 kernel: ffffffff00000202 fffeefff00000000 000000000000000f ffffffff81128803
Jan 18 17:15:22 rhsauto003 kernel: Call Trace:
Jan 18 17:15:22 rhsauto003 kernel: [<ffffffff816351f1>] dump_stack+0x19/0x1b
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81630191>] dump_header+0x8e/0x214
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81128803>] ? delayacct_end+0x63/0xb0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8116cdee>] oom_kill_process+0x24e/0x3b0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8116d616>] out_of_memory+0x4b6/0x4f0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff811737f5>] __alloc_pages_nodemask+0xa95/0xb90
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff811b78ca>] alloc_pages_vma+0x9a/0x140
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81192deb>] __do_fault+0x33b/0x510
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8119df35>] ? mmap_region+0x1c5/0x620
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81197088>] handle_mm_fault+0x5b8/0xf50
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8119e695>] ? do_mmap_pgoff+0x305/0x3c0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81640e22>] __do_page_fault+0x152/0x420
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81641113>] do_page_fault+0x23/0x80
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8163d408>] page_fault+0x28/0x30

**********************************************************************

Samba-client logs :
*******************************************

[2016-01-18 12:20:19.836159] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 2-testvol-client-3: remote operation failed [Transport endpoint is not co
nnected]
[2016-01-18 12:20:19.836195] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 2-testvol-client-3: disconnected from testvol-client-3. Client process will keep trying
 to connect to glusterd until brick's port is available
[2016-01-18 12:20:19.836288] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fd8c221fa66] (--> /lib64/libgfrpc.so.0(sav
ed_frames_unwind+0x1de)[0x7fd8c26ea9ce] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fd8c26eaade] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[
0x7fd8c26ec49c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7fd8c26ecca8] ))))) 2-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) ca
lled at 2016-01-18 12:20:19.834165 (xid=0x10)
[2016-01-18 12:20:19.836324] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 2-testvol-client-2: remote operation failed [Transport endpoint is not co
nnected]

****************************************************8

client logs :

[2016/01/18 16:43:12.722805,  0] ../source3/smbd/service.c:798(make_connection_snum)
  canonicalize_connect_path failed for service gluster-testvol, path /
[2016/01/18 16:43:12.728402,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
  testvol: Initialized volume from server localhost
[2016/01/18 16:43:12.728839,  0] ../source3/smbd/service.c:798(make_connection_snum)
  canonicalize_connect_path failed for service gluster-testvol, path /
[2016/01/18 16:43:40.662142,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
  testvol: Initialized volume from server localhost

******************************************************


Version-Release number of selected component (if applicable):
**************************************************
samba-4.2.4-12.el7rhgs.x86_64

glusterfs-3.7.5-16.el7rhgs.x86_64


How reproducible:
Hit once , trying to reproduce again

Steps to Reproduce:
1.Start the testsuite which has (mkdir, dd if=/dev/zero of=file1 bs=1M count=1024,create files, ls,rm -rf,renames from cifs mount,server side commands : smb server status etc) in loop of 25 
2.Check the results , logs and any crash


Actual results:
There is OOM kill by smbd process.All tests failed after that.

Expected results:
OOM kill should not happen and i/o's on mount point should not fail.


Additional info:

There is a glusterd crash as well oin the same setup , another bz is updated for the same. https://bugzilla.redhat.com/show_bug.cgi?id=1298524
Sosreports and other details will be updated soon.
Comment 6 Ira Cooper 2016-01-29 07:19:38 EST
This is likely the same bug as: https://bugzilla.redhat.com/show_bug.cgi?id=1302901,

Not duping them yet.  Just making people aware.
Comment 7 surabhi 2016-01-29 07:26:08 EST
The crash in #C 5 is similar to BZ mentioned above.
The original issue reported is smbd getting OOM killed and client hung where dd is running with graph changes on the server.
Comment 8 Ira Cooper 2016-01-29 07:52:46 EST
The client is going to get hung if the server is OOM killed.  Do we have any leads on a reproducer on the OOM condition?
Comment 9 surabhi 2016-01-29 08:00:58 EST
No, OOM kill seen once, but everytime this test runs where we run dd on cifs client and do a graph change (stat-prefetch on off) the mount point gets hung and following error seen on cifs client:

Jan 13 16:55:49 localhost kernel: CIFS VFS: Error -32 sending data on socket to server
Jan 13 17:00:16 localhost kernel: CIFS VFS: Server 10.70.47.179 has not responded in 120 seconds. Reconnecting...
Logs are uploaded as mentioned in C1

Note You need to log in before you can comment on or make changes to this bug.