Bug 1299759 - SMB:OOM kill invoked by smbd while running the I/O's on cifs mount and repeating the test cases for multiple times.
Summary: SMB:OOM kill invoked by smbd while running the I/O's on cifs mount and repeat...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: samba
Version: rhgs-3.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: rhs-smb@redhat.com
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-19 08:59 UTC by surabhi
Modified: 2018-04-05 10:35 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-05 10:35:25 UTC
Embargoed:


Attachments (Terms of Use)

Description surabhi 2016-01-19 08:59:46 UTC
Description of problem:
*************************************

While running the automated test suite for samba which covers cases like : creating dirs and files on mount point ,nested dirs/files , renames, changing graph from the server etc and running the whole suite multiple times , there is an instance of smbd OOM kill and all tests start failing :

******************************************************

Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.526463,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  testvol: Initialized volume from server localhost
Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.532072,  0] ../source3/smbd/service.c:798(make_connection_snum)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  canonicalize_connect_path failed for service gluster-testvol, path /
Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.543085,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  testvol: Initialized volume from server localhost
Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.543541,  0] ../source3/smbd/service.c:798(make_connection_snum)
Jan 18 17:14:23 rhsauto003 smbd[13023]:  canonicalize_connect_path failed for service gluster-testvol, path /
Jan 18 17:14:54 rhsauto003 rpc.statd[15253]: Version 1.3.0 starting
Jan 18 17:14:54 rhsauto003 sm-notify[15254]: Version 1.3.0 starting

********************************************************************
Jan 18 17:15:22 rhsauto003 kernel: smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
Jan 18 17:15:22 rhsauto003 kernel: smbd cpuset=/ mems_allowed=0
Jan 18 17:15:22 rhsauto003 kernel: CPU: 2 PID: 15309 Comm: smbd Not tainted 3.10.0-327.el7.x86_64 #1
Jan 18 17:15:22 rhsauto003 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
Jan 18 17:15:22 rhsauto003 kernel: ffff880210a9dc00 00000000fc407cb3 ffff88021386ba70 ffffffff816351f1
Jan 18 17:15:22 rhsauto003 kernel: ffff88021386bb00 ffffffff81630191 ffff8801dd773440 ffff8801dd773458
Jan 18 17:15:22 rhsauto003 kernel: ffffffff00000202 fffeefff00000000 000000000000000f ffffffff81128803
Jan 18 17:15:22 rhsauto003 kernel: Call Trace:
Jan 18 17:15:22 rhsauto003 kernel: [<ffffffff816351f1>] dump_stack+0x19/0x1b
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81630191>] dump_header+0x8e/0x214
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81128803>] ? delayacct_end+0x63/0xb0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8116cdee>] oom_kill_process+0x24e/0x3b0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8116d616>] out_of_memory+0x4b6/0x4f0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff811737f5>] __alloc_pages_nodemask+0xa95/0xb90
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff811b78ca>] alloc_pages_vma+0x9a/0x140
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81192deb>] __do_fault+0x33b/0x510
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8119df35>] ? mmap_region+0x1c5/0x620
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81197088>] handle_mm_fault+0x5b8/0xf50
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8119e695>] ? do_mmap_pgoff+0x305/0x3c0
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81640e22>] __do_page_fault+0x152/0x420
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81641113>] do_page_fault+0x23/0x80
Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8163d408>] page_fault+0x28/0x30

**********************************************************************

Samba-client logs :
*******************************************

[2016-01-18 12:20:19.836159] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 2-testvol-client-3: remote operation failed [Transport endpoint is not co
nnected]
[2016-01-18 12:20:19.836195] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 2-testvol-client-3: disconnected from testvol-client-3. Client process will keep trying
 to connect to glusterd until brick's port is available
[2016-01-18 12:20:19.836288] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fd8c221fa66] (--> /lib64/libgfrpc.so.0(sav
ed_frames_unwind+0x1de)[0x7fd8c26ea9ce] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fd8c26eaade] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[
0x7fd8c26ec49c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7fd8c26ecca8] ))))) 2-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) ca
lled at 2016-01-18 12:20:19.834165 (xid=0x10)
[2016-01-18 12:20:19.836324] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 2-testvol-client-2: remote operation failed [Transport endpoint is not co
nnected]

****************************************************8

client logs :

[2016/01/18 16:43:12.722805,  0] ../source3/smbd/service.c:798(make_connection_snum)
  canonicalize_connect_path failed for service gluster-testvol, path /
[2016/01/18 16:43:12.728402,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
  testvol: Initialized volume from server localhost
[2016/01/18 16:43:12.728839,  0] ../source3/smbd/service.c:798(make_connection_snum)
  canonicalize_connect_path failed for service gluster-testvol, path /
[2016/01/18 16:43:40.662142,  0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect)
  testvol: Initialized volume from server localhost

******************************************************


Version-Release number of selected component (if applicable):
**************************************************
samba-4.2.4-12.el7rhgs.x86_64

glusterfs-3.7.5-16.el7rhgs.x86_64


How reproducible:
Hit once , trying to reproduce again

Steps to Reproduce:
1.Start the testsuite which has (mkdir, dd if=/dev/zero of=file1 bs=1M count=1024,create files, ls,rm -rf,renames from cifs mount,server side commands : smb server status etc) in loop of 25 
2.Check the results , logs and any crash


Actual results:
There is OOM kill by smbd process.All tests failed after that.

Expected results:
OOM kill should not happen and i/o's on mount point should not fail.


Additional info:

There is a glusterd crash as well oin the same setup , another bz is updated for the same. https://bugzilla.redhat.com/show_bug.cgi?id=1298524
Sosreports and other details will be updated soon.

Comment 6 Ira Cooper 2016-01-29 12:19:38 UTC
This is likely the same bug as: https://bugzilla.redhat.com/show_bug.cgi?id=1302901,

Not duping them yet.  Just making people aware.

Comment 7 surabhi 2016-01-29 12:26:08 UTC
The crash in #C 5 is similar to BZ mentioned above.
The original issue reported is smbd getting OOM killed and client hung where dd is running with graph changes on the server.

Comment 8 Ira Cooper 2016-01-29 12:52:46 UTC
The client is going to get hung if the server is OOM killed.  Do we have any leads on a reproducer on the OOM condition?

Comment 9 surabhi 2016-01-29 13:00:58 UTC
No, OOM kill seen once, but everytime this test runs where we run dd on cifs client and do a graph change (stat-prefetch on off) the mount point gets hung and following error seen on cifs client:

Jan 13 16:55:49 localhost kernel: CIFS VFS: Error -32 sending data on socket to server
Jan 13 17:00:16 localhost kernel: CIFS VFS: Server 10.70.47.179 has not responded in 120 seconds. Reconnecting...
Logs are uploaded as mentioned in C1

Comment 10 Michael Adam 2018-04-05 10:35:25 UTC
that was old state. likely fixed by the bz mentioned in C #6.


Note You need to log in before you can comment on or make changes to this bug.