Description of problem: ************************************* While running the automated test suite for samba which covers cases like : creating dirs and files on mount point ,nested dirs/files , renames, changing graph from the server etc and running the whole suite multiple times , there is an instance of smbd OOM kill and all tests start failing : ****************************************************** Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.526463, 0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect) Jan 18 17:14:23 rhsauto003 smbd[13023]: testvol: Initialized volume from server localhost Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.532072, 0] ../source3/smbd/service.c:798(make_connection_snum) Jan 18 17:14:23 rhsauto003 smbd[13023]: canonicalize_connect_path failed for service gluster-testvol, path / Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.543085, 0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect) Jan 18 17:14:23 rhsauto003 smbd[13023]: testvol: Initialized volume from server localhost Jan 18 17:14:23 rhsauto003 smbd[13023]: [2016/01/18 17:14:23.543541, 0] ../source3/smbd/service.c:798(make_connection_snum) Jan 18 17:14:23 rhsauto003 smbd[13023]: canonicalize_connect_path failed for service gluster-testvol, path / Jan 18 17:14:54 rhsauto003 rpc.statd[15253]: Version 1.3.0 starting Jan 18 17:14:54 rhsauto003 sm-notify[15254]: Version 1.3.0 starting ******************************************************************** Jan 18 17:15:22 rhsauto003 kernel: smbd invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0 Jan 18 17:15:22 rhsauto003 kernel: smbd cpuset=/ mems_allowed=0 Jan 18 17:15:22 rhsauto003 kernel: CPU: 2 PID: 15309 Comm: smbd Not tainted 3.10.0-327.el7.x86_64 #1 Jan 18 17:15:22 rhsauto003 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 Jan 18 17:15:22 rhsauto003 kernel: ffff880210a9dc00 00000000fc407cb3 ffff88021386ba70 ffffffff816351f1 Jan 18 17:15:22 rhsauto003 kernel: ffff88021386bb00 ffffffff81630191 ffff8801dd773440 ffff8801dd773458 Jan 18 17:15:22 rhsauto003 kernel: ffffffff00000202 fffeefff00000000 000000000000000f ffffffff81128803 Jan 18 17:15:22 rhsauto003 kernel: Call Trace: Jan 18 17:15:22 rhsauto003 kernel: [<ffffffff816351f1>] dump_stack+0x19/0x1b Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81630191>] dump_header+0x8e/0x214 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81128803>] ? delayacct_end+0x63/0xb0 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8116cdee>] oom_kill_process+0x24e/0x3b0 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8116d616>] out_of_memory+0x4b6/0x4f0 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff811737f5>] __alloc_pages_nodemask+0xa95/0xb90 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff811b78ca>] alloc_pages_vma+0x9a/0x140 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81192deb>] __do_fault+0x33b/0x510 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8119df35>] ? mmap_region+0x1c5/0x620 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81197088>] handle_mm_fault+0x5b8/0xf50 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8119e695>] ? do_mmap_pgoff+0x305/0x3c0 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81640e22>] __do_page_fault+0x152/0x420 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff81641113>] do_page_fault+0x23/0x80 Jan 18 17:15:23 rhsauto003 kernel: [<ffffffff8163d408>] page_fault+0x28/0x30 ********************************************************************** Samba-client logs : ******************************************* [2016-01-18 12:20:19.836159] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 2-testvol-client-3: remote operation failed [Transport endpoint is not co nnected] [2016-01-18 12:20:19.836195] I [MSGID: 114018] [client.c:2042:client_rpc_notify] 2-testvol-client-3: disconnected from testvol-client-3. Client process will keep trying to connect to glusterd until brick's port is available [2016-01-18 12:20:19.836288] E [rpc-clnt.c:362:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x196)[0x7fd8c221fa66] (--> /lib64/libgfrpc.so.0(sav ed_frames_unwind+0x1de)[0x7fd8c26ea9ce] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fd8c26eaade] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x9c)[ 0x7fd8c26ec49c] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x88)[0x7fd8c26ecca8] ))))) 2-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) ca lled at 2016-01-18 12:20:19.834165 (xid=0x10) [2016-01-18 12:20:19.836324] E [MSGID: 114031] [client-rpc-fops.c:1676:client3_3_finodelk_cbk] 2-testvol-client-2: remote operation failed [Transport endpoint is not co nnected] ****************************************************8 client logs : [2016/01/18 16:43:12.722805, 0] ../source3/smbd/service.c:798(make_connection_snum) canonicalize_connect_path failed for service gluster-testvol, path / [2016/01/18 16:43:12.728402, 0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect) testvol: Initialized volume from server localhost [2016/01/18 16:43:12.728839, 0] ../source3/smbd/service.c:798(make_connection_snum) canonicalize_connect_path failed for service gluster-testvol, path / [2016/01/18 16:43:40.662142, 0] ../source3/modules/vfs_glusterfs.c:257(vfs_gluster_connect) testvol: Initialized volume from server localhost ****************************************************** Version-Release number of selected component (if applicable): ************************************************** samba-4.2.4-12.el7rhgs.x86_64 glusterfs-3.7.5-16.el7rhgs.x86_64 How reproducible: Hit once , trying to reproduce again Steps to Reproduce: 1.Start the testsuite which has (mkdir, dd if=/dev/zero of=file1 bs=1M count=1024,create files, ls,rm -rf,renames from cifs mount,server side commands : smb server status etc) in loop of 25 2.Check the results , logs and any crash Actual results: There is OOM kill by smbd process.All tests failed after that. Expected results: OOM kill should not happen and i/o's on mount point should not fail. Additional info: There is a glusterd crash as well oin the same setup , another bz is updated for the same. https://bugzilla.redhat.com/show_bug.cgi?id=1298524 Sosreports and other details will be updated soon.
This is likely the same bug as: https://bugzilla.redhat.com/show_bug.cgi?id=1302901, Not duping them yet. Just making people aware.
The crash in #C 5 is similar to BZ mentioned above. The original issue reported is smbd getting OOM killed and client hung where dd is running with graph changes on the server.
The client is going to get hung if the server is OOM killed. Do we have any leads on a reproducer on the OOM condition?
No, OOM kill seen once, but everytime this test runs where we run dd on cifs client and do a graph change (stat-prefetch on off) the mount point gets hung and following error seen on cifs client: Jan 13 16:55:49 localhost kernel: CIFS VFS: Error -32 sending data on socket to server Jan 13 17:00:16 localhost kernel: CIFS VFS: Server 10.70.47.179 has not responded in 120 seconds. Reconnecting... Logs are uploaded as mentioned in C1
that was old state. likely fixed by the bz mentioned in C #6.