Description of problem: *************************************** While running I/O's (using dd) from multiple cifs mount on the same client and cancelling dd using ctrl C doesn't stop dd and ls from another mount point hangs for a while. dd process stops only when it is killed manually. Tried the same test with xfs-samba share, dd command stops as we try to cancel it. when dd is hung on all the mounts , the kernel cifs client shows following errors : Jan 13 16:49:20 localhost systemd-logind: New session 1096 of user root. Jan 13 16:55:49 localhost kernel: CIFS VFS: Error -32 sending data on socket to server Jan 13 17:00:16 localhost kernel: CIFS VFS: Server 10.70.47.179 has not responded in 120 seconds. Reconnecting... Jan 13 21:30:20 localhost kernel: INFO: task kworker/1:2:441 blocked for more than 120 seconds. Jan 13 21:30:20 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 13 21:30:20 localhost kernel: kworker/1:2 D ffff880036b77d60 0 441 2 0x00000000 Jan 13 21:30:20 localhost kernel: Workqueue: cifsiod cifs_oplock_break [cifs] Jan 13 21:30:20 localhost kernel: ffff880036b77cf0 0000000000000046 ffff8800369c8b80 ffff880036b77fd8 Jan 13 21:30:20 localhost kernel: ffff880036b77fd8 ffff880036b77fd8 ffff8800369c8b80 ffff880036b77d70 Jan 13 21:30:20 localhost kernel: ffff88021ff44200 0000000000000002 ffffffffa04b2ca0 ffff880036b77d60 Jan 13 21:30:20 localhost kernel: Call Trace: Jan 13 21:30:20 localhost kernel: [<ffffffffa04b2ca0>] ? cifs_push_posix_locks+0x320/0x320 [cifs] Jan 13 21:30:20 localhost kernel: [<ffffffff8163a909>] schedule+0x29/0x70 Jan 13 21:30:20 localhost kernel: [<ffffffffa04b2cae>] cifs_pending_writers_wait+0xe/0x20 [cifs] Jan 13 21:30:20 localhost kernel: [<ffffffff81638780>] __wait_on_bit+0x60/0x90 Jan 13 21:30:20 localhost kernel: [<ffffffff810c1986>] ? dequeue_entity+0x106/0x510 Jan 13 21:30:20 localhost kernel: [<ffffffffa04b2ca0>] ? cifs_push_posix_locks+0x320/0x320 [cifs] Jan 13 21:30:20 localhost kernel: [<ffffffff81638837>] out_of_line_wait_on_bit+0x87/0xb0 Jan 13 21:30:20 localhost kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40 Jan 13 21:30:20 localhost kernel: [<ffffffffa04b3496>] cifs_oplock_break+0x66/0x2e0 [cifs] Jan 13 21:30:20 localhost kernel: [<ffffffff8109d5fb>] process_one_work+0x17b/0x470 Jan 13 21:30:20 localhost kernel: [<ffffffff8109e3cb>] worker_thread+0x11b/0x400 Jan 13 21:30:20 localhost kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400 Jan 13 21:30:20 localhost kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0 Jan 13 21:30:20 localhost kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 13 21:30:20 localhost kernel: [<ffffffff81645858>] ret_from_fork+0x58/0x90 Jan 13 21:30:20 localhost kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 Jan 13 21:30:20 localhost kernel: INFO: task rm:10770 blocked for more than 120 seconds. Jan 13 21:30:20 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 13 21:30:20 localhost kernel: rm D ffff880211110148 0 10770 10560 0x00000084 Jan 13 21:30:20 localhost kernel: ffff880211e27db0 0000000000000082 ffff880210aa7300 ffff880211e27fd8 Jan 13 21:30:20 localhost kernel: ffff880211e27fd8 ffff880211e27fd8 ffff880210aa7300 ffff880211110140 Jan 13 21:30:20 localhost kernel: ffff880211110144 ffff880210aa7300 00000000ffffffff ffff880211110148 Jan 13 21:30:20 localhost kernel: Call Trace: Jan 13 21:30:20 localhost kernel: [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70 Jan 13 21:30:20 localhost kernel: [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0 Jan 13 21:30:20 localhost kernel: [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f Jan 13 21:30:20 localhost kernel: [<ffffffff811eae8e>] vfs_unlink+0x4e/0x150 Jan 13 21:30:20 localhost kernel: [<ffffffff811ef42e>] do_unlinkat+0x26e/0x2b0 Jan 13 21:30:20 localhost kernel: [<ffffffff81641113>] ? do_page_fault+0x23/0x80 Jan 13 21:30:20 localhost kernel: [<ffffffff811f033b>] SyS_unlinkat+0x1b/0x40 Jan 13 21:30:20 localhost kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1b Version-Release number of selected component (if applicable): *********************************** samba-4.2.4-12.el7rhgs.x86_64 glusterfs-3.7.5-16.el7rhgs.x86_64 cifs-utils-6.2-7.el7.x86_64 uname -a Linux dhcp46-56.lab.eng.blr.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux How reproducible: 2/4 tried Steps to Reproduce: 1.Create a dis-rep volume , have multiple cifs mount on the same client 2.start dd from each mount : dd if=/dev/zero of=file7 bs=1M count=1024 3.do ctrl C to cancel dd from one of the mount point , it doesn't exit 4.Do ls from another mount point , sometimes it hangs for ls and recover after a while and in some instances the mount point hangs as there is a hung in the cifs client itself. dd process will keep on running until killed explicitly. When tried with xfs-samba share , the ctrl-c immediately kills dd from all mount point. Actual results: The i/o doesn't stop when cancelled and process keeps on running until killed explicitly. Also there are CIFS-VFS errors in dmesg of cifs client. Expected results: The dd command should exit when cancelled and there should not be cifs client proc hung. Additional info: Will try the same test case with RHEL6 and update the result soon.
continuously seeing the error from Linux cifs client when running io's from cifs mount using vers=3. Feb 8 23:24:02 dhcp46-56 kernel: CIFS VFS: SMB response too long (262224 bytes) Feb 8 23:24:03 dhcp46-56 kernel: CIFS VFS: SMB response too long (262224 bytes) Feb 8 23:24:03 dhcp46-56 kernel: CIFS VFS: Send error in read = -11 Feb 8 23:24:04 dhcp46-56 kernel: CIFS VFS: SMB response too long (524368 bytes) Feb 8 23:24:04 dhcp46-56 kernel: CIFS VFS: Send error in read = -11 Feb 8 23:24:06 dhcp46-56 kernel: CIFS VFS: SMB response too long (524368 bytes) Feb 8 23:24:06 dhcp46-56 kernel: CIFS VFS: Send error in read = -11 Feb 8 23:24:08 dhcp46-56 kernel: CIFS VFS: SMB response too long (1048656 bytes)
This is a generic error, it has nothing to do with RHGS alas. "There is a known issue in the RHEL SMB client with SMB vers=2, 2.1 or 3. They will not work properly with RHGS." Your thoughts Michael?
The original bug description reads > Tried the same test with xfs-samba share, dd command stops > as we try to cancel it. So is this really a generic bug? Also, there is no configuration or exact description of the setup. E.g. the only reference to aio is in the subject (added afterwards). Somehow the problem is not qualified well enough, to be a known issue that I could propose a doctext for... - I assume it is happening with vfs_glusterfs and aio enabled. - Does it happen with vfs_glusterfs but without aio? - Does it happen with gluster fuse mount and aio enabled / disabled? - Does it happen with xfs and aio enabled? - What is the real constellation of the cifs mounts? I.e. how many are running? Is 2 enough? Does a single cifs-mount not have the problem? ... I could be that this is a but specifically triggered by the vfs_glusterfs aio and a problem in cifs mount. So more data please! :-)
1. The aio has been added to the subject because aio is enabled by default for 3.1.2.(Just wanted to bring to notice that aio is enabled when issue is seen) 2.Yes, it happens when aio is enabled and we do multiple cifs mount and run dd on all the mount points.the dd doesn't exits and mount point hangs. 3.If we disable aio , dd exits from all mount points when cancelled. 4.vfs-glusterfs without aio doesn't see hang. 5.with single mount issue is not seen, if there are more than one cifs mount with aio enabled then the issue is seen. Following is the data: without aio : glusterfs-samba share dd exits on cancelling (from single mount as well as multiple mounts) No hung ***************************** without aio: xfs-samba share dd exit on cancelling (from single mount as well as multiple mounts) No hung ****************************** with aio: xfs-samba share dd exits on cancelling (from single mount as well as multiple cifs mounts) ******************************* with aio: glusterfs-samba share dd doesn't exit on cancelling (Always with more than one cifs mount) hung ****************************** The cifs client shows: [1018560.875127] INFO: task dd:30545 blocked for more than 120 seconds. [1018560.877405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1018560.879689] dd D ffff8800c3bee9f8 0 30545 30352 0x00000080 [1018560.879697] ffff8801ec80fc70 0000000000000086 ffff88021233c500 ffff8801ec80ffd8 [1018560.879703] ffff8801ec80ffd8 ffff8801ec80ffd8 ffff88021233c500 ffff8800c3bee9f0 [1018560.879707] ffff8800c3bee9f4 ffff88021233c500 00000000ffffffff ffff8800c3bee9f8 [1018560.879712] Call Trace: [1018560.879727] [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70 [1018560.879736] [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0 [1018560.879741] [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f [1018560.879747] [<ffffffff811eb9af>] do_last+0x28f/0x1270 [1018560.879754] [<ffffffff811c11ce>] ? kmem_cache_alloc_trace+0x1ce/0x1f0 [1018560.879759] [<ffffffff811ee672>] path_openat+0xc2/0x490 [1018560.879765] [<ffffffff811efe3b>] do_filp_open+0x4b/0xb0 [1018560.879771] [<ffffffff811fc9c7>] ? __alloc_fd+0xa7/0x130 [1018560.879778] [<ffffffff811dd7e3>] do_sys_open+0xf3/0x1f0 [1018560.879784] [<ffffffff811dd8fe>] SyS_open+0x1e/0x20 [1018560.879791] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [1018726.314764] SELinux: initialized (dev cifs, type cifs), uses genfs_contexts ***************************************************************** The error mentioned in #C6 happens when we give mount with vers3 and aio enabled:Another BZ https://bugzilla.redhat.com/show_bug.cgi?id=1305657 has been raised for the same. Let me know if you need more data.
Ira, Could you provide your RCA w.r.t #C4 which will help everyone.
The issue appears to be SMB2+ on the Linux CIFS client can't handle async operations. This really hurts us, because for windows, the async ops make a big speed difference, but they are incompatible with the linux client. For now I'm going to recommend the use of SMB1 for Linux, given that we expect Windows to be the dominant use-case for CIFS/SMB. "Due to a bug within the Linux CIFS client, we do not support the use of SMB2+ with RHGS." , is my suggestion for the known issue text, for now. I will work on seeing if we can countermeasure it, up to the release.
It is a bug in Linux, not RHGS. It should clearly be documented that way, because we'll run into it with any Linux CIFS using SMB2+. The wording I chose was intentional, because we'll have this as an issue for a while even after a fix is just a Linux kernel fix. Fixes can take a bit to be applied to clients. Once a fix is found we'll clearly direct people to update their clients, but well... Production is production, and they can't always do that. Thankfully, I don't expect much SMB2+ on Linux in the field, so this should largely be a non-issue.
The wording is awkward, is SMB 2.0, 2.1 and 3.0 not usable for ALL clients due to the linux cifs client, or just for the linux cifs client? It should be usable from Windows :).
(In reply to Ira Cooper from comment #17) > The wording is awkward, is SMB 2.0, 2.1 and 3.0 not usable for ALL clients > due to the linux cifs client, or just for the linux cifs client? > > It should be usable from Windows :). Right, the wording has to be chosen carefully. It has to make clear that: 1) SMB 2 and newer are generally supported on RHGS. 2) SMB 2+ are NOT supported with the linux cifs client (due to a bug in the cifs client). 3) It is up to the user of the cifs client to ensure that it does not mount with SMB version >= 2. (In order to not impact other clients.) Furthermore: 4) If a cifs client triggers the hang with SMB >= 2, are other clients (Windows...) affected as well? If yes, we should document that, too. Cheers - Michael
Other clients should not be impacted. -Ira
Try the text above, I think it is a hair awkward, but it is factually correct.