Bug 1300572 - SMB: while running dd from multiple cifs mounts with aio enabled ,cancelling the I/O's causes mount point to hang
SMB: while running dd from multiple cifs mounts with aio enabled ,cancelling ...
Status: ASSIGNED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: samba (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: rhs-smb@redhat.com
storage-qa-internal@redhat.com
: ZStream
Depends On:
Blocks: 1268895
  Show dependency treegraph
 
Reported: 2016-01-21 02:56 EST by surabhi
Modified: 2017-03-25 12:26 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
Due to a bug in the Linux CIFS client, SMB2.0+ connections from Linux to Red Hat Gluster Storage currently will not work properly. SMB1 connections from Linux to Red Hat Gluster Storage, and all connections with supported protocols from Windows continue to work. Workaround: If practical, restrict Linux CIFS mounts to SMB version 1. The simplest way to do this is to not specify the 'vers' mount option, since the default setting is to use only SMB version 1. If restricting Linux CIFS mounts to SMB1 is not practical, disable asynchronous I/O by setting 'aio read size' to 0.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description surabhi 2016-01-21 02:56:06 EST
Description of problem:

***************************************

While running I/O's (using dd) from multiple cifs mount on the same client and cancelling dd using ctrl C doesn't stop dd and ls from another mount point hangs for a while.

dd process stops only when it is killed manually.

Tried the same test with xfs-samba share, dd command stops as we try to cancel it.

when dd is hung on all the mounts , the kernel cifs client shows following errors :
Jan 13 16:49:20 localhost systemd-logind: New session 1096 of user root.
Jan 13 16:55:49 localhost kernel: CIFS VFS: Error -32 sending data on socket to server
Jan 13 17:00:16 localhost kernel: CIFS VFS: Server 10.70.47.179 has not responded in 120 seconds. Reconnecting...



Jan 13 21:30:20 localhost kernel: INFO: task kworker/1:2:441 blocked for more than 120 seconds.
Jan 13 21:30:20 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 13 21:30:20 localhost kernel: kworker/1:2     D ffff880036b77d60     0   441      2 0x00000000
Jan 13 21:30:20 localhost kernel: Workqueue: cifsiod cifs_oplock_break [cifs]
Jan 13 21:30:20 localhost kernel: ffff880036b77cf0 0000000000000046 ffff8800369c8b80 ffff880036b77fd8
Jan 13 21:30:20 localhost kernel: ffff880036b77fd8 ffff880036b77fd8 ffff8800369c8b80 ffff880036b77d70
Jan 13 21:30:20 localhost kernel: ffff88021ff44200 0000000000000002 ffffffffa04b2ca0 ffff880036b77d60
Jan 13 21:30:20 localhost kernel: Call Trace:
Jan 13 21:30:20 localhost kernel: [<ffffffffa04b2ca0>] ? cifs_push_posix_locks+0x320/0x320 [cifs]
Jan 13 21:30:20 localhost kernel: [<ffffffff8163a909>] schedule+0x29/0x70
Jan 13 21:30:20 localhost kernel: [<ffffffffa04b2cae>] cifs_pending_writers_wait+0xe/0x20 [cifs]
Jan 13 21:30:20 localhost kernel: [<ffffffff81638780>] __wait_on_bit+0x60/0x90
Jan 13 21:30:20 localhost kernel: [<ffffffff810c1986>] ? dequeue_entity+0x106/0x510
Jan 13 21:30:20 localhost kernel: [<ffffffffa04b2ca0>] ? cifs_push_posix_locks+0x320/0x320 [cifs]
Jan 13 21:30:20 localhost kernel: [<ffffffff81638837>] out_of_line_wait_on_bit+0x87/0xb0
Jan 13 21:30:20 localhost kernel: [<ffffffff810a6b60>] ? wake_atomic_t_function+0x40/0x40
Jan 13 21:30:20 localhost kernel: [<ffffffffa04b3496>] cifs_oplock_break+0x66/0x2e0 [cifs]
Jan 13 21:30:20 localhost kernel: [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
Jan 13 21:30:20 localhost kernel: [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
Jan 13 21:30:20 localhost kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
Jan 13 21:30:20 localhost kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Jan 13 21:30:20 localhost kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 13 21:30:20 localhost kernel: [<ffffffff81645858>] ret_from_fork+0x58/0x90
Jan 13 21:30:20 localhost kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Jan 13 21:30:20 localhost kernel: INFO: task rm:10770 blocked for more than 120 seconds.
Jan 13 21:30:20 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 13 21:30:20 localhost kernel: rm              D ffff880211110148     0 10770  10560 0x00000084

Jan 13 21:30:20 localhost kernel: ffff880211e27db0 0000000000000082 ffff880210aa7300 ffff880211e27fd8
Jan 13 21:30:20 localhost kernel: ffff880211e27fd8 ffff880211e27fd8 ffff880210aa7300 ffff880211110140
Jan 13 21:30:20 localhost kernel: ffff880211110144 ffff880210aa7300 00000000ffffffff ffff880211110148
Jan 13 21:30:20 localhost kernel: Call Trace:
Jan 13 21:30:20 localhost kernel: [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
Jan 13 21:30:20 localhost kernel: [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
Jan 13 21:30:20 localhost kernel: [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
Jan 13 21:30:20 localhost kernel: [<ffffffff811eae8e>] vfs_unlink+0x4e/0x150
Jan 13 21:30:20 localhost kernel: [<ffffffff811ef42e>] do_unlinkat+0x26e/0x2b0
Jan 13 21:30:20 localhost kernel: [<ffffffff81641113>] ? do_page_fault+0x23/0x80
Jan 13 21:30:20 localhost kernel: [<ffffffff811f033b>] SyS_unlinkat+0x1b/0x40
Jan 13 21:30:20 localhost kernel: [<ffffffff81645909>] system_call_fastpath+0x16/0x1b



Version-Release number of selected component (if applicable):
***********************************
samba-4.2.4-12.el7rhgs.x86_64
glusterfs-3.7.5-16.el7rhgs.x86_64
cifs-utils-6.2-7.el7.x86_64
uname -a
Linux dhcp46-56.lab.eng.blr.redhat.com 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
2/4 tried

Steps to Reproduce:
1.Create a dis-rep volume , have multiple cifs mount on the same client
2.start dd from each mount : dd if=/dev/zero of=file7 bs=1M count=1024
3.do ctrl C to cancel dd from one of the mount point , it doesn't exit
4.Do ls from another mount point , sometimes it hangs for ls and recover after a while and in some instances the mount point hangs as there is a hung in the cifs client itself.

dd process will keep on running until killed explicitly.

When tried with xfs-samba share , the ctrl-c immediately kills dd from all mount point.


Actual results:

The i/o doesn't stop when cancelled and process keeps on running until killed explicitly.
Also there are CIFS-VFS errors in dmesg of cifs client.

Expected results:
The dd command should exit when cancelled and there should not be cifs client proc hung.

Additional info:

Will try the same test case with RHEL6 and update the result soon.
Comment 6 surabhi 2016-02-09 01:31:49 EST
continuously seeing the error from Linux cifs client when running io's from cifs mount using vers=3.

 Feb  8 23:24:02 dhcp46-56 kernel: CIFS VFS: SMB response too long (262224 bytes)
Feb  8 23:24:03 dhcp46-56 kernel: CIFS VFS: SMB response too long (262224 bytes)
Feb  8 23:24:03 dhcp46-56 kernel: CIFS VFS: Send error in read = -11
Feb  8 23:24:04 dhcp46-56 kernel: CIFS VFS: SMB response too long (524368 bytes)
Feb  8 23:24:04 dhcp46-56 kernel: CIFS VFS: Send error in read = -11
Feb  8 23:24:06 dhcp46-56 kernel: CIFS VFS: SMB response too long (524368 bytes)
Feb  8 23:24:06 dhcp46-56 kernel: CIFS VFS: Send error in read = -11
Feb  8 23:24:08 dhcp46-56 kernel: CIFS VFS: SMB response too long (1048656 bytes)
Comment 7 Ira Cooper 2016-02-09 11:04:54 EST
This is a generic error, it has nothing to do with RHGS alas.

"There is a known issue in the RHEL SMB client with SMB vers=2, 2.1 or 3.  They will not work properly with RHGS."

Your thoughts Michael?
Comment 8 Michael Adam 2016-02-09 18:29:59 EST
The original bug description reads

> Tried the same test with xfs-samba share, dd command stops
> as we try to cancel it.

So is this really a generic bug?

Also, there is no configuration or exact description of the setup.
E.g. the only reference to aio is in the subject (added afterwards).
Somehow the problem is not qualified well enough, to be a known
issue that I could propose a doctext for...

- I assume it is happening with vfs_glusterfs and aio enabled.
- Does it happen with vfs_glusterfs but without aio?
- Does it happen with gluster fuse mount and aio enabled / disabled?
- Does it happen with xfs and aio enabled?

- What is the real constellation of the cifs mounts?
  I.e. how many are running? Is 2 enough? Does a single
  cifs-mount not have the problem? ...

I could be that this is a but specifically triggered by
the vfs_glusterfs aio and a problem in cifs mount.

So more data please!  :-)
Comment 10 surabhi 2016-02-10 01:23:55 EST
1. The aio has been added to the subject because aio is enabled by default for 3.1.2.(Just wanted to bring to notice that aio is enabled when issue is seen)
2.Yes, it happens when aio is enabled and we do multiple cifs mount and run dd on all the mount points.the dd doesn't exits and mount point hangs.
3.If we disable aio , dd exits from all mount points when cancelled.
4.vfs-glusterfs without aio doesn't see hang.
5.with single mount issue is not seen, if there are more than one cifs mount with aio enabled then the issue is seen.

Following is the data:

without aio :
glusterfs-samba share
dd exits on cancelling (from single mount as well as multiple mounts)
No hung
*****************************
without aio:
xfs-samba share
dd exit on cancelling (from single mount as well as multiple mounts)
No hung
******************************
with aio:
xfs-samba share
dd exits on cancelling (from single mount as well as multiple cifs mounts)
*******************************
with aio:
glusterfs-samba share
dd doesn't exit on cancelling (Always with more than one cifs mount)
hung
******************************



The cifs client shows:

[1018560.875127] INFO: task dd:30545 blocked for more than 120 seconds.
[1018560.877405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1018560.879689] dd              D ffff8800c3bee9f8     0 30545  30352 0x00000080
[1018560.879697]  ffff8801ec80fc70 0000000000000086 ffff88021233c500 ffff8801ec80ffd8
[1018560.879703]  ffff8801ec80ffd8 ffff8801ec80ffd8 ffff88021233c500 ffff8800c3bee9f0
[1018560.879707]  ffff8800c3bee9f4 ffff88021233c500 00000000ffffffff ffff8800c3bee9f8
[1018560.879712] Call Trace:
[1018560.879727]  [<ffffffff8163b9e9>] schedule_preempt_disabled+0x29/0x70
[1018560.879736]  [<ffffffff816396e5>] __mutex_lock_slowpath+0xc5/0x1c0
[1018560.879741]  [<ffffffff81638b4f>] mutex_lock+0x1f/0x2f
[1018560.879747]  [<ffffffff811eb9af>] do_last+0x28f/0x1270
[1018560.879754]  [<ffffffff811c11ce>] ? kmem_cache_alloc_trace+0x1ce/0x1f0
[1018560.879759]  [<ffffffff811ee672>] path_openat+0xc2/0x490
[1018560.879765]  [<ffffffff811efe3b>] do_filp_open+0x4b/0xb0
[1018560.879771]  [<ffffffff811fc9c7>] ? __alloc_fd+0xa7/0x130
[1018560.879778]  [<ffffffff811dd7e3>] do_sys_open+0xf3/0x1f0
[1018560.879784]  [<ffffffff811dd8fe>] SyS_open+0x1e/0x20
[1018560.879791]  [<ffffffff81645909>] system_call_fastpath+0x16/0x1b
[1018726.314764] SELinux: initialized (dev cifs, type cifs), uses genfs_contexts

*****************************************************************

The error mentioned in #C6 happens when we give mount with vers3 and aio enabled:Another BZ https://bugzilla.redhat.com/show_bug.cgi?id=1305657 has been raised for the same.

Let me know if you need more data.
Comment 11 surabhi 2016-02-10 01:26:48 EST
Ira,
Could you provide your RCA w.r.t #C4 which will help everyone.
Comment 12 Ira Cooper 2016-02-14 02:47:02 EST
The issue appears to be SMB2+ on the Linux CIFS client can't handle async operations.

This really hurts us, because for windows, the async ops make a big speed difference, but they are incompatible with the linux client.

For now I'm going to recommend the use of SMB1 for Linux, given that we expect Windows to be the dominant use-case for CIFS/SMB.

"Due to a bug within the Linux CIFS client, we do not support the use of SMB2+ with RHGS." , is my suggestion for the known issue text, for now.  I will work on seeing if we can countermeasure it, up to the release.
Comment 15 Ira Cooper 2016-02-15 01:29:44 EST
It is a bug in Linux, not RHGS.

It should clearly be documented that way, because we'll run into it with any Linux CIFS using SMB2+.

The wording I chose was intentional, because we'll have this as an issue for a while even after a fix is just a Linux kernel fix.  Fixes can take a bit to be applied to clients.  Once a fix is found we'll clearly direct people to update their clients, but well... Production is production, and they can't always do that.

Thankfully, I don't expect much SMB2+ on Linux in the field, so this should largely be a non-issue.
Comment 17 Ira Cooper 2016-02-16 02:01:04 EST
The wording is awkward, is SMB 2.0, 2.1 and 3.0 not usable for ALL clients due to the linux cifs client, or just for the linux cifs client?

It should be usable from Windows :).
Comment 18 Michael Adam 2016-02-16 02:59:46 EST
(In reply to Ira Cooper from comment #17)
> The wording is awkward, is SMB 2.0, 2.1 and 3.0 not usable for ALL clients
> due to the linux cifs client, or just for the linux cifs client?
> 
> It should be usable from Windows :).

Right, the wording has to be chosen carefully.
It has to make clear that:

1) SMB 2 and newer are generally supported on RHGS.

2) SMB 2+ are NOT supported with the linux cifs client
   (due to a bug in the cifs client).

3) It is up to the user of the cifs client to ensure that it does
   not mount with SMB version >= 2. (In order to not impact other
   clients.)

Furthermore:

4) If a cifs client triggers the hang with SMB >= 2,
   are other clients (Windows...) affected as well?
   If yes, we should document that, too.

Cheers - Michael
Comment 19 Ira Cooper 2016-02-16 04:00:02 EST
Other clients should not be impacted.

-Ira
Comment 21 Ira Cooper 2016-02-17 04:15:18 EST
Try the text above, I think it is a hair awkward, but it is factually correct.

Note You need to log in before you can comment on or make changes to this bug.