| Summary: | Killing a job that writes to mounted Windows directory will result in permission denied on that mount point | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | chruitad | ||||
| Component: | kernel | Assignee: | Sachin Prabhu <sprabhu> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 6.2 | CC: | chorn, chruitad, jlayton, mchristi, nfs-maint, rcrews, smfltc, sprabhu, stepkhcad, steved | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | CIFS | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-10-15 23:07:40 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Bug Depends On: | |||||||
| Bug Blocks: | 798385 | ||||||
| Attachments: |
|
||||||
What we need to understand is what happens with the later calls after you kill
that process. What may be an interesting first step is to get the client into this state and then turn up cifs debugging. Then attempt to stat() the
mountpoint and then disable the debugging and collect the logs. Here's some
info on how to do that:
http://wiki.samba.org/index.php/LinuxCIFS_troubleshooting#Enabling_Debugging
...that may give us an initial idea of what's going on when this occurs.
Created attachment 568041 [details]
log file
Please see the attached log file.
Thank you for your help.
Looks like something fell down in the handling of signatures after the signal. Most likely the sequence numbers got out of whack somehow. Some servers disconnect the socket, forcing the client to reconnect when there's a signing failure. This one apparently doesn't. A possible workaround is to mount with crypto signatures disabled until we can track down the problem and come up with a fix. Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. Targeting this issue for RHEL 6.4. Fedora 16 also has the same problem. It happens only in case of using Windows 2003/2008 Server that included in Windows Domain. In case of standalone Windows (XP/2003/2008/Win7) this problem is not appeared This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. This looks like the same problem which was fixed by upsteam commit 31efee60f489c759c341454d755a9fd13de8c03d. This fix has been backported to the 6.5 devel tree in version 2.6.32-408.el6. Please contact support for a test kernel containing this fix. Sachin Prabhu Closing this bz as dup of bz 877010. The patch requested in this bz has been include in the devel tree as part of the solution for bz 877010. The patched kernel will be released for RHEL 6.5 If you require this fix before the release of RHEL 6.5, please contact Red Hat support for kernels containing the patch. Sachin Prabhu *** This bug has been marked as a duplicate of bug 877010 *** I am unable to see the duplicate bug. Can you please grant me permissions so that I can see the progress? This allows a single user to wreak havoc on the usability of my systems and I am eager to follow and try the fix. Mark, BZ 877010 was an internal bugzilla created to track the patches required to reduce kmapping used by async read and write code. Upstream commit http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=31efee60f489c759c341454d755a9fd13de8c03d which fixes this issue reported in this bz was backported as part of those fixes. If you look at log file from c#3, we see the following messages repeatedly printed "CIFS VFS: Unexpected SMB signature" The "Unexpected SMB signature" message is usually seen when the sequence number expected by the client from the server does not match what was sent. The sequence number is incremented for every message sent by the client and the server. The CIFS signature is a md5 hash built using the session key, the response calculated during the authentication process and the message itself which contains the sequence number. The resulting hash is stored in the same location in the header as the sequence number overwriting the sequence number. The client when it receives the message from the server, saves the signature off the headers, replaces the signature with the expected sequence number and calculates its own md5 hash. It then compares the md5 hash it calculated with the md5 received in the packet to verify that the message was indeed sent by the server. If they don't match, it prints out the message we see above. What commit 31efee60f489c759c341454d755a9fd13de8c03d fixes is a case where the sequence number is incorrectly incremented in expectation of a response for a NT_CANCEL call. This call is called when we want to cancel a request and doesn't result in a response from the server. As mentioned in the summary, the problem is seen when a job is killed. When the job is killed, the client does send out a NT_CANCEL request. Without the patch mentioned above, we incorrectly increment the sequence number resulting in a the "Unexpected sequence number" error message. The client never recovers from this mismatched sequence number problem. As part of the patches for bz 877010, I had also backported the patch required to fix the NT_CANCEL issue as well as another case which could result in an invalid sequence number to RHEL 6. * Mon Aug 05 2013 Rafael Aquini <aquini> [2.6.32-408.el6] .. - [fs] cifs: on send failure, readjust server sequence number downward (Sachin Prabhu) [877010] .. - [fs] cifs: adjust sequence number downward after signing NT_CANCEL request (Sachin Prabhu) [877010] Both these patches are available in the RHEL 6.5 kernel. Please install this kernel and test. In case you still see this problem, please open a case with Red Hat Support who can help prioritise this issue and have a fix released for you in time. Sachin Prabhu |
Description of problem: I am using linux kernel 2.6.32-220.4.1.el6.x86_64 (RHEL6) and cifs version 4.8.1. Using a mount point (/mnt/tmp), we are able to read/write to our Windows directories. Occasionally, a user will kill a job that is writing to these directories. When this happens, it corrupts the mount point somehow and we get a permission denied error when we do an “ls”. If I do an “lsof” and grep for the path, I get a message: WARNING: can’t stat() cifs file system /mnt/tmp It seems that if I am able to successfully unmount all of these mount points, that I can do a “mount –a” and recover. However, a user should be able to kill a job without ruining mount points. Version-Release number of selected component (if applicable): RHEL 6.2 and CIFS 4.8.1 How reproducible: Configure winbind and samba for Windows users' login. Add a mount point Steps to Reproduce: 1. Configure winbind login 2. Add a mount point with cifs to a Windows share 3. Have a Windows user logged into Linux box kill a job that writes to the Windows share. Actual results: Occassionally, killing the job results in a permission denied error at the mount point. Expected results: I would expect the user to be able to kill their jobs without locking up the mount point. Additional info: