Bug 429755
Summary: | Null bytes in files access by 2 or more NFS clients | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Thom O'Connor <thom> | |
Component: | kernel | Assignee: | Jeff Layton <jlayton> | |
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 5.1 | CC: | coughlan, dzickus, k.georgiou, lwang, staubach, steved | |
Target Milestone: | rc | Keywords: | ZStream | |
Target Release: | --- | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | RHBA-2008-0314 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1002830 (view as bug list) | Environment: | ||
Last Closed: | 2008-05-21 15:07:14 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 432078, 1002830 | |||
Attachments: |
Description
Thom O'Connor
2008-01-22 21:31:23 UTC
Created attachment 292558 [details]
vi session screenshot of mailbox showing null bytes in message file
Please note that the screenshot has been modified to obfuscate potentially
proprietary data. These fields are clearly designated with red boxes.
This screenshot is quite representative of the problem, demonstrating how
entire or partial sections of messages are replaced with null bytes, in what
appear to be 1:1 byte replacements. The following message is properly
delineated in the message "mbox" file with a new special From line:
From <...
Few questions: 1. How did these corrupted files look like from server end ? 2. As stated in the problem statement, "RHEL 4.5" doesn't have this issue but it is not clear when the platform was moved back to RHEL 4.5, was the server moved too ? What are the OSs running on server with and without the problem ? What is server's filesystem (GFS, ext3, etc) ? Intuitively, if file size and offset are correct but file contens were partially filled with "NULL" characters, it normally implies the file spaces are allocated but file contents are not there. We need to isolate whether this is really a NFS client issue as stated or it is a server (nfsd and/or filesystem) issue. The NAS servers reportedly in use here are the following: 1. Customer A: NetApp 3020c running OnTAP v7.2.2 2. Customer B: NetApp 3020c So, both are using NetApp. We have contacts at NetApp if they should be brought into the discussion. However, one note that may or may not be in the case so far, but should be, is that the "null bytes" problem disappeared when Customer A went to a single NFS client (taking one CommuniGate Pro "Backend Server" offline). It is only with two or more NFS clients (Backend Servers) online that the problem can occur. NetApp uses a proprietary filesystem called "WAFS". I am unsure whether NetApp can provide filesystem/shell-level access to WAFS directly from their NAS device, but it may be possible. If we can attempt to reproduce these tests in-house, we do have a NetApp device on which to try this; although the model number and NetApp OS version may differ, and would need to be researched. ok, thanks ! Was wondering whether GFS clusters were involved. With above info, I would say this does look like an NFS client issue at this moment. Info will be passed to Red Hat NFS kernel folks. Thanks for the detailed BZ.
> Duplicating this problem will be challenging, though we believe possible ...
I am making arrangements to make RHEL 5.1 available to you. If you can reproduce
the problem that will be the first step.
Yes, thanks for the detailed bug report. I've looked over this and have a question: 2. Have one NFS client modify a non-binary, text file, using C++ operations such as lseek(), write(), and fsync() (all filehandles are properly fsync()'d when closed by an NFS client) 3. No less than 6 seconds later, have a second NFS client open the same file, modify it (lseek/write/fsync). is there any sort of fcntl locking going on here? You don't mention any so I assume not... Would it be possible for you to write a small a reproducer program and give us a set of steps to duplicate this? Trying to troubleshoot this in the context of a MTA is going to be tricky. It'll be much easier if we can reduce the reproducer down. Ideally we would, yes, write such a program. We do not have one currently for this particular problem. For the previous NFS-related "file-caching" bug (fixed in the 2.6.13/2.6.14 kernels by Trond Mykelbust), we did produce such a tool. However, that tool does not appear to trigger this new problem. I hope to be trying to reproduce this issue next week. If we can do so reliably, we can write such an application. Sincerely, -t Excellent. I'll set this to NEEDINFO for now. Go ahead and set it back to ASSIGNED once you have more info to go on. Created attachment 294295 [details]
Trivial testcase that should demonstrate the problem
Instructions for the trivial testcase script:
Please edit the variables 'filename' and 'remote' depending on your
test environment.
The testcase should be run on NFS client number 1. '$remote' is another
NFS client that shares the same NFS namespace (and has access to the file
${filename})
Created attachment 294296 [details]
NFS: Fix a potential file corruption issue when writing
Proposed fix for the bug.
Thanks, Trond. Let me see what we can do about getting this in soon. Yep, the reproducer here is indeed trivial and consistently fails. The patch seems to fix it and looks sane. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I've got some test kernels on my people page with this patch. Thom, would you be able to test your product on them and let me know if they correct the issue? Note that these kernels are based on develoment builds and aren't fully QA'ed, so please only deploy them for testing purposes... http://people.redhat.com/jlayton Excellent work all around, thank you folks. Thank you, Trond. I hope to test with the new kernel today. Excellent. I've just posted a new set of test kernels on my people page (jtltest.20). I'd recommend using those instead of any earlier ones since those kernels should also have the fixes for the vmsplice() local exploit that was disclosed recently. Let me know how it goes. in 2.6.18-81.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 I have confirmed that the new kernel 2.6.18-81.el5 looks to have eliminated the null bytes problem, when using the testcase that Trond provided. I am also running a "SPECmail" test on this environment today, with the new kernel in place, in order to verify proper behaviour under higher load. Thanks, sincerely, -t More supporting detail - I ran two SPECmail tests this afternoon. Each test used 2 Linux NFS client servers, attached to a shared NFS storage volume. The kernels used were as follows, along with the results: Test tool: SPECmail (spec.org) v1.01 Server application: CommuniGate Pro 5.2.0 x86-64, 0+2 Dynamic Cluster NFS server: NetApp FAS270 OS: RedHat 5.1 x86_64 [RHEL5.1-Server-20071017.0-x86_64-DVD.iso] Test 1: Kernel vmlinuz-2.6.18-53.el5 Resulted in null byte in CommuniGate Pro "mailbox" files (I will attach a sample mailbox file demonstrating the null bytes.) Test 2: Kernel vmlinuz-2.6.18-81.el5 Resulted in no null bytes in mailboxes Thank you, please let me know if there are any questions. Sincerely. Created attachment 295054 [details]
INBOX mailbox with 1 message replaced with null bytes
Verified based on customer's report as well as the testcase. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html |