Bug 200974
Summary: | NFS (client) file corruption with 2.6.17-1.2488.fc6xen | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mike Gahagan <mgahagan> |
Component: | kernel | Assignee: | Steve Dickson <steved> |
Status: | CLOSED NOTABUG | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | davej, dhowells, nhorman, riel, syeghiay, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-03-09 20:43:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mike Gahagan
2006-08-01 21:52:44 UTC
Does this only happen when a XEN kernel? This is happening on the non-Xen kernel as well. What's interesting is the file size is always correct. I did a diff on the files and did not get very much back so it's not like the file is getting truncated or filled with zeros. -bash-3.1# ll kernel-2.6.17-1.2505.fc6.src.rpm -rw-r--r-- 14 root root 49890117 Aug 1 21:17 kernel-2.6.17-1.2505.fc6.src.rpm -bash-3.1# rpm -K kernel-2.6.17-1.2505.fc6.src.rpm kernel-2.6.17-1.2505.fc6.src.rpm: sha1 MD5 NOT OK -bash-3.1# md5sum kernel-2.6.17-1.2505.fc6.src.rpm da35d19179b034e487802ed9c2e2f2b2 kernel-2.6.17-1.2505.fc6.src.rpm -bash-3.1# uname -a Linux et-4.test.redhat.com 2.6.17-1.2488.fc6 #1 SMP Mon Jul 31 21:09:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux Lowering the r/wsize to 8192 did not help, also transfers over http or ssh do not seem to be affected. No change in behavior in 2.6.17-1.2505.fc6 hmm... I'm not seeing this at all.... I using a 2.6.17-1.2488.fc6 kernel and installing kernel-2.6.17-1.2505.fc6.i686.rpm and both kernel-debuginfo packages from curly with no problem at all... So I guess I'm going to need some bzip2 binary tethereal trace to try and figure out whats going on here... from the client please use 'tethereal -w /tmp/data.pcap host <server> ; bzip2 /tmp/data.pcap' This sounds like one of the problems IBM has seen, where the installer kernel sometimes mucks up NFS retrieval somehow, somewhere. See bug 168981. This is on an amd x86_64 box in case that makes any difference. I also just tried to do a yum update from a guest system on the same box and the first package it tried to download and install was corrupted. This was the first time I've seen it with HTTP. I have also seen this behavior with FC5 GA using NFS over TCP, using NFS over UDP works around the problem. From a protocol standpoint, I think this is all running properly. What I see is the following: 1) Lost segments (frames 323, 333, 337, 343, 420, 4201, 11581) These are all flowing from the server to the client, and most likely represent simple network congestion. I'd be interested to know the results of the dropped frame counters on the NFS client to see if these drops are occuring on the client or elsewhere on the network. My guess would be the latter, but it would be interesting to know just the same. 2) Duplicate ACKS (even frames 324-410, odd frames 4103-4197, even frames 11582-11692) These are expected behavior. After every lost segment from (1), we enter an out of order delivery mode until such time as the lost segment is retransmitted. during this time, we respond to each frame that does not satisfy the missing segment bytes with an ACK indicating the last in-order segment we recevied. These are recorded as "duplicate acks" by wireshark (since they are), but are specifcified by RFC 2581 (tcp congestion control) as correct. Three duplicate ACKs received by the other peer should trigger a Fast retransmit of the missing data starting at the sequence number provided by the duplicate ACK. Unfortunately, the NFS server (I assume our Netapp) is ignoring the RFC prescribed method of fast retransmit detection, in that fast retransmit should begin after 3 duplicate ACKs are received and clearly it is waiting for dozens of duplicate ACKs. Not sure if that is caused by frames being dropped on the network, or just being ignored by the Netapp, but we should definately find out, since this looks to me like the most likely prospect in terms of what might be causing a file corruption. 3) Out of Order packets (frames 414, 416, 418, 4200, 11695, 11697, 11699) These are also expected behavior. The first of each of these out of order sequences are the start of the Netapps fast retransmit algorithm, in which it attempts to fill in the missing segment bytes that the duplicate acks of (2) were indicating. Each set of frames seems to correctly fill in the missing sequence properly, although I would be interested to correlate the corrupt segments of the transmitted rpm file to the retransmitted frames. If there is a correlation, it would suggest that, while the netapp is sending the retransmits properly, it isn't providing the proper data in those frames. So Moving forward, I think we should do two things. A) See if we can figure out if the Netapp NFS server is just never seeing these duplicate ACKs, or if it is choosing to ignore the recommendation of RFC 2581. If it is doing the latter, lets find out why, and what the impact of that is on our client. B) Do a binary diff of the corrupted RPM file with the good RPM file (bdiff and vbindiff can do this), and see if we can correlate the corrupted segments with the retransmitted data on the tcpdump. Is this happening with more recent kernels? I haven't seen this problem in quite some time on the particular test box I first saw the problem on. I suspect this might have been hardware all along. I just did some nfs copies from curly along with a few runs of 'rpm -K' and can't see any problems. I'm just going to close this assuming this was a hardware fault. |