Bug 137270
Summary: | "eth0: memory shortage" abort truncates very large file xfer | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Robert G. 'Doc' Savage <dsavage> | ||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||
Status: | CLOSED NOTABUG | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | petrides, riel | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | athlon | ||||||
OS: | Linux | ||||||
URL: | http://seclists.org/lists/linux-kernel/2002/Jul/0395.html | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-11-22 20:52:42 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Robert G. 'Doc' Savage
2004-10-27 03:55:46 UTC
In fact, the code in question remains all the way 'til today... I quickly tried to reproduce this using FC3 Test 3 on an x86_64 w/ a 3c905B card, but it was successful. Can you attach the output of running sysreport on the listener? Do you have any more specific recreation instructions? I'll attach some information from my test in the next comment. I'll also install RHEL3 on this box and retry... sender: Disk /dev/hda: 40.0 GB, 40000000000 bytes 255 heads, 63 sectors/track, 4863 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 * 1 13 104391 83 Linux /dev/hda2 14 4609 36917370 83 Linux /dev/hda3 4610 4863 2040255 82 Linux swap processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 9 cpu MHz : 2392.376 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4718.59 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping : 9 cpu MHz : 2392.376 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips : 4767.74 dd if=/dev/hda2 bs=2048 | nc 172.16.59.183 30000 -w 3 18458685+0 records in 18458685+0 records out listener: 06:03.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30) 06:03.0 Class 0200: 10b7:9055 (rev 30) Subsystem: 10b7:9055 Flags: bus master, medium devsel, latency 32, IRQ 193 I/O ports at bc00 [size=128] Memory at feaffc00 (32-bit, non-prefetchable) [size=128] Expansion ROM at feac0000 [disabled] [size=128K] Capabilities: [dc] Power Management version 1 processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 3 model name : Genuine Intel(R) CPU 3.20GHz stepping : 4 cpu MHz : 3194.190 cache size : 1024 KB fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl cid cx16 xtpr bogomips : 6307.84 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: nc -l -p 30000 > hd.img -rw-r--r-- 1 root root 37803386880 Oct 28 16:35 hd.img OK, I am seeing the "eth0: memory shortage" messages under RHEL3 U4... There is a patch pending that upgrades the RHEL3 3c59x to be almost identical to the RHEL4 version of the driver. It is associated w/ bug 133843 and will probably be in RHEL3 U5 (it was too late for U4). I will try that and see if the messages disappear... Sadly, that did not seem to remove the problem...it would seem that the problem is not totally inside the driver... I'll keep thinking... :-) My listener system has a dual Athlon (32-bit) motherboard with dual on-board 3c920 NICs. For complete hardware details see: ftp://ftp.tyan.com/datasheets/d_s2468_150.pdf To reproduce the problem exactly... The objective of the following is to take a safety snapshot of my laptop hard drive's partitions prior to upgrading from FC1 to FC3t3. By making bit-wise copies of the partitions and of the MBR, I can do a bare-metal restoration should anything go wrong during the upgrade. Source system: $ cat hda_fdisk.txt Disk /dev/hda: 48.0 GB, 48004669440 bytes 16 heads, 63 sectors/track, 93015 cylinders, total 93759120 sectors Units = sectors of 1 * 512 = 512 bytes FileSys Device Boot Start End Blocks Id System /boot /dev/hda1 * 63 211679 105808+ 83 Linux / /dev/hda2 211680 20699279 10243800 83 Linux swap /dev/hda3 20699280 22785839 1043280 82 Linuxswap -- /dev/hda4 22785840 93759119 35486640 f Win95 Ext'd (LBA) /pub /dev/hda5 22785903 93759119 35486608+ 83 Linux 1. Set up listener: # nc -l -p 30000 > /pub/images/hda1.img 2. Boot laptop with Helix 1.5 forensic CD and dd the first partition image to the listener. (Using CD so no source partitions have to be mounted.) # dd if=/dev/hda1 bs=2048 | nc 192.168.1.2 30000 -w 3 Result on listener: # ls -gG /pub/images/hda1.img -rw-rw-r-- 1 108347904 Oct 24 1820 hda1.img 3. Repeat 1 and 2 for /dev/hda2. Result on listener: # ls -gG /pub/images/hda2.img -rw-rw-r-- 1 10489651200 Oct 24 18:42 hda2.img 4. Repeat 1 and 2 for /dev/hda3. Result on listener: # ls -gG /pub/images/hda3.img -rw-rw-r-- 1 1068318720 Oct 24 19:03 hda3.img 5. Repeat 1 and 2 for /dev/hda4. Result on listener: # ls -gG /pub/images/hda4.img -rw-rw-r-- 1 1024 Oct 24 19:04 hda4.img 6. Repeat 1 and 2 for /dev/hda5. Result on listener: # ls -gG /pub/images/hda5.img -rw-rw-r-- 1 9882988544 Oct 24 22:03 hda5.img Note that the size of this image file should be 36338319360 bytes. The dd | nc session consistently fails with "eth0: memory shortage" errors at approximately 9983xxxxxx bytes. Note carefully that the hda2.img, which is larger than the partial hda5.img, is consistently created without incident or errors. I am still looking at this (although I will be out of town for the rest of this week)... The "eth0: memory shortage" message comes from the 3c59x driver when he starts to fail at allocating memory for incoming packets. Presumably your netcat sessions time-out when the delays caused by the memory shortage get big enough to kill the session. I repeated the test w/ output redirected to /dev/null. I did not get any of the "eth0: memory shortage" messages, and the test completed about 10% faster. To me, this suggested that at least some of the congestion is caused by writing to disk. Are there any netcat options (perhaps -i and/or -w) that can improve netcat's tolerance? Is it possible you could use a receiver system w/ a faster disk subsystem? Please note that I never saw netcat fail. My sender and receiver are very "close" on the network. Could you improve the network between the sender and the receiver? I have a preliminary NAPI patch for the 3c59x driver. I didn't see much difference with it, but it may be worth testing? I'll attach it in case you are feeling frisky... :-) It is against RHEL3 U4, BTW... Created attachment 106349 [details]
3c59x-napi.patch
NAPI may improve the driver's performance enough to make a difference...?
Got the patch, but not U4. Will it work against the U3 source? I've been using netcat with '-w 3'. Maybe adding a few more seconds will keep it from timing out. Haven't tried the '-i' option before. It'd be rather hard to find a faster disk system. I invested quite a lot of money to make this one very big and very fast :-). It has nine 10K rpm U320 drives in a hardare RAID5 array on a U160 channel. The sender is a measley 5400 rpm EIDE notebook drive. Like yours, my sender and receiver are also close -- on the same desk with the Fast Etherswitch. Total patch cable length is about 10 feet. Using the 'i' option slows things down to an impossible crawl. Increasing the wait time interval to '-w 30' changes nothing. The transfer still fails at exactly the same place in two attempts. This is too precise to be explained by ring buffer exhaustion. Despite the error message we're seeing, I'm now skeptical of the buggy driver hypothesis. I now suspect we're seeing the end result of something else entirely, but haven't a clue what it might be. What's so special about a repeating file size of 9,883,033,600 bytes on a disk with 142,794 logical 8,225,280 byte cylinders?? I re-ran the transfer sequence after first slowing the source NIC down to 10Mbps. This time the transfer took more than 12 hours rather than 22 minutes, but the listener again aborted when the received file size reached 9,883,033,600 bytes. There were no eth0:... errors in the log file. The exact error reported on the source end was: # dd if=/dev/hda5 bs=2048 | nc 192.168.1.2 30000 -w 3 dd: reading `/dev/hda5': Input/output error 4825700+0 records in 4825700+0 records out After getting the same result after slowing the transfer rate by 90%, I think it's unlikely that we're seeing the result of overdriving the listener's NIC or the large drive array. It must be something that's speed-independent, like an overflowing counter. In nc? In dd? Where? Strange, as I was always able to complete my transfers. Hmmm... Have you tried this? dd if=/dev/hda5 bs=2048 > /dev/null I'd be curious to know if the results are different. Have you tried the test w/ a different sender? Sorry to be so tardy getting back to you, John. Your suggestion to pipe to /dev/null ultimately lead me to the answer: A consecutive string of 160 bad sectors had formed on the source drive. It was killing every dd transfer with an input/output error (at the same place, of course). A friend turned me on to dd_rescue which has the ability to survive those errors and keep imaging. (Red Hat should consider adding dd_rescue to the RHEL bag of tools.) Certainly those "eth0: memory shortage errors" still need to be tracked down and fixed, but in the end they turned out to be a red herring and not the cause of this problem. I'm very sorry to have wasted your time on this wild goose chase. Thanks for the update, Robert. I'll close this as NOTABUG, since John already explained the eth0 messages in comment #7. |