Bug 571638

Summary: 2.6.32.9-67: big file transfers stall and break network
Product: [Fedora] Fedora Reporter: C Sand <conradsand.fb>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: medium    
Version: 12CC: anton, dougsland, gansalmon, itamar, jonathan, kernel-maint, lukasz, mark, mcarlson
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.32.10-90.fc12 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 02:24:35 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg (from working 2.6.31.12-174.2.22.fc12.x86_64) none

Description C Sand 2010-03-09 05:15:15 UTC
Description of problem:

Upgrading to kernel 2.6.32.9-67.fc12.x86_64 caused breakage in transfer of large files over SSH and SMB, followed by breaking network access.

Version-Release number of selected component (if applicable):
2.6.32.9-67.fc12

How reproducible:
Copy a large file (> 900 MB) using SSH or SMB (mounted either directly via "mount" or indirectly via "gfvs / nautilus").  Copy stalls and network becomes unusable (e.g. can't ssh to any other host).

Actual results:
Copy stalls after 3 MB transferred. Network dead.


Additional info:
Previous kernel (2.6.31.12-174.2.22.fc12.x86_64) works fine.

Comment 1 C Sand 2010-03-09 05:28:22 UTC
Output of lspci (from working 2.6.31.12-174.2.22.fc12.x86_64)

00:00.0 Host bridge: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub (rev 0c)
00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c)
00:02.1 Display controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 (rev 02)
00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 4 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HEM (ICH8M) LPC Interface Controller (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) IDE Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
03:01.0 FireWire (IEEE 1394): Ricoh Co Ltd R5C832 IEEE 1394 Controller (rev 05)
03:01.1 SD Host controller: Ricoh Co Ltd R5C822 SD/SDIO/MMC/MS/MSPro Host Adapter (rev 22)
03:01.2 System peripheral: Ricoh Co Ltd R5C592 Memory Stick Bus Host Adapter (rev 12)
03:01.3 System peripheral: Ricoh Co Ltd xD-Picture Card Controller (rev 12)
09:00.0 Ethernet controller: Broadcom Corporation NetLink BCM5906M Fast Ethernet PCI Express (rev 02)
0c:00.0 Network controller: Intel Corporation PRO/Wireless 4965 AG or AGN [Kedron] Network Connection (rev 61)

Comment 2 C Sand 2010-03-09 05:30:27 UTC
Created attachment 398680 [details]
dmesg (from working 2.6.31.12-174.2.22.fc12.x86_64)

Comment 3 Mark Borgerding 2010-03-10 20:53:07 UTC
I have a similar problem on my fc12 x86 laptop that has the same bcm5906m (also at address 09:00.0).  

The problem is observed running the 2.6.32 kernel (tg3 driver 3.102).  No problems if I boot into kernel 2.6.31 (tg3 3.99).

When I try to use a vnc server on the laptop thru ssh, the laptop network connection becomes unusable for existing and new TCP traffic. The bug does not happen with wireless connection nor with older kernel version

After it fails, the socket has a very full send-Q reported by netstat:
Proto Recv-Q Send-Q Local Address               Foreign Address             State       PID/Program name
...
tcp        0 713168 192.168.1.183:22            192.168.1.100:44203         ESTABLISHED 2560/sshd: markb [p

What follows is a diff between the dmesgs of the system booted with 2.6.31 (works) and 2.6.32 (broken)  Note the NETDEV WATCHDOG information at the end.

-Linux version 2.6.31.12-174.2.22.fc12.i686.PAE (mockbuild.fedoraproject.org) (gcc version 4.4.3 20100127 (Red Hat 4.4.3-4) (GCC) ) #1 SMP Fri Feb 19 19:10:04 UTC 2010
+Linux version 2.6.32.9-67.fc12.i686.PAE (mockbuild.fedoraproject.org) (gcc version 4.4.3 20100127 (Red Hat 4.4.3-4) (GCC) ) #1 SMP Sat Feb 27 09:42:55 UTC 2010

@@ -60,7 +60,6 @@
 reg 2, base: 3576MB, range: 8MB, type UC
 reg 3, base: 3584MB, range: 512MB, type UC
 reg 4, base: 4GB, range: 4GB, type WB
-x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
 e820 update range: 00000000df700000 - 0000000100000000 (usable) ==> (reserved)
 initial memory mapped : 0 - 01000000
 init_memory_mapping: 0000000000000000-00000000375fe000
...
 NR_CPUS:32 nr_cpumask_bits:32 nr_cpu_ids:2 nr_node_ids:1
-PERCPU: Embedded 15 pages at c3a9b000, static data 37788 bytes
+PERCPU: Embedded 15 pages/cpu @c3c00000 s37816 r0 d23624 u1048576
+pcpu-alloc: s37816 r0 d23624 u1048576 alloc=1*2097152
+pcpu-alloc: [0] 0 1
 Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 1036811
 Kernel command line: ro root=/dev/mapper/vg_tukey-lv_root  LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us rhgb quiet
-PID hash table entries: 4096 (order: 12, 16384 bytes)
+PID hash table entries: 4096 (order: 2, 16384 bytes)
 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
 Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
 Enabling fast FPU save and restore... done.
...
 sizeof(page)=32 bytes
 sizeof(inode)=352 bytes
 sizeof(dentry)=132 bytes
-sizeof(ext3inode)=500 bytes
+sizeof(ext3inode)=508 bytes
 sizeof(buffer_head)=56 bytes
-sizeof(skbuff)=192 bytes
+sizeof(skbuff)=184 bytes
 sizeof(task_struct)=3256 bytes
...
 hpet0: 3 comparators, 64-bit 14.318180 MHz counter
+Switching to clocksource tsc
 pnp: PnP ACPI init
 ACPI: bus type pnp registered
 pnp 00:09: io resource (0x1000-0x1005) overlaps 0000:00:1f.0 BAR 13 (0x1000-0x107f), disabling
...
...
 acpiphp: Slot [1] registered
+pci-stub: invalid id string ""
 ACPI: AC Adapter [AC] (on-line)
+Switching to clocksource hpet
 processor LNXCPU:01: registered as cooling_device1
...
 ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 3 ports 3 Gbps 0x5 impl SATA mode
-ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems
+ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ccc ems
 ahci 0000:00:1f.2: setting latency timer to 64
...
 usb 2-6: configuration #1 chosen from 1 choice
-Clocksource tsc unstable (delta = -168072787 ns)
 Synaptics Touchpad, model: 1, fw: 6.3, id: 0x1c0b1, caps: 0xa04751/0xa00000
...
-tg3.c:v3.99 (April 20, 2009)
+ACPI: WMI: Mapper loaded
+tg3.c:v3.102 (September 1, 2009)
 tg3 0000:09:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
 tg3 0000:09:00.0: setting latency timer to 64
+cfg80211: Calling CRDA to update world regulatory domain
 dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.2)
-tg3 0000:09:00.0: PME# disabled
+input: Dell WMI hotkeys as /devices/virtual/input/input8
...
 RPC: Registered udp transport module.
 RPC: Registered tcp transport module.
-tg3 0000:09:00.0: PME# disabled
+RPC: Registered tcp NFSv4.1 backchannel transport module.
 tg3 0000:09:00.0: irq 31 for MSI/MSI-X
 ADDRCONF(NETDEV_UP): eth0: link is not ready
 Bridge firewalling registered
+------------[ cut here ]------------
+WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xc6/0x12d()
+Hardware name: XPS M1330
+NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
+Modules linked in: fuse ipt_MASQUERADE iptable_nat nf_nat bridge stp llc sunrpc cpufreq_ondemand acpi_cpufreq xt_physdev ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm uinput snd_hda_codec_intelhdmi snd_hda_codec_idt arc4 ecb snd_hda_intel snd_hda_codec snd_hwdep iwl3945 snd_seq snd_seq_device iwlcore uvcvideo sdhci_pci snd_pcm sdhci mac80211 videodev dell_laptop snd_timer dell_wmi v4l1_compat dcdbas mmc_core tg3 wmi cfg80211 iTCO_wdt snd soundcore snd_page_alloc joydev rfkill iTCO_vendor_support i2c_i801 aes_i586 aes_generic xts gf128mul dm_crypt dm_multipath firewire_ohci firewire_core crc_itu_t i915 drm_kms_helper drm i2c_algo_bit i2c_core video output [last unloaded: microcode]
+Pid: 0, comm: swapper Not tainted 2.6.32.9-67.fc12.i686.PAE #1
+Call Trace:
+ [<c0441121>] warn_slowpath_common+0x6a/0x81
+ [<c072b8d7>] ? dev_watchdog+0xc6/0x12d
+ [<c0441176>] warn_slowpath_fmt+0x29/0x2c
+ [<c072b8d7>] dev_watchdog+0xc6/0x12d
+ [<c0457bf3>] ? insert_work+0x75/0x7e
+ [<c0457dfd>] ? __queue_work+0x2f/0x34
+ [<c044e174>] run_timer_softirq+0x16d/0x1f0
+ [<c072b811>] ? dev_watchdog+0x0/0x12d
+ [<c044780e>] __do_softirq+0xb1/0x157
+ [<c04478ea>] do_softirq+0x36/0x41
+ [<c04479dd>] irq_exit+0x2e/0x61
+ [<c040a9e9>] do_IRQ+0x86/0x9a
+ [<c0409750>] common_interrupt+0x30/0x38
+ [<c045007b>] ? rm_from_queue_full+0x1/0x6b
+ [<c062108b>] ? acpi_idle_enter_bm+0x24d/0x27e
+ [<c06f6be7>] cpuidle_idle_call+0x72/0xc3
+ [<c04081f2>] cpu_idle+0x96/0xb0
+ [<c07a0631>] start_secondary+0x1f5/0x233
+---[ end trace 198efbfe2e0a7910 ]---
+tg3: eth0: transmit timed out, resetting
+tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
+tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
+tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
+tg3: tg3_stop_block timed out, ofs=1400 enable_bit=2
+tg3: tg3_stop_block timed out, ofs=c00 enable_bit=2
+tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
 tg3: eth0: Link is down.
 tg3: eth0: Link is up at 100 Mbps, full duplex.
 tg3: eth0: Flow control is on for TX and on for RX.

Comment 4 Mark Borgerding 2010-03-10 21:09:15 UTC
May be related scatter gather code (maybe Bug 527209)
  ethtool -K eth0 sg off
makes the system stable

Comment 5 Matt Carlson 2010-03-11 22:22:07 UTC
I think an upstream kernel change allowed scatter/gather fragments to be sized less than or equal to 8 bytes.  This exposed a 5906 chip bug.  Commit 92c6b8d16a36df3f28b2537bed2a56491fb08f11 fixes the problem.  This commit was integrated into the 3.103 version of the tg3 driver.  I'm pretty sure this fix was integrated into RedHat's 2.6.31 kernel, but it might have been missed in the 2.6.32 migration.

Comment 6 Chuck Ebbert 2010-03-14 06:36:25 UTC
Fixed in 2.6.32.10-74.rc1

Comment 7 Fedora Update System 2010-03-23 14:56:18 UTC
kernel-2.6.32.10-90.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/kernel-2.6.32.10-90.fc12

Comment 8 Fedora Update System 2010-03-24 23:40:27 UTC
kernel-2.6.32.10-90.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update kernel'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/kernel-2.6.32.10-90.fc12

Comment 9 C Sand 2010-03-25 03:21:36 UTC
(In reply to comment #8)

kernel-2.6.32.10-90.fc12.x86_64 seems to be working.  I've done a few large file transfers and the network appears to be functioning without problems on my machine.

As per Bug 527209, I presume the fixes in -90 will be part of the next upstream release (2.6.32.11) ?

Comment 10 Matt Carlson 2010-03-29 18:34:54 UTC
Yes.  The fix should already be submitted upstream.

Comment 11 Fedora Update System 2010-03-30 02:23:57 UTC
kernel-2.6.32.10-90.fc12 has been pushed to the Fedora 12 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Łukasz Trąbiński 2010-04-07 12:15:40 UTC
Hello

I'm using 2.6.32.10-90.fc12.x86_64 on x86_64 machine, during copy 8TB data
via NFS, we have oops like that:
We tried use  "ethtool -K eth0 sg off" but it didn't resolve problem


------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164()
Hardware name: GA-MA790FX-DQ6
NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
Modules linked in: nfs lockd fscache nfs_acl auth_rpcgss sit tunnel4 sunrpc cpufreq_ondemand powernow_k8 freq_table ipv6 microcode uinput snd_hda_codec_atihdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec edac_core ppdev snd_hwdep r8169 snd_seq edac_mce_amd parport_pc mii i2c_piix4 snd_seq_device parport snd_pcm snd_timer snd soundcore snd_page_alloc raid10 firewire_ohci ata_generic firewire_core pata_acpi dm_multipath crc_itu_t pata_atiixp pata_jmicron radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32.10-90.fc12.x86_64 #1
Call Trace:
<IRQ>  [<ffffffff81056350>] warn_slowpath_common+0x7c/0x94
[<ffffffff810563bf>] warn_slowpath_fmt+0x41/0x43
[<ffffffff813c610b>] ? netif_tx_lock+0x44/0x6d
[<ffffffff813c6275>] dev_watchdog+0xf3/0x164
[<ffffffff8107932d>] ? sched_clock_local+0x1c/0x82
[<ffffffff81079459>] ? sched_clock_cpu+0xc6/0xd1
[<ffffffff810650bc>] run_timer_softirq+0x1c4/0x268
[<ffffffff81026b9e>] ? apic_write+0x16/0x18
[<ffffffff8105d96c>] __do_softirq+0xe5/0x1a9
[<ffffffff810804d6>] ? tick_program_event+0x2a/0x2c
[<ffffffff81012e6c>] call_softirq+0x1c/0x30
[<ffffffff810143ea>] do_softirq+0x46/0x86
[<ffffffff8105d7aa>] irq_exit+0x3b/0x7d
[<ffffffff81459f4a>] smp_apic_timer_interrupt+0x86/0x94
[<ffffffff81012833>] apic_timer_interrupt+0x13/0x20
<EOI>  [<ffffffff8103020d>] ? native_safe_halt+0xb/0xd
[<ffffffff81018f37>] ? default_idle+0x36/0x53
[<ffffffff8101904f>] ? c1e_idle+0xfb/0x102
[<ffffffff81010cc8>] ? cpu_idle+0xaa/0xe4
[<ffffffff8143ef07>] ? rest_init+0x6b/0x6d
[<ffffffff81817de2>] ? start_kernel+0x3f4/0x3ff
[<ffffffff818172c1>] ? x86_64_start_reservations+0xac/0xb0
[<ffffffff818173bd>] ? x86_64_start_kernel+0xf8/0x107

[root@beabourg ~]# lspci 
00:00.0 Host bridge: ATI Technologies Inc RD790 Northbridge only dual slot PCI-e_GFX and HT3 K8 part
00:02.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (external gfx0 port A)
00:06.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port C)
00:07.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port D)
00:09.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port E)
00:0a.0 PCI bridge: ATI Technologies Inc RD790 PCI to PCI bridge (PCI express gpp port F)
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0)
00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1)
00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2)
00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3)
00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4)
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14)
00:14.1 IDE interface: ATI Technologies Inc SB600 IDE
00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA)
00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:00.0 VGA compatible controller: ATI Technologies Inc RV610 [Radeon HD 2400 XT]
01:00.1 Audio device: ATI Technologies Inc RV610 audio device [Radeon HD 2400 PRO]
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 01)
04:00.0 SATA controller: JMicron Technologies, Inc. 20360/20363 Serial ATA Controller (rev 02)
04:00.1 IDE interface: JMicron Technologies, Inc. 20360/20363 Serial ATA Controller (rev 02)
05:00.0 SATA controller: JMicron Technologies, Inc. 20360/20363 Serial ATA Controller (rev 02)
05:00.1 IDE interface: JMicron Technologies, Inc. 20360/20363 Serial ATA Controller (rev 02)
06:0e.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)

Comment 13 Matt Carlson 2010-04-07 16:30:01 UTC
(In reply to comment #12)
> Hello
> 
> I'm using 2.6.32.10-90.fc12.x86_64 on x86_64 machine, during copy 8TB data
> via NFS, we have oops like that:
> We tried use  "ethtool -K eth0 sg off" but it didn't resolve problem

You should probably file a separate bug for this issue.  It doesn't appear to be related.