405521 – forcedeth hangs on loading

Bug 405521 - forcedeth hangs on loading

Summary: forcedeth hangs on loading

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-11-30 07:26 UTC by Vasily Averin
Modified:	2014-06-29 22:59 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2008-0314
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:02:36 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0314	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5.2	2008-05-20 18:43:34 UTC

Description Vasily Averin 2007-11-30 07:26:56 UTC

While testing OpenVZ kernel based on RHEL5.1 kernel 2.6.18-53.el5 we have
observed that node loads forcedeth driver too long:

ov 28 14:04:15 ts49 ata4: SATA max UDMA/133 cmd 0x0000000000010960 ctl
0x0000000000010b62 bmdma 0x00000000000134a8 irq 225
Nov 28 14:04:16 ts49 ata3: SATA link down (SStatus 0 SControl 300)
Nov 28 14:04:16 ts49 ata4: SATA link down (SStatus 0 SControl 300)
Nov 28 14:04:16 ts49 ACPI: PCI Interrupt Link [LSI2] enabled at IRQ 20
Nov 28 14:04:16 ts49 GSI 20 sharing vector 0xE9 and IRQ 20
Nov 28 14:04:16 ts49 ACPI: PCI Interrupt 0000:00:05.2[C] -> Link [LSI2] -> GSI
20 (level, high) -> IRQ 233
Nov 28 14:04:16 ts49 PCI: Setting latency timer of device 0000:00:05.2 to 64
Nov 28 14:04:16 ts49 scsi4 : sata_nv
Nov 28 14:04:16 ts49 scsi5 : sata_nv
Nov 28 14:04:16 ts49 ata5: SATA max UDMA/133 cmd 0x0000000000013800 ctl
0x00000000000134f6 bmdma 0x00000000000134b0 irq 233
Nov 28 14:04:16 ts49 ata6: SATA max UDMA/133 cmd 0x00000000000134f8 ctl
0x00000000000134f2 bmdma 0x00000000000134b8 irq 233
Nov 28 14:04:16 ts49 ata5: SATA link down (SStatus 0 SControl 300)
Nov 28 14:04:17 ts49 ata6: SATA link down (SStatus 0 SControl 300)
Nov 28 14:04:17 ts49 device-mapper: ioctl: 4.11.0-ioctl (2006-09-14)
initialised: dm-devel
Nov 28 14:04:19 ts49 kjournald starting.  Commit interval 5 seconds
Nov 28 14:04:19 ts49 EXT3-fs: mounted filesystem with ordered data mode.
...
Nov 28 19:34:47 ts49 modprobe      D ffff8101ff772120     0  4101   4098       
             (NOTLB)
Nov 28 19:34:47 ts49  ffff8101fe7e1c28 0000000000000086 0000000000000246
00000010000200d0
Nov 28 19:34:47 ts49  ffff8101ff772120 ffff8103ff93c930 000012229b7b3bd7
00002bcc55b0dda7
Nov 28 19:34:47 ts49  ffff8101ff772328 ffff8101ff93a000 ffffffff802de500
ffff8103ff93c930
Nov 28 19:34:47 ts49 Call Trace:
Nov 28 19:34:47 ts49  [<ffffffff80062cc2>] schedule_timeout+0x8a/0xad
Nov 28 19:34:47 ts49  [<ffffffff80091bf9>] process_timeout+0x0/0x5
Nov 28 19:34:47 ts49  [<ffffffff800925c0>] msleep+0x21/0x2c
Nov 28 19:34:47 ts49  [<ffffffff8819fc8f>] :forcedeth:nv_probe+0x873/0xd4d
Nov 28 19:34:47 ts49  [<ffffffff801427f8>] pci_device_probe+0x100/0x180
Nov 28 19:34:47 ts49  [<ffffffff8019ff2c>] driver_probe_device+0x52/0xaa
Nov 28 19:34:47 ts49  [<ffffffff801a005b>] __driver_attach+0x65/0xb6
Nov 28 19:34:47 ts49  [<ffffffff8019fff6>] __driver_attach+0x0/0xb6
Nov 28 19:34:47 ts49  [<ffffffff8019f96d>] bus_for_each_dev+0x43/0x6e
Nov 28 19:34:47 ts49  [<ffffffff8019f5b3>] bus_add_driver+0x7e/0x130
Nov 28 19:34:47 ts49  [<ffffffff801429dc>] __pci_register_driver+0x57/0x7e
Nov 28 19:34:47 ts49  [<ffffffff800aa815>] sys_init_module+0x16aa/0x185f
Nov 28 19:34:47 ts49  [<ffffffff8005e166>] system_call+0x7e/0x83
...
Nov 28 21:02:02 ts49 eth0: forcedeth.c: subsystem: 010f1:2912 bound to 0000:00:08.0
Nov 28 21:02:02 ts49 ACPI: PCI Interrupt Link [LMA2] enabled at IRQ 18
Nov 28 21:02:02 ts49 GSI 22 sharing vector 0x3A and IRQ 22
Nov 28 21:02:02 ts49 ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LMA2] -> GSI
18 (level, high) -> IRQ 58

nv_probe() was called under semaphore and blocks lots of other modprobe's from
loading other drivers.

linux-2.6-net-forcedeth-update-to-driver-version-0-60.patch patch from RHEL5.1
added the following piece of code to nv_probe():

+       if (id->driver_data & DEV_HAS_MGMT_UNIT) {
+               /* management unit running on the mac? */
+               if (readl(base + NvRegTransmitterControl) &
NVREG_XMITCTL_SYNC_PHY_INIT) {
+                       np->mac_in_use = readl(base + NvRegTransmitterControl) &
NVREG_XMITCTL_MGMT_ST;
+                       dprintk(KERN_INFO "%s: mgmt unit is running. mac in use
%x.\n", pci_name(pci_dev), np->mac_in_use);
+                       for (i = 0; i < 5000; i++) {
+                               msleep(1);
+                               if (nv_mgmt_acquire_sema(dev)) {
+                                       /* management unit setup the phy already? */
+                                       if ((readl(base +
NvRegTransmitterControl) & NVREG_XMITCTL_SYNC_MASK) ==
+                                           NVREG_XMITCTL_SYNC_PHY_INIT) {
+                                               /* phy is inited by mgmt unit */
+                                               phyinitialized = 1;
+                                               dprintk(KERN_INFO "%s: Phy
already initialized by mgmt unit.\n", pci_name(pci_dev));
+                                       } else {
+                                               /* we need to init the phy */
+                                       }
+                                       break;
+                               }
+                       }
+               }
+       }

obviously, this loops 5000 times and calls nv_mgmt_acquire_sema() inside, which
in the worst case does msleep(500) 10 times.
So this loop can last 5000*10*0.5sec = 25000sec = 6.94 hours
This is exactly what we face in the bug: boot hanged at 14:04:19, continued at
21:02:02, i.e. it took ~25063 seconds.

As a workaround we have changed number of loops from 5000 to 5.

Comment 1 Vasily Averin 2007-12-03 05:09:56 UTC

Patch from vendor can be found here:
http://bugzilla.kernel.org/show_bug.cgi?id=9308

Comment 2 Andy Gospodarek 2007-12-06 15:33:34 UTC

Yeah, I saw that float by and found it interesting.  We can pull the driver
workaround.

Were you able to resolve your boot delay with newer firmware?

Comment 4 RHEL Program Management 2007-12-06 15:45:26 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 5 Vasily Averin 2007-12-07 08:11:42 UTC

(In reply to comment #2)
> Were you able to resolve your boot delay with newer firmware?

Sorry, I don't understand You. Could You please explain wnat do You mean under
"new firmware"? motherboard BIOS update?

Comment 6 Andy Gospodarek 2007-12-07 13:48:49 UTC

Yes, I 'new firmware' was referring to a BIOS update.  It seems others
[upstream] have reported that an updated BIOS resolved the issue that made this
driver bug apparent.

Comment 8 Andy Gospodarek 2007-12-18 19:54:54 UTC

My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.

Comment 9 Vasily Averin 2007-12-20 06:32:18 UTC

Andy, sorry

to work-around original issue we have changed some BIOS settings. I've resetored
it back, however unfortunately now I cannot reprouce this issue even on original
kernel :(
Your testkernel works fine too.

Comment 11 Don Zickus 2008-01-08 20:46:49 UTC

in 2.6.18-64.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 errata-xmlrpc 2008-05-21 15:02:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.