While testing OpenVZ kernel based on RHEL5.1 kernel 2.6.18-53.el5 we have observed that node loads forcedeth driver too long: ov 28 14:04:15 ts49 ata4: SATA max UDMA/133 cmd 0x0000000000010960 ctl 0x0000000000010b62 bmdma 0x00000000000134a8 irq 225 Nov 28 14:04:16 ts49 ata3: SATA link down (SStatus 0 SControl 300) Nov 28 14:04:16 ts49 ata4: SATA link down (SStatus 0 SControl 300) Nov 28 14:04:16 ts49 ACPI: PCI Interrupt Link [LSI2] enabled at IRQ 20 Nov 28 14:04:16 ts49 GSI 20 sharing vector 0xE9 and IRQ 20 Nov 28 14:04:16 ts49 ACPI: PCI Interrupt 0000:00:05.2[C] -> Link [LSI2] -> GSI 20 (level, high) -> IRQ 233 Nov 28 14:04:16 ts49 PCI: Setting latency timer of device 0000:00:05.2 to 64 Nov 28 14:04:16 ts49 scsi4 : sata_nv Nov 28 14:04:16 ts49 scsi5 : sata_nv Nov 28 14:04:16 ts49 ata5: SATA max UDMA/133 cmd 0x0000000000013800 ctl 0x00000000000134f6 bmdma 0x00000000000134b0 irq 233 Nov 28 14:04:16 ts49 ata6: SATA max UDMA/133 cmd 0x00000000000134f8 ctl 0x00000000000134f2 bmdma 0x00000000000134b8 irq 233 Nov 28 14:04:16 ts49 ata5: SATA link down (SStatus 0 SControl 300) Nov 28 14:04:17 ts49 ata6: SATA link down (SStatus 0 SControl 300) Nov 28 14:04:17 ts49 device-mapper: ioctl: 4.11.0-ioctl (2006-09-14) initialised: dm-devel Nov 28 14:04:19 ts49 kjournald starting. Commit interval 5 seconds Nov 28 14:04:19 ts49 EXT3-fs: mounted filesystem with ordered data mode. ... Nov 28 19:34:47 ts49 modprobe D ffff8101ff772120 0 4101 4098 (NOTLB) Nov 28 19:34:47 ts49 ffff8101fe7e1c28 0000000000000086 0000000000000246 00000010000200d0 Nov 28 19:34:47 ts49 ffff8101ff772120 ffff8103ff93c930 000012229b7b3bd7 00002bcc55b0dda7 Nov 28 19:34:47 ts49 ffff8101ff772328 ffff8101ff93a000 ffffffff802de500 ffff8103ff93c930 Nov 28 19:34:47 ts49 Call Trace: Nov 28 19:34:47 ts49 [<ffffffff80062cc2>] schedule_timeout+0x8a/0xad Nov 28 19:34:47 ts49 [<ffffffff80091bf9>] process_timeout+0x0/0x5 Nov 28 19:34:47 ts49 [<ffffffff800925c0>] msleep+0x21/0x2c Nov 28 19:34:47 ts49 [<ffffffff8819fc8f>] :forcedeth:nv_probe+0x873/0xd4d Nov 28 19:34:47 ts49 [<ffffffff801427f8>] pci_device_probe+0x100/0x180 Nov 28 19:34:47 ts49 [<ffffffff8019ff2c>] driver_probe_device+0x52/0xaa Nov 28 19:34:47 ts49 [<ffffffff801a005b>] __driver_attach+0x65/0xb6 Nov 28 19:34:47 ts49 [<ffffffff8019fff6>] __driver_attach+0x0/0xb6 Nov 28 19:34:47 ts49 [<ffffffff8019f96d>] bus_for_each_dev+0x43/0x6e Nov 28 19:34:47 ts49 [<ffffffff8019f5b3>] bus_add_driver+0x7e/0x130 Nov 28 19:34:47 ts49 [<ffffffff801429dc>] __pci_register_driver+0x57/0x7e Nov 28 19:34:47 ts49 [<ffffffff800aa815>] sys_init_module+0x16aa/0x185f Nov 28 19:34:47 ts49 [<ffffffff8005e166>] system_call+0x7e/0x83 ... Nov 28 21:02:02 ts49 eth0: forcedeth.c: subsystem: 010f1:2912 bound to 0000:00:08.0 Nov 28 21:02:02 ts49 ACPI: PCI Interrupt Link [LMA2] enabled at IRQ 18 Nov 28 21:02:02 ts49 GSI 22 sharing vector 0x3A and IRQ 22 Nov 28 21:02:02 ts49 ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LMA2] -> GSI 18 (level, high) -> IRQ 58 nv_probe() was called under semaphore and blocks lots of other modprobe's from loading other drivers. linux-2.6-net-forcedeth-update-to-driver-version-0-60.patch patch from RHEL5.1 added the following piece of code to nv_probe(): + if (id->driver_data & DEV_HAS_MGMT_UNIT) { + /* management unit running on the mac? */ + if (readl(base + NvRegTransmitterControl) & NVREG_XMITCTL_SYNC_PHY_INIT) { + np->mac_in_use = readl(base + NvRegTransmitterControl) & NVREG_XMITCTL_MGMT_ST; + dprintk(KERN_INFO "%s: mgmt unit is running. mac in use %x.\n", pci_name(pci_dev), np->mac_in_use); + for (i = 0; i < 5000; i++) { + msleep(1); + if (nv_mgmt_acquire_sema(dev)) { + /* management unit setup the phy already? */ + if ((readl(base + NvRegTransmitterControl) & NVREG_XMITCTL_SYNC_MASK) == + NVREG_XMITCTL_SYNC_PHY_INIT) { + /* phy is inited by mgmt unit */ + phyinitialized = 1; + dprintk(KERN_INFO "%s: Phy already initialized by mgmt unit.\n", pci_name(pci_dev)); + } else { + /* we need to init the phy */ + } + break; + } + } + } + } obviously, this loops 5000 times and calls nv_mgmt_acquire_sema() inside, which in the worst case does msleep(500) 10 times. So this loop can last 5000*10*0.5sec = 25000sec = 6.94 hours This is exactly what we face in the bug: boot hanged at 14:04:19, continued at 21:02:02, i.e. it took ~25063 seconds. As a workaround we have changed number of loops from 5000 to 5.
Patch from vendor can be found here: http://bugzilla.kernel.org/show_bug.cgi?id=9308
Yeah, I saw that float by and found it interesting. We can pull the driver workaround. Were you able to resolve your boot delay with newer firmware?
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(In reply to comment #2) > Were you able to resolve your boot delay with newer firmware? Sorry, I don't understand You. Could You please explain wnat do You mean under "new firmware"? motherboard BIOS update?
Yes, I 'new firmware' was referring to a BIOS update. It seems others [upstream] have reported that an updated BIOS resolved the issue that made this driver bug apparent.
My test kernels have been updated to include a patch for this bugzilla. http://people.redhat.com/agospoda/#rhel5 Please test them and report back your results.
Andy, sorry to work-around original issue we have changed some BIOS settings. I've resetored it back, however unfortunately now I cannot reprouce this issue even on original kernel :( Your testkernel works fine too.
in 2.6.18-64.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html