Bug 498024
Summary: | "CPU#foo stuck for 10s!" when using bonding with use_carrier=0 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Veaceslav Falico <vfalico> |
Component: | kernel | Assignee: | Andy Gospodarek <agospoda> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 5.2 | CC: | anton, fbijlsma, jpirko, nhorman, peterm, tao, vincent.leriche, xiaobao623 |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-08-07 19:31:49 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Veaceslav Falico
2009-04-28 14:28:03 UTC
As a shot, I think commit f0c76d61779b153dbfb955db3f144c62d02173c2 from upstream (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f0c76d61779b153dbfb955db3f144c62d02173c2 ) should fix the issue, however, still didn't manage to test it. Hi Veaceslav. Can you please try if this issue appears with Andy's upgrade patch posted here: http://post-office.corp.redhat.com/archives/rhkernel-list/2009-April/msg01231.html Thanks Veaceslav, use_carrier=1 works around this issue because bond_check_dev_link looks if use_carrier is set and just calls netif_carrier_ok rather than actually calling down to the driver and checking the hardware. What other applications or commands are running when this happens? Do you also notice anything in the logs that the interface goes down right before this happens? __bond_mii_monitor is called once with rtnl locked and if a change is needed, is called again with it locked. I'm not currently aware of any possible contention over the bnx2 driver's phy_lock and rtnl, but that was the first thing that came to my mind. Did you say you have a longer message with more information? If so, please attach it to the bug. You should also consider testing 5.3 as well as my test kernels here: http://people.redhat.com/agospoda/#rhel5 Thanks! *** Bug 499109 has been marked as a duplicate of this bug. *** Veaceslav were you able to test with the latest RHEL5 kernels from here: http://people.redhat.com/agospoda/#rhel5 or http://people.redhat.com/dzickus/el5 Thanks! Veaceslav, any feedback on your testing? Nope :(, customer with the reproducer is unreachable. From my pov, this BZ can be closed cause of insufficient data... Thank you! I also have same error log. OS: RHEL 5.2 x86_64 kernel: Linux DZGZ001.shbank.net 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Hardware HP DL580G5 (6 core Intel 7450 CPU) ,32GB, 05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3) 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) 07:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3) 08:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12) 0d:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) 0d:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06) log messages Jun 30 13:18:58 DZGZ001 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [bond0:4167] Jun 30 13:18:58 DZGZ001 kernel: CPU 15: Jun 30 13:18:58 DZGZ001 kernel: Modules linked in: lock_dlm gfs2 dlm configfs mptctl mptbase sg ipmi_si(U) ipmi_devintf(U) ip mi_msghandler(U) autofs4 hidp l2cap bluetooth sunrpc bonding ip_conntrack_ftp ip_conntrack_netbios_ns ipt_REJECT xt_state ip_ conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables dm_mirror dm_log dm_multipath scsi_dh dm_mod video hwmon back light sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport e1000e bnx2(U) serio_raw shpchp hpilo pcspkr qla2xxx(FU) qla2xxx_conf(FU) intermodule(U) ata_piix libata cciss(U) sd_mod scsi_mod e xt3 jbd ehci_hcd ohci_hcd uhci_hcd Jun 30 13:18:58 DZGZ001 kernel: Pid: 4167, comm: bond0 Tainted: GF 2.6.18-128.el5 #1 Jun 30 13:18:58 DZGZ001 kernel: RIP: 0010:[<ffffffff80064cd8>] [<ffffffff80064cd8>] .text.lock.spinlock+0x26/0x30 Jun 30 13:18:58 DZGZ001 kernel: RSP: 0018:ffff810822b27d20 EFLAGS: 00000286 Jun 30 13:18:58 DZGZ001 kernel: RAX: ffff810822b27fd8 RBX: ffff81082df60000 RCX: ffff810822b27d80 Jun 30 13:18:58 DZGZ001 kernel: RDX: 0000000000008948 RSI: ffff810822b27d70 RDI: ffff81082df637c0 Jun 30 13:18:58 DZGZ001 kernel: RBP: ffff810807155480 R08: ffff810822b27d50 R09: ffff81080404eac0 Jun 30 13:18:58 DZGZ001 kernel: R10: ffff8107f4290b40 R11: 0000000000000246 R12: ffff81081d360bc0 Jun 30 13:18:58 DZGZ001 kernel: R13: 0000000000000246 R14: 0000000000000001 R15: 0000000000000000 Jun 30 13:18:58 DZGZ001 kernel: FS: 0000000000000000(0000) GS:ffff81082fcccc40(0000) knlGS:0000000000000000 Jun 30 13:18:58 DZGZ001 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Jun 30 13:18:58 DZGZ001 kernel: CR2: 00002aaaaacd5000 CR3: 0000000000201000 CR4: 00000000000006e0 Jun 30 13:18:58 DZGZ001 kernel: Jun 30 13:18:58 DZGZ001 kernel: Call Trace: Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff882c84ee>] :bnx2:bnx2_ioctl+0x69/0xff Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff88500f1f>] :bonding:bond_check_dev_link+0xd3/0x1b9 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff80063097>] thread_return+0x62/0xfe Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff885025c9>] :bonding:__bond_mii_monitor+0x88/0x444 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8850399a>] :bonding:bond_mii_monitor+0x0/0x8c Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff885039c7>] :bonding:bond_mii_monitor+0x2d/0x8c Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8004d139>] run_workqueue+0x94/0xe4 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff800499ba>] worker_thread+0x0/0x122 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff80049aaa>] worker_thread+0xf0/0x122 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8008a461>] default_wake_function+0x0/0xe Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff80032360>] kthread+0xfe/0x132 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff80032262>] kthread+0x0/0x132 Jun 30 13:18:58 DZGZ001 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 Jun 30 13:18:58 DZGZ001 kernel: I have two bonding, bnx2+intel . I have install RHEL 5.1 before, but s the system orfen no respond or reboot, and no error log message in syslog. So, I update to RHEL 5.3 now. But I saw the BUG soft lockup message, and the system still no respond or reboot some time. York, were you able to test with the latest RHEL5 kernels from here: http://people.redhat.com/agospoda/#rhel5 or http://people.redhat.com/dzickus/el5 Thanks! I have try not using parameter "use_carrier=1", the error messages not comes . So I will try add again to confirm this parameter and monitor syslog. Sorry, I use " use_carrier=0" , not 1 . |