Bug 734815

Summary: kernel: NETDEV WATCHDOG: eth2 (bnx2): transmit queue 5 timed out
Product: Red Hat Enterprise Linux 6 Reporter: Robert Stroetgen <stroetgen>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED NOTABUG QA Contact: Network QE <network-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.1CC: chorn, jeder, jwest, kzhang, nhorman, rdassen
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-14 13:06:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
patch to disable carrier early in bnx2_netif_stop none

Description Robert Stroetgen 2011-08-31 14:32:54 UTC
Description of problem:

Once or twice a week the server looses network connection on a bnx2 interface:

Aug 29 10:37:55 vmhost3 kernel: ------------[ cut here ]------------
Aug 29 10:37:55 vmhost3 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Tainted: G           ---------------- T)
Aug 29 10:37:55 vmhost3 kernel: Hardware name: System x3550 M3 -[7944K1G]-
Aug 29 10:37:55 vmhost3 kernel: NETDEV WATCHDOG: eth2 (bnx2): transmit queue 5 timed out
Aug 29 10:37:55 vmhost3 kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle mpt2sas scsi_transport_sas raid_class mptctl mptbase autofs4 coretemp hwmon ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss dlm configfs sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_round_robin scsi_dh_emc vhost_net macvtap macvlan tun kvm_intel kvm microcode bnx2 serio_raw i2c_i801 i2c_core sg cdc_ether usbnet mii iTCO_wdt iTCO_vendor_support ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom megaraid_s
Aug 29 10:37:55 vmhost3 kernel: as ata_generic pata_acpi ata_piix bfa(T) scsi_transport_fc scsi_tgt dm_multipath dm_mod [last unloaded: scsi_wait_scan]
Aug 29 10:37:55 vmhost3 kernel: Pid: 0, comm: swapper Tainted: G           ---------------- T 2.6.32-131.12.1.el6.x86_64 #1
Aug 29 10:37:55 vmhost3 kernel: Call Trace:
Aug 29 10:37:55 vmhost3 kernel: <IRQ>  [<ffffffff810670f7>] ? warn_slowpath_common+0x87/0xc0
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff810671e6>] ? warn_slowpath_fmt+0x46/0x50
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8143a39d>] ? dev_watchdog+0x26d/0x280
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81391e00>] ? rh_timer_func+0x0/0x10
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81391621>] ? usb_hcd_poll_rh_status+0x141/0x180
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8143a130>] ? dev_watchdog+0x0/0x280
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81079ef7>] ? run_timer_softirq+0x197/0x340
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff810d6930>] ? handle_IRQ_event+0x60/0x170
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff814e2f45>] ? do_IRQ+0x75/0xf0
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff8100bad3>] ? ret_from_intr+0x0/0x11
Aug 29 10:37:55 vmhost3 kernel: <EOI>  [<ffffffff812bb7ce>] ? intel_idle+0xde/0x170
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff812bb7b1>] ? intel_idle+0xc1/0x170
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff813ec987>] ? cpuidle_idle_call+0xa7/0x140
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81009e86>] ? cpu_idle+0xb6/0x110
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff814c318a>] ? rest_init+0x7a/0x80
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81c1df28>] ? start_kernel+0x41d/0x429
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81c1d33a>] ? x86_64_start_reservations+0x125/0x129
Aug 29 10:37:55 vmhost3 kernel: [<ffffffff81c1d438>] ? x86_64_start_kernel+0xfa/0x109
Aug 29 10:37:55 vmhost3 kernel: ---[ end trace bd1730928f1d1c4d ]---
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: intr_sem[0] PCI_CMD[00100446]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: MCP_STATE_P0[0007610e] MCP_STATE_P1[0003600e]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: HC_STATS_INTERRUPT_STATUS[01df0020]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: DEBUG: PBA[00000000]
Aug 29 10:37:55 vmhost3 kernel: bnx2 0000:10:00.0: eth2: NIC Copper Link is Down
Aug 29 10:37:58 vmhost3 kernel: bnx2 0000:10:00.0: eth2: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON

Speed and duplex mode are fixed on the switch and for the interface ('ETHTOOL_OPTS="speed 1000 duplex full"').

Loosing the connection for 3 seconds causes some problems for the cluster management.


Version-Release number of selected component (if applicable):

kernel-2.6.32-131.12.1.el6.x86_64

How reproducible:

Recurrent, not reproducible.

Happens on different machines in the same way.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Neil Horman 2011-09-01 13:36:02 UTC
Can you try the 6.2 kernel please.  I fixed a few bugs there that affected how and when we got tx timeouts.

Comment 3 Robert Stroetgen 2011-09-01 14:41:37 UTC
(In reply to comment #2)
> Can you try the 6.2 kernel please.  I fixed a few bugs there that affected how
> and when we got tx timeouts.

Sorry, stupid question, where do I find the 6.2 kernel?

Comment 4 Neil Horman 2011-09-01 15:03:53 UTC
RHN, it should be in the latest RHEL6 beta channel.  If its not there I can get you a build.

Comment 5 Robert Stroetgen 2011-09-01 15:08:15 UTC
(In reply to comment #4)
> RHN, it should be in the latest RHEL6 beta channel.  If its not there I can get
> you a build.

I did not find it in the RHEL6 beta channel, maybe my fault. Could you please give me a build?

Comment 8 Neil Horman 2011-09-02 23:45:13 UTC
Could be related jeremy, can you give this customer a copy of the latest build with the bnx2 updates in it?

Comment 11 Neil Horman 2011-09-04 12:46:07 UTC
Just out of curiosity, are you using the iscsi cna features of the bnx2 card?  It looks like you might be.  Does this happen if you just use the device as a NIC?

Comment 12 Robert Stroetgen 2011-09-05 07:34:17 UTC
Not intentionally.

We use Brocade FibreChannel adapters and we use iscsi - but without enabling any extra features.

The error happens for with eth interfaces, not only with the interface used for iscsi.

(In reply to comment #11)
> Just out of curiosity, are you using the iscsi cna features of the bnx2 card? 
> It looks like you might be.  Does this happen if you just use the device as a
> NIC?

Comment 18 Neil Horman 2011-09-07 13:35:36 UTC
Created attachment 521884 [details]
patch to disable carrier early in bnx2_netif_stop

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=3612684

This is a build including the attached patch that should disable carrier early enough to prevent timeouts during devices changes.  If this is the root cause of the problem, this patch should fix it.  Please test and let me know the results.
thanks

Comment 19 Robert Stroetgen 2011-09-07 13:43:57 UTC
Sorry, I cannot reach the host brewweb.devel.redhat.com (DNS: not found).

Comment 20 Neil Horman 2011-09-07 15:53:03 UTC
Yes,  you won't be able to, as its an internal build system.  I was expecting Chrstian would provide you with a copy of the appropriate resultant rpms when the build completed.

Comment 21 Christian Horn 2011-09-07 19:27:29 UTC
Robert: you should be able to access 
https://access.redhat.com/support/cases/00527421 and get the -195 kernel from there.  You can contact me directly via email if that does not work for you.

Comment 22 Robert Stroetgen 2011-09-08 09:33:30 UTC
I downloaded and installed the test kernel:

Linux vmhost4.gei.de 2.6.32-195.el6.test.x86_64 #1 SMP Wed Sep 7 10:32:23 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Sep 8 11:25:15 vmhost4 kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.7.0.3 (Jun 15, 2010)
Sep 8 11:25:15 vmhost4 kernel: iscsi: registered transport (bnx2i)
Sep 8 11:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: using MSIX
Sep 8 11:25:15 vmhost4 kernel: bnx2 0000:0b:00.1: eth1: using MSIX
Sep 8 11:25:15 vmhost4 kernel: bnx2 0000:0b:00.1: eth1: NIC Copper Link is Up, 1000 Mbps full duplex
Sep 8 11:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex
Sep 8 11:25:15 vmhost4 kernel: bnx2 0000:10:00.0: eth2: using MSIX
Sep 8 11:25:15 vmhost4 kernel: bnx2 0000:10:00.0: eth2: NIC Copper Link is Up, 1000 Mbps full duplex

The error is not reproducible, but happens usually once or twice a week. I will inform you, what will happen.

Thanks and best regards
Robert

Comment 23 Neil Horman 2011-09-08 11:04:10 UTC
Ok, thank you.

Comment 24 Robert Stroetgen 2011-09-11 07:27:20 UTC
The error happened again:

Sep  8 20:25:15 vmhost4 kernel: NETDEV WATCHDOG: eth0 (bnx2): transmit queue 4 timed out
Sep  8 20:25:15 vmhost4 kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle autofs4 coretemp ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_round_robin vhost_net macvtap macvlan tun kvm_intel kvm microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii ch osst st sg bnx2 ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix bfa(U) scsi_transport_fc scsi_tgt megaraid_sas dm
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: intr_sem[0] PCI_CMD[00100446]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: HC_STATS_INTERRUPT_STATUS[01ef0010]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: PBA[00000000]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Down
Sep  8 20:25:18 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex

Comment 25 Neil Horman 2011-09-11 12:35:20 UTC
Ok, I'm out of ideas.  Christian, can you give me access to the local reproducer please?  I'll start poking around to see what else I can find.  Robert, can you you please post the complete error, you seem to have cut out the backtrace for some reason.

Comment 27 Christian Horn 2011-09-11 17:02:59 UTC
Would it help to see if the problem appears with -195 and the driver from the broadcom website?  Could atleast help us distinguish "bnx2 only" and "all other areas could be affected"?

Comment 28 Robert Stroetgen 2011-09-12 08:52:50 UTC
(In reply to comment #25)
> Robert, can you you please post the complete error, you seem to have cut out
> the backtrace for some reason.

Sorry, I grepped "bnx". The complete log:

Sep  8 20:25:15 vmhost4 kernel: ------------[ cut here ]------------
Sep  8 20:25:15 vmhost4 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted)
Sep  8 20:25:15 vmhost4 kernel: Hardware name: System x3550 M3 -[7944K1G]-
Sep  8 20:25:15 vmhost4 kernel: NETDEV WATCHDOG: eth0 (bnx2): transmit queue 4 timed out
Sep  8 20:25:15 vmhost4 kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle autofs4 coretemp ipmi_si ipmi_msghandler nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_round_robin vhost_net macvtap macvlan tun kvm_intel kvm microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii ch osst st sg bnx2 ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix bfa(U) scsi_transport_fc scsi_tgt megaraid_sas dm
Sep  8 20:25:15 vmhost4 kernel: _multipath dm_mod scsi_dh_emc [last unloaded: scsi_wait_scan]
Sep  8 20:25:15 vmhost4 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-195.el6.test.x86_64 #1
Sep  8 20:25:15 vmhost4 kernel: Call Trace:
Sep  8 20:25:15 vmhost4 kernel: <IRQ>  [<ffffffff81069b17>] ? warn_slowpath_common+0x87/0xc0
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81069c06>] ? warn_slowpath_fmt+0x46/0x50
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81449acd>] ? dev_watchdog+0x26d/0x280
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff8107d0f4>] ? mod_timer+0x144/0x220
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81449860>] ? dev_watchdog+0x0/0x280
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff8107c8f7>] ? run_timer_softirq+0x197/0x340
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81072101>] ? __do_softirq+0xc1/0x1d0
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff810d9410>] ? handle_IRQ_event+0x60/0x170
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff8100de45>] ? do_softirq+0x65/0xa0
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81071ee5>] ? irq_exit+0x85/0x90
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff814f40f5>] ? do_IRQ+0x75/0xf0
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff8100ba13>] ? ret_from_intr+0x0/0x11
Sep  8 20:25:15 vmhost4 kernel: <EOI>  [<ffffffff812c3f2e>] ? intel_idle+0xde/0x170
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff812c3f11>] ? intel_idle+0xc1/0x170
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff813f9767>] ? cpuidle_idle_call+0xa7/0x140
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff814d367a>] ? rest_init+0x7a/0x80
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81c1ff76>] ? start_kernel+0x424/0x430
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81c1f33a>] ? x86_64_start_reservations+0x125/0x129
Sep  8 20:25:15 vmhost4 kernel: [<ffffffff81c1f438>] ? x86_64_start_kernel+0xfa/0x109
Sep  8 20:25:15 vmhost4 kernel: ---[ end trace 45f28c736a30ea38 ]---
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: intr_sem[0] PCI_CMD[00100446]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: RPM_MGMT_PKT_CTRL[40000088]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: HC_STATS_INTERRUPT_STATUS[01ef0010]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: DEBUG: PBA[00000000]
Sep  8 20:25:15 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Down
Sep  8 20:25:15 vmhost4 kernel: br0: port 1(eth0) entering disabled state
Sep  8 20:25:18 vmhost4 kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex
Sep  8 20:25:18 vmhost4 kernel: br0: port 1(eth0) entering forwarding state
Sep  8 20:25:31 vmhost4 auditd[9026]: Audit daemon rotating log files
Sep  8 20:25:36 vmhost4 abrt: Kerneloops: Reported 1 kernel oopses to Abrt
Sep  8 20:25:36 vmhost4 abrtd: Directory 'kerneloops-1315506336-8583-1' creation detected
Sep  8 20:25:36 vmhost4 abrtd: Crash is in database already (dup of /var/spool/abrt/kerneloops-1308049562-7262-1)
Sep  8 20:25:36 vmhost4 abrtd: Deleting crash kerneloops-1315506336-8583-1 (dup of kerneloops-1308049562-7262-1), sending dbus signal

Comment 29 Robert Stroetgen 2011-09-12 10:00:20 UTC
(In reply to comment #27)
> Would it help to see if the problem appears with -195 and the driver from the
> broadcom website?  Could atleast help us distinguish "bnx2 only" and "all other
> areas could be affected"?

I just downloaded broadcom Broadcom NetXtreme II Driver iSCSI version 2.6.2.4c (Feb 01, 2011) and Broadcom bnx2 Linux Driver bnx2 v2.0.23b (Feb 01, 2011) cnic v2.2.13b (Feb 01, 2011)

To test the original broadcom driver I would need the -195 kernel sources.

Comment 30 Christian Horn 2011-09-12 11:03:13 UTC
Since the problems already occured with -131 using this for testing should work I think (and in case we see that the broadcom driver works with -131 then we also have fewer diffs to the working -71 than from our -193).

Comment 31 Christian Horn 2011-09-12 12:01:48 UTC
Both affected environments are running quite similiar hardware:

environment a)
        Vendor: IBM Corp.
        Version: -[D6E149AUS-1.09]-
        Release Date: 09/21/2010
       System Information
        Manufacturer: IBM
        Product Name: System x3550 M3 -[7944K1G]-
environment b)
        Vendor: IBM Corp.
        Version: -[D6E153AUS-1.12]-
        Release Date: 06/30/2011
       System Information
        Manufacturer: IBM
        Product Name: System x3650 M2 -[7947PCV]-

environment a)
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 13)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 13)
environment b)
00:00.0 Host bridge: Intel Corporation 5520 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)

The affected bnx2 devices are identical in both cases:
0b:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
0b:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
10:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
10:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

I think it is quite interesting both environments showing this run so similiar hardware.  Maybe we should also think of chipset/bios versions (ensure we run latest released versions, check outstanding bugs)?

Comment 32 Robert Stroetgen 2011-09-12 12:52:55 UTC
(In reply to comment #30)
> Since the problems already occured with -131 using this for testing should work
> I think (and in case we see that the broadcom driver works with -131 then we
> also have fewer diffs to the working -71 than from our -193).

I just have built and installed the original Broadcom driver for the 131 kernel - replacing v2.7.0.3 (Jun 15, 2010):

Sep 12 14:44:07 vmhost4 kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.6.2.4c (Feb 01, 2011)
Sep 12 14:44:07 vmhost4 kernel: iscsi: registered transport (bnx2i)
Sep 12 14:44:07 vmhost4 kernel: bnx2: eth0: using MSIX
Sep 12 14:44:07 vmhost4 kernel: bnx2i: dev eth0 does not support iSCSI
Sep 12 14:44:07 vmhost4 kernel: bnx2i: eth0 free_hba done after 0 retries
Sep 12 14:44:07 vmhost4 kernel: bnx2: eth1: using MSIX
Sep 12 14:44:07 vmhost4 kernel: bnx2i: dev eth1 does not support iSCSI
Sep 12 14:44:07 vmhost4 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Sep 12 14:44:07 vmhost4 kernel: bnx2i: eth1 free_hba done after 0 retries
Sep 12 14:44:07 vmhost4 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
Sep 12 14:44:07 vmhost4 kernel: bnx2: eth2: using MSIX
Sep 12 14:44:07 vmhost4 kernel: bnx2i: dev eth2 does not support iSCSI
Sep 12 14:44:07 vmhost4 kernel: bnx2i: eth2 free_hba done after 0 retries
Sep 12 14:44:07 vmhost4 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps full duplex

Let's wait if the error happens again within the next few days.

Comment 33 Neil Horman 2011-09-12 15:23:42 UTC
Ok, please let us know.

Comment 34 Neil Horman 2011-09-19 15:28:31 UTC
any update here?  I've run the latest 197 kernel here all weekend with heavy traffic, and did not encounter a hang

Comment 35 Robert Stroetgen 2011-09-20 07:01:20 UTC
Still watching, no incident yet with the 131 kernel and the original broadcom driver:

[root@vmhost4 ~]# uname -a
Linux vmhost4.gei.de 2.6.32-131.12.1.el6.x86_64 #1 SMP Sun Jul 31 16:44:56 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Sep 12 14:44:07 vmhost4 kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.6.2.4c (Feb 01, 2011)

Another system with 195 kernel has one error in 12 days:

[root@vmhost-pbx ~]# uname -a
Linux vmhost-pbx.gei.de 2.6.32-195.el6.test.x86_64 #1 SMP Wed Sep 7 10:32:23 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Sep  8 14:01:30 vmhost-pbx kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.7.0.3 (Jun 15, 2010)

Sep 18 13:02:48 vmhost-pbx kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Down
Sep 18 13:02:51 vmhost-pbx kernel: bnx2 0000:0b:00.0: eth0: NIC Copper Link is Up, 1000 Mbps full duplex

Comment 36 Neil Horman 2011-09-20 10:41:07 UTC
just out of curiosity what bios level is the system in question running, and what version of the kernel-firmware package to you have installed?

Comment 37 Robert Stroetgen 2011-09-20 11:15:21 UTC
Both machines have the same kernel-firmware and BIOS (UEFI):

[root@vmhost4 ~]# rpmquery kernel-firmware
kernel-firmware-2.6.32-195.el6.test.noarch
[root@vmhost4 ~]# dmidecode | grep UEFI
        String 1: $MV Min UEFI Version -[D6E123AUS-1.00]-


[root@vmhost-pbx ~]# rpmquery kernel-firmware
kernel-firmware-2.6.32-195.el6.test.noarch
[root@vmhost-pbx ~]# dmidecode | grep UEFI
        String 1: $MV Min UEFI Version -[D6E123AUS-1.00]-

He have prepared a BIOS/UEFI upgrade, but first we wanted to watch the kernel/driver tests.

Comment 38 Neil Horman 2011-09-21 00:59:01 UTC
Ok, thanks.  Let us know what the driver tests produce.

Comment 39 Robert Stroetgen 2011-09-24 23:25:43 UTC
Error with the original Broadcom driver and the 131 kernel:

Sep 23 21:50:08 vmhost4 kernel: ------------[ cut here ]------------
Sep 23 21:50:08 vmhost4 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Tainted: G           ---------------- T)
Sep 23 21:50:08 vmhost4 kernel: Hardware name: System x3550 M3 -[7944K1G]-
Sep 23 21:50:08 vmhost4 kernel: NETDEV WATCHDOG: eth0 (bnx2): transmit queue 1 timed out
Sep 23 21:50:08 vmhost4 kernel: Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle autofs4 coretemp hwmon ipmi_si ipmi_msghandler nfs lockd fscache(T) nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i(U) cnic(U) uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_round_robin vhost_net macvtap macvlan tun kvm_intel kvm microcode serio_raw i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support osst st ch cdc_ether usbnet mii sg bnx2(U) ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif megaraid_sas pata_acpi ata_generic ata_piix bfa(U) scsi_transport_fc
Sep 23 21:50:08 vmhost4 kernel: scsi_tgt dm_multipath dm_mod scsi_dh_emc [last unloaded: scsi_wait_scan]
Sep 23 21:50:08 vmhost4 kernel: Pid: 0, comm: swapper Tainted: G           ---------------- T 2.6.32-131.12.1.el6.x86_64 #1
Sep 23 21:50:08 vmhost4 kernel: Call Trace:
Sep 23 21:50:08 vmhost4 kernel: <IRQ>  [<ffffffff810670f7>] ? warn_slowpath_common+0x87/0xc0
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff810671e6>] ? warn_slowpath_fmt+0x46/0x50
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8143a39d>] ? dev_watchdog+0x26d/0x280
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff81088aed>] ? insert_work+0x6d/0xb0
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8143a130>] ? dev_watchdog+0x0/0x280
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff81079ef7>] ? run_timer_softirq+0x197/0x340
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff81268c5d>] ? rb_insert_color+0x9d/0x160
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8102a00d>] ? lapic_next_event+0x1d/0x30
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8106f6e1>] ? __do_softirq+0xc1/0x1d0
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff81092cc0>] ? hrtimer_interrupt+0x140/0x250
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8100c2cc>] ? call_softirq+0x1c/0x30
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8100df05>] ? do_softirq+0x65/0xa0
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8106f4c5>] ? irq_exit+0x85/0x90
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff814e3030>] ? smp_apic_timer_interrupt+0x70/0x9b
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff8100bc93>] ? apic_timer_interrupt+0x13/0x20
Sep 23 21:50:08 vmhost4 kernel: <EOI>  [<ffffffff812bb7ce>] ? intel_idle+0xde/0x170
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff812bb7b1>] ? intel_idle+0xc1/0x170
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff813ec987>] ? cpuidle_idle_call+0xa7/0x140
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff81009e86>] ? cpu_idle+0xb6/0x110
Sep 23 21:50:08 vmhost4 kernel: [<ffffffff814d438a>] ? start_secondary+0x202/0x245
Sep 23 21:50:08 vmhost4 kernel: ---[ end trace 2ce7b3b8d8f26d8b ]---
Sep 23 21:50:08 vmhost4 kernel: bnx2: <--- start FTQ dump on eth0 --->
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_RV2P_PFTQ_CTL 10002
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_RV2P_TFTQ_CTL 20000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_RV2P_MFTQ_CTL 4000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_TBDR_FTQ_CTL 4002
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_TDMA_FTQ_CTL 10000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_TXP_FTQ_CTL 10000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_TPAT_FTQ_CTL 10000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_RXP_CFTQ_CTL 8000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_RXP_FTQ_CTL 100000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_COM_COMXQ_FTQ_CTL 10000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_COM_COMTQ_FTQ_CTL 20000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_COM_COMQ_FTQ_CTL 10000
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: BNX2_CP_CPQ_FTQ_CTL 4002
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: TXP mode b84c state 80001000 evt_mask 500 pc 8001284 pc 800128c instr 38640001
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: TPAT mode b84c state 80001000 evt_mask 500 pc 8000a4c pc 8000a5c instr 10400016
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: RXP mode b84c state 80001000 evt_mask 500 pc 8004c1c pc 8004c1c instr 10a0fffd
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: COM mode b8cc state 80000000 evt_mask 500 pc 8000a98 pc 8000aa4 instr 3c020800
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0: CP mode b8cc state 80000000 evt_mask 500 pc 8000c50 pc 8000c48 instr 3e00008
Sep 23 21:50:08 vmhost4 kernel: bnx2: <--- end FTQ dump on eth0 --->
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: intr_sem[0]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: intr_sem[0] PCI_CMD[00100446]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000088]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 RPM_MGMT_PKT_CTRL[40000088]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: HC_STATS_INTERRUPT_STATUS[01fc0003]
Sep 23 21:50:08 vmhost4 kernel: bnx2: eth0 DEBUG: PBA[00000000]
Sep 23 21:50:09 vmhost4 kernel: cnic: cnic_stop_bnx2_ooo_hw: hw rx_cons=0 != sw rx_cons=0 rx_prod=511
Sep 23 21:50:09 vmhost4 kernel: bnx2: eth0 NIC Copper Link is Down
Sep 23 21:50:09 vmhost4 kernel: br0: port 1(eth0) entering disabled state
Sep 23 21:50:12 vmhost4 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Sep 23 21:50:12 vmhost4 kernel: br0: port 1(eth0) entering forwarding state

Comment 40 Neil Horman 2011-09-26 11:10:39 UTC
ok, looks like we're back to the firmware/UEFI upgrade question...

Comment 41 Robert Stroetgen 2011-09-26 12:54:41 UTC
I just updated the server firmware:

Old:

        Type                Version                  Release Date
        ----                -------                  ------------
        IMM                 YUOO84C                  09/28/2010
        UEFI                D6E149A                  09/21/2010
        DSA                 DSYT75X                  09/17/2010


New:

        Type                Version                  Release Date
        ----                -------                  ------------
        IMM                 YUOOB7C                  06/11/2011
        UEFI                D6E153A                  06/30/2011
        DSA                 DSYT89G                  06/21/2011

I updated the broadcom firmware, too:

Old:

    ADAPTER MAC         BOOT    IPMI    ASF     PXE     UMP     NCSI    iSCSI   EFI
-------------------     ----    ----    ---     ---     ---     ---     ---     ---
E41F136D0B64 (5709)     5.2.2   NA      NA      NA      NA      2.0.10  NA  NA
E41F136D0B66 (5709)     5.2.2   NA      NA      NA      NA      2.0.10  NA      NA
E41F13D60EDC (5709)     4.6.4   NA      NA      NA      NA      1.0.3   NA      NA
E41F13D60EDE (5709)     NA      NA      NA      NA      NA      NA      NA      NA

New:

    ADAPTER MAC         BOOT    IPMI    ASF     PXE     UMP     NCSI    iSCSI   EFI
-------------------     ----    ----    ---     ---     ---     ---     ---     ---
E41F136D0B64 (5709)     6.2.0   NA      NA      NA      NA      2.0.11  NA      NA
E41F136D0B66 (5709)     6.2.0   NA      NA      NA      NA      2.0.11  NA      NA
E41F13D60EDC (5709)     6.2.0   NA      NA      NA      NA      2.0.11  NA      NA
E41F13D60EDE (5709)     NA      NA      NA      NA      NA      NA      NA      NA

I keep watching ...

Comment 42 Neil Horman 2011-09-26 18:43:32 UTC
Ok, thank you

Comment 46 Robert Stroetgen 2011-10-12 07:05:44 UTC
Just to keep you up to date: No incident after firmware update yet.

Sep 26 14:40:57 vmhost4 kernel: Broadcom NetXtreme II iSCSI Driver bnx2i v2.6.2.4c (Feb 01, 2011)
Sep 26 14:40:57 vmhost4 kernel: iscsi: registered transport (bnx2i)
Sep 26 14:40:57 vmhost4 kernel: bnx2: eth0: using MSIX
Sep 26 14:40:57 vmhost4 kernel: bnx2i: dev eth0 does not support iSCSI
Sep 26 14:40:57 vmhost4 kernel: bnx2i: eth0 free_hba done after 0 retries
Sep 26 14:40:57 vmhost4 kernel: bnx2: eth1: using MSIX
Sep 26 14:40:57 vmhost4 kernel: bnx2i: dev eth1 does not support iSCSI
Sep 26 14:40:57 vmhost4 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
Sep 26 14:40:57 vmhost4 kernel: bnx2i: eth1 free_hba done after 0 retries
Sep 26 14:40:57 vmhost4 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
Sep 26 14:40:57 vmhost4 kernel: bnx2: eth2: using MSIX
Sep 26 14:40:57 vmhost4 kernel: bnx2i: dev eth2 does not support iSCSI
Sep 26 14:40:57 vmhost4 kernel: bnx2i: eth2 free_hba done after 0 retries
Sep 26 14:40:57 vmhost4 kernel: bnx2: eth2 NIC Copper Link is Up, 1000 Mbps full duplex

Kernel 2.6.32-131.12.1.el6.x86_64

Comment 47 Neil Horman 2011-10-12 11:11:30 UTC
copy that, thank you for the update.  Are the tests continuing to run, or have you concluded that this is the cause of the problem?

Comment 48 Robert Stroetgen 2011-10-12 11:44:28 UTC
We keep the tests running, but more than two weeks without error let us hope.

Comment 49 Neil Horman 2011-10-14 13:06:47 UTC
ok, at 2 weeks, I'll say this is fixed.  Please re-open if the problem resurfaces.  Thanks!