Bug 1658700 - RHEL 7.5+: On reboot of redundant network switch the bnx2x driver interfaces do not re-establish connection to the node consistently
Summary: RHEL 7.5+: On reboot of redundant network switch the bnx2x driver interfaces ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.6
Hardware: ppc64le
OS: Linux
high
high
Target Milestone: rc
: 7.7
Assignee: Manish Chopra (Marvell)
QA Contact: Ma Yuying
URL:
Whiteboard:
Depends On:
Blocks: 1689420 1707052 1739630 1598750
TreeView+ depends on / blocked
 
Reported: 2018-12-12 17:12 UTC by Brett Hull
Modified: 2019-11-14 14:51 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-14 14:46:05 UTC
Target Upstream Version:


Attachments (Terms of Use)
register dumps (3.75 MB, application/gzip)
2019-02-25 23:52 UTC, Jamie Bainbridge
no flags Details
Driver source package (1.76 MB, application/gzip)
2019-03-05 06:54 UTC, sudarsana.kalluru
no flags Details
Edebug - Marvell tool for collecting the detailed debug information (6.16 MB, application/gzip)
2019-03-05 07:01 UTC, sudarsana.kalluru
no flags Details


Links
System ID Priority Status Summary Last Updated
IBM Linux Technology Center 176029 None None None 2019-07-26 19:33:08 UTC

Description Brett Hull 2018-12-12 17:12:41 UTC
Description of problem:
  We are seeing an issue where during testing the customer will reboot one switch of two and sometimes we do not see the the link(s) re-establish and this is an issue. We have some panics which do show the physical link up (but it will not bring up the logical link). At this time we also see bnx2x_panic_dump in the messages files.

Version-Release number of selected component (if applicable):
3.10.0-957.el7.ppc64le - bnx2x behavior and panic_dump's

How reproducible:
The customer has a test suite, they have 7 nodes in a Appliance cluster. Each test at least one node will have this issue, usually multiple nodes have this issue. This is delaying there release date.

Steps to Reproduce:
1. Bring application up, start to generate traffic over network bond
2. Restart one of the network switches, they are in a redundant configuration
3. The connections are lost as expected then they should recover when available

Actual results:
One or both of the fab devices (part of a 4 nic bond) connected to the restarted switch will not recover and remain down.

The last message in the logs indicate a timeout:

[bnx2x_sp_rtnl_task:10288(fab0)]Indicating link is down due to Tx-timeout

Expected results:
As long as there are not physical issues, we expect both fab devices to recover and rejoin the bond.

Additional info:
I do not know how to decode these bnx2x_panic_dump registers, but they do associate with the interfaces (fabx) which do not return to service.

We have multiple forced vmcore's and sosreport's for the nodes. I can easily make these available.

This is not 100% consistent failure per node, but each test has at least one of these failures.

Dec  8 20:25:12 node0105 kernel: device-mapper: multipath: Failing path 66:1504.
Dec  8 20:25:12 node0105 kernel: device-mapper: multipath: Failing path 133:1360.
Dec  8 20:25:12 node0105 kernel: NETDEV WATCHDOG: fab0 (bnx2x): transmit queue 3 timed out
Dec  8 20:25:12 node0105 kernel: ------------[ cut here ]------------
Dec  8 20:25:12 node0105 kernel: WARNING: CPU: 122 PID: 0 at net/sched/sch_generic.c:356 dev_watchdog+0x340/0x360
Dec  8 20:25:12 node0105 kernel: Modules linked in: tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nfnetlink_queue nfnetlink_log bluetooth rfkill mmfs26(OE) mmfslinux(OE) tracedev(OE) xt_addrtype ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables br_netfilter xfs bonding ip_set nfnetlink bridge stp llc dm_service_time dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_multipath i2c_opal leds_powernv ibmpowernv
Dec  8 20:25:12 node0105 kernel: ipmi_powernv i2c_core powernv_rng ses enclosure scsi_transport_sas dm_mod sg ipmi_devintf ipmi_msghandler binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 mbcache jbd2 sd_mod sr_mod cdrom lpfc bnx2x ipr nvmet_fc nvmet libata crc_t10dif crct10dif_generic nvme_fc nvme_fabrics nvme_core mdio scsi_transport_fc ptp pps_core libcrc32c scsi_tgt crct10dif_common [last unloaded: ip_tables]
Dec  8 20:25:12 node0105 kernel: CPU: 122 PID: 0 Comm: swapper/122 Kdump: loaded Tainted: G        W  OEL ------------   3.10.0-957.el7.ppc64le #1
Dec  8 20:25:12 node0105 kernel: task: c000005fd245b300 ti: c000007fffba4000 task.ti: c000005fd2494000
Dec  8 20:25:12 node0105 kernel: NIP: c000000000916560 LR: c00000000091655c CTR: 0000000000000000
Dec  8 20:25:12 node0105 kernel: REGS: c000007fffba7a30 TRAP: 0700   Tainted: G        W  OEL ------------    (3.10.0-957.el7.ppc64le)
Dec  8 20:25:12 node0105 kernel: MSR: 9000000100029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42004024  XER: 00000000
Dec  8 20:25:12 node0105 kernel: CFAR: c000000000a986b8 SOFTE: 1 #012GPR00: c00000000091655c c000007fffba7cb0 c0000000013e4d00 0000000000000039 #012GPR04: 0000000000000001 c0000000015b4d00 c0000000016658c8 c0000000015b4d00 #012GPR08: 000000212072605a 0000000000000000 0000000000000000 c0000000015b4d00 #012GPR12: 0000000000004400 c000000007b64a00 c000005fd2497f90 0000000010200040 #012GPR16: c00000010fe8bd28 c00000010fe8c128 c00000010fe8c528 0000000000000000 #012GPR20: c00000010fe8b928 c000000001422280 0000000000000000 0000000000000000 #012GPR24: 0000000000000000 ffffffffffffffff 0000000000000000 000000000000007a #012GPR28: 0000000000000004 c000000001422280 c000005fcb680000 0000000000000003
Dec  8 20:25:12 node0105 kernel: NIP [c000000000916560] dev_watchdog+0x340/0x360
Dec  8 20:25:12 node0105 kernel: LR [c00000000091655c] dev_watchdog+0x33c/0x360
Dec  8 20:25:12 node0105 kernel: Call Trace:
Dec  8 20:25:12 node0105 kernel: [c000007fffba7cb0] [c00000000091655c] dev_watchdog+0x33c/0x360 (unreliable)
Dec  8 20:25:12 node0105 kernel: [c000007fffba7d50] [c0000000001058b8] call_timer_fn+0x68/0x170
Dec  8 20:25:12 node0105 kernel: [c000007fffba7df0] [c0000000001080bc] run_timer_softirq+0x2dc/0x3a0
Dec  8 20:25:12 node0105 kernel: [c000007fffba7ea0] [c0000000000f7364] __do_softirq+0x154/0x380
Dec  8 20:25:12 node0105 kernel: [c000007fffba7f90] [c00000000002b1fc] call_do_softirq+0x14/0x24
Dec  8 20:25:12 node0105 kernel: [c000005fd2497a40] [c0000000000161d0] do_softirq+0x130/0x180
Dec  8 20:25:12 node0105 kernel: [c000005fd2497a80] [c0000000000f78f4] irq_exit+0x1f4/0x200
Dec  8 20:25:12 node0105 kernel: [c000005fd2497ac0] [c000000000024f94] timer_interrupt+0xa4/0xe0
Dec  8 20:25:12 node0105 kernel: [c000005fd2497af0] [c000000000002c14] decrementer_common+0x114/0x118
Dec  8 20:25:12 node0105 kernel: --- Exception: 901 at arch_local_irq_restore+0xf0/0x140#012    LR = arch_local_irq_restore+0xf0/0x140
Dec  8 20:25:12 node0105 kernel: [c000005fd2497de0] [c00000010fe9f970] 0xc00000010fe9f970 (unreliable)
Dec  8 20:25:12 node0105 kernel: [c000005fd2497e00] [c000000000848870] cpuidle_idle_call+0x140/0x410
Dec  8 20:25:12 node0105 kernel: [c000005fd2497e70] [c000000000089710] powernv_idle+0x20/0x50
Dec  8 20:25:12 node0105 kernel: [c000005fd2497e90] [c00000000001d380] arch_cpu_idle+0x70/0x160
Dec  8 20:25:12 node0105 kernel: [c000005fd2497ec0] [c000000000180a00] cpu_startup_entry+0x190/0x210
Dec  8 20:25:12 node0105 kernel: [c000005fd2497f20] [c000000000054c30] start_secondary+0x310/0x340
Dec  8 20:25:12 node0105 kernel: [c000005fd2497f90] [c000000000009b6c] start_secondary_prolog+0x10/0x14
Dec  8 20:25:12 node0105 kernel: Instruction dump:
Dec  8 20:25:12 node0105 kernel: 994d02a4 4bffff14 7fc3f378 4bfc717d 60000000 7fc4f378 7fe6fb78 7c651b78
Dec  8 20:25:12 node0105 kernel: 3c62ff93 386379f8 48182101 60000000 <0fe00000> 39200001 3d02fff7 9928c5f1
Dec  8 20:25:12 node0105 kernel: ---[ end trace 6d44f50a074840aa ]---
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:923(fab0)]begin crash dump -----------------
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:933(fab0)]def_idx(0x3065)  def_att_idx(0xec)  attn_state(0x0)  spq_prod_idx(0x7d) next_stats_cnt(0x3053)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:938(fab0)]DSB: attn bits(0x0)  ack(0x4)  id(0x0)  idx(0xec)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:939(fab0)]     def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x3196 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0)  igu_sb_id(0x0)  igu_seg_id(0x1) pf_id(0x0)  vnic_id(0x0)  vf_id(0xff)  vf_valid (0x0) state(0x1)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp0: rx_bd_prod(0xba8e)  rx_bd_cons(0x8c7)  rx_comp_prod(0xf9bd)  rx_comp_cons(0xf7f1)  *rx_cons_sb(0xf7f1)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xfe55)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp0: tx_pkt_prod(0x7de3)  tx_pkt_cons(0x7de3)  tx_bd_prod(0x4b6c)  tx_bd_cons(0x4b6b)  *tx_cons_sb(0x7de3)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp0: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp0: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0xfe55 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0xf7f1 0x0 0x0 0x0 0x7de3 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x2)  igu_seg_id(0x0) time_to_expire (0x2aa006a4) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x2)  igu_seg_id(0x0) time_to_expire (0x2a9f03c8) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp1: rx_bd_prod(0xc744)  rx_bd_cons(0x57d)  rx_comp_prod(0xf75e)  rx_comp_cons(0xf592)  *rx_cons_sb(0xf592)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x6235)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp1: tx_pkt_prod(0xdf4b)  tx_pkt_cons(0xdf4b)  tx_bd_prod(0xde69)  tx_bd_cons(0xde68)  *tx_cons_sb(0xdf4b)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp1: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp1: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0x6235 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0xf592 0x0 0x0 0x0 0xdf4b 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x3)  igu_seg_id(0x0) time_to_expire (0x2a872089) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x3)  igu_seg_id(0x0) time_to_expire (0x2a9e48be) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp2: rx_bd_prod(0x1b8a)  rx_bd_cons(0x9c3)  rx_comp_prod(0x4381)  rx_comp_cons(0x41b4)  *rx_cons_sb(0x41b4)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xd426)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp2: tx_pkt_prod(0xfdb6)  tx_pkt_cons(0xfdb6)  tx_bd_prod(0x29de)  tx_bd_cons(0x29dd)  *tx_cons_sb(0xfdb6)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp2: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp2: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0xd426 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0x41b4 0x0 0x0 0x0 0xfdb6 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x4)  igu_seg_id(0x0) time_to_expire (0x2a9f8d8c) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x4)  igu_seg_id(0x0) time_to_expire (0x2aa01b97) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp3: rx_bd_prod(0xb9c9)  rx_bd_cons(0x804)  rx_comp_prod(0xce4d)  rx_comp_cons(0xcc81)  *rx_cons_sb(0xccd8)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xf3f5)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp3: tx_pkt_prod(0x9c2c)  tx_pkt_cons(0x9807)  tx_bd_prod(0x83ad)  tx_bd_cons(0x7b5a)  *tx_cons_sb(0x9c2c)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp3: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp3: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0xf867 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0xccd8 0x0 0x0 0x0 0x9c2c 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x5)  igu_seg_id(0x0) time_to_expire (0x2a9d9253) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x5)  igu_seg_id(0x0) time_to_expire (0x2a50b198) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp4: rx_bd_prod(0x79ef)  rx_bd_cons(0x82a)  rx_comp_prod(0x9ced)  rx_comp_cons(0x9b21)  *rx_cons_sb(0x9b21)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x4eae)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp4: tx_pkt_prod(0x2239)  tx_pkt_cons(0x2239)  tx_bd_prod(0x5cc8)  tx_bd_cons(0x5cc7)  *tx_cons_sb(0x2239)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp4: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0x4eae 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0x9b21 0x0 0x0 0x0 0x2239 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x6)  igu_seg_id(0x0) time_to_expire (0x2a9f8da6) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x6)  igu_seg_id(0x0) time_to_expire (0x2a9fd121) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp5: rx_bd_prod(0xea8d)  rx_bd_cons(0x8c6)  rx_comp_prod(0x2743)  rx_comp_cons(0x2576)  *rx_cons_sb(0x2576)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x4695)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp5: tx_pkt_prod(0x9a19)  tx_pkt_cons(0x9a19)  tx_bd_prod(0x546c)  tx_bd_cons(0x546b)  *tx_cons_sb(0x9a19)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp5: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp5: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0x4695 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0x2576 0x0 0x0 0x0 0x9a19 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x7)  igu_seg_id(0x0) time_to_expire (0x2aa07692) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x7)  igu_seg_id(0x0) time_to_expire (0x2aa01b77) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp6: rx_bd_prod(0x8c3d)  rx_bd_cons(0xa76)  rx_comp_prod(0x9404)  rx_comp_cons(0x9237)  *rx_cons_sb(0x9237)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x6756)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp6: tx_pkt_prod(0x1b8f)  tx_pkt_cons(0x1b8f)  tx_bd_prod(0x7020)  tx_bd_cons(0x701f)  *tx_cons_sb(0x1b8f)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp6: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp6: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0x6756 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0x9237 0x0 0x0 0x0 0x1b8f 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x8)  igu_seg_id(0x0) time_to_expire (0x2aa01b3c) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x8)  igu_seg_id(0x0) time_to_expire (0x2aa077c6) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:990(fab0)]fp7: rx_bd_prod(0x46bc)  rx_bd_cons(0x4f5)  rx_comp_prod(0x785a)  rx_comp_cons(0x768e)  *rx_cons_sb(0x768e)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:993(fab0)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x844d)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp7: tx_pkt_prod(0x36f6)  tx_pkt_cons(0x36f6)  tx_bd_prod(0x95f4)  tx_bd_cons(0x95f3)  *tx_cons_sb(0x36f6)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp7: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1010(fab0)]fp7: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1021(fab0)]     run indexes (0x844d 0x0)
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1027(fab0)]     indexes (0x0 0x768e 0x0 0x0 0x0 0x36f6 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Dec  8 20:25:12 node0105 kernel: SM[0] __flags (0x0) igu_sb_id (0x9)  igu_seg_id(0x0) time_to_expire (0x2a9fd0e8) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: SM[1] __flags (0x0) igu_sb_id (0x9)  igu_seg_id(0x0) time_to_expire (0x2a9d1d54) timer_value(0xff)
Dec  8 20:25:12 node0105 kernel: INDEX[0] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[1] flags (0x2) timeout (0x6)
Dec  8 20:25:12 node0105 kernel: INDEX[2] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[3] flags (0x0) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[4] flags (0x1) timeout (0x0)
Dec  8 20:25:12 node0105 kernel: INDEX[5] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[6] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: INDEX[7] flags (0x3) timeout (0xc)
Dec  8 20:25:12 node0105 kernel: bnx2x 0004:01:00.0 fab0: bc 7.10.4
Dec  8 20:25:12 node0105 kernel: begin fw dump (mark 0x3c6b18)
Dec  8 20:25:12 node0105 kernel: #012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0-Ap#001àAp#001à#007#012#004>0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0-
Dec  8 20:25:12 node0105 kernel: >0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0
Dec  8 20:25:12 node0105 kernel: x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012f0 t.o.
Dec  8 20:25:12 node0105 kernel: end of fw dump
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_mc_assert:750(fab0)]Chip Revision: everest3, FW Version: 7_13_1
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_panic_dump:1186(fab0)]end crash dump -----------------
Dec  8 20:25:12 node0105 kernel: bnx2x: [bnx2x_sp_rtnl_task:10288(fab0)]Indicating link is down due to Tx-timeout

----
node0103 - retrace-server-interact 972546956 crash

node0105 - retrace-server-interact 612634249 crash

----------------------------------------------------------------------------------------------------------------

node0105:  duplicating Jamie's work from #324 and #326

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
c000001fcd0c8000  lo     127.0.0.1
c000003fd0e78000  fab1   
c000003fd0e70000  fab3   
c000005fd1d88000  mgt1   
c000005fd1d90000  eth1   
c000005fcb680000  fab0   
c000005fcb688000  fab2   
c000005fcb690000  mgt0   
c000005fcb698000  eth0   
c0000079445d3000  bond0  
c00000793c704000  fbond  9.0.226.20
c00000793c707000  mbond  9.0.224.27, 9.0.231.27
c000005fc4a82000  docker0 172.17.0.1

crash> net_device.operstate 0xc000003fd0e78000    <<< fab1
  operstate = 2 '\002'                            <<< IF_OPER_DOWN
crash> net_device.operstate c000003fd0e70000      <<< fab3
  operstate = 6 '\006'                            <<< IF_OPER_UP
crash> net_device.operstate c000005fcb680000      <<< fab0
  operstate = 2 '\002'                            <<< IF_OPER_DOWN
crash> net_device.operstate c000005fcb688000      <<< fab2
  operstate = 6 '\006'                            <<< IF_OPER_UP
crash> struct -ox net_device | grep SIZE
SIZE: 0x900
crash> px 0xc00000793c704000 +0x900
$4 = 0xc00000793c704900

crash> bonding 0xc00000793c704900 |grep slave_arr
  slave_arr = 0xc00000790da3bd40, 
  slave_arr_work = {
      func = 0xd000000047b135a0 <bond_slave_arr_handler>
crash> bond_up_slave 0xc00000790da3bd40
struct bond_up_slave {
  count = 2,                                      <<< says only two in bond (functioning at least)
  rcu = {
    next = 0x0, 
    func = 0x0
  }, 
  arr = 0xc00000790da3bd58
}
-------------
#define BNX2X_STATE_OPEN                0x3000   /  12288 base 10.
#define BNX2X_STATE_ERROR               0xf000   /  61440 base 10.

fab1 not good:
crash> px 0xc000003fd0e78000+0x900
$5 = 0xc000003fd0e78900
crash> bnx2x.state 0xc000003fd0e78900
  state = 61440                        <<< 0xf000
crash>
crash> bnx2x.link_vars 0xc000003fd0e78900
  link_vars = {
    phy_flags = 0 '\000',             <<<< seems odd no flags set
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',           <<<< says physical link is available
    link_up = 0 '\000',               <<<< says logical link is down
    line_speed = 10000, 
    duplex = 1, 
-------------------------------------------------

fab3 good:
crash> px 0xc000003fd0e70000 +0x900
$6 = 0xc000003fd0e70900
crash> bnx2x.state 0xc000003fd0e70900
  state = 12288                        <<< 0x3000
crash> bnx2x.link_vars 0xc000003fd0e70900
  link_vars = {
    phy_flags = 5 '\005',              <<<< BNX2_PHY_FLAG_SERDES / BNX2_PHY_FLAG_PARALLEL_DETECT
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',            <<<< says physical link is available
    link_up = 1 '\001',                <<<< says logical link is up
    line_speed = 10000, 
    duplex = 1,
-------------------------------------------------

fab0 not good:
crash> px 0xc000005fcb680000 +0x900
$7 = 0xc000005fcb680900
crash> bnx2x.state 0xc000005fcb680900
  state = 61440                        <<< 0xf000
crash> bnx2x.link_vars 0xc000005fcb680900
  link_vars = {
    phy_flags = 0 '\000',              <<<< seems odd no flags set
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',            <<<< says physical link is available
    link_up = 0 '\000',                <<<< says logical link is down
    line_speed = 10000, 
    duplex = 1,
-------------------------------------------------

fab2 good:

crash> px 0xc000005fcb688000+0x900
$9 = 0xc000005fcb688900
crash> bnx2x.state 0xc000005fcb688900
  state = 12288                        <<< 0x3000
crash> bnx2x.link_vars 0xc000005fcb688900
  link_vars = {
    phy_flags = 5 '\005',               <<<< BNX2_PHY_FLAG_SERDES / BNX2_PHY_FLAG_PARALLEL_DETECT
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',            <<<< says physical link is available
    link_up = 1 '\001',                <<<< says logical link is up
    line_speed = 10000, 
    duplex = 1,
------------------------------------------------
        u32                     phy_flags;
#define BNX2_PHY_FLAG_SERDES                    0x00000001
#define BNX2_PHY_FLAG_CRC_FIX                   0x00000002
#define BNX2_PHY_FLAG_PARALLEL_DETECT           0x00000004
#define BNX2_PHY_FLAG_2_5G_CAPABLE              0x00000008
#define BNX2_PHY_FLAG_INT_MODE_MASK             0x00000300
#define BNX2_PHY_FLAG_INT_MODE_AUTO_POLLING     0x00000100
#define BNX2_PHY_FLAG_INT_MODE_LINK_READY       0x00000200
#define BNX2_PHY_FLAG_DIS_EARLY_DAC             0x00000400
#define BNX2_PHY_FLAG_REMOTE_PHY_CAP            0x00000800
#define BNX2_PHY_FLAG_FORCED_DOWN               0x00001000
#define BNX2_PHY_FLAG_NO_PARALLEL               0x00002000
#define BNX2_PHY_FLAG_MDIX                      0x00004000

so apparent issue with fab0 and fab1 - dmesg rolled

----------------------------------------------------------------------------------------------------------------
node0103:  duplicating Jamie's work from #324 and #326

crash> net
   NET_DEVICE     NAME   IP ADDRESS(ES)
c000003fd0188000  lo     127.0.0.1
c000003fd1e04000  fab1   
c000003fd1e0c000  fab3   
c000003fd1e14000  mgt1   
c000003fd1e1c000  eth1   
c000005fd0e0c000  fab0   
c000005fd0e14000  fab2   
c000005fd0e1c000  mgt0   
c000005fd0e24000  eth0   
c000003fca391000  bond0  
c000003fc8f91000  fbond  9.0.226.18
c000003fc6769000  mbond  9.0.224.25, 9.0.231.25
c000007937302000  docker0 172.17.0.1

crash> net_device.operstate 0xc000003fd1e04000    <<< fab1
  operstate = 6 '\006'                            <<< IF_OPER_UP
crash> net_device.operstate c000003fd1e0c000      <<< fab3
  operstate = 6 '\006'                            <<< IF_OPER_UP
crash> net_device.operstate c000005fd0e0c000      <<< fab0
  operstate = 2 '\002'                            <<< IF_OPER_DOWN
crash> net_device.operstate c000005fd0e14000      <<< fab2
  operstate = 6 '\006'                            <<< IF_OPER_UP

crash> struct -ox net_device | grep SIZE
SIZE: 0x900
crash> px 0xc000003fc8f91000 +0x900
$2 = 0xc000003fc8f91900

crash> bonding 0xc000003fc8f91900 |grep slave_arr
  slave_arr = 0xc0000037807d2c80, 
  slave_arr_work = {
      func = 0xd0000000470f35a0 <bond_slave_arr_handler>

crash> bond_up_slave 0xc0000037807d2c80
struct bond_up_slave {
  count = 3,                            <<< 3 interfaces are available.
  rcu = {
    next = 0x0, 
    func = 0x0
  }, 
  arr = 0xc0000037807d2c98
}
-------------
#define BNX2X_STATE_OPEN                0x3000   /  12288 base 10.
#define BNX2X_STATE_ERROR               0xf000   /  61440 base 10.

fab1 good:
crash> px 0xc000003fd1e04000 +0x900
$3 = 0xc000003fd1e04900
crash> bnx2x.state 0xc000003fd1e04900
  state = 12288                            <<< BNX2X_STATE_OPEN
crash> bnx2x.link_vars 0xc000003fd1e04900
  link_vars = {
    phy_flags = 5 '\005',               <<<< BNX2_PHY_FLAG_SERDES / BNX2_PHY_FLAG_PARALLEL_DETECT
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',             <<<< says physical link is available
    link_up = 1 '\001',                 <<<< says logical link is up
    line_speed = 10000, 
    duplex = 1,
-------------------------------------------------

fab3 good:
crash> px 0xc000003fd1e0c000 +0x900
$4 = 0xc000003fd1e0c900
crash> bnx2x.state 0xc000003fd1e0c900
  state = 12288                           <<< BNX2X_STATE_OPEN
crash> bnx2x.link_vars 0xc000003fd1e0c900
  link_vars = {
    phy_flags = 5 '\005',                <<<< BNX2_PHY_FLAG_SERDES / BNX2_PHY_FLAG_PARALLEL_DETECT
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',              <<<< says physical link is available
    link_up = 1 '\001',                  <<<< says logical link is up
    line_speed = 10000, 
    duplex = 1,
-------------------------------------------------

fab0 not good:
crash> px 0xc000005fd0e0c000 +0x900
$5 = 0xc000005fd0e0c900
crash> bnx2x.state 0xc000005fd0e0c900
  state = 61440                          <<< BNX2X_STATE_ERROR
crash> bnx2x.link_vars 0xc000005fd0e0c900
  link_vars = {
    phy_flags = 0 '\000',              <<<< seems odd no flags set
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',            <<<< says physical link is available
    link_up = 0 '\000',                <<<< says logical link is down
    line_speed = 10000, 
    duplex = 1,
-------------------------------------------------

fab2 good:
crash> px 0xc000005fd0e14000 +0x900
$6 = 0xc000005fd0e14900
crash> bnx2x.state 0xc000005fd0e14900
  state = 12288                           <<< BNX2X_STATE_OPEN
crash> bnx2x.link_vars 0xc000005fd0e14900
  link_vars = {
    phy_flags = 5 '\005',                 <<<< BNX2_PHY_FLAG_SERDES / BNX2_PHY_FLAG_PARALLEL_DETECT
    mac_type = 4 '\004', 
    phy_link_up = 1 '\001',               <<<< says physical link is available
    link_up = 1 '\001',                   <<<< says logical link is up
    line_speed = 10000, 
    duplex = 1,
-------------------------------------------------

        u32                     phy_flags;
#define BNX2_PHY_FLAG_SERDES                    0x00000001
#define BNX2_PHY_FLAG_CRC_FIX                   0x00000002
#define BNX2_PHY_FLAG_PARALLEL_DETECT           0x00000004
#define BNX2_PHY_FLAG_2_5G_CAPABLE              0x00000008
#define BNX2_PHY_FLAG_INT_MODE_MASK             0x00000300
#define BNX2_PHY_FLAG_INT_MODE_AUTO_POLLING     0x00000100
#define BNX2_PHY_FLAG_INT_MODE_LINK_READY       0x00000200
#define BNX2_PHY_FLAG_DIS_EARLY_DAC             0x00000400
#define BNX2_PHY_FLAG_REMOTE_PHY_CAP            0x00000800
#define BNX2_PHY_FLAG_FORCED_DOWN               0x00001000
#define BNX2_PHY_FLAG_NO_PARALLEL               0x00002000
#define BNX2_PHY_FLAG_MDIX                      0x00004000

so apparent issue with fab0 - dmesg rolled

Need vendor assist with this BZ on why we are getting the NETDEV WATCHDOG timeout and the bnx2x_panic_dump why the interfaces do not recover logically every time.  A simple reboot will recover all interfaces. 

We are currently waiting on customer to test and reply back if they are able to recover the interface using the following command (supplied by Jamie):

ip link set dev fabX down; sleep 1; ip link set dev fabX up

The inability to automatically recover is preventing the customer from shipping their Appliance. Switch reboots are part of normal redundancy testing.

Comment 2 Brett Hull 2018-12-18 17:30:36 UTC
Work around attempted, but did not resolve issue:

	[root@node0107 ~]# ip link set dev fab0 down
	[root@node0107 ~]#
  	  (paused 10 sec before next cmd)
	[root@node0107 ~]# ip link set dev fab0 up
	RTNETLINK answers: Device or resource busy
	[root@node0107 ~]#

	  (repeated 'up' cmd again after additional 10 sec pause)
	[root@node0107 ~]# ip link set dev fab0 up
	RTNETLINK answers: Device or resource busy
	[root@node0107 ~]#

	fab0 remained link state down and was no longer listed in 'ifconfig' output.

Brett

Comment 4 loberman 2018-12-18 18:42:18 UTC
Hello Jonathan

I added Cavium to the Bugzilla as you may have seen.
Can we get Cavium to review.

Please review this bug and let us know what else is needed to get this resolved.
This inability to recover a network interface after a switch reboot is a show stopper
for the customer. They are unable to ship their product.

We have multiple vmcores we can make available.

Vmcores are on
optimus-ppc64le.gsslab.rdu2.redhat.com

retrace-server-interact 783705935 crash - node0103
retrace-server-interact 791663392 crash - node0107

The vmcore show the watchdog for the Network device happening so we will pass this on to engineering.

[201459.621866] NETDEV WATCHDOG: fab3 (bnx2x): transmit queue 5 timed out
[201459.621906] ------------[ cut here ]------------
[201459.621916] WARNING: CPU: 34 PID: 0 at net/sched/sch_generic.c:356 dev_watchdog+0x340/0x360
[201459.621994] Modules linked in: nfnetlink_queue nfnetlink_log bluetooth rfkill mmfs26(OE) mmfslinux(OE) tracedev(OE) tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nf_conntrack_netlink xt_addrtype br_netfilter xfs bonding ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_service_time dm_multipath leds_powernv ibmpowernv
[201459.622044]  ipmi_powernv powernv_rng i2c_opal i2c_core ses enclosure scsi_transport_sas dm_mod sg ipmi_devintf ipmi_msghandler binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sr_mod sd_mod cdrom lpfc bnx2x ipr nvmet_fc nvmet crc_t10dif crct10dif_generic nvme_fc libata nvme_fabrics nvme_core scsi_transport_fc mdio libcrc32c ptp pps_core scsi_tgt crct10dif_common [last unloaded: tracedev]
[201459.622051] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Tainted: G        W  OEL ------------   3.10.0-957.el7.ppc64le #1
[201459.622053] task: c000001fd170ae00 ti: c000007fffe64000 task.ti: c000001fd18ac000
[201459.622055] NIP: c000000000916560 LR: c00000000091655c CTR: 0000000000000000
[201459.622057] REGS: c000007fffe67a30 TRAP: 0700   Tainted: G        W  OEL ------------    (3.10.0-957.el7.ppc64le)
[201459.622066] MSR: 9000000100029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 42004024  XER: 00000000
[201459.622096] CFAR: c000000000a986b8 SOFTE: 1 
                GPR00: c00000000091655c c000007fffe67cb0 c0000000013e4d00 0000000000000039 
                GPR04: 0000000000000001 c0000000015b4d00 c00000011314d790 c0000000015b4d00 
                GPR08: 000000222514412a 0000000000000000 0000000000000000 c0000000015b4d00 
                GPR12: 0000000000004400 c000000007b33200 c000001fd18aff90 0000000010200040 
                GPR16: c00000010e88bd28 c00000010e88c128 c00000010e88c528 0000000000000000 
                GPR20: c00000010e88b928 c000000001422280 0000000000000000 0000000000000000 
                GPR24: 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000022 
                GPR28: 0000000000000004 c000000001422280 c000003fd1e90000 0000000000000005 
[201459.622099] NIP [c000000000916560] dev_watchdog+0x340/0x360
[201459.622101] LR [c00000000091655c] dev_watchdog+0x33c/0x360
[201459.622102] Call Trace:
[201459.622107] [c000007fffe67cb0] [c00000000091655c] dev_watchdog+0x33c/0x360 (unreliable)
[201459.622115] [c000007fffe67d50] [c0000000001058b8] call_timer_fn+0x68/0x170
[201459.622121] [c000007fffe67df0] [c0000000001080bc] run_timer_softirq+0x2dc/0x3a0
[201459.622125] [c000007fffe67ea0] [c0000000000f7364] __do_softirq+0x154/0x380
[201459.622131] [c000007fffe67f90] [c00000000002b1fc] call_do_softirq+0x14/0x24
[201459.622136] [c000001fd18afa40] [c0000000000161d0] do_softirq+0x130/0x180
[201459.622140] [c000001fd18afa80] [c0000000000f78f4] irq_exit+0x1f4/0x200
[201459.622146] [c000001fd18afac0] [c000000000024f94] timer_interrupt+0xa4/0xe0
[201459.622151] [c000001fd18afaf0] [c000000000002c14] decrementer_common+0x114/0x118
[201459.622156] --- Exception: 901 at arch_local_irq_restore+0xf0/0x140
                    LR = arch_local_irq_restore+0xf0/0x140
[201459.622174] [c000001fd18afde0] [c00000010e89f970] 0xc00000010e89f970 (unreliable)
[201459.622180] [c000001fd18afe00] [c000000000848870] cpuidle_idle_call+0x140/0x410
[201459.622187] [c000001fd18afe70] [c000000000089710] powernv_idle+0x20/0x50
[201459.622192] [c000001fd18afe90] [c00000000001d380] arch_cpu_idle+0x70/0x160
[201459.622197] [c000001fd18afec0] [c000000000180a00] cpu_startup_entry+0x190/0x210
[201459.622202] [c000001fd18aff20] [c000000000054c30] start_secondary+0x310/0x340
[201459.622206] [c000001fd18aff90] [c000000000009b6c] start_secondary_prolog+0x10/0x14
[201459.622207] Instruction dump:
[201459.622213] 994d02a4 4bffff14 7fc3f378 4bfc717d 60000000 7fc4f378 7fe6fb78 7c651b78 
[201459.622219] 3c62ff93 386379f8 48182101 60000000 <0fe00000> 39200001 3d02fff7 9928c5f1 
[201459.622220] ---[ end trace 23982464f60761ea ]---
[201459.622224] bnx2x: [bnx2x_panic_dump:923(fab3)]begin crash dump -----------------
[201459.622227] bnx2x: [bnx2x_panic_dump:933(fab3)]def_idx(0xf69)  def_att_idx(0x494)  attn_state(0x0)  spq_prod_idx(0x81) next_stats_cnt(0xf58)
[201459.622230] bnx2x: [bnx2x_panic_dump:938(fab3)]DSB: attn bits(0x0)  ack(0x1)  id(0x0)  idx(0x494)
[201459.622248] bnx2x: [bnx2x_panic_dump:939(fab3)]     def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x127b 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0)  igu_sb_id(0x0)  igu_seg_id(0x1) pf_id(0x0)  vnic_id(0x0)  vf_id(0xff)  vf_valid (0x0) state(0x1)
[201459.622252] bnx2x: [bnx2x_panic_dump:990(fab3)]fp0: rx_bd_prod(0xf1e4)  rx_bd_cons(0x1f)  rx_comp_prod(0xf7d6)  rx_comp_cons(0xf60a)  *rx_cons_sb(0xf60a)
[201459.622254] bnx2x: [bnx2x_panic_dump:993(fab3)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x4e0d)
[201459.622258] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp0: tx_pkt_prod(0x6c22)  tx_pkt_cons(0x6c22)  tx_bd_prod(0xf10a)  tx_bd_cons(0xf109)  *tx_cons_sb(0x6c22)
[201459.622262] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp0: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
[201459.622265] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp0: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
[201459.622269] bnx2x: [bnx2x_panic_dump:1021(fab3)]     run indexes (0x4e0d 0x0)
[201459.622291] bnx2x: [bnx2x_panic_dump:1027(fab3)]     indexes (0x0 0xf60a 0x0 0x0 0x0 0x6c22 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
[201459.622294] SM[0] __flags (0x0) igu_sb_id (0x2)  igu_seg_id(0x0) time_to_expire (0xa62f494a) timer_value(0xff)
[201459.622296] SM[1] __flags (0x0) igu_sb_id (0x2)  igu_seg_id(0x0) time_to_expire (0xa62e3f65) timer_value(0xff)
[201459.622297] INDEX[0] flags (0x0) timeout (0x0)
[201459.622298] INDEX[1] flags (0x2) timeout (0x6)
[201459.622300] INDEX[2] flags (0x0) timeout (0x0)
[201459.622301] INDEX[3] flags (0x0) timeout (0x0)
[201459.622302] INDEX[4] flags (0x1) timeout (0x0)
[201459.622303] INDEX[5] flags (0x3) timeout (0xc)
[201459.622305] INDEX[6] flags (0x3) timeout (0xc)
[201459.622306] INDEX[7] flags (0x3) timeout (0xc)
[201459.622310] bnx2x: [bnx2x_panic_dump:990(fab3)]fp1: rx_bd_prod(0x2419)  rx_bd_cons(0x252)  rx_comp_prod(0x4304)  rx_comp_cons(0x4137)  *rx_cons_sb(0x4137)
[201459.622312] bnx2x: [bnx2x_panic_dump:993(fab3)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x547d)
[201459.622315] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp1: tx_pkt_prod(0x8954)  tx_pkt_cons(0x8954)  tx_bd_prod(0x4bb4)  tx_bd_cons(0x4bb3)  *tx_cons_sb(0x8954)
[201459.622318] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp1: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
[201459.622321] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp1: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
[201459.622325] bnx2x: [bnx2x_panic_dump:1021(fab3)]     run indexes (0x547d 0x0)
[201459.622346] bnx2x: [bnx2x_panic_dump:1027(fab3)]     indexes (0x0 0x4137 0x0 0x0 0x0 0x8954 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
[201459.622349] SM[0] __flags (0x0) igu_sb_id (0x3)  igu_seg_id(0x0) time_to_expire (0xa62efa12) timer_value(0xff)
[201459.622351] SM[1] __flags (0x0) igu_sb_id (0x3)  igu_seg_id(0x0) time_to_expire (0xa60b9fbf) timer_value(0xff)
[201459.622352] INDEX[0] flags (0x0) timeout (0x0)
[201459.622354] INDEX[1] flags (0x2) timeout (0x6)
[201459.622355] INDEX[2] flags (0x0) timeout (0x0)
[201459.622356] INDEX[3] flags (0x0) timeout (0x0)
[201459.622357] INDEX[4] flags (0x1) timeout (0x0)
[201459.622358] INDEX[5] flags (0x3) timeout (0xc)
[201459.622360] INDEX[6] flags (0x3) timeout (0xc)
[201459.622361] INDEX[7] flags (0x3) timeout (0xc)
[201459.622365] bnx2x: [bnx2x_panic_dump:990(fab3)]fp2: rx_bd_prod(0x51a1)  rx_bd_cons(0xfda)  rx_comp_prod(0x64e7)  rx_comp_cons(0x631b)  *rx_cons_sb(0x631b)
[201459.622367] bnx2x: [bnx2x_panic_dump:993(fab3)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x1a6d)
[201459.622370] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp2: tx_pkt_prod(0xe10f)  tx_pkt_cons(0xe10f)  tx_bd_prod(0x393d)  tx_bd_cons(0x393c)  *tx_cons_sb(0xe10f)
[201459.622373] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp2: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
[201459.622376] bnx2x: [bnx2x_panic_dump:1010(fab3)]fp2: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
[201459.622379] bnx2x: [bnx2x_panic_dump:1021(fab3)]     run indexes (0x1a6d 0x0)
[201459.622400] bnx2x: [bnx2x_panic_dump:1027(fab3)]     indexes (0x0 0x631b 0x0 0x0 0x0 0xe10f 0x0 0x0)pf_id(0x0)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
[201459.622402] SM[0] __flags (0x0) igu_sb_id (0x4)  igu_seg_id(0x0) time_to_expire (0xa6322e60) timer_value(0xff)
[201459.622405] SM[1] __flags (0x0) igu_sb_id (0x4)  igu_seg_id(0x0) time_to_expire (0xa6069e0e) timer_value(0xff)
[201459.622406] INDEX[0] flags (0x0) timeout (0x0)
[201459.622407] INDEX[1] flags (0x2) timeout (0x6)
 

Regards
Laurence Oberman

Comment 5 Jonathan Toppins 2018-12-18 20:14:04 UTC
If you have a recreate you can try this kernel:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19549933

It contains the patches from upstream for bz1643534, a general driver update request, I have no reason to think this will solve their problem as usually link issues on Broadcom (bnx2x was originally a Broadcom device) type hardware means some sort of firmware issue.

I added the standard Cavium contacts and engineering partner manager to the bug.

Comment 8 Ameen Rahman 2018-12-18 21:11:46 UTC
Sudarsana will help debug this from Cavium/Marvell end.

Comment 9 loberman 2019-01-08 15:56:09 UTC
Hello Jonathan

Latest testing with the test kernel as expected did not help.

Netezza just escalated this issue now so we will have to get the vendor (I guess Cavium) to make progress on this bnx issue

Jan  4 16:13:13 node0101 kernel: NETDEV WATCHDOG: mgt1 (bnx2x): transmit queue 7 timed out
Jan  4 16:13:13 node0101 kernel: ibmpowernv ipmi_powernv powernv_rng i2c_opal i2c_core ses enclosure scsi_transport_sas dm_mod sg binfmt_misc ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 mbcache jbd2 sd_mod sr_mod cdrom lpfc bnx2x ipr nvmet_fc nvmet crc_t10dif crct10dif_generic libata nvme_fc nvme_fabrics nvme_core mdio libcrc32c scsi_transport_fc ptp pps_core scsi_tgt crct10dif_common [last unloaded: ip_tables]
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:923(mgt1)]begin crash dump -----------------
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:933(mgt1)]def_idx(0x1caa)  def_att_idx(0x1a)  attn_state(0x0)  spq_prod_idx(0xc2) next_stats_cnt(0x1c98)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:938(mgt1)]DSB: attn bits(0x0)  ack(0x4)  id(0x0)  idx(0x1a)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:939(mgt1)]     def (0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x1cc6 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0)  igu_sb_id(0x44)  igu_seg_id(0x1) pf_id(0x1)  vnic_id(0x0)  vf_id(0xff)  vf_valid (0x0) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp0: rx_bd_prod(0xad49)  rx_bd_cons(0xcc9)  rx_comp_prod(0xaf5d)  rx_comp_cons(0xaedb)  *rx_cons_sb(0xaedb)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xa36)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp0: tx_pkt_prod(0x71b1)  tx_pkt_cons(0x71b1)  tx_bd_prod(0xfe60)  tx_bd_cons(0xfe5f)  *tx_cons_sb(0x71b1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp0: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp0: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0xa36 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0xaedb 0x0 0x0 0x0 0x71b1 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp1: rx_bd_prod(0xad71)  rx_bd_cons(0xcf1)  rx_comp_prod(0xaf86)  rx_comp_cons(0xaf04)  *rx_cons_sb(0xaf04)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x9cb9)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp1: tx_pkt_prod(0xf230)  tx_pkt_cons(0xf230)  tx_bd_prod(0xe646)  tx_bd_cons(0xe645)  *tx_cons_sb(0xf230)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp1: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp1: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0x9cb9 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0xaf04 0x0 0x0 0x0 0xf230 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp2: rx_bd_prod(0x222b)  rx_bd_cons(0x1a9)  rx_comp_prod(0x25a2)  rx_comp_cons(0x2520)  *rx_cons_sb(0x2520)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x5065)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp2: tx_pkt_prod(0x3272)  tx_pkt_cons(0x3272)  tx_bd_prod(0x677d)  tx_bd_cons(0x677c)  *tx_cons_sb(0x3272)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp2: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp2: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0x5065 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0x2520 0x0 0x0 0x0 0x3272 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp3: rx_bd_prod(0x376)  rx_bd_cons(0x2f6)  rx_comp_prod(0x691)  rx_comp_cons(0x60f)  *rx_cons_sb(0x60f)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xe68)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp3: tx_pkt_prod(0x3933)  tx_pkt_cons(0x3933)  tx_bd_prod(0x74da)  tx_bd_cons(0x74d9)  *tx_cons_sb(0x3933)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp3: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp3: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0xe68 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0x60f 0x0 0x0 0x0 0x3933 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp4: rx_bd_prod(0x58d7)  rx_bd_cons(0x857)  rx_comp_prod(0x5cf5)  rx_comp_cons(0x5c73)  *rx_cons_sb(0x5c73)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0x61bd)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp4: tx_pkt_prod(0xc680)  tx_pkt_cons(0xc680)  tx_bd_prod(0xa4d6)  tx_bd_cons(0xa4d5)  *tx_cons_sb(0xc680)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp4: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp4: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0x61bd 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0x5c73 0x0 0x0 0x0 0xc680 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp5: rx_bd_prod(0xf2ff)  rx_bd_cons(0x27f)  rx_comp_prod(0xf5e7)  rx_comp_cons(0xf565)  *rx_cons_sb(0xf565)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xbe04)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp5: tx_pkt_prod(0xcd38)  tx_pkt_cons(0xcd38)  tx_bd_prod(0x9c0c)  tx_bd_cons(0x9c0b)  *tx_cons_sb(0xcd38)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp5: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp5: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0xbe04 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0xf565 0x0 0x0 0x0 0xcd38 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp6: rx_bd_prod(0xcfd9)  rx_bd_cons(0xf59)  rx_comp_prod(0xd257)  rx_comp_cons(0xd1d5)  *rx_cons_sb(0xd1d5)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xbda5)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp6: tx_pkt_prod(0xf128)  tx_pkt_cons(0xf128)  tx_bd_prod(0xe466)  tx_bd_cons(0xe465)  *tx_cons_sb(0xf128)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp6: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp6: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0xbda5 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0xd1d5 0x0 0x0 0x0 0xf128 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:990(mgt1)]fp7: rx_bd_prod(0x4c73)  rx_bd_cons(0xbf1)  rx_comp_prod(0x506b)  rx_comp_cons(0x4fe9)  *rx_cons_sb(0x4fe9)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:993(mgt1)]     rx_sge_prod(0x0)  last_max_sge(0x0)  fp_hc_idx(0xf706)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp7: tx_pkt_prod(0xed20)  tx_pkt_cons(0xed02)  tx_bd_prod(0xdc1c)  tx_bd_cons(0xdbde)  *tx_cons_sb(0xed20)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp7: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1010(mgt1)]fp7: tx_pkt_prod(0x0)  tx_pkt_cons(0x0)  tx_bd_prod(0x0)  tx_bd_cons(0x0)  *tx_cons_sb(0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1021(mgt1)]     run indexes (0xf724 0x0)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1027(mgt1)]     indexes (0x0 0x4fe9 0x0 0x0 0x0 0xed20 0x0 0x0)pf_id(0x1)  vf_id(0xff)  vf_valid(0x0) vnic_id(0x0)  same_igu_sb_1b(0x1) state(0x1)
Jan  4 16:13:13 node0101 kernel: bnx2x 0002:01:00.2 mgt1: bc 7.10.4
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_mc_assert:750(mgt1)]Chip Revision: everest3, FW Version: 7_13_1
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_panic_dump:1186(mgt1)]end crash dump -----------------
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_sp_rtnl_task:10298(mgt1)]Indicating link is down due to Tx-timeout
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_clean_tx_queue:1206(mgt1)]timeout waiting for queue[7]: txdata->tx_pkt_prod(60704) != txdata->tx_pkt_cons(60674)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_clean_tx_queue:1206(mgt1)]timeout waiting for queue[7]: txdata->tx_pkt_prod(60704) != txdata->tx_pkt_cons(60674)
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_state_wait:310(mgt1)]timeout waiting for state 9
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_flr_clnup_poll_hw_counter:1296(mgt1)]CFC PF usage counter timed out usage count=2
Jan  4 16:13:13 node0101 kernel: bnx2x 0002:01:00.2 mgt1: bc 7.10.4
Jan  4 16:13:13 node0101 kernel: bnx2x: [bnx2x_nic_load:2728(mgt1)]HW init failed, aborting
Jan  4 16:13:13 node0101 kernel: bnx2x 0002:01:00.2 mgt1: speed changed to 0 for port mgt1
Jan  4 16:16:56 node0101 multipathd: 3600507682187458820000000110000b2: sdbnx - tur checker reports path is up
Jan  4 16:16:56 node0101 multipathd: 3600507682187458820000000110000b2: sdbnx - tur checker reports path is up
Jan  4 16:26:37 node0101 kernel: bnx2x: [bnx2x_get_regs:1006(fab1)]Generating register dump. Might trigger harmless GRC timeouts

Comment 11 loberman 2019-01-08 16:27:47 UTC
Customer has this additional message to share. some of this has already been shared above.

[root@node0101 ~]# ip link set dev mgt1 down; sleep 1; ip link set dev mgt1 up
RTNETLINK answers: Device or resource busy

Jan  8 10:57:36 node0101 systemd-logind: New session 57200 of user root.
Jan  8 10:57:36 node0101 systemd: Started Session 57200 of user root.
Jan  8 10:57:36 node0101 systemd: Started Session 57200 of user root.
Jan  8 10:57:36 node0101 systemd: Starting Session 57200 of user root.
Jan  8 10:57:36 node0101 systemd: Starting Session 57200 of user root.
Jan  8 10:57:36 node0101 nslcd[111442]: [8f3ef3] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:36 node0101 nslcd[111442]: [8f3ef3] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:36 node0101 nslcd[111442]: [8f3ef3] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:36 node0101 nslcd[111442]: [8f3ef3] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:36 node0101 systemd-logind: Removed session 57200.
Jan  8 10:57:51 node0101 kernel: bnx2x: [bnx2x_flr_clnup_poll_hw_counter:1296(mgt1)]CFC PF usage counter timed out usage count=2
Jan  8 10:57:51 node0101 kernel: bnx2x 0002:01:00.2 mgt1: bc 7.10.4
Jan  8 10:57:51 node0101 kernel: begin fw dump (mark 0x3c6908)
Jan  8 10:57:51 node0101 kernel: 0x4#012attn 0x4->0x0#012attn 0x0->0x4
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#01
Jan  8 10:57:51 node0101 kernel: attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012attn 0x0->0x4#012attn 0x4->0x0#012f2: LOAD_REQ 0x1
Jan  8 10:57:51 node0101 kernel: end of fw dump
Jan  8 10:57:51 node0101 kernel: bnx2x: [bnx2x_nic_load:2728(mgt1)]HW init failed, aborting
Jan  8 10:57:54 node0101 nslcd[111442]: [964162] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 nslcd[111442]: [964162] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 nslcd[111442]: [964162] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 nslcd[111442]: [964162] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 systemd-logind: New session 57202 of user root.
Jan  8 10:57:54 node0101 systemd: Started Session 57202 of user root.
Jan  8 10:57:54 node0101 systemd: Started Session 57202 of user root.
Jan  8 10:57:54 node0101 systemd: Starting Session 57202 of user root.
Jan  8 10:57:54 node0101 systemd: Starting Session 57202 of user root.
Jan  8 10:57:54 node0101 nslcd[111442]: [aeb49e] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 nslcd[111442]: [aeb49e] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 nslcd[111442]: [aeb49e] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required
Jan  8 10:57:54 node0101 nslcd[111442]: [aeb49e] <group/member="root"> ldap_result() failed: Server is unwilling to perform: authentication required

Comment 13 Jonathan Toppins 2019-01-08 17:43:29 UTC
Thoughts on the firmware dump in comments #11 and #9?

Comment 14 Jamie Bainbridge 2019-02-25 23:51:18 UTC
The hardware vendor support has provided a method to reproduce, and a set of text firmware register dumps. Including their latest case comment below and attaching the register dumps for Cavium review.

--------------------------------------------------------------------------------
Let me tell the issue's story the way I know it, since I don't have access to the redhat bz.

So at the beginning the thought was: we're experiencing a couple of bugs here:

 - the NETDEV WATCHDOG queue timeout for bnx2x driver;
  - the bonding complaining about the iface withtout link and removing it from its slave pool;
  - The iface becames unresponsive and unrecoverable;

In the end it's all just one and the same issue: The timeout happens, the driver resets the adapter in order to recover from it, the adapter fails to recovery, driver sets the speed to zero and bonding removes it.

Going through each step of the issue: 
  - NETDEV WATCHDOG timeout
 It's happening *consistently* when the test systems are put to stress, specifically the test consists in a 12 rounds of turning off a fibre channel switch to which there was around 2000 multipath connections. The issue reproduces during the multipath timing out and detecting the paths as down, or at the moment the paths are back up and system starts to recovery everything. I tried to reproduce this situation with some other stressor like HTX using multiple exercises like cpu, memory, disk and network simultaneously and got no success so far.

  - The adapter fails to recovery:
 Once the adapter needs to reset you'll see a lot of registers and panic dump going on. No allarm on that, unless the registers dumped are pointing to something which I couldn't yet figure due to lack of a register map for tihs adapter. Once the dump is done you see:

...
 Jan 28 16:11:39 node0103 kernel: bnx2x: [bnx2x_panic_dump:1186(fab0)]end crash dump -----------------
 Jan 28 16:11:39 node0103 kernel: bnx2x: [bnx2x_sp_rtnl_task:10288(fab0)]Indicating link is down due to Tx-timeout
 Jan 28 16:11:39 node0103 kernel: bnx2x: [bnx2x_clean_tx_queue:1205(fab0)]timeout waiting for queue[3]: txdata->tx_pkt_prod(50754) != txdata->tx_pkt_cons(50687)
 Jan 28 16:11:39 node0103 kernel: bnx2x: [bnx2x_clean_tx_queue:1205(fab0)]timeout waiting for queue[3]: txdata->tx_pkt_prod(50754) != txdata->tx_pkt_cons(50687)
 Jan 28 16:11:39 node0103 kernel: bnx2x: [bnx2x_state_wait:310(fab0)]timeout waiting for state 9
 ...
 Jan 28 16:11:41 node0103 kernel: bnx2x: [bnx2x_flr_clnup_poll_hw_counter:1296(fab0)]CFC PF usage counter timed out usage count=5
 Jan 28 16:11:41 node0103 kernel: bnx2x 0004:01:00.0 fab0: bc 7.10.4              
 Jan 28 16:11:41 node0103 kernel: bnx2x: [bnx2x_nic_load:2728(fab0)]HW init failed, aborting                                                                                                                        
 Jan 28 16:11:41 node0103 kernel: bnx2x 0004:01:00.0 fab0: speed changed to 0 for port fab0

As far as I can tell the firmware (RAMROD?) timed out transitioning to the state required to perform the re-initialization, which is later translated as returning -EBUSY. *After that driver disables the ifaces, including deallocating IRQs* (we can't find the IRQ for this iface in /proc/interrupts for instance) there was nothing I could do to make the iface up again, even removing and probing the module again didn't help. Other ifaces from the same device, and thus from the same firmware, are not affected. It only recovers once you reboot the system.

 FWIW I tried to "increase" the recovery level by changing the UNLOAD_NORMAL parameter from the calls of bnx2x_nic_unload to UNLOAD_RECOVERY (as it happens on EEH situations), it didn't help.

 - Bond removes the failed slave.
 At least so far, the bonding seems to me to be acting just like expected, miitool detects speed, and thus the link,  of the iface going down to zero and just reports that, the other iface in the bond assumes the traffic alone. 

Note: It's a LACP bond (802.3ad).

Some other information worthy to note:
  - We tested this running the follow firmwares:
    * e4148a1614109304.30100150 
 And this updated  version, which I updated using the instructions at https://delivery04.dhe.ibm.com/sar/CMA/IOA/0519t/1/Shiner-S_EN0S_EN0U_EN0T_EN0V_30100150_readme_V5-AIXandLinux.html 
    * e4148a1614109304.30100310 (please note this is internal for IBM, and not yet available at fix central)

Both firmwares fail equally.

 It would be great if redhat could mirror the bugzila back to ltc-bz so we can communicate through it, so I can upload the sosreport from our latest run which is too big to attach; also I think it's time to let Cavium know about the issue if they don't know about it yet, and get some orientation here. 

 4 Register dumps are attached:
  * mgt0_bad_interface : For this run mgt0 is the interface that failed. This is its register dump
  * fab2_good_interface_different_bond_same_nic: Reg dump for an interface in good state, from a different bond, same NIC.
  * fab1_good_interface_different_bond_different_nic : Reg dump for an interface in good state, from a different bond and different nic (we have 2 NIC in the system)
  * mgt1_good_interface_same_bond_different_nic: Reg dump for an interface in good state, same bond, but different NIC.
--------------------------------------------------------------------------------

Comment 15 Jamie Bainbridge 2019-02-25 23:52:34 UTC
Created attachment 1538627 [details]
register dumps

Comment 16 Jamie Bainbridge 2019-02-25 23:53:56 UTC
Please comment on the supplied reproduce steps and register dumps.

Comment 17 sudarsana.kalluru 2019-03-01 16:15:12 UTC
Marvell FW team analyzed the grcDump, it doesn't reveal the cause for tx-timeout or the recovery failure. Does this issue reproducible with out of box driver?
FW team requires following data for further analysis,
  -  grcDump to be collected before the recovery process trigger. This needs some instrumentation to the driver code. Hence need a repro with oob driver.
  -  Recording of device internal debug data (also known as Recorded Handlers). This requires running the marvell debug tool on the SUT till the timeout issue is hit. The debug data will be formatted into Ethernet packets and will placed on the Tx of the interface. Need to run packet snipper on the peer interface to collect these packets. 
Please let us know if customer is willing to perform these tests.

Comment 18 Jamie Bainbridge 2019-03-03 23:10:25 UTC
Thank you very much, I will ask the customer.

Specifically which OOB driver would you like used?

Please provide an RPM or source or some other installation method. Keep in mind the customer is using ppc64le architecture.

Comment 19 Jamie Bainbridge 2019-03-04 01:38:25 UTC
The customer is willing to test. Please provide the OOB driver you'd like used and the debug tool.

Comment 20 sudarsana.kalluru 2019-03-05 06:54:37 UTC
Created attachment 1540856 [details]
Driver source package

Comment 21 sudarsana.kalluru 2019-03-05 07:01:01 UTC
Created attachment 1540857 [details]
Edebug - Marvell tool for collecting the detailed debug information

Comment 22 sudarsana.kalluru 2019-03-05 07:01:53 UTC
Thanks for your help.
Please find the attachment of the driver source and edebug (Marvell internal tool) packages.

Steps for building/installing the driver is straight forward,
  # tar xvf bnx2x-1.714.27-1.tar.gz
  # cd bnx2x-1.714.27-1/src/
  # make
  # insmod  bnx2x.ko

Edebug tool enables the adapter to send debug information/packets to the Tx of interface, these packets need to be captured on the peer interface using a sniffer (e.g., tcpdump) to a pcap file.
Following are the steps for collecting the debug data.
1. Extract/build the tool. Please refer to the README file for more info.
   tar xvf edebug_linux_ver_1.0.26.tar.gz
   make (or make DONT_USE_MACHINE_TCL=1)
2. Start edbug tool. New CLI interface will be spawned.
   #./load.sh -b10eng
   1 :  57810:B0    00:07:00:00 PCIE-8 5.0     00:10:18:A7:18:30  7.12.83.0  Mp,10G     XX p3p1    D0  1.714.12-b bnx2x
   2 :  57810:B0    00:07:00:00 PCIE-8 5.0     00:10:18:A7:18:30  7.12.83.0  Mp,10G     XY p3p2    D0  1.714.12-b bnx2x
   1:> 
3. Execute the following commands in the debug tool prompt,
   1:> source scripts/dbgTools.tcl

   1:> setp RH_ALL_STORMS -network h1 -port0

   << Start tcpdump/packet-snipper on peer interface to collect the debug packets >>

   1:> dbgOn

   << Wait for the issue to occur >>

   1:> dbgEnd

    << Stop the tcpDump >>

4. Collect "ethtool -d <interface> output" on the host interface(SUT).

Please share the pcp file and ethtool output collected in steps (3) and (4) respectively.

Thanks,
Sudarsana

Comment 23 Hanns-Joachim Uhl 2019-03-05 11:28:21 UTC
Hello Red Hat / Brett,
... I got the request to reverse mirror this Red Hat bugzilla to IBM ...
... i.e. can this Red Hat bugzilla be reverse mirrored to IBM ...? 
... Please confirm or advise ...
Thanks in advance for your support.

Comment 24 Brett Hull 2019-03-06 01:35:39 UTC
Hello Sudarsana,

  Thank you very much for the driver/tool and instructions. I have sent them to the customer. 

IBM has requested that this BZ be reverse mirror to IBM?  I feel this is fine, wanted to verify with you.

Best regards,
Brett

Comment 26 loberman 2019-03-08 08:07:53 UTC
From Brett

Hello Sudarsana,

  Thank you very much for the driver/tool and instructions. I have sent them to the customer. 

IBM has requested that this BZ be reverse mirror to IBM?  I feel this is fine, wanted to verify with you.

Best regards,
Brett

Comment 27 Ameen Rahman 2019-03-08 15:39:14 UTC
Not sure what does reverse mirror mean. Are they asking for access to the BZ? We are fine with that.

Comment 28 Hanns-Joachim Uhl 2019-03-08 16:12:29 UTC
(In reply to Ameen Rahman from comment #27)
> Not sure what does reverse mirror mean. Are they asking for access to the
> BZ? We are fine with that.
.
... thanks for the notice ... IBM has now access to this Red Hat bugzilla ...

Comment 29 IBM Bug Proxy 2019-10-29 22:20:37 UTC
------- Comment From seg@us.ibm.com 2019-10-29 18:19 EDT-------
Nobody working it and nobody from the submitter side driving AFAICT. Time to close.

Comment 30 Hanns-Joachim Uhl 2019-11-14 14:46:05 UTC
ok ... per the previous comments and with RHEL7.7 GA now
I am closing this Red Hat bugzilla now ...
... please correct me if I am wrong ...

Comment 31 loberman 2019-11-14 14:51:34 UTC
Hello,
Last I heard was that this was tested and validated with changes to the driver so yes we can close it.

Regards
Laurence


Note You need to log in before you can comment on or make changes to this bug.