Hide Forgot
Description of problem: I was running smb_torture with ctdb on a four-node RHEL6 cluster. After running it successfully 7 or 8 times, one of the nodes failed with the errors show in "Actual Results". I was able to recreate the problem a second time. Version-Release number of selected component (if applicable): 2.6.32-167.el6.bz676626.x86_64 How reproducible: Unknown, but it recreated twice for me already, so I think it's likely to recreate again. Steps to Reproduce: 1.On all nodes: Start the cluster: service cman start service clvmd start 2.From one node, create the two gfs2 file systems: mkfs.gfs2 -j4 -p lock_dlm -t intec_cluster:ctdb /dev/intec/ctdb -O && mkfs.gfs2 -j4 -p lock_dlm -t intec_cluster:gfs2 /dev/intec/intec1 -O 3.From all nodes, mount the GFS2 file systems: mount -t gfs2 /dev/intec/ctdb /mnt/ctdb/ && mount -t gfs2 /dev/intec/in tec1 /mnt/gfs2a/ 4.From all nodes, start ctdb: service ctdb start 5.From the driver system, gfs-i24c-01, start smb_torture: cd /home/bob/samba/bin/default/source4/torture ./smbtorture //localhost/data -U testmonkey%password bench.nbench --unclist=/root/unclist.txt --num-progs=32 -t600 6. Keep doing #5 repeatedly Actual results: It may run successfully up to ten times in a row, but then: ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: PowerEdge R815 NETDEV WATCHDOG: em1 (bnx2): transmit queue 4 timed out Modules linked in: gfs2(U) iptable_filter ip_tables dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_] Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1 Call Trace: <IRQ> [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280 [<ffffffff8107ac54>] ? mod_timer+0x144/0x220 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114 ---[ end trace fbbba1dfce3a091d ]--- bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct endian mode bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct endian mode bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct endian mode etc. Expected results: All tests should be successful. Additional info: This test was performed in order to test the patch for bug #676626 so you need to have that GFS2 patch in order to run it, otherwise GFS2 may have serious issues (see that bug for details). We might be able to recreate it by flooding the card with other traffic or other dlm traffic.
Created attachment 512502 [details] patch to swap endianness if the device soft resets to opposite endianess bob, can you try this patch out please? I've returned your system to you. Thanks!
Good news and bad news: The good news is that I was able to recreate the problem with only three of the four nodes, and without GFS2. To recreate the problem, I used a DLM test tool given to me by Dave Teigland called dlm_load. The original source is in: http://fedorapeople.org/gitweb?p=teigland/public_git/dct-stuff.git;a=tree;f=dlm;hb=HEAD I modified the tool so I could control the sleep parameters to push the ethernet driver harder. The new "-S" parameter makes it sleep every iteration rather than every ten iterations. The new "-s" parameter specifies the usleep time (in microseconds). The default is 200000. (1/5 of a second). I tighten it to 2. So to recreate the problem, I'm running this command on three nodes simultaneously, with cluster-ssh (cssh): /home/bob/dlm_load -i 1000000 -q -S -s 2 I had to run this command several times before the failure recreated for me. The failure has always occurred on gfs-a16c-04, so I'm still not ruling out hardware problems. The bad news is that Neil's patch from comment #2 did not seem to help at all. I still got: ------------[ cut here ]------------ WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26d/0x280() (Not tainted) Hardware name: PowerEdge R815 NETDEV WATCHDOG: em1 (bnx2): transmit queue 1 timed out Modules linked in: dlm configfs autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log power_meter dcdbas microcode serio_raw ] Pid: 0, comm: swapper Not tainted 2.6.32-167.el6.bz676626.x86_64 #1 Call Trace: <IRQ> [<ffffffff81067657>] ? warn_slowpath_common+0x87/0xc0 [<ffffffff81067746>] ? warn_slowpath_fmt+0x46/0x50 [<ffffffff8143cc4d>] ? dev_watchdog+0x26d/0x280 [<ffffffff81079825>] ? internal_add_timer+0xb5/0x110 [<ffffffff8143c9e0>] ? dev_watchdog+0x0/0x280 [<ffffffff8107a457>] ? run_timer_softirq+0x197/0x340 [<ffffffff8109e550>] ? tick_sched_timer+0x0/0xc0 [<ffffffff8102a2fd>] ? lapic_next_event+0x1d/0x30 [<ffffffff8106fc41>] ? __do_softirq+0xc1/0x1d0 [<ffffffff81093200>] ? hrtimer_interrupt+0x140/0x250 [<ffffffff8100c20c>] ? call_softirq+0x1c/0x30 [<ffffffff8100de45>] ? do_softirq+0x65/0xa0 [<ffffffff8106fa25>] ? irq_exit+0x85/0x90 [<ffffffff814e5f80>] ? smp_apic_timer_interrupt+0x70/0x9b [<ffffffff8100bbd3>] ? apic_timer_interrupt+0x13/0x20 <EOI> [<ffffffff8103656b>] ? native_safe_halt+0xb/0x10 [<ffffffff8101433d>] ? default_idle+0x4d/0xb0 [<ffffffff81009de6>] ? cpu_idle+0xb6/0x110 [<ffffffff814c59fa>] ? rest_init+0x7a/0x80 [<ffffffff81b9df48>] ? start_kernel+0x41d/0x429 [<ffffffff81b9d33a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81b9d443>] ? x86_64_start_kernel+0x105/0x114 ---[ end trace f821242cfe8f7341 ]--- bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct Endian mode bnx2 0000:01:00.0: em1: DEBUG: intr_sem[0] PCI_CMD[00100000] bnx2 0000:01:00.0: em1: DEBUG: PCI_PM[19002008] PCI_MISC_CFG[92000000] bnx2 0000:01:00.0: em1: DEBUG: EMAC_TX_STATUS[ffffffff] EMAC_RX_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: RPM_MGMT_PKT_CTRL[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: MCP_STATE_P0[ffffffff] MCP_STATE_P1[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: HC_STATS_INTERRUPT_STATUS[ffffffff] bnx2 0000:01:00.0: em1: DEBUG: PBA[ffffffff] bnx2: Chip not in correct Endian mode (Repeated). Note that "Endian" in the above message is capitalized. I changed that to ensure the patched version was, in fact, running rather than the stock version. It was.
Created attachment 512720 [details] The modified dlm_load I used to recreate the problem
Bob, could you please try something for me? When this problem occurs, can you rmmod the bnx2 module and modprobe it again? I'd like to see if the device can actually be recovered at all without a cold boot. If it can, then I did something wrong with my patch. If I can't then I'm starting to think this might be a hardware issue that we need to involve broadcom with.
I'll try to get this going, but for some reason the serial ports on my consoles are all showing good output, but taking bad input. I've been trying to backtrack it and have eliminated the easy things (such as other processes having ttyS1 open). If I can't log in to the serial console, I can't rmmod or insmod the bnx2 driver because my only way in would be the console. I'll keep you posted.
probably, I've seen that happen before, although I don't ever recall what the solution is. Please let me know when the system is back in working order.
Robert, any luck gettnig the console on this system fixed?
have you tried 197? I ran the 197 kernel all weekend for bz 734815 with heavy netperf traffic and encountered no errors at all. /me is starting to wonder if perhaps a bios update is needed on some of the systems in quesion
I tried to recreate the problem with a newer kernel but ran into other non-related issues. I'll try it again when I get a few spare cycles.
ok, copy that.
With the latest RHEL6 kernel bits, I ran the test successfully a bunch of times in a row without failure, so I'm going to close this bug record. I'm not sure which microcode patch fixed the problem, but it seems fixed. Sorry for the noise and sorry it took so long to resolve. Incidentally, I did my testing on the 2.6.32-203.el6.x86_64 kernel.