Bug 234918 - NMI Watchdog detected LOCKUP while running process kmirrord
Summary: NMI Watchdog detected LOCKUP while running process kmirrord
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-04-02 20:31 UTC by Corey Marthaler
Modified: 2010-01-12 02:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-08-05 21:42:07 UTC
Embargoed:


Attachments (Terms of Use)
Here is the kernel panic messages (10.55 KB, text/plain)
2007-04-05 18:02 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2007-04-02 20:31:32 UTC
Description of problem:
I had been running revolver (for quite awile in fact) before hitting this issue
on my x86_64 cluster. There were 3 cmirror gfs filesystems, all running I/O and
for this iteration 3 of the 4 nodes in the cluster were killed and then brought
back in. This issue happened when the gfs filesystems were being remounted after
the recovery.

[...]
Verifying that recovery properly took place on the node(s) which stayed in the
cluster
checking Fence recovery...
checking DLM recovery...
checking GFS recovery...
Verifying that clvmd was started properly on the dueler(s)
mounting /dev/mapper/taft-cmirror1 on /mnt/taft1 on taft-04
mounting /dev/mapper/taft-cmirror2 on /mnt/taft2 on taft-04
mounting /dev/mapper/taft-cmirror3 on /mnt/taft3 on taft-04
mounting /dev/mapper/taft-cmirror1 on /mnt/taft1 on taft-01
mounting /dev/mapper/taft-cmirror2 on /mnt/taft2 on taft-01
mounting /dev/mapper/taft-cmirror3 on /mnt/taft3 on taft-01
mounting /dev/mapper/taft-cmirror1 on /mnt/taft1 on taft-02
mounting /dev/mapper/taft-cmirror2 on /mnt/taft2 on taft-02
Didn't receive heartbeat for 120 seconds




[...]
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Trying to acquire journal lock...
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Looking at journal...
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Acquiring the transaction lock...
GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Acquiring the transaction lock...
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Replaying journal...
GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Replaying journal...
GFS: fsid=TAFT_CLUSTER:gfs3.0: jid=1: Journal replayed in 9s
GFS: fsid=TAFT_CLUSTER:gfs3.0: jid=1: Done
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Replayed 1377 of 2255 blocks
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: replays = 1377, skips = 611, sames = 267
GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Replayed 10711 of 10824 blocks
GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: replays = 10711, skips = 7, sames = 106
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Journal replayed in 4s
GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Done
GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Journal replayed in 5s
GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Done
NMI Watchdog detected LOCKUP, CPU=0, registers:
CPU 0
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) nfsd exportfs lockd nfs_ad
Pid: 3722, comm: kmirrord Not tainted 2.6.9-50.ELlargesmp
RIP: 0010:[<ffffffff8030cddf>] <ffffffff8030cddf>{.text.lock.spinlock+2}
RSP: 0018:00000102120ddc88  EFLAGS: 00000086
RAX: 0000000000000000 RBX: ffffffffa029cd28 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffffa029cd28
RBP: 000000000000026f R08: 00000102120d8048 R09: 0000000000000000
R10: 0000000300000000 R11: 0000000000000000 R12: 00000100dfd56c00
R13: 00000101ffd2a618 R14: 00000101ffd2a600 R15: ffffffffa0094623
FS:  0000000000000000(0000) GS:ffffffff80501480(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000572f20 CR3: 0000000000101000 CR4: 00000000000006e0
Process kmirrord (pid: 3722, threadinfo 00000102120dc000, task 0000010216021030)
Stack: 0000010214d4c820 ffffffffa029349e 0000010213b4e180 00000102120dde68
       00000102120dddf8 ffffffffa009476f 0000000000000000 0000000000000000
       00000102120ddcb8 0000000200000000
Call Trace:<ffffffffa029349e>{:dm_cmirror:cluster_clear_region+56}
       <ffffffffa009476f>{:dm_mirror:do_work+332} <ffffffff8030bb19>{thread_ret
       <ffffffff8030bb71>{thread_return+88} <ffffffffa0094623>{:dm_mirror:do_wo
       <ffffffff80147e1a>{worker_thread+419} <ffffffff801343a8>{default_wake_fu
       <ffffffff801343a8>{default_wake_function+0} <ffffffff8014bb67>{keventd_c
       <ffffffff80147c77>{worker_thread+0} <ffffffff8014bb67>{keventd_create_kt
       <ffffffff8014bb3e>{kthread+199} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014bb67>{keventd_create_kthread+0} <ffffffff8014ba77>{kthread+
       <ffffffff80110f3f>{child_rip+0}

Code: 83 3b 00 7e f9 e9 60 fc ff ff f3 90 83 3b 00 7e f9 e9 ce fc
Kernel panic - not syncing: nmi watchdog
 Badness in panic at kernel/panic.c:118

Call Trace:<ffffffff801380c6>{panic+527} <ffffffff80111860>{show_stack+241}
<ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130}
<ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_call_chai
<ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115}
<ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_work+0}
<ffffffff8011c624>{smp_send_stop+76}  <EOE> <ffffffff801380c6>{panic+527}
       <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277
       <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210
       <ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do
       <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0}
       <ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_sto

Badness in i8042_panic_blink at drivers/input/serio/i8042.c:987

Call Trace:<ffffffff8024334b>{i8042_panic_blink+238} <ffffffff80138074>{panic+4
<ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277}
<ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210}
<ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do_nmi+11
<ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0}
<ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76}
 <EOE> <ffffffff80138074>{panic+445} <ffffffff80111860>{show_stack+241}
       <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130}
       <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_ca
       <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115}
       <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_wor
       <ffffffff8011c624>{smp_send_stop+76}
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:990

Call Trace:<ffffffff802433dd>{i8042_panic_blink+384} <ffffffff80138074>{panic+4
<ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277}
<ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210}
<ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do_nmi+11
<ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0}
<ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76}
 <EOE> <ffffffff80138074>{panic+445} <ffffffff80111860>{show_stack+241}
       <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130}
       <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_ca
       <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115}
       <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_wor
       <ffffffff8011c624>{smp_send_stop+76}
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:992

Call Trace:<ffffffff80243442>{i8042_panic_blink+485} <ffffffff80138074>{panic+4
<ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277}
<ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210}
<ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do_nmi+11
<ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0}
<ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76}
 <EOE> <ffffffff80138074>{panic+445} <ffffffff80111860>{show_stack+241}
       <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130}
       <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_ca
       <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115}
       <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_wor
       <ffffffff8011c624>{smp_send_stop+76}

 

Version-Release number of selected component (if applicable):
2.6.9-50.ELlargesmp
cmirror-kernel-2.6.9-26.0

Comment 1 Corey Marthaler 2007-04-05 18:02:22 UTC
Created attachment 151792 [details]
Here is the kernel panic messages

reproduced this again today on the tafts running revolver. This was while
runing the kernel module with change to postpone the recovery/write collision
test to the flush funct.

Comment 2 Jonathan Earl Brassow 2007-04-05 19:42:06 UTC
Here's the important message from the panic messages:

Double clear on region (314)

And here's the code that caused it:

	list_for_each_entry_safe(rs, tmp_rs, &lc->clear_waiting, rs_list){
		if(region == rs->rs_region){
			DMERR("%d) Double clear on region ("
			      SECTOR_FORMAT ")", __LINE__, region);
			BUG();
		}
	}

The BUG() forces the machine to panic.  I'll see if it's valid to get into this
situation.  If so, the fix is to simply return instead of call BUG()


Comment 3 Jonathan Earl Brassow 2007-04-05 20:05:35 UTC
Yes, it is possible.

	/*
	 * It is possible, but unlikely to get to this case. It requires
	 * the following to happen:
	 * 1) mark the region for writing
	 * 2) clear the region
	 * 3) clear doesn't get flushed because of bug 235039/235040
	 * 4) suspend due to server relocation
	 * 5) on-disk log says we need to recover (because it hasn't been cleared)
	 * 6) we recover the region
	 * 7) clearing the region after recovery causes us to get here
	 *
	 * Once 235040 is cleared, any entries found in this list should
	 * cause a bug.
	 */

Have a work-around.


Comment 4 Jonathan Earl Brassow 2007-04-05 21:34:51 UTC
assigned -> post


Comment 5 Jonathan Earl Brassow 2007-04-09 16:00:00 UTC
post -> modified

built 4/5/07 (cmirror-kernel-2.6.9-28.0)


Comment 6 Corey Marthaler 2007-04-12 17:46:37 UTC
fix verified in cmirror-kernel-2.6.9-30.0.

Comment 7 Chris Feist 2008-08-05 21:42:07 UTC
Fixed in current release (4.7).


Note You need to log in before you can comment on or make changes to this bug.