Description of problem: I had been running revolver (for quite awile in fact) before hitting this issue on my x86_64 cluster. There were 3 cmirror gfs filesystems, all running I/O and for this iteration 3 of the 4 nodes in the cluster were killed and then brought back in. This issue happened when the gfs filesystems were being remounted after the recovery. [...] Verifying that recovery properly took place on the node(s) which stayed in the cluster checking Fence recovery... checking DLM recovery... checking GFS recovery... Verifying that clvmd was started properly on the dueler(s) mounting /dev/mapper/taft-cmirror1 on /mnt/taft1 on taft-04 mounting /dev/mapper/taft-cmirror2 on /mnt/taft2 on taft-04 mounting /dev/mapper/taft-cmirror3 on /mnt/taft3 on taft-04 mounting /dev/mapper/taft-cmirror1 on /mnt/taft1 on taft-01 mounting /dev/mapper/taft-cmirror2 on /mnt/taft2 on taft-01 mounting /dev/mapper/taft-cmirror3 on /mnt/taft3 on taft-01 mounting /dev/mapper/taft-cmirror1 on /mnt/taft1 on taft-02 mounting /dev/mapper/taft-cmirror2 on /mnt/taft2 on taft-02 Didn't receive heartbeat for 120 seconds [...] GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Trying to acquire journal lock... GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Looking at journal... GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Acquiring the transaction lock... GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Acquiring the transaction lock... GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Replaying journal... GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Replaying journal... GFS: fsid=TAFT_CLUSTER:gfs3.0: jid=1: Journal replayed in 9s GFS: fsid=TAFT_CLUSTER:gfs3.0: jid=1: Done GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Replayed 1377 of 2255 blocks GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: replays = 1377, skips = 611, sames = 267 GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Replayed 10711 of 10824 blocks GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: replays = 10711, skips = 7, sames = 106 GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Journal replayed in 4s GFS: fsid=TAFT_CLUSTER:gfs2.0: jid=1: Done GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Journal replayed in 5s GFS: fsid=TAFT_CLUSTER:gfs1.0: jid=1: Done NMI Watchdog detected LOCKUP, CPU=0, registers: CPU 0 Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) nfsd exportfs lockd nfs_ad Pid: 3722, comm: kmirrord Not tainted 2.6.9-50.ELlargesmp RIP: 0010:[<ffffffff8030cddf>] <ffffffff8030cddf>{.text.lock.spinlock+2} RSP: 0018:00000102120ddc88 EFLAGS: 00000086 RAX: 0000000000000000 RBX: ffffffffa029cd28 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffffa029cd28 RBP: 000000000000026f R08: 00000102120d8048 R09: 0000000000000000 R10: 0000000300000000 R11: 0000000000000000 R12: 00000100dfd56c00 R13: 00000101ffd2a618 R14: 00000101ffd2a600 R15: ffffffffa0094623 FS: 0000000000000000(0000) GS:ffffffff80501480(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000572f20 CR3: 0000000000101000 CR4: 00000000000006e0 Process kmirrord (pid: 3722, threadinfo 00000102120dc000, task 0000010216021030) Stack: 0000010214d4c820 ffffffffa029349e 0000010213b4e180 00000102120dde68 00000102120dddf8 ffffffffa009476f 0000000000000000 0000000000000000 00000102120ddcb8 0000000200000000 Call Trace:<ffffffffa029349e>{:dm_cmirror:cluster_clear_region+56} <ffffffffa009476f>{:dm_mirror:do_work+332} <ffffffff8030bb19>{thread_ret <ffffffff8030bb71>{thread_return+88} <ffffffffa0094623>{:dm_mirror:do_wo <ffffffff80147e1a>{worker_thread+419} <ffffffff801343a8>{default_wake_fu <ffffffff801343a8>{default_wake_function+0} <ffffffff8014bb67>{keventd_c <ffffffff80147c77>{worker_thread+0} <ffffffff8014bb67>{keventd_create_kt <ffffffff8014bb3e>{kthread+199} <ffffffff80110f47>{child_rip+8} <ffffffff8014bb67>{keventd_create_kthread+0} <ffffffff8014ba77>{kthread+ <ffffffff80110f3f>{child_rip+0} Code: 83 3b 00 7e f9 e9 60 fc ff ff f3 90 83 3b 00 7e f9 e9 ce fc Kernel panic - not syncing: nmi watchdog Badness in panic at kernel/panic.c:118 Call Trace:<ffffffff801380c6>{panic+527} <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_call_chai <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76} <EOE> <ffffffff801380c6>{panic+527} <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277 <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210 <ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_sto Badness in i8042_panic_blink at drivers/input/serio/i8042.c:987 Call Trace:<ffffffff8024334b>{i8042_panic_blink+238} <ffffffff80138074>{panic+4 <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do_nmi+11 <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76} <EOE> <ffffffff80138074>{panic+445} <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_ca <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_wor <ffffffff8011c624>{smp_send_stop+76} Badness in i8042_panic_blink at drivers/input/serio/i8042.c:990 Call Trace:<ffffffff802433dd>{i8042_panic_blink+384} <ffffffff80138074>{panic+4 <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do_nmi+11 <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76} <EOE> <ffffffff80138074>{panic+445} <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_ca <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_wor <ffffffff8011c624>{smp_send_stop+76} Badness in i8042_panic_blink at drivers/input/serio/i8042.c:992 Call Trace:<ffffffff80243442>{i8042_panic_blink+485} <ffffffff80138074>{panic+4 <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_call_chain+31} <ffffffff8011255e>{default_do_nmi+11 <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_work+0} <ffffffff8011c624>{smp_send_stop+76} <EOE> <ffffffff80138074>{panic+445} <ffffffff80111860>{show_stack+241} <ffffffff8011198a>{show_registers+277} <ffffffff80111c91>{die_nmi+130} <ffffffff8011d12d>{nmi_watchdog_tick+210} <ffffffff80144bc3>{notifier_ca <ffffffff8011255e>{default_do_nmi+112} <ffffffff8011d1e3>{do_nmi+115} <ffffffff80111173>{paranoid_exit+0} <ffffffffa0094623>{:dm_mirror:do_wor <ffffffff8011c624>{smp_send_stop+76} Version-Release number of selected component (if applicable): 2.6.9-50.ELlargesmp cmirror-kernel-2.6.9-26.0
Created attachment 151792 [details] Here is the kernel panic messages reproduced this again today on the tafts running revolver. This was while runing the kernel module with change to postpone the recovery/write collision test to the flush funct.
Here's the important message from the panic messages: Double clear on region (314) And here's the code that caused it: list_for_each_entry_safe(rs, tmp_rs, &lc->clear_waiting, rs_list){ if(region == rs->rs_region){ DMERR("%d) Double clear on region (" SECTOR_FORMAT ")", __LINE__, region); BUG(); } } The BUG() forces the machine to panic. I'll see if it's valid to get into this situation. If so, the fix is to simply return instead of call BUG()
Yes, it is possible. /* * It is possible, but unlikely to get to this case. It requires * the following to happen: * 1) mark the region for writing * 2) clear the region * 3) clear doesn't get flushed because of bug 235039/235040 * 4) suspend due to server relocation * 5) on-disk log says we need to recover (because it hasn't been cleared) * 6) we recover the region * 7) clearing the region after recovery causes us to get here * * Once 235040 is cleared, any entries found in this list should * cause a bug. */ Have a work-around.
assigned -> post
post -> modified built 4/5/07 (cmirror-kernel-2.6.9-28.0)
fix verified in cmirror-kernel-2.6.9-30.0.
Fixed in current release (4.7).