Bug 228070 - DLM assertion when running GFS I/O during cmirror leg failure
Summary: DLM assertion when running GFS I/O during cmirror leg failure
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-02-09 20:29 UTC by Corey Marthaler
Modified: 2010-01-12 02:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-08-05 21:43:02 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2007-02-09 20:29:21 UTC
Description of problem:
This happened at the same time as bz 228067, but on link-02. 

I had 4 (3 legged) cluster mirrors (created out of 2 seperate VGs) each with GFS
on top and all running filesystem I/O. I then failed /dev/hda which was set up
to be one of the three legs in all 4 mirrors. Just after the leg failed I hit
the following panic on link-02. 



Feb  9 08:34:30 link-02 ccsd[4570]: Unable to connect to cluster infrastructure
after 270 seconds.
Feb  9 08:34:30 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
768 of 453358 (1%)
Feb  9 08:34:38 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
800 of 453390 (1%)
Feb  9 08:34:46 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
832 of 453422 (1%)
Feb  9 08:34:54 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
864 of 453454 (1%)
Feb  9 08:35:00 link-02 ccsd[4570]: Unable to connect to cluster infrastructure
after 300 seconds.
Feb  9 08:35:02 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
896 of 453486 (1%)
Feb  9 08:35:10 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
928 of 453518 (1%)
Feb  9 08:35:18 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
960 of 453550 (1%)
Feb  9 08:35:26 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
992 of 453582 (1%)
Feb  9 08:35:30 link-02 ccsd[4570]: Unable to connect to cluster infrastructure
after 330 seconds.
Feb  9 08:35:34 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
1024 of 453614 (1%)
Feb  9 08:35:42 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
1056 of 453646 (1%)
Feb  9 08:35:50 link-02 kernel: dm-cmirror: Clustered mirror retried requests ::
1088 of 453678 (1%)
dlm: dlm_unlock: lkid 20196 lockspace not found
fs2 recover event 29 finished
gfs2 move flags 1,0,0 ids 29,29,29
gfs2 move flags 0,1,0 ids 29,31,29
gfs2 move use event 31
gfs2 recover event 31
gfs2 add node 3
gfs2 total nodes 3
gfs2 rebuild resource directory
gfs2 rebuilt 5 resources
gfs2 purge requests
gfs2 purged 0 requests
gfs2 mark waiting requests
gfs2 marked 0 requests
gfs2 recover event 31 done
gfs2 move flags 0,0,1 ids 29,31,31
gfs2 process held requests
gfs2 processed 0 requests
gfs2 resend marked requests
gfs2 resent 0 requests
gfs2 recover event 31 finished
gfs2 move flags 1,0,0 ids 31,31,31
gfs2 move flags 0,1,0 ids 31,33,31
gfs2 move use event 33
gfs2 recover event 33
gfs2 add node 4
gfs2 total nodes 4
gfs2 rebuild resource directory
gfs2 rebuilt 6 resources
gfs2 purge requests
gfs2 purged 0 requests
gfs2 mark waiting requests
gfs2 marked 0 requests
gfs2 recover event 33 done
gfs2 move flags 0,0,1 ids 31,33,33
gfs2 process held requests
gfs2 processed 0 requests
gfs2 resend marked requests
gfs2 resent 0 requests
gfs2 recover event 33 finished
5069 pr_start last_stop 0 last_start 12 last_finish 0
5069 pr_start count 1 type 2 event 12 flags 250
5069 claim_jid 0
5069 pr_start 12 done 1
5069 pr_finish flags 5b
5054 recovery_done jid 0 msg 309 b
5054 recovery_done nodeid 1 flg 18
5069 pr_start last_stop 12 last_start 14 last_finish 12
5069 pr_start count 2 type 2 event 14 flags 21b
5054 recovery_done jid 1 msg 309 21b
5054 recovery_done jid 2 msg 309 21b
5054 recovery_done jid 3 msg 309 21b
5054 others_may_mount 21b
5069 pr_start delay done before omm 21b
5069 pr_start 14 done 0
5054 others_may_mount start_done 14 181b
5069 pr_finish flags 181b
5069 pr_start last_stop 14 last_start 16 last_finish 14
5069 pr_start count 3 type 2 event 16 flags 1a1b
5069 pr_start 16 done 1
5068 pr_finish flags 181b
5069 pr_start last_stop 16 last_start 18 last_finish 16
5069 pr_start count 4 type 2 event 18 flags 1a1b
5069 pr_start 18 done 1
5068 pr_finish flags 181b
5114 pr_start last_stop 0 last_start 20 last_finish 0
5114 pr_start count 1 type 2 event 20 flags 250
5114 claim_jid 0
5114 pr_start 20 done 1
5114 pr_finish flags 5b
5109 recovery_done jid 0 msg 309 b
5109 recovery_done nodeid 1 flg 18
5109 recovery_done jid 1 msg 309 b
5109 recovery_done jid 2 msg 309 b
5109 recovery_done jid 3 msg 309 b
5109 others_may_mount b
5113 pr_start last_stop 20 last_start 22 last_finish 20
5113 pr_start count 2 type 2 event 22 flags a1b
5113 pr_start 22 done 1
5113 pr_finish flags 81b
5114 pr_start last_stop 22 last_start 24 last_finish 22
5114 pr_start count 3 type 2 event 24 flags a1b
5114 pr_start 24 done 1
5114 pr_finish flags 81b
5113 pr_start last_stop 24 last_start 26 last_finish 24
5113 pr_start count 4 type 2 event 26 flags a1b
5113 pr_start 26 done 1
5113 pr_finish flags 81b
5136 pr_start last_stop 0 last_start 28 last_finish 0
5136 pr_start count 1 type 2 event 28 flags 250
5136 claim_jid 0
5136 pr_start 28 done 1
5136 pr_finish flags 5b
5131 recovery_done jid 0 msg 309 b
5131 recovery_done nodeid 1 flg 18
5131 recovery_done jid 1 msg 309 b
5131 recovery_done jid 2 msg 309 b
5131 recovery_done jid 3 msg 309 b
5131 others_may_mount b
5135 pr_start last_stop 28 last_start 30 last_finish 28
5135 pr_start count 2 type 2 event 30 flags a1b
5135 pr_start 30 done 1
5135 pr_finish flags 81b
5136 pr_start last_stop 30 last_start 32 last_finish 30
5136 pr_start count 3 type 2 event 32 flags a1b
5136 pr_start 32 done 1
5136 pr_finish flags 81b
5135 pr_start last_stop 32 last_start 34 last_finish 32
5135 pr_start count 4 type 2 event 34 flags a1b
5135 pr_start 34 done 1
5135 pr_finish flags 81b

lock_dlm:  Assertion failed on line 361 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error || (plock && error == -EINPROGRESS)"
lock_dlm:  time = 4306154010
gfs1: error=-22 num=2,1a lkf=10000 flags=84

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:361
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) dm_cmirror(U) dlm(U)
cman(U) md5 ipv6 parport_pc lp parport autofs4 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random k8_edac edac_mc tg3 floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc
sd_mod scsi_mod
Pid: 5071, comm: gfs_glockd Not tainted 2.6.9-46.ELsmp
RIP: 0010:[<ffffffffa029183d>] <ffffffffa029183d>{:lock_dlm:do_dlm_unlock+203}
RSP: 0000:0000010032ab9dd8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 000001003beded80 RCX: 0000000000000246
RDX: 00000000001033e8 RSI: 0000000000000246 RDI: ffffffff803e5600
RBP: 00000000ffffffea R08: 00000000000927c0 R09: 000001003beded80
R10: ffffffff80318d00 R11: 0000ffff80400460 R12: 0000000000000000
R13: ffffff00001a2000 R14: ffffffffa028d260 R15: 0000010032b65520
FS:  0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a95628000 CR3: 0000000000101000 CR4: 00000000000006e0
Process gfs_glockd (pid: 5071, threadinfo 0000010032ab8000, task 00000100391d67f0)
Stack: 0000000000000084 000001003beded80 ffffff00001a2000 0000010032b65520
       0000010032b6554c ffffffffa0291bba 0000000000000001 ffffffffa0256fd4
       0000000000000001 ffffffffa024d7f7
Call Trace:<ffffffffa0291bba>{:lock_dlm:lm_dlm_unlock+21}
<ffffffffa0256fd4>{:gfs:gfs_lm_unlock+41}
       <ffffffffa024d7f7>{:gfs:gfs_glock_drop_th+290}
<ffffffffa024bf49>{:gfs:run_queue+314}
       <ffffffffa024c19d>{:gfs:unlock_on_glock+37}
<ffffffffa024c293>{:gfs:gfs_reclaim_glock+234}
       <ffffffffa024075a>{:gfs:gfs_glockd+61}
<ffffffff80134660>{default_wake_function+0}
       <ffffffff80134660>{default_wake_function+0} <ffffffff80110f47>{child_rip+8}
       <ffffffffa024071d>{:gfs:gfs_glockd+0} <ffffffff80110f3f>{child_rip+0}


Code: 0f 0b 08 62 29 a0 ff ff ff ff 69 01 48 c7 c7 0d 62 29 a0 31
RIP <ffffffffa029183d>{:lock_dlm:do_dlm_unlock+203} RSP <0000010032ab9dd8>
 <0>Kernel panic - not syncing: Oops



Version-Release number of selected component (if applicable):
2.6.9-46.ELsmp
dlm-kernel-smp-2.6.9-46.10
dlm-1.0.1-2

Comment 1 David Teigland 2007-02-09 20:36:42 UTC
Looks like the node was probably kicked out of the cluster
  dlm: dlm_unlock: lkid 20196 lockspace not found
We have the "emergency shutdown" messages to make it more
obvious when that happens.  There are none of those or any
cman messages about removing the node from the cluster?


Comment 2 Corey Marthaler 2007-02-09 21:15:33 UTC
Feb  9 08:58:46 link-04 kernel: CMAN: removing node link-02 from the cluster :
Missed too many heartbeats


Comment 3 Corey Marthaler 2007-02-09 21:16:12 UTC
Feb  9 08:31:36 link-02 kernel: CMAN: Being told to leave the cluster by node 3
Feb  9 08:31:36 link-02 kernel: CMAN: we are leaving the cluster.
Feb  9 08:31:36 link-02 kernel: WARNING: dlm_emergency_shutdown
Feb  9 08:31:36 link-02 kernel: WARNING: dlm_emergency_shutdown
Feb  9 08:31:36 link-02 kernel: SM: 00000005 sm_stop: SG still joined
Feb  9 08:31:36 link-02 kernel: SM: 01000006 sm_stop: SG still joined
Feb  9 08:31:36 link-02 kernel: SM: 02000009 sm_stop: SG still joined
Feb  9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1
Feb  9 08:31:36 link-02 kernel:
Feb  9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid.
Feb  9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed
Feb  9 08:31:36 link-02 kernel: device-mapper: recovery failed on region 10714
Feb  9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1
Feb  9 08:31:36 link-02 kernel:
Feb  9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid.
Feb  9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed
Feb  9 08:31:36 link-02 kernel: dm-cmirror: No address list available for 1
Feb  9 08:31:36 link-02 kernel:
Feb  9 08:31:36 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid.
Feb  9 08:31:36 link-02 kernel: dm-cmirror: process_log_request:: failed


Comment 4 Corey Marthaler 2007-02-09 21:53:00 UTC
reproduced with slightly different assert, but again same scenario, nodes get
told to leave the cluster as a result of a cmirror device failing and slowing
down the machines.

lock_dlm:  Assertion failed on line 432 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4300566187
gfs3: num=2,18 err=-22 cur=0 req=5 lkf=4

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:432
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: lock_dlm(U) gfs(U) lock_harness(U) qla2300 qla2xxx
scsi_transport_fc dm_cmirror(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport
autofs4 sunrpc ds yenta_socket pcmcia_core button battery ac ohci_hcd hw_random
k8_edac edac_mc tg3 floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod sd_mod
scsi_mod
Pid: 6201, comm: lock_dlm1 Tainted: G   M  2.6.9-46.ELsmp
RIP: 0010:[<ffffffffa0064a0c>] <ffffffffa0064a0c>{:lock_dlm:do_dlm_lock+366}
RSP: 0000:00000100300adde8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 00000000ffffffea RCX: 0000000000020000
RDX: 00000000001b2865 RSI: 0000000000000246 RDI: ffffffff803e5600
RBP: 000001003aae3680 R08: 00000000fffffffb R09: 00000000ffffffea
R10: 0000000000000000 R11: 0000000000000000 R12: 0000010037e15e00
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
FS:  0000002a95562b00(0000) GS:ffffffff804ed500(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000000059ba08 CR3: 0000000000101000 CR4: 00000000000006e0
Process lock_dlm1 (pid: 6201, threadinfo 00000100300ac000, task 0000010036a35030)
Stack: 0000000000000005 0000000000000004 3220202020202020 2020202020202020
       3831202020202020 ffffffff80130018 000001003aae3680 0000010037e15e00
       0000000000000000 0000000000000000
Call Trace:<ffffffff80130018>{ia32_setup_arg_pages+422}
<ffffffffa0064f93>{:lock_dlm:process_submit+43}
       <ffffffffa0068ab8>{:lock_dlm:dlm_async+2020}
<ffffffff801346b1>{__wake_up_common+67}
       <ffffffff80134660>{default_wake_function+0}
<ffffffff8014be24>{keventd_create_kthread+0}
       <ffffffffa00682d4>{:lock_dlm:dlm_async+0}
<ffffffff8014be24>{keventd_create_kthread+0}
       <ffffffff8014bdfb>{kthread+200} <ffffffff80110f47>{child_rip+8}
       <ffffffff8014be24>{keventd_create_kthread+0} <ffffffff8014bd33>{kthread+0}
       <ffffffff80110f3f>{child_rip+0}

Code: 0f 0b 08 92 06 a0 ff ff ff ff b0 01 48 c7 c7 0d 92 06 a0 31
RIP <ffffffffa0064a0c>{:lock_dlm:do_dlm_lock+366} RSP <00000100300adde8>
 <3>dev<ic0e>-Kmearpneple rp:a nUniacb l-e  notto  sryenadc ifnrgo:m O opprsi
ary mirror during recovery                                                   m
                                                                              

Comment 5 Jonathan Earl Brassow 2007-02-13 18:51:11 UTC
I propose setting a restriction that mirrors are limited to 2 sides for 4.5. 
This would diffuse this bug.  Once we agree on that, I'll open a RFE for 4.6 and
make this bug dependent on that.


Comment 6 Jonathan Earl Brassow 2007-02-13 19:45:54 UTC
bug 217895 ?


Comment 7 Jonathan Earl Brassow 2007-02-16 19:17:50 UTC
I found a simpler way to reproduce this issue.

Do I/O on a cmirror from a node other than the server, then 'cman_tool leave
force' on the log server node.


Comment 8 Jonathan Earl Brassow 2007-02-19 16:29:56 UTC
When a log server drops out of the cluster, it ignores any requests - forcing
the clients to retry.  Unfortunately, the clients never ran another election -
causing operations to stall.  The server now replies that it cannot handle the
requests, which causes proper initiation of elections.


Comment 9 Jonathan Earl Brassow 2007-03-13 15:45:22 UTC
assigned ->needinfo

Comment 10 Jonathan Earl Brassow 2007-03-16 19:22:21 UTC
needinfo -> modified

this should be in MODIFIED

Comment 11 Corey Marthaler 2007-04-12 16:41:27 UTC
Fix verified in latest packages. 

Comment 12 Chris Feist 2008-08-05 21:43:02 UTC
Fixed in current release (4.7).


Note You need to log in before you can comment on or make changes to this bug.