Bug 241248
| Summary: | cmirror leg failure + delayed server failure can cause new server to be removed from cluster | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> | ||||||||||
| Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||
| Status: | CLOSED DEFERRED | QA Contact: | Cluster QE <mspqa-list> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 4 | CC: | agk, dwysocha, jbrassow, prockai, pvrabec | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | All | ||||||||||||
| OS: | Linux | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2013-09-23 15:31:38 UTC | Type: | --- | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
Created attachment 155370 [details]
attaching full log from link-02
Created attachment 155371 [details]
attaching full log from link-04
Created attachment 155372 [details]
attaching full log from link-07
Created attachment 155373 [details]
attaching full log from link-08
I hit something like this while running mirror failure scenarios with
helter_skelter. One node panicked.
Senario: Kill secondary leg of non synced 2 leg mirror(s)
...
Disabling device sda on tank-01
Disabling device sda on tank-02
Disabling device sda on tank-03
Disabling device sda on tank-04
Attempting I/O to cause mirror down conversion(s) on tank-03
10+0 records in
10+0 records out
Didn't receive heartbeat for 120 seconds (machine panicked here)
Output from console log:
scsi0 (0:1): rejecting I/O to offline device
dm-cmirror: Unable to convert nodeid_to_ipaddr in run_election
dm-cmirror: Election processing failed.
dlm: dlm_lock: no lockspace
hed
gfs1 move flags 1,0,0 ids 433,433,433
gfs1 move flags 0,1,0 ids 433,437,433
gfs1 move use event 437
gfs1 recover event 437
gfs1 add node 4
gfs1 total nodes 4
gfs1 rebuild resource directory
gfs1 rebuilt 5 resources
gfs1 purge requests
gfs1 purged 0 requests
gfs1 mark waiting requests
gfs1 marked 0 requests
gfs1 recover event 437 done
gfs1 move flags 0,0,1 ids 433,437,437
gfs1 process held requests
gfs1 processed 0 requests
gfs1 resend marked requests
gfs1 resent 0 requests
gfs1 recover event 437 finished
gfs2 move flags 1,0,0 ids 435,435,435
gfs2 move flags 0,1,0 ids 435,439,435
gfs2 move use event 439
gfs2 recover event 439
gfs2 add node 4
gfs2 total nodes 4
gfs2 rebuild resource directory
gfs2 rebuilt 5 resources
gfs2 purge requests
gfs2 purged 0 requests
gfs2 mark waiting requests
gfs2 marked 0 requests
gfs2 recover event 439 done
gfs2 move flags 0,0,1 ids 435,439,439
gfs2 process held requests
gfs2 processed 0 requests
gfs2 resend marked requests
gfs2 resent 0 requests
gfs2 recover event 439 finished
ne 1
18719 pr_finish flags 1a
18951 unmount flags a
18951 release_mountgroup flags a
18966 unmount flags a
18966 release_mountgroup flags a
19239 pr_start last_stop 0 last_start 371 last_finish 0
19239 pr_start count 3 type 2 event 371 flags 250
19239 claim_jid 2
19239 pr_start 371 done 1
19239 pr_finish flags 5a
19234 recovery_done jid 2 msg 309 a
19234 recovery_done nodeid 3 flg 18
19254 pr_start last_stop 0 last_start 373 last_finish 0
19254 pr_start count 3 type 2 event 373 flags 250
19254 claim_jid 2
19254 pr_start 373 done 1
19254 pr_finish flags 5a
19249 recovery_done jid 2 msg 309 a
19249 recovery_done nodeid 3 flg 18
19239 pr_start last_stop 371 last_start 375 last_finish 371
19239 pr_start count 4 type 2 event 375 flags 21a
19239 pr_start 375 done 1
19239 pr_finish flags 1a
19254 pr_start last_stop 373 last_start 377 last_finish 373
19254 pr_start count 4 type 2 event 377 flags 21a
19254 pr_start 377 done 1
19254 pr_finish flags 1a
19239 pr_start last_stop 375 last_start 384 last_finish 375
19239 pr_start count 3 type 3 event 384 flags 21a
19239 pr_start 384 done 1
19238 pr_finish flags 1a
19254 pr_start last_stop 377 last_start 386 last_finish 377
19254 pr_start count 3 type 3 event 386 flags 21a
19254 pr_start 386 done 1
19253 pr_finish flags 1a
19239 pr_start last_stop 384 last_start 388 last_finish 384
19239 pr_start count 2 type 3 event 388 flags 21a
19239 pr_start 388 done 1
19238 pr_finish flags 1a
19254 pr_start last_stop 386 last_start 390 last_finish 386
19254 pr_start count 2 type 3 event 390 flags 21a
19254 pr_start 390 done 1
19253 pr_finish flags 1a
19536 unmount flags a
19536 release_mountgroup flags a
19551 unmount flags a
19551 release_mountgroup flags a
19758 pr_start last_stop 0 last_start 403 last_finish 0
19758 pr_start count 3 type 2 event 403 flags 250
19758 claim_jid 2
19758 pr_start 403 done 1
19758 pr_finish flags 5a
19753 recovery_done jid 2 msg 309 a
19753 recovery_done nodeid 3 flg 18
19773 pr_start last_stop 0 last_start 405 last_finish 0
19773 pr_start count 3 type 2 event 405 flags 250
19773 claim_jid 2
19773 pr_start 405 done 1
19773 pr_finish flags 5a
19768 recovery_done jid 2 msg 309 a
19768 recovery_done nodeid 3 flg 18
19758 pr_start last_stop 403 last_start 407 last_finish 403
19758 pr_start count 4 type 2 event 407 flags 21a
19758 pr_start 407 done 1
19758 pr_finish flags 1a
19772 pr_start last_stop 405 last_start 409 last_finish 405
19772 pr_start count 4 type 2 event 409 flags 21a
19772 pr_start 409 done 1
19772 pr_finish flags 1a
19758 pr_start last_stop 407 last_start 415 last_finish 407
19758 pr_start count 3 type 3 event 415 flags 21a
19758 pr_start 415 done 1
19757 pr_finish flags 1a
19773 pr_start last_stop 409 last_start 417 last_finish 409
19773 pr_start count 3 type 3 event 417 flags 21a
19773 pr_start 417 done 1
19772 pr_finish flags 1a
19758 pr_start last_stop 415 last_start 419 last_finish 415
19758 pr_start count 2 type 3 event 419 flags 21a
19758 pr_start 419 done 1
19757 pr_finish flags 1a
19773 pr_start last_stop 417 last_start 421 last_finish 417
19773 pr_start count 2 type 3 event 421 flags 21a
19773 pr_start 421 done 1
19772 pr_finish flags 1a
19918 unmount flags a
19918 release_mountgroup flags a
19933 unmount flags a
19933 release_mountgroup flags a
20184 pr_start last_stop 0 last_start 434 last_finish 0
20184 pr_start count 3 type 2 event 434 flags 250
20184 claim_jid 2
20184 pr_start 434 done 1
20184 pr_finish flags 5a
20179 recovery_done jid 2 msg 309 a
20179 recovery_done nodeid 3 flg 18
20208 pr_start last_stop 0 last_start 436 last_finish 0
20208 pr_start count 3 type 2 event 436 flags 250
20208 claim_jid 2
20208 pr_start 436 done 1
20208 pr_finish flags 5a
20194 recovery_done jid 2 msg 309 a
20194 recovery_done nodeid 3 flg 18
20183 pr_start last_stop 434 last_start 438 last_finish 434
20183 pr_start count 4 type 2 event 438 flags 21a
20183 pr_start 438 done 1
20183 pr_finish flags 1a
20207 pr_start last_stop 436 last_start 440 last_finish 436
20207 pr_start count 4 type 2 event 440 flags 21a
20207 pr_start 440 done 1
20207 pr_finish flags 1a
lock_dlm: Assertion failed on line 432 of file
/builddir/build/BUILD/gfs-kernel-2.6.9-75/hugemem/src/dlm/lock.c
lock_dlm: assertion: "!error"
lock_dlm: time = 12916708
gfs2: num=8,0 err=-22 cur=-1 req=5 lkf=8
------------[ cut here ]------------
kernel BUG at /builddir/build/BUILD/gfs-kernel-2.6.9-75/hugemem/src/dlm/lock.c:432!
invalid operand: 0000 [#1]
SMP
Modules linked in: lock_dlm(U) dm_cmirror(U) gnbd(U) lock_nolock(U) gfs(U)
lock_harness(U) dlm(U) cman(U) md5 ipv6 parport_pc lp parport autofs4 i2c_dev
i2c_core sunrpc button battery ac uhci_hcd hw_random e1000 floppy dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod qla2300 ata_piix libata qla2xxx
scsi_transport_fc sd_mod scsi_mod
CPU: 1
EIP: 0060:[<82be4798>] Not tainted VLI
EFLAGS: 00010212 (2.6.9-67.ELhugemem)
EIP is at do_dlm_lock+0x134/0x14e [lock_dlm]
eax: 00000001 ebx: ffffffea ecx: 7c530da4 edx: 82be9287
esi: 82be47b7 edi: 8100f200 ebp: 39e93a00 esp: 7c530da0
ds: 007b es: 007b ss: 0068
Process gfs_quotad (pid: 20213, threadinfo=7c530000 task=75584230)
Stack: 82be9287 20202020 38202020 20202020 20202020 20202020 30202020 00000018
00000000 39e93a00 00000001 00000000 39e93a00 82be4847 00000005 82becc20
82c42000 82c11aea 00000000 00000001 79f637c8 79f637ac 82c42000 82c07922
Call Trace:
[<82be4847>] lm_dlm_lock+0x49/0x52 [lock_dlm]
[<82c11aea>] gfs_lm_lock+0x35/0x4d [gfs]
[<82c07922>] gfs_glock_xmote_th+0x130/0x172 [gfs]
[<82c06fe1>] rq_promote+0xc8/0x147 [gfs]
[<82c071cd>] run_queue+0x91/0xc1 [gfs]
[<82c081dd>] gfs_glock_nq+0xcf/0x116 [gfs]
[<82c087b3>] gfs_glock_nq_init+0x13/0x26 [gfs]
[<82c23319>] do_quota_sync+0xb0/0x579 [gfs]
[<0211df75>] find_busiest_group+0xdd/0x295
[<022ce3a0>] schedule+0x850/0x8ee
[<022ce3d0>] schedule+0x880/0x8ee
[<022cea89>] __cond_resched+0x14/0x39
[<82c22bfe>] quota_find+0xe3/0x102 [gfs]
[<82c2416c>] gfs_quota_sync+0x9c/0xfb [gfs]
[<82bfc18c>] gfs_quotad+0xf5/0x1c7 [gfs]
[<82bfc097>] gfs_quotad+0x0/0x1c7 [gfs]
[<021041f5>] kernel_thread_helper+0x5/0xb
Code: 26 50 0f bf 45 24 50 53 ff 75 08 ff 75 04 ff 75 0c ff 77 18 68 dd 93 be 82
e8 86 e1 53 7f 83 c4 38 68 87 92 be 82 e8 79 e1 53 7f <0f> 0b b0 01 cb 91 be 82
68 89 92 be 82 e8 d1 d8 53 7f 83 c4 20
<0>Fatal exception: panic in 5 seconds
The Red Hat Cluster Suite product is past end-of-life; closing. |
Description of problem: I had been running 'cmirror_lock_stress' which does cmirror locking operations (creates/deletes, activates/deactivates, up/down conversions), and while doing so, I failed one of the legs in the cmirror (/dev/sdh1), wrote to that mirror to start the down conversion, and then followed that with failing the current cmirror server (link-08). Link-04 appeared to became the next server but was quickly removed from the remaining cluster due to a loss of heartbeating. The result was a deadlocked cluster. Here's the current status of the mirror: [root@link-02 ~]# dmsetup ls lock_stress-link--07.26410_mlog (253, 2) lock_stress-link--07.26410 (253, 7) lock_stress-link--07.26410_mimage_3 (253, 6) lock_stress-link--07.26410_mimage_2 (253, 5) lock_stress-link--07.26410_mimage_1 (253, 4) lock_stress-link--07.26410_mimage_0 (253, 3) VolGroup00-LogVol01 (253, 1) VolGroup00-LogVol00 (253, 0) [root@link-02 ~]# dmsetup info Name: lock_stress-link--07.26410_mlog State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 2 Number of targets: 1 UUID: LVM-rRf2kwFdXA2tYF0WVAkkwbHFhTtzweLFeY5Ts0F6GCLLqB0SmwfeRi4K2dHqGlmO Name: lock_stress-link--07.26410 State: ACTIVE Tables present: LIVE Open count: 0 Event number: 0 Major, minor: 253, 7 Number of targets: 1 UUID: LVM-rRf2kwFdXA2tYF0WVAkkwbHFhTtzweLFGmcwO7hZa3OhO8PI9JmWDucOSEWgH3v1 Name: lock_stress-link--07.26410_mimage_3 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 6 Number of targets: 1 UUID: LVM-rRf2kwFdXA2tYF0WVAkkwbHFhTtzweLFzdQl2zgcZWecWXU8pacQbhB5j0MhMhsn Name: lock_stress-link--07.26410_mimage_2 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 5 Number of targets: 1 UUID: LVM-rRf2kwFdXA2tYF0WVAkkwbHFhTtzweLFeX0e4psYETVq5PTmfq1gigVVNtmD1Gc2 Name: lock_stress-link--07.26410_mimage_1 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 4 Number of targets: 1 UUID: LVM-rRf2kwFdXA2tYF0WVAkkwbHFhTtzweLFtzT7pq4WlZjk0dly6eY2uluW6xUyPPh1 Name: lock_stress-link--07.26410_mimage_0 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 3 Number of targets: 1 UUID: LVM-rRf2kwFdXA2tYF0WVAkkwbHFhTtzweLF19HiQwRawcwxeECRbmdiJkPqnssfvfix Name: VolGroup00-LogVol01 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 1 Number of targets: 1 UUID: LVM-dq1liKVsB8CzZiuNRtOF1tYTkXqgKO85b28r2eyrPqurvpuPPS83R8fcaG7QULHG Name: VolGroup00-LogVol00 State: ACTIVE Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 UUID: LVM-dq1liKVsB8CzZiuNRtOF1tYTkXqgKO85W9BzgAV2x8i4cT3pKw0TArv1HJm54Tu6 It appears that there were issues just before failing link-08 after the initial device failure: LINK-08: May 24 10:39:25 link-08 qarshd[9832]: Running cmdline: vgremove lock_stress May 24 10:39:28 link-08 qarshd[9832]: Nothing to do May 24 10:39:40 link-08 last message repeated 4 times May 24 10:39:40 link-08 dhclient: receive_packet failed on eth0: Network is down May 24 10:39:40 link-08 kernel: md: stopping all md devices. May 24 10:39:40 link-08 kernel: md: md0 switched to read-only mode. May 24 10:39:41 link-08 kernel: Synchronizing SCSI cache for disk sdh: May 24 10:39:43 link-08 qarshd[9832]: Nothing to do May 24 10:39:44 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:39:46 link-08 qarshd[9832]: Nothing to do May 24 10:39:49 link-08 qarshd[9832]: Nothing to do May 24 10:39:49 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:39:52 link-08 qarshd[9832]: Nothing to do May 24 10:39:54 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:39:55 link-08 qarshd[9832]: Nothing to do May 24 10:39:58 link-08 qarshd[9832]: Nothing to do May 24 10:39:59 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:39:59 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:39:59 link-08 kernel: CMAN: removing node link-02 from the cluster : Missed too many heartbeats May 24 10:39:59 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:40:00 link-08 kernel: CMAN: resend failed: -101 May 24 10:40:04 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:40:04 link-08 kernel: CMAN: No functional network interfaces, leaving cluster May 24 10:40:04 link-08 kernel: CMAN: sendmsg failed: -101 May 24 10:40:04 link-08 kernel: CMAN: we are leaving the cluster. May 24 10:40:04 link-08 kernel: WARNING: dlm_emergency_shutdown May 24 10:40:04 link-08 kernel: WARNING: dlm_emergency_shutdown May 24 10:40:04 link-08 ccsd[5172]: Cluster manager shutdown. Attemping to reconnect... May 24 10:40:04 link-08 kernel: SM: 00000005 sm_stop: SG still joined May 24 10:40:04 link-08 kernel: SM: 0100000a sm_stop: SG still joined May 24 10:40:08 link-08 kernel: SCSI error : <2 0 1 1> return code = 0x10000 May 24 10:40:08 link-08 kernel: end_request: I/O error, dev sdh, sector 330817 [device errors] Link-08 is killed LINK-04: May 24 10:37:11 link-04 kernel: dm-cmirror: Node joining May 24 10:37:55 link-04 kernel: CMAN: node link-08 has been removed from the cluster : Missed too many heartbeats May 24 10:38:26 link-04 fenced[5048]: link-08 not a cluster member after 30 sec post_fail_delay May 24 10:38:26 link-04 fenced[5048]: fencing node "link-08" May 24 10:38:26 link-04 fenced[5048]: fence "link-08" success May 24 10:38:36 link-04 kernel: dm-cmirror: A cluster mirror log member has failed. May 24 10:38:36 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (2dHqGlmO) May 24 10:38:36 link-04 kernel: dm-cmirror: starter : 2 May 24 10:38:36 link-04 kernel: dm-cmirror: co-ordinator: 2 May 24 10:38:36 link-04 kernel: dm-cmirror: node_count : 2 May 24 10:38:36 link-04 kernel: dm-cmirror: LRT_SELECTION(11): (2dHqGlmO) May 24 10:38:36 link-04 kernel: dm-cmirror: starter : 2 May 24 10:38:36 link-04 kernel: dm-cmirror: co-ordinator: 1 May 24 10:38:36 link-04 kernel: dm-cmirror: node_count : 2 May 24 10:38:36 link-04 kernel: dm-cmirror: LRT_MASTER_ASSIGN(12): (2dHqGlmO) May 24 10:38:36 link-04 kernel: dm-cmirror: starter : 2 May 24 10:38:36 link-04 kernel: dm-cmirror: co-ordinator: 1 May 24 10:38:36 link-04 kernel: dm-cmirror: node_count : 1 May 24 10:38:36 link-04 kernel: dm-cmirror: I'm the cluster mirror log server for 2dHqGlmO May 24 10:38:36 link-04 kernel: dm-cmirror: Disk Resume:: 2dHqGlmO (active) May 24 10:38:36 link-04 kernel: dm-cmirror: Live nodes :: 3 May 24 10:38:36 link-04 kernel: dm-cmirror: In-Use Regions :: 0 May 24 10:38:36 link-04 kernel: dm-cmirror: Good IUR's :: 0 May 24 10:38:36 link-04 kernel: dm-cmirror: Bad IUR's :: 0 May 24 10:38:36 link-04 kernel: dm-cmirror: Sync count :: 307 May 24 10:38:36 link-04 kernel: dm-cmirror: Disk Region count :: 1000 May 24 10:38:36 link-04 kernel: dm-cmirror: Region count :: 1000 May 24 10:38:36 link-04 kernel: dm-cmirror: Marked regions:: May 24 10:38:36 link-04 kernel: dm-cmirror: 307 - 1000 May 24 10:38:36 link-04 kernel: dm-cmirror: Total = 693 May 24 10:38:36 link-04 kernel: dm-cmirror: Out-of-sync regions:: May 24 10:38:36 link-04 kernel: dm-cmirror: 307 - 1000 May 24 10:38:36 link-04 kernel: dm-cmirror: Total = 693 May 24 10:38:36 link-04 kernel: dm-cmirror: Assigning recovery work to 2/2dHqGlmO: 307 May 24 10:38:37 link-04 kernel: dm-cmirror: Setting recovering region out-of-sync: 307/2dHqGlmO/2 May 24 10:38:37 link-04 kernel: dm-cmirror: Assigning recovery work to 2/2dHqGlmO: 308 May 24 10:38:38 link-04 kernel: dm-cmirror: Setting recovering region out-of-sync: 308/2dHqGlmO/2 May 24 10:38:38 link-04 kernel: dm-cmirror: Assigning recovery work to 2/2dHqGlmO: 309 [...] May 24 10:38:50 link-04 kernel: dm-cmirror: Assigning recovery work to 2/2dHqGlmO: 998 May 24 10:38:50 link-04 kernel: dm-cmirror: Setting recovering region out-of-sync: 998/2dHqGlmO/2 May 24 10:38:50 link-04 kernel: dm-cmirror: Assigning recovery work to 2/2dHqGlmO: 999 May 24 10:38:50 link-04 kernel: dm-cmirror: Setting recovering region out-of-sync: 999/2dHqGlmO/2 May 24 10:39:34 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (2dHqGlmO) May 24 10:39:34 link-04 kernel: dm-cmirror: starter : 3 May 24 10:39:34 link-04 kernel: dm-cmirror: co-ordinator: 3 May 24 10:39:34 link-04 kernel: dm-cmirror: node_count : 1 May 24 10:39:34 link-04 kernel: SCSI error : <1 0 1 1> return code = 0x10000 May 24 10:39:34 link-04 kernel: end_request: I/O error, dev sdh, sector 16449 May 24 10:39:34 link-04 kernel: dm-cmirror: server_id=1, server_valid=0, 2dHqGlmO May 24 10:39:34 link-04 kernel: dm-cmirror: trigger = LRT_GET_RESYNC_WORK May 24 10:39:34 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (2dHqGlmO) May 24 10:39:34 link-04 kernel: dm-cmirror: starter : 1 May 24 10:39:34 link-04 kernel: dm-cmirror: co-ordinator: 1 May 24 10:39:34 link-04 kernel: dm-cmirror: node_count : 0 May 24 10:39:34 link-04 kernel: SCSI error : <1 0 1 1> return code = 0x10000 May 24 10:39:34 link-04 kernel: end_request: I/O error, dev sdh, sector 16449 May 24 10:39:34 link-04 kernel: device-mapper: Primary mirror device has failed while mirror is out of sync. May 24 10:39:34 link-04 kernel: device-mapper: Unable to choose alternative primary device May 24 10:39:34 link-04 kernel: device-mapper: Read failure on mirror: Failing I/O. May 24 10:39:34 link-04 kernel: device-mapper: A read failure occurred on a mirror device. May 24 10:39:34 link-04 kernel: device-mapper: Unable to retry read. May 24 10:39:34 link-04 kernel: SCSI error : <1 0 1 1> return code = 0x10000 May 24 10:39:34 link-04 kernel: end_request: I/O error, dev sdh, sector 16065 [...] May 24 10:40:15 link-04 kernel: SCSI error : <1 0 1 1> return code = 0x10000 May 24 10:40:15 link-04 kernel: end_request: I/O error, dev sdh, sector 16553 May 24 10:41:06 link-04 lvm[16520]: Mirror device, 253:3, has failed. May 24 10:41:06 link-04 lvm[16520]: Device failure in lock_stress-link--07.26410 May 24 10:41:06 link-04 lvm[16520]: Parsing: vgreduce --config devices{ignore_suspended_devices=1} --removemissing lock_stress May 24 10:41:06 link-04 lvm[16520]: Reloading config files [...] May 24 10:41:12 link-04 kernel: CMAN: Being told to leave the cluster by node 2 May 24 10:41:12 link-04 kernel: CMAN: we are leaving the cluster. May 24 10:41:12 link-04 kernel: WARNING: dlm_emergency_shutdown May 24 10:41:12 link-04 kernel: WARNING: dlm_emergency_shutdown May 24 10:41:12 link-04 kernel: SM: 00000005 sm_stop: SG still joined May 24 10:41:12 link-04 kernel: SM: 0100000a sm_stop: SG still joined May 24 10:41:12 link-04 kernel: cdrom: open failed. May 24 10:41:12 link-04 kernel: SCSI error : <1 0 1 1> return code = 0x10000 May 24 10:41:12 link-04 kernel: end_request: I/O error, dev sdh, sector 1040321 May 24 10:41:12 link-04 kernel: SCSI error : <1 0 1 1> return code = 0x10000 May 24 10:41:12 link-04 kernel: end_request: I/O error, dev sdh, sector 16449 May 24 10:41:15 link-04 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 2280603 (1%) May 24 10:41:15 link-04 kernel: dm-cmirror: Last request: May 24 10:41:15 link-04 kernel: dm-cmirror: - my_id :: 1 May 24 10:41:15 link-04 kernel: dm-cmirror: - server :: 1 May 24 10:41:15 link-04 kernel: dm-cmirror: - log uuid:: 2dHqGlmO (active) May 24 10:41:15 link-04 kernel: dm-cmirror: - request :: LRT_GET_RESYNC_WORK May 24 10:41:15 link-04 kernel: dm-cmirror: - error :: -6 May 24 10:41:15 link-04 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. May 24 10:41:15 link-04 kernel: dm-cmirror: server_id=dead, server_valid=1, 2dHqGlmO May 24 10:41:15 link-04 kernel: dm-cmirror: trigger = LRT_GET_RESYNC_WORK May 24 10:41:15 link-04 kernel: dm-cmirror: Unable to convert nodeid_to_ipaddr in run_election May 24 10:41:15 link-04 kernel: dm-cmirror: LRT_ELECTION(10): (2dHqGlmO) May 24 10:41:15 link-04 kernel: dm-cmirror: starter : 1 May 24 10:41:15 link-04 kernel: dm-cmirror: co-ordinator: 1 May 24 10:41:15 link-04 kernel: dm-cmirror: node_count : 0 May 24 10:41:15 link-04 kernel: dm-cmirror: Election processing failed. May 24 10:41:15 link-04 kernel: dm-cmirror: process_log_request:: failed May 24 10:41:22 link-04 ccsd[4906]: Unable to connect to cluster infrastructure after 30 seconds. May 24 10:41:35 link-04 kernel: dm-cmirror: Failed to receive election results from server: (2dHqGlmO,-110) LINK-07: May 24 10:41:47 link-07 kernel: end_request: I/O error, dev sdh, sector 16065 May 24 10:42:33 link-07 kernel: SCSI error : <2 0 1 1> return code = 0x10000 May 24 10:42:33 link-07 kernel: end_request: I/O error, dev sdh, sector 16449 May 24 10:42:48 link-07 kernel: dm-cmirror: Error listening for server(1) response for 2dHqGlmO: -110 May 24 10:42:50 link-07 kernel: CMAN: removing node link-04 from the cluster : Missed too many heartbeats May 24 10:42:51 link-07 kernel: CMAN: quorum lost, blocking activity May 24 10:42:53 link-07 ccsd[5343]: Cluster is not quorate. Refusing connection. May 24 10:42:53 link-07 ccsd[5343]: Error while processing connect: Connection refused May 24 10:42:58 link-07 ccsd[5343]: Cluster is not quorate. Refusing connection. May 24 10:42:58 link-07 ccsd[5343]: Error while processing connect: Connection refused May 24 10:43:03 link-07 kernel: dm-cmirror: Error listening for server(1) response for 2dHqGlmO: -110 May 24 10:43:03 link-07 ccsd[5343]: Cluster is not quorate. Refusing connection. May 24 10:43:03 link-07 ccsd[5343]: Error while processing connect: Connection refused May 24 10:43:08 link-07 ccsd[5343]: Cluster is not quorate. Refusing connection. May 24 10:43:08 link-07 ccsd[5343]: Error while processing connect: Connection refused May 24 10:43:14 link-07 ccsd[5343]: Cluster is not quorate. Refusing connection. May 24 10:43:14 link-07 ccsd[5343]: Error while processing connect: Connection refused May 24 10:43:18 link-07 kernel: CMAN: node link-02 has been removed from the cluster : Missed too many heartbeats May 24 10:43:19 link-07 ccsd[5343]: Cluster is not quorate. Refusing connection. Version-Release number of selected component (if applicable): 2.6.9-55.ELlargesmp cmirror-kernel-2.6.9-32.0