Description of problem: I was attempting cmirror failed leg recovery on my 4 node x86_64 cluster and hit this issue. [root@link-08 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices mirror2 vg mwi-a- 3.00G mirror2_mlog 11.98 mirror2_mimage_0(0),mirror2_mimage_1(0) [mirror2_mimage_0] vg iwi-ao 3.00G /dev/sda1(0) [mirror2_mimage_1] vg iwi-ao 3.00G /dev/sdh1(0) [mirror2_mlog] vg lwi-ao 4.00M /dev/sdg1(0) I then failed /dev/hda and attempted an lvs command on link-08. It apppeared to be stuck but then I realized that it was getting fenced by link-07. [root@link-08 ~]# lvs -a -o +devices /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error /dev/dm-4: read failed after 0 of 4096 at 3221159936: Input/output error [...] Couldn't find device with uuid 'ogdhvU-2v5N-lxmG-YK1w-bB4R-TY3J-v91XI9'. Couldn't find all physical volumes for volume group vg. /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error Read from remote host link-08: Connection reset by peer Connection to link-08 closed. I then attempted an lvs on link-07 and it hung. [root@link-07 ~]# lvs -a -o +devices /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error I have not yet tried to wirte to the mirro device so conversion hasn't taken place yet. Here are the console messages from the reaming three "up" nodes: link-07: [...] Jan 29 15:42:04 link-07 kernel: SCSI error : <2 0 1 1> return code = 0x10000 Jan 29 15:42:04 link-07 kernel: end_request: I/O error, dev sdh, sector 5367353 Jan 29 15:42:04 link-07 kernel: device-mapper: Write error during recovery (error = 0x1) Jan 29 15:42:04 link-07 kernel: device-mapper: incrementing error_count on 253:4 Jan 29 15:42:04 link-07 kernel: device-mapper: recovery failed on region 5225 Jan 29 15:42:04 link-07 kernel: SCSI error : <2 0 1 1> return code = 0x10000 Jan 29 15:42:04 link-07 kernel: end_request: I/O error, dev sdh, sector 5371969 [...] Jan 29 15:42:05 link-07 kernel: device-mapper: Write error during recovery (error = 0x1) Jan 29 15:42:05 link-07 kernel: device-mapper: incrementing error_count on 253:4 Jan 29 15:42:05 link-07 kernel: device-mapper: recovery failed on region 5237 Jan 29 15:42:05 link-07 kernel: dm-cmirror: Error while listening for server response: -110 Jan 29 15:42:20 link-07 last message repeated 3 times Jan 29 15:42:20 link-07 kernel: CMAN: node link-08 has been removed from the cluster : Missed too many heartbeats Jan 29 15:42:25 link-07 kernel: dm-cmirror: Error while listening for server response: -110 Jan 29 15:42:36 link-07 kernel: CMAN: node link-02 has been removed from the cluster : Missed too many heartbeats Jan 29 15:42:51 link-07 fenced[4710]: link-08 not a cluster member after 30 sec post_fail_delay Jan 29 15:42:51 link-07 fenced[4710]: fencing node "link-08" Jan 29 15:42:56 link-07 fenced[4710]: fence "link-08" success link-02: [...] an 29 10:09:59 link-02 kernel: end_request: I/O error, dev sdh, sector 5378609 Jan 29 10:09:59 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jan 29 10:09:59 link-02 kernel: end_request: I/O error, dev sdh, sector 5378617 Jan 29 10:09:59 link-02 kernel: device-mapper: Write error during recovery (error = 0x1) Jan 29 10:09:59 link-02 kernel: device-mapper: incrementing error_count on 253:4 Jan 29 10:09:59 link-02 kernel: device-mapper: recovery failed on region 5236 Jan 29 10:10:04 link-02 kernel: dm-cmirror: Error while listening for server response: -110 Jan 29 10:10:19 link-02 last message repeated 3 times Jan 29 10:10:19 link-02 kernel: CMAN: removing node link-08 from the cluster : Missed too many heartbeats Jan 29 10:10:20 link-02 fenced[4454]: fencing deferred to link-07 Jan 29 10:10:24 link-02 kernel: dm-cmirror: Error while listening for server response: -110 Jan 29 10:10:35 link-02 kernel: CMAN: node link-02 has been removed from the cluster : Missed too many heartbeats Jan 29 10:10:35 link-02 kernel: CMAN: killed by NODEDOWN message Jan 29 10:10:35 link-02 kernel: CMAN: we are leaving the cluster. Missed too many heartbeats Jan 29 10:10:35 link-02 kernel: WARNING: dlm_emergency_shutdown Jan 29 10:10:35 link-02 ccsd[4392]: Cluster manager shutdown. Attemping to reconnect... Jan 29 10:10:35 link-02 kernel: WARNING: dlm_emergency_shutdown Jan 29 10:10:35 link-02 kernel: SM: 00000003 sm_stop: SG still joined Jan 29 10:10:35 link-02 kernel: SM: 01000005 sm_stop: SG still joined Jan 29 10:10:56 link-02 ccsd[4392]: Unable to connect to cluster infrastructure after 30 seconds. Jan 29 10:11:26 link-02 ccsd[4392]: Unable to connect to cluster infrastructure after 60 seconds. Jan 29 10:11:56 link-02 ccsd[4392]: Unable to connect to cluster infrastructure after 90 seconds. Jan 29 10:12:26 link-02 ccsd[4392]: Unable to connect to cluster infrastructure after 120 seconds. link-04: [...] Jan 29 10:37:15 link-04 kernel: end_request: I/O error, dev sdh, sector 5381169 Jan 29 10:37:15 link-04 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Jan 29 10:37:15 link-04 kernel: end_request: I/O error, dev sdh, sector 5381177 Jan 29 10:37:15 link-04 kernel: device-mapper: Write error during recovery (error = 0x1) Jan 29 10:37:15 link-04 kernel: device-mapper: incrementing error_count on 253:4 Jan 29 10:37:15 link-04 kernel: device-mapper: recovery failed on region 5238 Jan 29 10:37:15 link-04 kernel: dm-cmirror: Error while listening for server response: -110 Jan 29 10:37:30 link-04 last message repeated 3 times Jan 29 10:37:31 link-04 kernel: CMAN: node link-08 has been removed from the cluster : Missed too many heartbeats Jan 29 10:37:32 link-04 fenced[4270]: fencing deferred to link-07 dm-cmirror: Error while listening for server response: -110 Jan 29 10:37:35 link-04 kernel: dm-cmirror: Error while listening for server response: -110 Jan 29 10:37:46 link-04 kernel: CMAN: node link-02 has been removed from the cluster : Missed too many heartbeats Version-Release number of selected component (if applicable): lvm2-cluster-2.02.20-1.el4 lvm2-2.02.20-1.el4 device-mapper-1.02.16-1.el4 2.6.9-43.ELsmp
Root cause of this may also be causing 223865...
FYI - I reproduced this today.
with 45.EL ?
2.6.9-43.ELsmp
RHEL4-U5-re20070129.nightly
54.EL was built 30-Jan-2007 19:30
Did you have any GFS I/O happening?
There was no GFS I/O happening in this case.
I have also now reproduced this with GFS I/O running to the cmirrors. This should be a blocker. [...] Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: fatal: filesystem consistency error Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: function = find_good_lh Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: file = /builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/gfs/recovery.c, line = 179 Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: time = 1170858357 Feb 7 08:25:57 link-02 lvm[2692]: WARNING: dev_open(/dev/ram11) called while suspended Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: about to withdraw from the cluster Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: waiting for outstanding I/O Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: telling LM to withdraw Feb 7 08:25:57 link-02 lvm[2692]: WARNING: dev_open(/dev/ram11) called while suspended Feb 7 08:25:57 link-02 lvm[2692]: WARNING: dev_open(/dev/ram12) called while suspended [...] Feb 7 08:25:58 link-02 lvm[2692]: WARNING: dev_open(/dev/sdh2) called while suspended Feb 7 08:25:58 link-02 lvm[2692]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Feb 7 08:25:58 link-02 lvm[2692]: Failed to remove faulty devices in vg-mirror lock_dlm: withdraw abandoned memory GFS: fsid=LINK_128:gfs1.0: withdrawn GFS: fsid=LINK_128:gfs1.0: jid=1: Failed Feb 7 08:26:05 link-02 kernel: lock_dlm: withdraw abandoned memory Feb 7 08:26:05 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: withdrawn Feb 7 08:26:05 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: jid=1: Failed Feb 7 08:26:12 link-02 kernel: CMAN: node link-08 has been removed from the cluster : Shutdown Feb 7 08:26:44 link-02 fenced[3349]: link-08 not a cluster member after 30 sec post_fail_delay Feb 7 08:26:44 link-02 fenced[3349]: fencing node "link-08" Feb 7 08:27:05 link-02 fenced[3349]: fence "link-08" success Feb 7 08:27:11 link-02 clurgmgrd[3777]: <info> Magma Event: Membership Change Feb 7 08:27:11 link-02 clurgmgrd[3777]: <info> State change: link-08 DOWN Feb 7 08:29:20 link-02 kernel: CMAN: node link-08 rejoining
I should have fixed cmirror's interactions with cman, such that it fixes this bug. However, be careful when testing the fix. If you are killing the primary device while the mirror is not in sync, it will return I/O errors (which it is suppose to do) - resulting in a GFS withdrawal.
Reproduced this bug today while running mirror failure testing... ================================================================================ Senario: Kill secondary leg of non synced 3 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: fail_secondary_non_synced_3_legs_2_mirrors * sync: 0 * mirrors: 2 * disklog: 1 * failpv: /dev/sdg1 * legs: 3 * pvs: /dev/sdf1 /dev/sdg1 /dev/sde1 /dev/sda1 ************************************************ Creating mirror(s) on link-02... qarsh root@link-02 lvcreate -m 2 -n fail_secondary_non_synced_3_legs_2_mirrors_1 -L 800M helter_skelter /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sda1:0-150 qarsh root@link-02 lvcreate -m 2 -n fail_secondary_non_synced_3_legs_2_mirrors_2 -L 800M helter_skelter /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sda1:0-150 mirror(s) currently at 0.00% synced right now Creating gfs on top of mirror(s) on link-02... Mounting mirror gfs filesystems on link-02... Mounting mirror gfs filesystems on link-07... Mounting mirror gfs filesystems on link-08... Writing files to mirror(s) on... link-02: checkit starting with: CREATE Num files: 100 Random Seed: 22014 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_1 checkit starting with: CREATE Num files: 100 Random Seed: 22021 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_2 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_2 link-07: checkit starting with: CREATE Num files: 100 Random Seed: 20593 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_1 checkit starting with: CREATE Num files: 100 Random Seed: 20595 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_2 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_2 link-08: checkit starting with: CREATE Num files: 100 Random Seed: 21565 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_1 checkit starting with: CREATE Num files: 100 Random Seed: 21567 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_2 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_2 Disabling device sdg on link-02 Disabling device sdg on link-07 Disabling device sdg on link-08 Attempting I/O to cause mirror down conversion(s) on link-02 10+0 records in 10+0 records out Didn't receive heartbeat for 120 seconds Verifying the down conversion from mirror(s) to linear(s) Could not connect to remote host down conversion didn't appear to work as a simple lvs fails link-02 was the node to get fenced: Aug 14 18:42:03 link-02 lvm[7190]: Device failure in helter_skelter-fail_secondary_non_synced_3_legs_2_mirrors_2 Aug 14 18:42:03 link-02 lvm[7190]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Aug 14 18:42:03 link-02 kernel: scsi0 (0:7): rejecting I/O to offline device Aug 14 18:42:03 link-02 last message repeated 1742 times Aug 14 18:42:03 link-02 kernel: CMAN: node link-02 has been removed from the cluster : No response to messages Aug 14 18:42:03 link-02 kernel: CMAN: killed by NODEDOWN message Aug 14 18:42:03 link-02 kernel: CMAN: we are leaving the cluster. No response to messages Aug 14 18:42:03 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 14 18:42:03 link-02 kernel: scsi0 (0:7): rejecting I/O to offline device Aug 14 18:42:03 link-02 last message repeated 21 times Aug 14 18:42:04 link-02 lvm[7190]: Another thread is handling an event. Waiting... Aug 14 18:42:04 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 14 18:42:04 link-02 kernel: SM: 00000004 sm_stop: SG still joined Aug 14 18:42:04 link-02 kernel: SM: 01000009 sm_stop: SG still joined Aug 14 18:42:04 link-02 ccsd[4922]: Cluster manager shutdown. Attemping to reconnect... Aug 14 18:42:04 link-02 kernel: SM: 020000df sm_stop: SG still joined Aug 14 18:42:10 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 1426694 (1%) Aug 14 18:42:10 link-02 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Aug 14 18:42:10 link-02 kernel: dm-cmirror: Unable to convert nodeid_to_ipaddr in run_election Aug 14 18:42:10 link-02 kernel: dm-cmirror: Election processing failed. Aug 14 18:42:10 link-02 kernel: dm-cmirror: process_log_request:: failed Aug 14 18:42:26 link-02 ccsd[4922]: Unable to connect to cluster infrastructure after 30 seconds. Aug 14 18:42:28 link-02 kernel: dlm: dlm_lock: no lockspace Aug 14 18:42:28 link-02 kernel: hed Aug 14 18:42:28 link-02 kernel: gfs1 move flags 1,0,0 ids 380,380,380 Aug 14 18:42:28 link-02 kernel: gfs1 move flags 0,1,0 ids 380,384,380 Aug 14 18:42:28 link-02 kernel: gfs1 move use event 384 [...] link-07: Aug 14 18:55:18 link-07 kernel: CMAN: node link-02 has been removed from the cluster : No response to messages Aug 14 18:55:19 link-07 fenced[5027]: fencing deferred to link-08 Aug 14 18:56:20 link-07 kernel: dm-cmirror: A cluster mirror log member has failed. link-08: Aug 14 18:54:28 link-08 kernel: CMAN: removing node link-02 from the cluster : No response to messages Aug 14 18:54:59 link-08 fenced[4988]: link-02 not a cluster member after 30 sec post_fail_delay Aug 14 18:54:59 link-08 fenced[4988]: fencing node "link-02" Aug 14 18:55:19 link-08 fenced[4988]: fence "link-02" success Aug 14 18:55:24 link-08 kernel: dm-cmirror: A cluster mirror log member has failed.
Appear to have hit this bug again: Aug 30 15:39:58 link-02 kernel: CMAN: Being told to leave the cluster by node 2 Aug 30 15:39:58 link-02 kernel: CMAN: we are leaving the cluster. Aug 30 15:39:58 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 30 15:39:58 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 30 15:39:58 link-02 kernel: SM: 00000002 sm_stop: SG still joined Aug 30 15:39:58 link-02 kernel: SM: 01000005 sm_stop: SG still joined Aug 30 15:39:58 link-02 kernel: SM: 0200000b sm_stop: SG still joined Aug 30 15:39:58 link-02 kernel: device-mapper: Write error during recovery (error = 0x1) Aug 30 15:39:58 link-02 kernel: device-mapper: recovery failed on region 3714 Aug 30 15:39:58 link-02 kernel: dm-cmirror: No address list available for 1 Aug 30 15:39:58 link-02 kernel: Aug 30 15:39:58 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Aug 30 15:39:58 link-02 kernel: dm-cmirror: LRT_ELECTION(9): (8yr9JHeP) Aug 30 15:39:58 link-02 kernel: dm-cmirror: starter : 2 Aug 30 15:39:58 link-02 kernel: dm-cmirror: co-ordinator: 2 Aug 30 15:39:58 link-02 kernel: dm-cmirror: node_count : 1 Aug 30 15:39:58 link-02 kernel: dm-cmirror: Election processing failed. Aug 30 15:39:58 link-02 kernel: dm-cmirror: process_log_request:: failed [...] Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 1696295054 Aug 30 15:39:58 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 1696295062 Aug 30 15:39:58 link-02 kernel: device-mapper: Write error during recovery (error = 0x1) Aug 30 15:39:58 link-02 kernel: device-mapper: recovery failed on region 7396 Aug 30 15:39:58 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 25166081 Aug 30 15:39:58 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 385 Aug 30 15:40:03 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 190305 (1%) Aug 30 15:40:03 link-02 kernel: dm-cmirror: Last request: Aug 30 15:40:03 link-02 kernel: dm-cmirror: - my_id :: 1 Aug 30 15:40:03 link-02 kernel: dm-cmirror: - server :: 1 Aug 30 15:40:03 link-02 kernel: dm-cmirror: - log uuid:: Hti6Joon (active) Aug 30 15:40:03 link-02 kernel: dm-cmirror: - request :: LRT_COMPLETE_RESYNC_WORK Aug 30 15:40:03 link-02 kernel: dm-cmirror: - error :: -6 Aug 30 15:40:03 link-02 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Aug 30 15:40:03 link-02 kernel: dm-cmirror: server_id=dead, server_valid=1, Hti6Joon Aug 30 15:40:03 link-02 kernel: dm-cmirror: trigger = LRT_COMPLETE_RESYNC_WORK Aug 30 15:40:03 link-02 kernel: dm-cmirror: Unable to convert nodeid_to_ipaddr in run_election Aug 30 15:55:41 link-07 kernel: dm-cmirror: Failed to receive election results from server: (8yr9JHeP/1, -110) Aug 30 15:55:41 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, 8yr9JHeP Aug 30 15:55:41 link-07 kernel: dm-cmirror: trigger = LRT_CLEAR_REGION Aug 30 15:55:41 link-07 kernel: dm-cmirror: LRT_ELECTION(9): (8yr9JHeP) Aug 30 15:55:41 link-07 kernel: dm-cmirror: starter : 2 Aug 30 15:55:41 link-07 kernel: dm-cmirror: co-ordinator: 2 Aug 30 15:55:41 link-07 kernel: dm-cmirror: node_count : 0 Aug 30 15:55:53 link-07 fenced[3373]: link-02 not a cluster member after 30 sec post_fail_delay Aug 30 15:55:53 link-07 fenced[3373]: fencing node "link-02" Aug 30 15:56:01 link-07 kernel: dm-cmirror: Failed to receive election results from server: (8yr9JHeP/1, -110) Aug 30 15:56:01 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, 8yr9JHeP Aug 30 15:56:01 link-07 kernel: dm-cmirror: trigger = LRT_CLEAR_REGION Aug 30 15:56:01 link-07 kernel: dm-cmirror: LRT_ELECTION(9): (8yr9JHeP) Aug 30 15:56:01 link-07 kernel: dm-cmirror: starter : 2 Aug 30 15:56:01 link-07 kernel: dm-cmirror: co-ordinator: 2 Aug 30 15:56:01 link-07 kernel: dm-cmirror: node_count : 0 Aug 30 15:56:13 link-07 fenced[3373]: fence "link-02" success Aug 30 15:56:20 link-07 kernel: cdrom: open failed. These nodes where running: 2.6.9-56.ELsmp cmirror-kernel-2.6.9-34.1 lvm2-2.02.27-2.el4 lvm2-cluster-2.02.27-2.el4 device-mapper-1.02.21-1.el4
Appeear to have hit this while running cmirror_lock_stress and failing a mirror leg. It was on a three node cluster and two of the nodes ended up being told to leave and so quorum was lost.
There may be two different issues in this bug. Focusing on Couldn't find device with uuid 'ogdhvU-2v5N-lxmG-YK1w-bB4R-TY3J-v91XI9'. Couldn't find all physical volumes for volume group vg. from commnent 0, leads me to believe the cluster had a different view of the available devices (similar to issues seen in 239482 and others). However, the rest of the comments suggest a different culprit. I'm betting the problems in comments 1-13 have been fixed. I'm putting this bug into modified. assigned -> modified
Marking this bz verified. Like Jon said, there may be two bugs here and the issues in the earlier comments appear to be fixed. The issues in the later comments may be similar to bz 239482 and if seen again, I'll open a new bug.
Fixed in current release (4.7).