Bug 225312
Summary: | dead cluster mirror leg appears to cause a node to get stuck and be fenced | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4 | CC: | agk, cfeist, dwysocha, jbrassow, mbroz, prockai |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-08-05 21:44:00 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2007-01-29 21:40:30 UTC
Root cause of this may also be causing 223865... FYI - I reproduced this today. with 45.EL ? 2.6.9-43.ELsmp RHEL4-U5-re20070129.nightly 54.EL was built 30-Jan-2007 19:30 Did you have any GFS I/O happening? There was no GFS I/O happening in this case. I have also now reproduced this with GFS I/O running to the cmirrors. This should be a blocker. [...] Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: fatal: filesystem consistency error Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: function = find_good_lh Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: file = /builddir/build/BUILD/gfs-kernel-2.6.9-67/smp/src/gfs/recovery.c, line = 179 Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: time = 1170858357 Feb 7 08:25:57 link-02 lvm[2692]: WARNING: dev_open(/dev/ram11) called while suspended Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: about to withdraw from the cluster Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: waiting for outstanding I/O Feb 7 08:25:57 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: telling LM to withdraw Feb 7 08:25:57 link-02 lvm[2692]: WARNING: dev_open(/dev/ram11) called while suspended Feb 7 08:25:57 link-02 lvm[2692]: WARNING: dev_open(/dev/ram12) called while suspended [...] Feb 7 08:25:58 link-02 lvm[2692]: WARNING: dev_open(/dev/sdh2) called while suspended Feb 7 08:25:58 link-02 lvm[2692]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Feb 7 08:25:58 link-02 lvm[2692]: Failed to remove faulty devices in vg-mirror lock_dlm: withdraw abandoned memory GFS: fsid=LINK_128:gfs1.0: withdrawn GFS: fsid=LINK_128:gfs1.0: jid=1: Failed Feb 7 08:26:05 link-02 kernel: lock_dlm: withdraw abandoned memory Feb 7 08:26:05 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: withdrawn Feb 7 08:26:05 link-02 kernel: GFS: fsid=LINK_128:gfs1.0: jid=1: Failed Feb 7 08:26:12 link-02 kernel: CMAN: node link-08 has been removed from the cluster : Shutdown Feb 7 08:26:44 link-02 fenced[3349]: link-08 not a cluster member after 30 sec post_fail_delay Feb 7 08:26:44 link-02 fenced[3349]: fencing node "link-08" Feb 7 08:27:05 link-02 fenced[3349]: fence "link-08" success Feb 7 08:27:11 link-02 clurgmgrd[3777]: <info> Magma Event: Membership Change Feb 7 08:27:11 link-02 clurgmgrd[3777]: <info> State change: link-08 DOWN Feb 7 08:29:20 link-02 kernel: CMAN: node link-08 rejoining I should have fixed cmirror's interactions with cman, such that it fixes this bug. However, be careful when testing the fix. If you are killing the primary device while the mirror is not in sync, it will return I/O errors (which it is suppose to do) - resulting in a GFS withdrawal. Reproduced this bug today while running mirror failure testing... ================================================================================ Senario: Kill secondary leg of non synced 3 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: fail_secondary_non_synced_3_legs_2_mirrors * sync: 0 * mirrors: 2 * disklog: 1 * failpv: /dev/sdg1 * legs: 3 * pvs: /dev/sdf1 /dev/sdg1 /dev/sde1 /dev/sda1 ************************************************ Creating mirror(s) on link-02... qarsh root@link-02 lvcreate -m 2 -n fail_secondary_non_synced_3_legs_2_mirrors_1 -L 800M helter_skelter /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sda1:0-150 qarsh root@link-02 lvcreate -m 2 -n fail_secondary_non_synced_3_legs_2_mirrors_2 -L 800M helter_skelter /dev/sdf1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sda1:0-150 mirror(s) currently at 0.00% synced right now Creating gfs on top of mirror(s) on link-02... Mounting mirror gfs filesystems on link-02... Mounting mirror gfs filesystems on link-07... Mounting mirror gfs filesystems on link-08... Writing files to mirror(s) on... link-02: checkit starting with: CREATE Num files: 100 Random Seed: 22014 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_1 checkit starting with: CREATE Num files: 100 Random Seed: 22021 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_2 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_2 link-07: checkit starting with: CREATE Num files: 100 Random Seed: 20593 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_1 checkit starting with: CREATE Num files: 100 Random Seed: 20595 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_2 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_2 link-08: checkit starting with: CREATE Num files: 100 Random Seed: 21565 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_1 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_1 checkit starting with: CREATE Num files: 100 Random Seed: 21567 Verify XIOR Stream: /tmp/checkit_fail_secondary_non_synced_3_legs_2_mirrors_2 Working dir: /mnt/fail_secondary_non_synced_3_legs_2_mirrors_2 Disabling device sdg on link-02 Disabling device sdg on link-07 Disabling device sdg on link-08 Attempting I/O to cause mirror down conversion(s) on link-02 10+0 records in 10+0 records out Didn't receive heartbeat for 120 seconds Verifying the down conversion from mirror(s) to linear(s) Could not connect to remote host down conversion didn't appear to work as a simple lvs fails link-02 was the node to get fenced: Aug 14 18:42:03 link-02 lvm[7190]: Device failure in helter_skelter-fail_secondary_non_synced_3_legs_2_mirrors_2 Aug 14 18:42:03 link-02 lvm[7190]: WARNING: dev_open(/etc/lvm/lvm.conf) called while suspended Aug 14 18:42:03 link-02 kernel: scsi0 (0:7): rejecting I/O to offline device Aug 14 18:42:03 link-02 last message repeated 1742 times Aug 14 18:42:03 link-02 kernel: CMAN: node link-02 has been removed from the cluster : No response to messages Aug 14 18:42:03 link-02 kernel: CMAN: killed by NODEDOWN message Aug 14 18:42:03 link-02 kernel: CMAN: we are leaving the cluster. No response to messages Aug 14 18:42:03 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 14 18:42:03 link-02 kernel: scsi0 (0:7): rejecting I/O to offline device Aug 14 18:42:03 link-02 last message repeated 21 times Aug 14 18:42:04 link-02 lvm[7190]: Another thread is handling an event. Waiting... Aug 14 18:42:04 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 14 18:42:04 link-02 kernel: SM: 00000004 sm_stop: SG still joined Aug 14 18:42:04 link-02 kernel: SM: 01000009 sm_stop: SG still joined Aug 14 18:42:04 link-02 ccsd[4922]: Cluster manager shutdown. Attemping to reconnect... Aug 14 18:42:04 link-02 kernel: SM: 020000df sm_stop: SG still joined Aug 14 18:42:10 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 1426694 (1%) Aug 14 18:42:10 link-02 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Aug 14 18:42:10 link-02 kernel: dm-cmirror: Unable to convert nodeid_to_ipaddr in run_election Aug 14 18:42:10 link-02 kernel: dm-cmirror: Election processing failed. Aug 14 18:42:10 link-02 kernel: dm-cmirror: process_log_request:: failed Aug 14 18:42:26 link-02 ccsd[4922]: Unable to connect to cluster infrastructure after 30 seconds. Aug 14 18:42:28 link-02 kernel: dlm: dlm_lock: no lockspace Aug 14 18:42:28 link-02 kernel: hed Aug 14 18:42:28 link-02 kernel: gfs1 move flags 1,0,0 ids 380,380,380 Aug 14 18:42:28 link-02 kernel: gfs1 move flags 0,1,0 ids 380,384,380 Aug 14 18:42:28 link-02 kernel: gfs1 move use event 384 [...] link-07: Aug 14 18:55:18 link-07 kernel: CMAN: node link-02 has been removed from the cluster : No response to messages Aug 14 18:55:19 link-07 fenced[5027]: fencing deferred to link-08 Aug 14 18:56:20 link-07 kernel: dm-cmirror: A cluster mirror log member has failed. link-08: Aug 14 18:54:28 link-08 kernel: CMAN: removing node link-02 from the cluster : No response to messages Aug 14 18:54:59 link-08 fenced[4988]: link-02 not a cluster member after 30 sec post_fail_delay Aug 14 18:54:59 link-08 fenced[4988]: fencing node "link-02" Aug 14 18:55:19 link-08 fenced[4988]: fence "link-02" success Aug 14 18:55:24 link-08 kernel: dm-cmirror: A cluster mirror log member has failed. Appear to have hit this bug again: Aug 30 15:39:58 link-02 kernel: CMAN: Being told to leave the cluster by node 2 Aug 30 15:39:58 link-02 kernel: CMAN: we are leaving the cluster. Aug 30 15:39:58 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 30 15:39:58 link-02 kernel: WARNING: dlm_emergency_shutdown Aug 30 15:39:58 link-02 kernel: SM: 00000002 sm_stop: SG still joined Aug 30 15:39:58 link-02 kernel: SM: 01000005 sm_stop: SG still joined Aug 30 15:39:58 link-02 kernel: SM: 0200000b sm_stop: SG still joined Aug 30 15:39:58 link-02 kernel: device-mapper: Write error during recovery (error = 0x1) Aug 30 15:39:58 link-02 kernel: device-mapper: recovery failed on region 3714 Aug 30 15:39:58 link-02 kernel: dm-cmirror: No address list available for 1 Aug 30 15:39:58 link-02 kernel: Aug 30 15:39:58 link-02 kernel: dm-cmirror: Failed to convert IP address to nodeid. Aug 30 15:39:58 link-02 kernel: dm-cmirror: LRT_ELECTION(9): (8yr9JHeP) Aug 30 15:39:58 link-02 kernel: dm-cmirror: starter : 2 Aug 30 15:39:58 link-02 kernel: dm-cmirror: co-ordinator: 2 Aug 30 15:39:58 link-02 kernel: dm-cmirror: node_count : 1 Aug 30 15:39:58 link-02 kernel: dm-cmirror: Election processing failed. Aug 30 15:39:58 link-02 kernel: dm-cmirror: process_log_request:: failed [...] Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 1696295054 Aug 30 15:39:58 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 1696295062 Aug 30 15:39:58 link-02 kernel: device-mapper: Write error during recovery (error = 0x1) Aug 30 15:39:58 link-02 kernel: device-mapper: recovery failed on region 7396 Aug 30 15:39:58 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 25166081 Aug 30 15:39:58 link-02 kernel: SCSI error : <0 0 1 1> return code = 0x10000 Aug 30 15:39:58 link-02 kernel: end_request: I/O error, dev sdh, sector 385 Aug 30 15:40:03 link-02 kernel: dm-cmirror: Clustered mirror retried requests :: 32 of 190305 (1%) Aug 30 15:40:03 link-02 kernel: dm-cmirror: Last request: Aug 30 15:40:03 link-02 kernel: dm-cmirror: - my_id :: 1 Aug 30 15:40:03 link-02 kernel: dm-cmirror: - server :: 1 Aug 30 15:40:03 link-02 kernel: dm-cmirror: - log uuid:: Hti6Joon (active) Aug 30 15:40:03 link-02 kernel: dm-cmirror: - request :: LRT_COMPLETE_RESYNC_WORK Aug 30 15:40:03 link-02 kernel: dm-cmirror: - error :: -6 Aug 30 15:40:03 link-02 kernel: dm-cmirror: Too many retries, attempting to re-establish server connection. Aug 30 15:40:03 link-02 kernel: dm-cmirror: server_id=dead, server_valid=1, Hti6Joon Aug 30 15:40:03 link-02 kernel: dm-cmirror: trigger = LRT_COMPLETE_RESYNC_WORK Aug 30 15:40:03 link-02 kernel: dm-cmirror: Unable to convert nodeid_to_ipaddr in run_election Aug 30 15:55:41 link-07 kernel: dm-cmirror: Failed to receive election results from server: (8yr9JHeP/1, -110) Aug 30 15:55:41 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, 8yr9JHeP Aug 30 15:55:41 link-07 kernel: dm-cmirror: trigger = LRT_CLEAR_REGION Aug 30 15:55:41 link-07 kernel: dm-cmirror: LRT_ELECTION(9): (8yr9JHeP) Aug 30 15:55:41 link-07 kernel: dm-cmirror: starter : 2 Aug 30 15:55:41 link-07 kernel: dm-cmirror: co-ordinator: 2 Aug 30 15:55:41 link-07 kernel: dm-cmirror: node_count : 0 Aug 30 15:55:53 link-07 fenced[3373]: link-02 not a cluster member after 30 sec post_fail_delay Aug 30 15:55:53 link-07 fenced[3373]: fencing node "link-02" Aug 30 15:56:01 link-07 kernel: dm-cmirror: Failed to receive election results from server: (8yr9JHeP/1, -110) Aug 30 15:56:01 link-07 kernel: dm-cmirror: server_id=dead, server_valid=1, 8yr9JHeP Aug 30 15:56:01 link-07 kernel: dm-cmirror: trigger = LRT_CLEAR_REGION Aug 30 15:56:01 link-07 kernel: dm-cmirror: LRT_ELECTION(9): (8yr9JHeP) Aug 30 15:56:01 link-07 kernel: dm-cmirror: starter : 2 Aug 30 15:56:01 link-07 kernel: dm-cmirror: co-ordinator: 2 Aug 30 15:56:01 link-07 kernel: dm-cmirror: node_count : 0 Aug 30 15:56:13 link-07 fenced[3373]: fence "link-02" success Aug 30 15:56:20 link-07 kernel: cdrom: open failed. These nodes where running: 2.6.9-56.ELsmp cmirror-kernel-2.6.9-34.1 lvm2-2.02.27-2.el4 lvm2-cluster-2.02.27-2.el4 device-mapper-1.02.21-1.el4 Appeear to have hit this while running cmirror_lock_stress and failing a mirror leg. It was on a three node cluster and two of the nodes ended up being told to leave and so quorum was lost. There may be two different issues in this bug. Focusing on Couldn't find device with uuid 'ogdhvU-2v5N-lxmG-YK1w-bB4R-TY3J-v91XI9'. Couldn't find all physical volumes for volume group vg. from commnent 0, leads me to believe the cluster had a different view of the available devices (similar to issues seen in 239482 and others). However, the rest of the comments suggest a different culprit. I'm betting the problems in comments 1-13 have been fixed. I'm putting this bug into modified. assigned -> modified Marking this bz verified. Like Jon said, there may be two bugs here and the issues in the earlier comments appear to be fixed. The issues in the later comments may be similar to bz 239482 and if seen again, I'll open a new bug. Fixed in current release (4.7). |