Bug 241422
Summary: | cmirror/clvmd issues when leg fails on subset of cluster | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cmirror | Assignee: | LVM and device-mapper development team <lvm-team> |
Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4 | CC: | agk, bstevens, ccaulfie, coughlan, dwysocha, jbrassow, kawasaki, mbroz, prockai |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-05-07 20:42:38 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2007-05-25 21:16:21 UTC
More info... link-04 (node that still sees the leg): [...] May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 492/SjTXMEG6 May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 493/SjTXMEG6 May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 494/SjTXMEG6 May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 495/SjTXMEG6 May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 496/SjTXMEG6 May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 825/SjTXMEG6 May 25 16:13:22 link-04 kernel: dm-cmirror: Recovery blocked by outstanding write on region 826/SjTXMEG6 link-02 (node doesn't see the leg): May 25 15:29:42 link-02 qarshd[6394]: Running cmdline: echo offline > /sys/block/sda/device/state May 25 15:29:42 link-02 qarshd[6394]: That's enough scsi0 (0:1): rejecting I/O to offline device May 25 15:29:51 link-02 kernel: scsi0 (0:1): rejecting I/O to offline device May 25 15:29:51 link-02 kernel: dm-cmirror: LOG INFO: May 25 15:29:51 link-02 kernel: dm-cmirror: uuid: LVM-ZcfTPEokTadP8VK8Czcm4aEia6yh6BUpdesI0PhGLu3eiY9jf0xaqHf0SjTXMEG6 May 25 15:29:51 link-02 kernel: dm-cmirror: uuid_ref : 1 May 25 15:29:51 link-02 kernel: dm-cmirror: ?region_count: 1600 May 25 15:29:51 link-02 kernel: dm-cmirror: ?sync_count : 0 May 25 15:29:51 link-02 kernel: dm-cmirror: ?sync_search : 0 May 25 15:29:51 link-02 kernel: dm-cmirror: in_sync : YES May 25 15:29:51 link-02 kernel: dm-cmirror: suspended : NO May 25 15:29:51 link-02 kernel: dm-cmirror: server_id : 2 May 25 15:29:51 link-02 kernel: dm-cmirror: server_valid: YES May 25 15:29:51 link-02 lvm[5480]: No longer monitoring mirror device helter_skelter-fail_primary_synced_2_legs for events May 25 15:29:51 link-02 lvm[5480]: Unlocking memory May 25 15:29:51 link-02 lvm[5480]: memlock_count dec to 0 May 25 15:29:51 link-02 lvm[5480]: Dumping persistent device cache to /etc/lvm/.cache May 25 15:29:51 link-02 lvm[5480]: Locking /etc/lvm/.cache (F_WRLCK, 1) May 25 15:29:51 link-02 lvm[5480]: Unlocking fd 8 May 25 15:29:51 link-02 lvm[5480]: Wiping internal VG cache May 25 15:29:51 link-02 kernel: dm-cmirror: Performing flush to work around bug 235040 May 25 15:29:51 link-02 kernel: dm-cmirror: Log flush complete May 25 15:30:11 link-02 kernel: dm-cmirror: LRT_MASTER_LEAVING(13): (SjTXMEG6) May 25 15:30:11 link-02 kernel: dm-cmirror: starter : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: co-ordinator: 0 May 25 15:30:11 link-02 kernel: dm-cmirror: node_count : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: LRT_ELECTION(10): (SjTXMEG6) May 25 15:30:11 link-02 kernel: dm-cmirror: starter : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: co-ordinator: 57005 May 25 15:30:11 link-02 kernel: dm-cmirror: node_count : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: LRT_SELECTION(11): (SjTXMEG6) May 25 15:30:11 link-02 kernel: dm-cmirror: starter : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: co-ordinator: 1 May 25 15:30:11 link-02 kernel: dm-cmirror: node_count : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: LRT_MASTER_ASSIGN(12): (SjTXMEG6) May 25 15:30:11 link-02 kernel: dm-cmirror: starter : 2 May 25 15:30:11 link-02 kernel: dm-cmirror: co-ordinator: 1 May 25 15:30:11 link-02 kernel: dm-cmirror: node_count : 1 May 25 15:30:12 link-02 kernel: dm-cmirror: LRT_ELECTION(10): (SjTXMEG6) May 25 15:30:12 link-02 kernel: dm-cmirror: starter : 3 May 25 15:30:12 link-02 kernel: dm-cmirror: co-ordinator: 3 May 25 15:30:12 link-02 kernel: dm-cmirror: node_count : 1 scsi0 (0:1): rejecting I/O to offline device May 25 16:04:39 link-02 kernel: scsi0 (0:1): rejecting I/O to offline device May 25 16:04:39 link-02 kernel: dm-cmirror: server_id=dead, server_valid=1, SjTXMEG6 May 25 16:04:39 link-02 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT May 25 16:04:39 link-02 kernel: dm-cmirror: LRT_ELECTION(10): (SjTXMEG6) May 25 16:04:39 link-02 kernel: dm-cmirror: starter : 4 May 25 16:04:39 link-02 kernel: dm-cmirror: co-ordinator: 4 May 25 16:04:39 link-02 kernel: dm-cmirror: node_count : 0 scsi0 (0:1): rejecting I/O to offline device scsi0 (0:1): rejecting I/O to offline device May 25 16:12:40 link-02 kernel: scsi0 (0:1): rejecting I/O to offline device Bug 249092 is related to this bug. This defect is getting a lot of attention by our customers. This is a fairly typical scenario for bridging two datacenters storage arrays. Does fencing one of the nodes in the cluster allow normal operations to resume? Setting flags to get this into 4.7, would like a solution much sooner. Is this the equivalent of split-brain mode from a storage perspective? This can get more complicated when one subset sees one device fail and another subset sees a different device fail. I tried this with a 3 node cluster and failed the primary leg on two nodes (this included the mirror master) and failed the secondary leg on another. The conversion failed so I fenced the third node so there'd be a consistent storage view. The I/O attempts to that mirror remained deadlocked however. mirror test Mwi-so 10.00G mirror_mlog 0.00 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] test iwi-so 10.00G [mirror_mimage_1] test iwi-so 10.00G /dev/sdb1(0) [mirror_mlog] test lwi-so 4.00M /dev/sdc1(0) [root@link-02 ~]# dmsetup ls --tree test-mirror (253:5) ├─test-mirror_mimage_1 (253:4) │ └─ (8:17) ├─test-mirror_mimage_0 (253:3) │ └─ (8:1) └─test-mirror_mlog (253:2) └─ (8:33) When I tried the downconvert by hand it failed because the mirror was already "consistent". [root@link-02 ~]# vgreduce --config devices{ignore_suspended_devices=1} --removemissing test /dev/sda1: read failed after 0 of 512 at 145661362176: Input/output error /dev/sda1: read failed after 0 of 2048 at 0: Input/output error Volume group "test" is already consistent |