Bug 1454699
Summary: | LVM resource agent does not detect multipath with all paths failed | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | michal novacek <mnovacek> |
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | agk, cfeist, cluster-maint, cmarthal, fdinitto, rbednar, sbradley, tlavigne |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | resource-agents-3.9.5-101.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 15:00:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
michal novacek
2017-05-23 11:09:11 UTC
This doesn't require multipath. If you fail all non mpath devices in an LV, you'll get the same result. [root@host-113 ~]# pcs status Cluster name: STSRHTS8303 Stack: corosync Current DC: host-114 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Wed May 24 11:27:38 2017 Last change: Tue May 23 17:16:46 2017 by root via cibadmin on host-113 3 nodes configured 5 resources configured Online: [ host-113 host-114 host-115 ] Full list of resources: fence-host-113 (stonith:fence_xvm): Started host-113 fence-host-114 (stonith:fence_xvm): Started host-114 fence-host-115 (stonith:fence_xvm): Started host-115 Resource Group: HA_LVM lvm (ocf::heartbeat:LVM): Started host-113 fs1 (ocf::heartbeat:Filesystem): Started host-113 [root@host-113 ~]# lvs -a -o +devices LV VG Attr LSize Cpy%Sync Devices ha1 STSRHTS8303 Rwi-aor--- 8.00g 100.00 ha1_rimage_0(0),ha1_rimage_1(0) [ha1_rimage_0] STSRHTS8303 iwi-aor--- 8.00g /dev/sdh1(1) [ha1_rimage_1] STSRHTS8303 iwi-aor--- 8.00g /dev/sdg1(1) [ha1_rmeta_0] STSRHTS8303 ewi-aor--- 4.00m /dev/sdh1(0) [ha1_rmeta_1] STSRHTS8303 ewi-aor--- 4.00m /dev/sdg1(0) # Fail all HA lvm PV paths [root@host-113 ~]# echo offline > /sys/block/sdh/device/state [root@host-113 ~]# echo offline > /sys/block/sdg/device/state [root@host-113 ~]# pvscan --cache /dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 0: Input/output error /dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 8589869056: Input/output error /dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 8589926400: Input/output error /dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 4096: Input/output error /dev/sdg1: read failed after 0 of 1024 at 22545367040: Input/output error [...] /dev/sdh1: read failed after 0 of 2048 at 0: Input/output error [root@host-113 ~]# dd if=/dev/zero of=/dev/STSRHTS8303/ha1 count=1 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.000399487 s, 1.3 MB/s # nothing is done by the resource, and lvm is attempting to repair the LV locally, but with no paths to remaining LV storage that's impossible [root@host-113 ~]# pcs status Cluster name: STSRHTS8303 Stack: corosync Current DC: host-114 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Wed May 24 11:28:19 2017 Last change: Tue May 23 17:16:46 2017 by root via cibadmin on host-113 3 nodes configured 5 resources configured Online: [ host-113 host-114 host-115 ] Full list of resources: fence-host-113 (stonith:fence_xvm): Started host-113 fence-host-114 (stonith:fence_xvm): Started host-114 fence-host-115 (stonith:fence_xvm): Started host-115 Resource Group: HA_LVM lvm (ocf::heartbeat:LVM): Started host-113 fs1 (ocf::heartbeat:Filesystem): Started host-113 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@host-113 ~]# lvs -a -o +devices WARNING: Device for PV k3qPvD-QUnA-t6MF-KUHo-Gfld-YFAo-XlibxJ not found or rejected by a filter. WARNING: Device for PV XXMdw4-AzCc-5R6g-JvBf-S5zb-7MN7-onHPNF not found or rejected by a filter. WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rimage_0 while checking used and assumed devices. WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rmeta_0 while checking used and assumed devices. WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rimage_1 while checking used and assumed devices. WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rmeta_1 while checking used and assumed devices. LV VG Attr LSize Cpy%Sync Devices ha1 STSRHTS8303 Rwi-aor-p- 8.00g 100.00 ha1_rimage_0(0),ha1_rimage_1(0) [ha1_rimage_0] STSRHTS8303 iwi-aor-p- 8.00g [unknown](1) [ha1_rimage_1] STSRHTS8303 Iwi-aor-p- 8.00g [unknown](1) [ha1_rmeta_0] STSRHTS8303 ewi-aor-p- 4.00m [unknown](0) [ha1_rmeta_1] STSRHTS8303 ewi-aor-p- 4.00m [unknown](0) # path to storage exists on other machines in cluster [root@host-114 ~]# lvs -a -o +devices LV VG Attr LSize Cpy%Sync Devices ha1 STSRHTS8303 Rwi---r--- 8.00g ha1_rimage_0(0),ha1_rimage_1(0) [ha1_rimage_0] STSRHTS8303 Iwi---r--- 8.00g /dev/sdg1(1) [ha1_rimage_1] STSRHTS8303 Iwi---r--- 8.00g /dev/sdh1(1) [ha1_rmeta_0] STSRHTS8303 ewi---r--- 4.00m /dev/sdg1(0) [ha1_rmeta_1] STSRHTS8303 ewi---r--- 4.00m /dev/sdh1(0) [root@host-113 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/STSRHTS8303-ha1 7.8G 36M 7.3G 1% /mnt/ha1 [root@host-113 ~]# dd if=/dev/zero of=/mnt/ha1/ddfile count=1 dd: failed to open â/mnt/ha1/ddfileâ: Input/output error # attempt to manually move [root@host-113 ~]# pcs resource move HA_LVM host-114 # Apparently the resource ended up on 115 instead? [root@host-114 ~]# pcs status Cluster name: STSRHTS8303 Stack: corosync Current DC: host-114 (version 1.1.16-9.el7-94ff4df) - partition with quorum Last updated: Wed May 24 11:34:00 2017 Last change: Wed May 24 11:33:42 2017 by root via crm_resource on host-113 3 nodes configured 5 resources configured Online: [ host-113 host-114 host-115 ] Full list of resources: fence-host-113 (stonith:fence_xvm): Started host-113 fence-host-114 (stonith:fence_xvm): Started host-114 fence-host-115 (stonith:fence_xvm): Started host-115 Resource Group: HA_LVM lvm (ocf::heartbeat:LVM): Started host-115 fs1 (ocf::heartbeat:Filesystem): Started host-115 Failed Actions: * lvm_start_0 on host-113 'unknown error' (1): call=65, status=complete, exitreason='Volume group [STSRHTS8303] has devices missing. Consider partial_activation=true to attempt to activate partially', last-rc-change='Wed May 24 11:33:43 2017', queued=0ms, exec=285ms * lvm_start_0 on host-114 'unknown error' (1): call=52, status=complete, exitreason='Volume group [STSRHTS8303] does not exist or contains error! WARNING: Missing device /dev/sdg1 reappeared, updating metadata f', last-rc-change='Wed May 24 11:33:43 2017', queued=0ms, exec=216ms [root@host-115 ~]# lvs -a -o +devices LV VG Attr LSize Cpy%Sync Devices ha1 STSRHTS8303 Rwi-aor--- 8.00g 100.00 ha1_rimage_0(0),ha1_rimage_1(0) [ha1_rimage_0] STSRHTS8303 iwi-aor--- 8.00g /dev/sdh1(1) [ha1_rimage_1] STSRHTS8303 iwi-aor--- 8.00g /dev/sdg1(1) [ha1_rmeta_0] STSRHTS8303 ewi-aor--- 4.00m /dev/sdh1(0) [ha1_rmeta_1] STSRHTS8303 ewi-aor--- 4.00m /dev/sdg1(0) [root@host-115 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/STSRHTS8303-ha1 8125880 36852 7653216 1% /mnt/ha1 This gets tricky when failing a subset of the storage, since LVM locally should first be responsible for attempting to repair the volume metadata before any resource relocation should happen. If it's determined that all paths to storage are failed, then the resource agent should attempt a relocation to another node. Marking verified. LVM resource agent now detects device failures and moves the resource properly. [root@virt-156 ~]# pcs status Cluster name: STSRHTS25405 Stack: corosync Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Thu Jun 29 13:54:49 2017 Last change: Thu Jun 29 13:44:05 2017 by root via cibadmin on virt-157 2 nodes configured 4 resources configured Online: [ virt-156 virt-157 ] Full list of resources: fence-virt-156 (stonith:fence_xvm): Started virt-156 fence-virt-157 (stonith:fence_xvm): Started virt-157 Resource Group: HA_LVM my_vg (ocf::heartbeat:LVM): Started virt-156 my_fs (ocf::heartbeat:Filesystem): Started virt-156 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@virt-156 ~]# lvs -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel_virt-156 -wi-ao---- <6.70g /dev/vda2(219) swap rhel_virt-156 -wi-ao---- 876.00m /dev/vda2(0) lv vg -wi-ao---- 1.00g /dev/sda(0) lv vg -wi-ao---- 1.00g /dev/sdb(0) ## fail lvm paths [root@virt-156 ~]# echo offline > /sys/block/sda/device/state [root@virt-156 ~]# echo offline > /sys/block/sdb/device/state [root@virt-156 ~]# dd if=/dev/zero of=/dev/vg/lv count=1 dd: writing to ‘/dev/vg/lv’: Input/output error 1+0 records in 0+0 records out 0 bytes (0 B) copied, 0.00236177 s, 0.0 kB/s [root@virt-156 ~]# pcs status Cluster name: STSRHTS25405 Stack: corosync Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Thu Jun 29 13:55:44 2017 Last change: Thu Jun 29 13:44:05 2017 by root via cibadmin on virt-157 2 nodes configured 4 resources configured Online: [ virt-156 virt-157 ] Full list of resources: fence-virt-156 (stonith:fence_xvm): Started virt-156 fence-virt-157 (stonith:fence_xvm): Started virt-157 Resource Group: HA_LVM my_vg (ocf::heartbeat:LVM): Started virt-157 my_fs (ocf::heartbeat:Filesystem): Started virt-157 Failed Actions: * my_vg_start_0 on virt-156 'unknown error' (1): call=181, status=complete, exitreason='Volume group [vg] does not exist or contains error! /dev/vg/lv: read failed after 0 of 4096 at 0: Input/output error', last-rc-change='Thu Jun 29 13:55:06 2017', queued=0ms, exec=149ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ## resource group has been relocated after failure ## resume devices, reset failcount and see if the resource gets back [root@virt-156 ~]# echo running > /sys/block/sda/device/state [root@virt-156 ~]# echo running > /sys/block/sdb/device/state [root@virt-156 ~]# pcs resource cleanup Waiting for 1 replies from the CRMd. OK [root@virt-156 ~]# pcs resource failcount show my_vg No failcounts for my_vg [root@virt-156 ~]# pcs constraint Location Constraints: Resource: HA_LVM Enabled on: virt-156 (score:INFINITY) (role: Started) Ordering Constraints: start my_vg then start my_fs (kind:Mandatory) Colocation Constraints: Ticket Constraints: [root@virt-156 ~]# pcs status Cluster name: STSRHTS25405 Stack: corosync Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Thu Jun 29 14:30:35 2017 Last change: Thu Jun 29 14:30:10 2017 by root via cibadmin on virt-156 2 nodes configured 4 resources configured Online: [ virt-156 virt-157 ] Full list of resources: fence-virt-156 (stonith:fence_xvm): Started virt-156 fence-virt-157 (stonith:fence_xvm): Started virt-157 Resource Group: HA_LVM my_vg (ocf::heartbeat:LVM): Started virt-156 my_fs (ocf::heartbeat:Filesystem): Started virt-156 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@virt-156 ~]# lvs -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices root rhel_virt-156 -wi-ao---- <6.70g /dev/vda2(219) swap rhel_virt-156 -wi-ao---- 876.00m /dev/vda2(0) lv vg -wi-ao---- 1.00g /dev/sda(0) lv vg -wi-ao---- 1.00g /dev/sdb(0) ## try moving to other node [root@virt-156 ~]# pcs resource move HA_LVM virt-157 [root@virt-156 ~]# pcs status Cluster name: STSRHTS25405 Stack: corosync Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Thu Jun 29 14:34:11 2017 Last change: Thu Jun 29 14:34:08 2017 by root via crm_resource on virt-156 2 nodes configured 4 resources configured Online: [ virt-156 virt-157 ] Full list of resources: fence-virt-156 (stonith:fence_xvm): Started virt-156 fence-virt-157 (stonith:fence_xvm): Started virt-157 Resource Group: HA_LVM my_vg (ocf::heartbeat:LVM): Started virt-157 my_fs (ocf::heartbeat:Filesystem): Started virt-157 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled =========================================== resource-agents-3.9.5-105.el7.x86_64 pcs-0.9.158-6.el7.x86_64 pacemaker-1.1.16-12.el7.x86_64 kernel-3.10.0-689.el7.x86_64 corosync-2.4.0-9.el7.x86_64 lvm2-2.02.171-8.el7.x86_64 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1844 |