Bug 1454699
| Summary: | LVM resource agent does not detect multipath with all paths failed | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | michal novacek <mnovacek> |
| Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> |
| Status: | CLOSED ERRATA | QA Contact: | cluster-qe <cluster-qe> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.4 | CC: | agk, cfeist, cluster-maint, cmarthal, fdinitto, rbednar, sbradley, tlavigne |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | resource-agents-3.9.5-101.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-01 15:00:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This doesn't require multipath. If you fail all non mpath devices in an LV, you'll get the same result.
[root@host-113 ~]# pcs status
Cluster name: STSRHTS8303
Stack: corosync
Current DC: host-114 (version 1.1.16-9.el7-94ff4df) - partition with quorum
Last updated: Wed May 24 11:27:38 2017
Last change: Tue May 23 17:16:46 2017 by root via cibadmin on host-113
3 nodes configured
5 resources configured
Online: [ host-113 host-114 host-115 ]
Full list of resources:
fence-host-113 (stonith:fence_xvm): Started host-113
fence-host-114 (stonith:fence_xvm): Started host-114
fence-host-115 (stonith:fence_xvm): Started host-115
Resource Group: HA_LVM
lvm (ocf::heartbeat:LVM): Started host-113
fs1 (ocf::heartbeat:Filesystem): Started host-113
[root@host-113 ~]# lvs -a -o +devices
LV VG Attr LSize Cpy%Sync Devices
ha1 STSRHTS8303 Rwi-aor--- 8.00g 100.00 ha1_rimage_0(0),ha1_rimage_1(0)
[ha1_rimage_0] STSRHTS8303 iwi-aor--- 8.00g /dev/sdh1(1)
[ha1_rimage_1] STSRHTS8303 iwi-aor--- 8.00g /dev/sdg1(1)
[ha1_rmeta_0] STSRHTS8303 ewi-aor--- 4.00m /dev/sdh1(0)
[ha1_rmeta_1] STSRHTS8303 ewi-aor--- 4.00m /dev/sdg1(0)
# Fail all HA lvm PV paths
[root@host-113 ~]# echo offline > /sys/block/sdh/device/state
[root@host-113 ~]# echo offline > /sys/block/sdg/device/state
[root@host-113 ~]# pvscan --cache
/dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 0: Input/output error
/dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 8589869056: Input/output error
/dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 8589926400: Input/output error
/dev/STSRHTS8303/ha1: read failed after 0 of 4096 at 4096: Input/output error
/dev/sdg1: read failed after 0 of 1024 at 22545367040: Input/output error
[...]
/dev/sdh1: read failed after 0 of 2048 at 0: Input/output error
[root@host-113 ~]# dd if=/dev/zero of=/dev/STSRHTS8303/ha1 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.000399487 s, 1.3 MB/s
# nothing is done by the resource, and lvm is attempting to repair the LV locally, but with no paths to remaining LV storage that's impossible
[root@host-113 ~]# pcs status
Cluster name: STSRHTS8303
Stack: corosync
Current DC: host-114 (version 1.1.16-9.el7-94ff4df) - partition with quorum
Last updated: Wed May 24 11:28:19 2017
Last change: Tue May 23 17:16:46 2017 by root via cibadmin on host-113
3 nodes configured
5 resources configured
Online: [ host-113 host-114 host-115 ]
Full list of resources:
fence-host-113 (stonith:fence_xvm): Started host-113
fence-host-114 (stonith:fence_xvm): Started host-114
fence-host-115 (stonith:fence_xvm): Started host-115
Resource Group: HA_LVM
lvm (ocf::heartbeat:LVM): Started host-113
fs1 (ocf::heartbeat:Filesystem): Started host-113
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
[root@host-113 ~]# lvs -a -o +devices
WARNING: Device for PV k3qPvD-QUnA-t6MF-KUHo-Gfld-YFAo-XlibxJ not found or rejected by a filter.
WARNING: Device for PV XXMdw4-AzCc-5R6g-JvBf-S5zb-7MN7-onHPNF not found or rejected by a filter.
WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rimage_0 while checking used and assumed devices.
WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rmeta_0 while checking used and assumed devices.
WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rimage_1 while checking used and assumed devices.
WARNING: Couldn't find all devices for LV STSRHTS8303/ha1_rmeta_1 while checking used and assumed devices.
LV VG Attr LSize Cpy%Sync Devices
ha1 STSRHTS8303 Rwi-aor-p- 8.00g 100.00 ha1_rimage_0(0),ha1_rimage_1(0)
[ha1_rimage_0] STSRHTS8303 iwi-aor-p- 8.00g [unknown](1)
[ha1_rimage_1] STSRHTS8303 Iwi-aor-p- 8.00g [unknown](1)
[ha1_rmeta_0] STSRHTS8303 ewi-aor-p- 4.00m [unknown](0)
[ha1_rmeta_1] STSRHTS8303 ewi-aor-p- 4.00m [unknown](0)
# path to storage exists on other machines in cluster
[root@host-114 ~]# lvs -a -o +devices
LV VG Attr LSize Cpy%Sync Devices
ha1 STSRHTS8303 Rwi---r--- 8.00g ha1_rimage_0(0),ha1_rimage_1(0)
[ha1_rimage_0] STSRHTS8303 Iwi---r--- 8.00g /dev/sdg1(1)
[ha1_rimage_1] STSRHTS8303 Iwi---r--- 8.00g /dev/sdh1(1)
[ha1_rmeta_0] STSRHTS8303 ewi---r--- 4.00m /dev/sdg1(0)
[ha1_rmeta_1] STSRHTS8303 ewi---r--- 4.00m /dev/sdh1(0)
[root@host-113 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/STSRHTS8303-ha1 7.8G 36M 7.3G 1% /mnt/ha1
[root@host-113 ~]# dd if=/dev/zero of=/mnt/ha1/ddfile count=1
dd: failed to open â/mnt/ha1/ddfileâ: Input/output error
# attempt to manually move
[root@host-113 ~]# pcs resource move HA_LVM host-114
# Apparently the resource ended up on 115 instead?
[root@host-114 ~]# pcs status
Cluster name: STSRHTS8303
Stack: corosync
Current DC: host-114 (version 1.1.16-9.el7-94ff4df) - partition with quorum
Last updated: Wed May 24 11:34:00 2017
Last change: Wed May 24 11:33:42 2017 by root via crm_resource on host-113
3 nodes configured
5 resources configured
Online: [ host-113 host-114 host-115 ]
Full list of resources:
fence-host-113 (stonith:fence_xvm): Started host-113
fence-host-114 (stonith:fence_xvm): Started host-114
fence-host-115 (stonith:fence_xvm): Started host-115
Resource Group: HA_LVM
lvm (ocf::heartbeat:LVM): Started host-115
fs1 (ocf::heartbeat:Filesystem): Started host-115
Failed Actions:
* lvm_start_0 on host-113 'unknown error' (1): call=65, status=complete, exitreason='Volume group [STSRHTS8303] has devices missing. Consider partial_activation=true to attempt to activate partially',
last-rc-change='Wed May 24 11:33:43 2017', queued=0ms, exec=285ms
* lvm_start_0 on host-114 'unknown error' (1): call=52, status=complete, exitreason='Volume group [STSRHTS8303] does not exist or contains error! WARNING: Missing device /dev/sdg1 reappeared, updating metadata f',
last-rc-change='Wed May 24 11:33:43 2017', queued=0ms, exec=216ms
[root@host-115 ~]# lvs -a -o +devices
LV VG Attr LSize Cpy%Sync Devices
ha1 STSRHTS8303 Rwi-aor--- 8.00g 100.00 ha1_rimage_0(0),ha1_rimage_1(0)
[ha1_rimage_0] STSRHTS8303 iwi-aor--- 8.00g /dev/sdh1(1)
[ha1_rimage_1] STSRHTS8303 iwi-aor--- 8.00g /dev/sdg1(1)
[ha1_rmeta_0] STSRHTS8303 ewi-aor--- 4.00m /dev/sdh1(0)
[ha1_rmeta_1] STSRHTS8303 ewi-aor--- 4.00m /dev/sdg1(0)
[root@host-115 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/STSRHTS8303-ha1 8125880 36852 7653216 1% /mnt/ha1
This gets tricky when failing a subset of the storage, since LVM locally should first be responsible for attempting to repair the volume metadata before any resource relocation should happen. If it's determined that all paths to storage are failed, then the resource agent should attempt a relocation to another node. Marking verified. LVM resource agent now detects device failures and moves the resource properly.
[root@virt-156 ~]# pcs status
Cluster name: STSRHTS25405
Stack: corosync
Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Jun 29 13:54:49 2017
Last change: Thu Jun 29 13:44:05 2017 by root via cibadmin on virt-157
2 nodes configured
4 resources configured
Online: [ virt-156 virt-157 ]
Full list of resources:
fence-virt-156 (stonith:fence_xvm): Started virt-156
fence-virt-157 (stonith:fence_xvm): Started virt-157
Resource Group: HA_LVM
my_vg (ocf::heartbeat:LVM): Started virt-156
my_fs (ocf::heartbeat:Filesystem): Started virt-156
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@virt-156 ~]# lvs -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
root rhel_virt-156 -wi-ao---- <6.70g /dev/vda2(219)
swap rhel_virt-156 -wi-ao---- 876.00m /dev/vda2(0)
lv vg -wi-ao---- 1.00g /dev/sda(0)
lv vg -wi-ao---- 1.00g /dev/sdb(0)
## fail lvm paths
[root@virt-156 ~]# echo offline > /sys/block/sda/device/state
[root@virt-156 ~]# echo offline > /sys/block/sdb/device/state
[root@virt-156 ~]# dd if=/dev/zero of=/dev/vg/lv count=1
dd: writing to ‘/dev/vg/lv’: Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00236177 s, 0.0 kB/s
[root@virt-156 ~]# pcs status
Cluster name: STSRHTS25405
Stack: corosync
Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Jun 29 13:55:44 2017
Last change: Thu Jun 29 13:44:05 2017 by root via cibadmin on virt-157
2 nodes configured
4 resources configured
Online: [ virt-156 virt-157 ]
Full list of resources:
fence-virt-156 (stonith:fence_xvm): Started virt-156
fence-virt-157 (stonith:fence_xvm): Started virt-157
Resource Group: HA_LVM
my_vg (ocf::heartbeat:LVM): Started virt-157
my_fs (ocf::heartbeat:Filesystem): Started virt-157
Failed Actions:
* my_vg_start_0 on virt-156 'unknown error' (1): call=181, status=complete, exitreason='Volume group [vg] does not exist or contains error! /dev/vg/lv: read failed after 0 of 4096 at 0: Input/output error',
last-rc-change='Thu Jun 29 13:55:06 2017', queued=0ms, exec=149ms
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
## resource group has been relocated after failure
## resume devices, reset failcount and see if the resource gets back
[root@virt-156 ~]# echo running > /sys/block/sda/device/state
[root@virt-156 ~]# echo running > /sys/block/sdb/device/state
[root@virt-156 ~]# pcs resource cleanup
Waiting for 1 replies from the CRMd. OK
[root@virt-156 ~]# pcs resource failcount show my_vg
No failcounts for my_vg
[root@virt-156 ~]# pcs constraint
Location Constraints:
Resource: HA_LVM
Enabled on: virt-156 (score:INFINITY) (role: Started)
Ordering Constraints:
start my_vg then start my_fs (kind:Mandatory)
Colocation Constraints:
Ticket Constraints:
[root@virt-156 ~]# pcs status
Cluster name: STSRHTS25405
Stack: corosync
Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Jun 29 14:30:35 2017
Last change: Thu Jun 29 14:30:10 2017 by root via cibadmin on virt-156
2 nodes configured
4 resources configured
Online: [ virt-156 virt-157 ]
Full list of resources:
fence-virt-156 (stonith:fence_xvm): Started virt-156
fence-virt-157 (stonith:fence_xvm): Started virt-157
Resource Group: HA_LVM
my_vg (ocf::heartbeat:LVM): Started virt-156
my_fs (ocf::heartbeat:Filesystem): Started virt-156
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root@virt-156 ~]# lvs -o +devices
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
root rhel_virt-156 -wi-ao---- <6.70g /dev/vda2(219)
swap rhel_virt-156 -wi-ao---- 876.00m /dev/vda2(0)
lv vg -wi-ao---- 1.00g /dev/sda(0)
lv vg -wi-ao---- 1.00g /dev/sdb(0)
## try moving to other node
[root@virt-156 ~]# pcs resource move HA_LVM virt-157
[root@virt-156 ~]# pcs status
Cluster name: STSRHTS25405
Stack: corosync
Current DC: virt-156 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Thu Jun 29 14:34:11 2017
Last change: Thu Jun 29 14:34:08 2017 by root via crm_resource on virt-156
2 nodes configured
4 resources configured
Online: [ virt-156 virt-157 ]
Full list of resources:
fence-virt-156 (stonith:fence_xvm): Started virt-156
fence-virt-157 (stonith:fence_xvm): Started virt-157
Resource Group: HA_LVM
my_vg (ocf::heartbeat:LVM): Started virt-157
my_fs (ocf::heartbeat:Filesystem): Started virt-157
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
===========================================
resource-agents-3.9.5-105.el7.x86_64
pcs-0.9.158-6.el7.x86_64
pacemaker-1.1.16-12.el7.x86_64
kernel-3.10.0-689.el7.x86_64
corosync-2.4.0-9.el7.x86_64
lvm2-2.02.171-8.el7.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1844 |
Description of problem: When volume group is an multipath pv the fact that this pv have all path dead (i/o error on the device) is not correctly detected. Version-Release number of selected component (if applicable): resource-agents-3.9.5-99 How reproducible: always Steps to Reproduce: 1/ configure LVM2 resource with vg on multipath device 2/ fail all disks of the multipath device Actual results: resource agent not detecting the problem Expected results: resource agent failing and moving Additional info: [root@virt-196 ~]# multipath -ll mpatha (1beaker-disk-61358-2) dm-2 IET ,VIRTUAL-DISK size=5.0G features='0' hwhandler='0' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | `- 5:0:0:1 sdc 8:32 failed faulty offline `-+- policy='service-time 0' prio=0 status=enabled `- 4:0:0:1 sdd 8:48 failed faulty offline [root@virt-196 ~]# pvs /dev/mapper/mpatha: read failed after 0 of 4096 at 0: Input/output error /dev/mapper/mpatha: read failed after 0 of 4096 at 5368643584: Input/output error /dev/mapper/mpatha: read failed after 0 of 4096 at 5368700928: Input/output error /dev/mapper/mpatha: read failed after 0 of 4096 at 4096: Input/output error /dev/shared/shared0: read failed after 0 of 4096 at 0: Input/output error /dev/shared/shared0: read failed after 0 of 4096 at 5364449280: Input/output error /dev/shared/shared0: read failed after 0 of 4096 at 5364506624: Input/output error /dev/shared/shared0: read failed after 0 of 4096 at 4096: Input/output error PV VG Fmt Attr PSize PFree /dev/mapper/mpatha shared lvm2 a-- <5.00g 0 /dev/vda2 rhel_virt-196 lvm2 a-- <7.56g 4.00m [root@virt-196 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root rhel_virt-196 -wi-ao---- <6.70g swap rhel_virt-196 -wi-ao---- 876.00m shared0 shared -wi-a----- <5.00g [root@virt-196 ~]# pcs resource show --full Resource: havg (class=ocf provider=heartbeat type=LVM) Attributes: exclusive=true partial_activation=false volgrpname=shared Operations: monitor interval=10 timeout=30 (havg-monitor-interval-10) start interval=0s timeout=30 (havg-start-interval-0s) stop interval=0s timeout=30 (havg-stop-interval-0s) [root@virt-196 ~]# pcs resource havg (ocf::heartbeat:LVM): Started virt-196 [root@virt-196 ~]# pcs resource debug-monitor havg Operation monitor for havg (ocf:heartbeat:LVM) returned 0 > stdout: volume_list="rhel_virt-196"