Bug 692186
Summary: | HA LVM mirror repair can print worrisome messages on device failure + service relocation | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | |
Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> | |
lvm2 sub component: | Default / Unclassified (RHEL6) | QA Contact: | Corey Marthaler <cmarthal> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | low | |||
Priority: | low | CC: | agk, cfeist, cluster-maint, coughlan, dwysocha, fdinitto, heinzm, jbrassow, jharriga, lhh, msnitzer, nperic, prajnoha, prockai, slevine, thornber, zkabelac | |
Version: | 6.1 | |||
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | lvm2-2.02.107-1.el6 | Doc Type: | Bug Fix | |
Doc Text: |
No Documentation needed - message clean-up
|
Story Points: | --- | |
Clone Of: | ||||
: | 730141 773423 (view as bug list) | Environment: | ||
Last Closed: | 2014-10-14 08:22:45 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1075263 |
Description
Corey Marthaler
2011-03-30 16:16:11 UTC
> taft-02 rgmanager[2450]: Service service:halvm is stopped
> taft-02 lvm[3417]: ABORTING: Mirror percentage check failed.
> taft-02 lvm[3417]: Repair of mirrored LV TAFT/ha1 failed.
> taft-02 lvm[3417]: Failed to remove faulty devices in TAFT-ha1.
> taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address
Hm. What's going on here? halvm stopped, so it should not have LVs active now?
Or the relocation happens while repair is in progress?
I am not sure if this is halvm or lvm2 bug, assigning to Jon & componentn resource-agetns, please switch it back if it is core lvm problem.
The test causes the service relocation to happen just after the device failure occurs. So there isn't always time for the repair to finish before the relocation takes place. This is reproducible. [...] Disabling device sdb on taft-03 Attempting I/O to cause mirror down conversion(s) on taft-03 1+0 records in 1+0 records out 512 bytes (512 B) copied, 4.9235e-05 s, 10.4 MB/s RELOCATING halvm FROM taft-03 TO taft-01 Mar 30 16:42:00 taft-03 lvm[1203]: Trying to up-convert to 2 images, 1 logs. Mar 30 16:42:02 taft-03 lvm[1203]: Monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:02 taft-03 lvm[1203]: Another thread is handling an event. Waiting... Mar 30 16:42:02 taft-03 lvm[1203]: 1 missing and now unallocated Physical Volumes removed from VG. Mar 30 16:42:02 taft-03 lvm[1203]: TAFT/ha1: Converted: 0.0% Mar 30 16:42:04 taft-03 rgmanager[2303]: Stopping service service:halvm Mar 30 16:42:04 taft-03 rgmanager[8853]: unmounting /mnt/fs1 Mar 30 16:42:07 taft-03 rgmanager[8910]: Deactivating TAFT/ha1 Mar 30 16:42:07 taft-03 rgmanager[8932]: Making resilient : lvchange -an TAFT/ha1 Mar 30 16:42:08 taft-03 rgmanager[8957]: Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdc1|","a|/dev/sdd1|","a|/dev/sde1|", Mar 30 16:42:08 taft-03 rgmanager[8990]: Removing ownership tag (taft-03) from TAFT/ha1 Mar 30 16:42:09 taft-03 rgmanager[2303]: Service service:halvm is stopped Mar 30 16:42:17 taft-03 lvm[1203]: ABORTING: Mirror percentage check failed. Mar 30 16:42:17 taft-03 lvm[1203]: Repair of mirrored LV TAFT/ha1 failed. Mar 30 16:42:17 taft-03 lvm[1203]: Failed to remove faulty devices in TAFT-ha1. Mar 30 16:42:17 taft-03 lvm[1203]: dm_task_run failed, errno = 6, No such device or address Mar 30 16:42:17 taft-03 lvm[1203]: dm_task_run failed, errno = 6, No such device or address Mar 30 16:42:17 taft-03 lvm[1203]: TAFT-ha1 disappeared, detaching Mar 30 16:42:17 taft-03 lvm[1203]: TAFT-ha1 disappeared, detaching Mar 30 16:42:17 taft-03 lvm[1203]: No longer monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:17 taft-03 lvm[1203]: No longer monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:12 taft-01 rgmanager[2263]: Starting stopped service service:halvm Mar 30 16:42:15 taft-01 rgmanager[3268]: Activating TAFT/ha1 Mar 30 16:42:15 taft-01 rgmanager[3331]: Making resilient : lvchange -ay TAFT/ha1 Mar 30 16:42:16 taft-01 rgmanager[3356]: Resilient command: lvchange -ay TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|", Mar 30 16:42:16 taft-01 dmeventd[3404]: dmeventd ready for processing. Mar 30 16:42:16 taft-01 lvm[3404]: Monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:17 taft-01 rgmanager[3494]: mounting /dev/dm-6 on /mnt/fs1 Mar 30 16:42:17 taft-01 rgmanager[3516]: mount -t ext3 /dev/dm-6 /mnt/fs1 Mar 30 16:42:17 taft-01 rgmanager[2263]: Service service:halvm started Mar 30 16:53:01 taft-01 lvm[3404]: TAFT-ha1 is now in-sync. There is no failure here, just scary messages. Here's what's happening: 1) Test kills one of mirror legs 2) Mirror is down-converted to linear to remove faulty device 3) Mirror is up-converted to replace faulty device (causes new log device too) 4) Poll daemon is launched to track the progress of the up-convert 5) Test forces relocation of service 6) Poll daemon fails to track sync progress b/c mirror has moved. 7) Mirror starts safely and cleanly on alternate machine *) Scary messages remain on original node. This analysis is verifiably correct. Simply try the following: terminal1~> lvcreate -L 1G -n lv vg terminal1~> lvconvert -m1 vg/lv terminal2~> lvchange -an vg/lv [root@bp-01 ~]# lvcreate -L 1G -n lv vg Logical volume "lv" created [root@bp-01 ~]# lvconvert -m1 vg/lv vg/lv: Converted: 0.0% ABORTING: Mirror percentage check failed. I'm pushing this out to rhel6.3. We may have to refile against LVM or find a different bug to describe this problem in. (In reply to comment #6) > There is no failure here, just scary messages. So should there be a 6.2 technical note, or an update to the manual that explains these messages, and how to determine they are not serious? I'll try to make sure the poll daemon either shuts down when an LV is deactivated or that it checks to see if the LV is still active when doing the check... This is a fantastically low priority bug though. I'll leave in Cond NACK - design. Fix committed upstream: commit a20de8af208725ff2054209aa7ff7f41cc770db6 Author: Jonathan Brassow <jbrassow> Date: Mon Jun 16 18:56:32 2014 -0500 poll_daemon: Cleanly exit polling if the LV is no longer active If the we are polling an LV due to some sort of conversion and it becomes inactive, a rather worrisome message is produced, e.g.: " ABORTING: Mirror percentage check failed." We can cleanly exit if we do a simple check to see if the LV is active before performing the check. This eliminates the scary message. When a convert gets interrupted by deactivation, you will see: [root@bp-01 lvm2]# lvcreate -L 50G -n lv vg Logical volume "lv" created [root@bp-01 lvm2]# lvconvert --type mirror -m 1 vg/lv vg/lv: Converted: 0.0% vg/lv: Interrupted: No longer active. When a pvmove gets interrupted by deactivation, you will see: [root@bp-01 lvm2]# pvmove /dev/sdb1 /dev/sdc1 /dev/sdb1: Moved: 0.0% /dev/sdb1: Interrupted: No longer active. To test, do: terminal1~> lvcreate -L 1G -n lv vg terminal1~> lvconvert -m1 vg/lv terminal2~> lvchange -an vg/lv This should NOT produce the old "scary" message: " ABORTING: Mirror percentage check failed." Fix is upstream and already in LVM package: commit a20de8af208725ff2054209aa7ff7f41cc770db6 Author: Jonathan Brassow <jbrassow> Date: Mon Jun 16 18:56:32 2014 -0500 poll_daemon: Cleanly exit polling if the LV is no longer active If the we are polling an LV due to some sort of conversion and it becomes inactive, a rather worrisome message is produced, e.g.: " ABORTING: Mirror percentage check failed." We can cleanly exit if we do a simple check to see if the LV is active before performing the check. This eliminates the scary message. [root@virt-064 ~]# lvcreate -L1G -n lv1 cluster Logical volume "lv1" created . . [root@virt-065 ~]# lvconvert -m1 cluster/lv1 cluster/lv1: Converted: 0.8% . . [root@virt-064 ~]# lvchange -an cluster/lv1 [root@virt-064 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert linear cluster -wi-a----- 100.00m lv1 cluster mwi---m--- 1.00g lv1_mlog mirror cluster mwi-a-m--- 500.00m mirror_mlog 100.00 lv_root vg_virt064 -wi-ao---- 6.71g lv_swap vg_virt064 -wi-ao---- 816.00m [root@virt-065 ~]# lvconvert -m1 cluster/lv1 cluster/lv1: Converted: 0.8% cluster/lv1: Interrupted: No longer active. [root@virt-065 ~]# Marking VERIFIED with: lvm2-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 lvm2-libs-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 lvm2-cluster-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 udev-147-2.56.el6 BUILT: Fri Jul 11 16:53:07 CEST 2014 device-mapper-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-libs-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-event-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-event-libs-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 15:43:06 CEST 2014 cmirror-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1387.html |