Red Hat Bugzilla – Bug 692186
HA LVM mirror repair can print worrisome messages on device failure + service relocation
Last modified: 2014-10-14 04:22:45 EDT
Description of problem: Scenario: Kill primary leg ********* Mirror info for this scenario ********* * mirrors: ha1 * leg devices: /dev/sde1 /dev/sdf1 * log devices: /dev/sdc1 * failpv(s): /dev/sde1 * failnode(s): taft-02 * leg fault policy: allocate * log fault policy: remove * HA mirror: YES * HA service name: halvm * HA current owner: taft-02 ************************************************* PV=/dev/sde1 ha1_mimage_0: 5 PV=/dev/sde1 ha1_mimage_0: 5 Disabling device sde on taft-02 Attempting I/O to cause mirror down conversion(s) on taft-02 1+0 records in 1+0 records out 512 bytes (512 B) copied, 4.7413e-05 s, 10.8 MB/s RELOCATING halvm FROM taft-02 TO taft-01 Verifying current sanity of lvm after the failure Verifying FAILED device /dev/sde1 is *NOT* in the volume(s) ORIGINAL LOGS: /dev/sdc1 ORIGINAL LOGS LEFT: /dev/sdc1 olog: 1 Verifying LOG device(s) /dev/sdc1 *ARE* in the mirror(s) log device /dev/sdc1 should still be present on taft-01 ** ARE THERE ENOUGH SPARE PVS TO ALLOCATE? ** # Old service owner: taft-02 lvm[3417]: Trying to up-convert to 2 images, 1 logs. taft-02 lvm[3417]: Monitoring mirror device TAFT-ha1 for events. taft-02 lvm[3417]: Another thread is handling an event. Waiting... taft-02 lvm[3417]: 1 missing and now unallocated Physical Volumes removed from VG. taft-02 lvm[3417]: TAFT/ha1: Converted: 0.0% taft-02 rgmanager[2450]: Stopping service service:halvm taft-02 rgmanager[8546]: unmounting /mnt/fs1 taft-02 rgmanager[8602]: Deactivating TAFT/ha1 taft-02 rgmanager[8624]: Making resilient : lvchange -an TAFT/ha1 taft-02 rgmanager[8649]: Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|", taft-02 rgmanager[8681]: Removing ownership tag (taft-02) from TAFT/ha1 taft-02 rgmanager[2450]: Service service:halvm is stopped taft-02 lvm[3417]: ABORTING: Mirror percentage check failed. taft-02 lvm[3417]: Repair of mirrored LV TAFT/ha1 failed. taft-02 lvm[3417]: Failed to remove faulty devices in TAFT-ha1. taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address taft-02 lvm[3417]: TAFT-ha1 disappeared, detaching taft-02 lvm[3417]: No longer monitoring mirror device TAFT-ha1 for events. taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address taft-02 lvm[3417]: TAFT-ha1 disappeared, detaching taft-02 lvm[3417]: No longer monitoring mirror device TAFT-ha1 for events. # New service owner: taft-01 rgmanager[2462]: Starting stopped service service:halvm taft-01 rgmanager[3595]: Activating TAFT/ha1 taft-01 rgmanager[3660]: Making resilient : lvchange -ay TAFT/ha1 taft-01 rgmanager[3685]: Resilient command: lvchange -ay TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|", taft-01 lvm[1220]: Monitoring mirror device TAFT-ha1 for events. taft-01 rgmanager[3812]: mounting /dev/dm-5 on /mnt/fs1 taft-01 rgmanager[3834]: mount -t ext3 /dev/dm-5 /mnt/fs1 taft-01 kernel: kjournald starting. Commit interval 5 seconds taft-01 kernel: EXT3 FS on dm-5, internal journal taft-01 kernel: EXT3-fs: mounted filesystem with ordered data mode. taft-01 kernel: SELinux: initialized (dev dm-5, type ext3), uses xattr taft-01 rgmanager[2462]: Service service:halvm started taft-01 lvm[1220]: TAFT-ha1 is now in-sync. taft-01 kernel: dlm: rgmanager: receive_request_reply 3ea0001 1 master diff 3 -53 Although the repair failed, everything appears fine on the new service owner (although the log device isn't the one that's expected). So this may not be that big of an issue. [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices ha1 TAFT mwi-ao 25.00g ha1_mlog 100.00 ha1_mimage_0(0),ha1_mimage_1(0) [ha1_mimage_0] TAFT iwi-ao 25.00g /dev/sdf1(0) [ha1_mimage_1] TAFT iwi-ao 25.00g /dev/sdg1(0) [ha1_mlog] TAFT lwi-ao 4.00m /dev/sdb1(0) [root@taft-01 ~]# clustat Cluster Status for TAFT @ Wed Mar 30 11:12:38 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ taft-01 1 Online, Local, rgmanager taft-02 2 Online, rgmanager taft-03 3 Online, rgmanager taft-04 4 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:halvm taft-01 started Version-Release number of selected component (if applicable): 2.6.32-94.el6.x86_64 lvm2-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-libs-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 lvm2-cluster-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 udev-147-2.31.el6 BUILT: Wed Jan 26 05:39:15 CST 2011 device-mapper-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 device-mapper-event-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011 cmirror-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
> taft-02 rgmanager[2450]: Service service:halvm is stopped > taft-02 lvm[3417]: ABORTING: Mirror percentage check failed. > taft-02 lvm[3417]: Repair of mirrored LV TAFT/ha1 failed. > taft-02 lvm[3417]: Failed to remove faulty devices in TAFT-ha1. > taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address Hm. What's going on here? halvm stopped, so it should not have LVs active now? Or the relocation happens while repair is in progress? I am not sure if this is halvm or lvm2 bug, assigning to Jon & componentn resource-agetns, please switch it back if it is core lvm problem.
The test causes the service relocation to happen just after the device failure occurs. So there isn't always time for the repair to finish before the relocation takes place. This is reproducible. [...] Disabling device sdb on taft-03 Attempting I/O to cause mirror down conversion(s) on taft-03 1+0 records in 1+0 records out 512 bytes (512 B) copied, 4.9235e-05 s, 10.4 MB/s RELOCATING halvm FROM taft-03 TO taft-01 Mar 30 16:42:00 taft-03 lvm[1203]: Trying to up-convert to 2 images, 1 logs. Mar 30 16:42:02 taft-03 lvm[1203]: Monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:02 taft-03 lvm[1203]: Another thread is handling an event. Waiting... Mar 30 16:42:02 taft-03 lvm[1203]: 1 missing and now unallocated Physical Volumes removed from VG. Mar 30 16:42:02 taft-03 lvm[1203]: TAFT/ha1: Converted: 0.0% Mar 30 16:42:04 taft-03 rgmanager[2303]: Stopping service service:halvm Mar 30 16:42:04 taft-03 rgmanager[8853]: unmounting /mnt/fs1 Mar 30 16:42:07 taft-03 rgmanager[8910]: Deactivating TAFT/ha1 Mar 30 16:42:07 taft-03 rgmanager[8932]: Making resilient : lvchange -an TAFT/ha1 Mar 30 16:42:08 taft-03 rgmanager[8957]: Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdc1|","a|/dev/sdd1|","a|/dev/sde1|", Mar 30 16:42:08 taft-03 rgmanager[8990]: Removing ownership tag (taft-03) from TAFT/ha1 Mar 30 16:42:09 taft-03 rgmanager[2303]: Service service:halvm is stopped Mar 30 16:42:17 taft-03 lvm[1203]: ABORTING: Mirror percentage check failed. Mar 30 16:42:17 taft-03 lvm[1203]: Repair of mirrored LV TAFT/ha1 failed. Mar 30 16:42:17 taft-03 lvm[1203]: Failed to remove faulty devices in TAFT-ha1. Mar 30 16:42:17 taft-03 lvm[1203]: dm_task_run failed, errno = 6, No such device or address Mar 30 16:42:17 taft-03 lvm[1203]: dm_task_run failed, errno = 6, No such device or address Mar 30 16:42:17 taft-03 lvm[1203]: TAFT-ha1 disappeared, detaching Mar 30 16:42:17 taft-03 lvm[1203]: TAFT-ha1 disappeared, detaching Mar 30 16:42:17 taft-03 lvm[1203]: No longer monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:17 taft-03 lvm[1203]: No longer monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:12 taft-01 rgmanager[2263]: Starting stopped service service:halvm Mar 30 16:42:15 taft-01 rgmanager[3268]: Activating TAFT/ha1 Mar 30 16:42:15 taft-01 rgmanager[3331]: Making resilient : lvchange -ay TAFT/ha1 Mar 30 16:42:16 taft-01 rgmanager[3356]: Resilient command: lvchange -ay TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|", Mar 30 16:42:16 taft-01 dmeventd[3404]: dmeventd ready for processing. Mar 30 16:42:16 taft-01 lvm[3404]: Monitoring mirror device TAFT-ha1 for events. Mar 30 16:42:17 taft-01 rgmanager[3494]: mounting /dev/dm-6 on /mnt/fs1 Mar 30 16:42:17 taft-01 rgmanager[3516]: mount -t ext3 /dev/dm-6 /mnt/fs1 Mar 30 16:42:17 taft-01 rgmanager[2263]: Service service:halvm started Mar 30 16:53:01 taft-01 lvm[3404]: TAFT-ha1 is now in-sync.
There is no failure here, just scary messages. Here's what's happening: 1) Test kills one of mirror legs 2) Mirror is down-converted to linear to remove faulty device 3) Mirror is up-converted to replace faulty device (causes new log device too) 4) Poll daemon is launched to track the progress of the up-convert 5) Test forces relocation of service 6) Poll daemon fails to track sync progress b/c mirror has moved. 7) Mirror starts safely and cleanly on alternate machine *) Scary messages remain on original node. This analysis is verifiably correct. Simply try the following: terminal1~> lvcreate -L 1G -n lv vg terminal1~> lvconvert -m1 vg/lv terminal2~> lvchange -an vg/lv [root@bp-01 ~]# lvcreate -L 1G -n lv vg Logical volume "lv" created [root@bp-01 ~]# lvconvert -m1 vg/lv vg/lv: Converted: 0.0% ABORTING: Mirror percentage check failed. I'm pushing this out to rhel6.3. We may have to refile against LVM or find a different bug to describe this problem in.
(In reply to comment #6) > There is no failure here, just scary messages. So should there be a 6.2 technical note, or an update to the manual that explains these messages, and how to determine they are not serious?
I'll try to make sure the poll daemon either shuts down when an LV is deactivated or that it checks to see if the LV is still active when doing the check... This is a fantastically low priority bug though. I'll leave in Cond NACK - design.
Fix committed upstream: commit a20de8af208725ff2054209aa7ff7f41cc770db6 Author: Jonathan Brassow <jbrassow@redhat.com> Date: Mon Jun 16 18:56:32 2014 -0500 poll_daemon: Cleanly exit polling if the LV is no longer active If the we are polling an LV due to some sort of conversion and it becomes inactive, a rather worrisome message is produced, e.g.: " ABORTING: Mirror percentage check failed." We can cleanly exit if we do a simple check to see if the LV is active before performing the check. This eliminates the scary message. When a convert gets interrupted by deactivation, you will see: [root@bp-01 lvm2]# lvcreate -L 50G -n lv vg Logical volume "lv" created [root@bp-01 lvm2]# lvconvert --type mirror -m 1 vg/lv vg/lv: Converted: 0.0% vg/lv: Interrupted: No longer active. When a pvmove gets interrupted by deactivation, you will see: [root@bp-01 lvm2]# pvmove /dev/sdb1 /dev/sdc1 /dev/sdb1: Moved: 0.0% /dev/sdb1: Interrupted: No longer active.
To test, do: terminal1~> lvcreate -L 1G -n lv vg terminal1~> lvconvert -m1 vg/lv terminal2~> lvchange -an vg/lv This should NOT produce the old "scary" message: " ABORTING: Mirror percentage check failed."
Fix is upstream and already in LVM package: commit a20de8af208725ff2054209aa7ff7f41cc770db6 Author: Jonathan Brassow <jbrassow@redhat.com> Date: Mon Jun 16 18:56:32 2014 -0500 poll_daemon: Cleanly exit polling if the LV is no longer active If the we are polling an LV due to some sort of conversion and it becomes inactive, a rather worrisome message is produced, e.g.: " ABORTING: Mirror percentage check failed." We can cleanly exit if we do a simple check to see if the LV is active before performing the check. This eliminates the scary message.
[root@virt-064 ~]# lvcreate -L1G -n lv1 cluster Logical volume "lv1" created . . [root@virt-065 ~]# lvconvert -m1 cluster/lv1 cluster/lv1: Converted: 0.8% . . [root@virt-064 ~]# lvchange -an cluster/lv1 [root@virt-064 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert linear cluster -wi-a----- 100.00m lv1 cluster mwi---m--- 1.00g lv1_mlog mirror cluster mwi-a-m--- 500.00m mirror_mlog 100.00 lv_root vg_virt064 -wi-ao---- 6.71g lv_swap vg_virt064 -wi-ao---- 816.00m [root@virt-065 ~]# lvconvert -m1 cluster/lv1 cluster/lv1: Converted: 0.8% cluster/lv1: Interrupted: No longer active. [root@virt-065 ~]# Marking VERIFIED with: lvm2-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 lvm2-libs-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 lvm2-cluster-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 udev-147-2.56.el6 BUILT: Fri Jul 11 16:53:07 CEST 2014 device-mapper-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-libs-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-event-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-event-libs-1.02.86-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 15:43:06 CEST 2014 cmirror-2.02.107-2.el6 BUILT: Fri Jul 11 15:47:33 CEST 2014
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1387.html