Bug 692186 - HA LVM mirror repair can print worrisome messages on device failure + service relocation
HA LVM mirror repair can print worrisome messages on device failure + service...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.1
x86_64 Linux
low Severity low
: rc
: ---
Assigned To: Jonathan Earl Brassow
Corey Marthaler
:
Depends On:
Blocks: BrassowRHEL6Bugs
  Show dependency treegraph
 
Reported: 2011-03-30 12:16 EDT by Corey Marthaler
Modified: 2014-10-14 04:22 EDT (History)
17 users (show)

See Also:
Fixed In Version: lvm2-2.02.107-1.el6
Doc Type: Bug Fix
Doc Text:
No Documentation needed - message clean-up
Story Points: ---
Clone Of:
: 730141 773423 (view as bug list)
Environment:
Last Closed: 2014-10-14 04:22:45 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2011-03-30 12:16:11 EDT
Description of problem:


Scenario: Kill primary leg

********* Mirror info for this scenario *********
* mirrors:            ha1
* leg devices:        /dev/sde1 /dev/sdf1
* log devices:        /dev/sdc1
* failpv(s):          /dev/sde1
* failnode(s):        taft-02
* leg fault policy:   allocate
* log fault policy:   remove

* HA mirror:          YES
* HA service name:    halvm
* HA current owner:   taft-02
*************************************************

PV=/dev/sde1
        ha1_mimage_0: 5
PV=/dev/sde1
        ha1_mimage_0: 5

Disabling device sde on taft-02

Attempting I/O to cause mirror down conversion(s) on taft-02
1+0 records in
1+0 records out
512 bytes (512 B) copied, 4.7413e-05 s, 10.8 MB/s

RELOCATING halvm FROM taft-02 TO taft-01

Verifying current sanity of lvm after the failure
Verifying FAILED device /dev/sde1 is *NOT* in the volume(s)
ORIGINAL LOGS: /dev/sdc1
ORIGINAL LOGS LEFT: /dev/sdc1
olog: 1
Verifying LOG device(s) /dev/sdc1 *ARE* in the mirror(s)
log device /dev/sdc1 should still be present on taft-01
** ARE THERE ENOUGH SPARE PVS TO ALLOCATE? **

# Old service owner:
taft-02 lvm[3417]: Trying to up-convert to 2 images, 1 logs.
taft-02 lvm[3417]: Monitoring mirror device TAFT-ha1 for events.
taft-02 lvm[3417]: Another thread is handling an event. Waiting...
taft-02 lvm[3417]: 1 missing and now unallocated Physical Volumes removed from VG.
taft-02 lvm[3417]: TAFT/ha1: Converted: 0.0%
taft-02 rgmanager[2450]: Stopping service service:halvm
taft-02 rgmanager[8546]: unmounting /mnt/fs1
taft-02 rgmanager[8602]: Deactivating TAFT/ha1
taft-02 rgmanager[8624]: Making resilient : lvchange -an TAFT/ha1
taft-02 rgmanager[8649]: Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|",
taft-02 rgmanager[8681]: Removing ownership tag (taft-02) from TAFT/ha1
taft-02 rgmanager[2450]: Service service:halvm is stopped
taft-02 lvm[3417]: ABORTING: Mirror percentage check failed.
taft-02 lvm[3417]: Repair of mirrored LV TAFT/ha1 failed.
taft-02 lvm[3417]: Failed to remove faulty devices in TAFT-ha1.
taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address
taft-02 lvm[3417]: TAFT-ha1 disappeared, detaching
taft-02 lvm[3417]: No longer monitoring mirror device TAFT-ha1 for events.
taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address
taft-02 lvm[3417]: TAFT-ha1 disappeared, detaching
taft-02 lvm[3417]: No longer monitoring mirror device TAFT-ha1 for events.

# New service owner:
taft-01 rgmanager[2462]: Starting stopped service service:halvm
taft-01 rgmanager[3595]: Activating TAFT/ha1
taft-01 rgmanager[3660]: Making resilient : lvchange -ay TAFT/ha1
taft-01 rgmanager[3685]: Resilient command: lvchange -ay TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|",
taft-01 lvm[1220]: Monitoring mirror device TAFT-ha1 for events.
taft-01 rgmanager[3812]: mounting /dev/dm-5 on /mnt/fs1
taft-01 rgmanager[3834]: mount -t ext3  /dev/dm-5 /mnt/fs1
taft-01 kernel: kjournald starting.  Commit interval 5 seconds
taft-01 kernel: EXT3 FS on dm-5, internal journal
taft-01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
taft-01 kernel: SELinux: initialized (dev dm-5, type ext3), uses xattr
taft-01 rgmanager[2462]: Service service:halvm started
taft-01 lvm[1220]: TAFT-ha1 is now in-sync.
taft-01 kernel: dlm: rgmanager: receive_request_reply 3ea0001 1 master diff 3 -53



Although the repair failed, everything appears fine on the new service owner (although the log device isn't the one that's expected). So this may not be that big of an issue.

[root@taft-01 ~]# lvs -a -o +devices
 LV             VG   Attr   LSize  Log      Copy%  Devices
 ha1            TAFT mwi-ao 25.00g ha1_mlog 100.00 ha1_mimage_0(0),ha1_mimage_1(0)
 [ha1_mimage_0] TAFT iwi-ao 25.00g                 /dev/sdf1(0)
 [ha1_mimage_1] TAFT iwi-ao 25.00g                 /dev/sdg1(0)
 [ha1_mlog]     TAFT lwi-ao  4.00m                 /dev/sdb1(0)

[root@taft-01 ~]# clustat
Cluster Status for TAFT @ Wed Mar 30 11:12:38 2011
Member Status: Quorate

 Member Name        ID   Status
 ------ ----        ---- ------
 taft-01              1  Online, Local, rgmanager
 taft-02              2  Online, rgmanager
 taft-03              3  Online, rgmanager
 taft-04              4  Online, rgmanager

 Service Name     Owner (Last)    State
 ------- ----     ----- ------    -----
 service:halvm    taft-01         started







Version-Release number of selected component (if applicable):
2.6.32-94.el6.x86_64

lvm2-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-libs-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-cluster-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
cmirror-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
Comment 1 Milan Broz 2011-03-30 12:40:40 EDT
> taft-02 rgmanager[2450]: Service service:halvm is stopped
> taft-02 lvm[3417]: ABORTING: Mirror percentage check failed.
> taft-02 lvm[3417]: Repair of mirrored LV TAFT/ha1 failed.
> taft-02 lvm[3417]: Failed to remove faulty devices in TAFT-ha1.
> taft-02 lvm[3417]: dm_task_run failed, errno = 6, No such device or address

Hm. What's going on here?  halvm stopped, so it should not have LVs active now?
Or the relocation happens while repair is in progress?

I am not sure if this is halvm or lvm2 bug, assigning to Jon & componentn resource-agetns, please switch it back if it is core lvm problem.
Comment 2 Corey Marthaler 2011-03-30 18:25:13 EDT
The test causes the service relocation to happen just after the device failure occurs. So there isn't always time for the repair to finish before the relocation takes place. 

This is reproducible.


[...]
Disabling device sdb on taft-03

Attempting I/O to cause mirror down conversion(s) on taft-03
1+0 records in
1+0 records out
512 bytes (512 B) copied, 4.9235e-05 s, 10.4 MB/s

RELOCATING halvm FROM taft-03 TO taft-01


Mar 30 16:42:00 taft-03 lvm[1203]: Trying to up-convert to 2 images, 1 logs.
Mar 30 16:42:02 taft-03 lvm[1203]: Monitoring mirror device TAFT-ha1 for events.
Mar 30 16:42:02 taft-03 lvm[1203]: Another thread is handling an event. Waiting...
Mar 30 16:42:02 taft-03 lvm[1203]: 1 missing and now unallocated Physical Volumes removed from VG.
Mar 30 16:42:02 taft-03 lvm[1203]: TAFT/ha1: Converted: 0.0%
Mar 30 16:42:04 taft-03 rgmanager[2303]: Stopping service service:halvm
Mar 30 16:42:04 taft-03 rgmanager[8853]: unmounting /mnt/fs1
Mar 30 16:42:07 taft-03 rgmanager[8910]: Deactivating TAFT/ha1
Mar 30 16:42:07 taft-03 rgmanager[8932]: Making resilient : lvchange -an TAFT/ha1
Mar 30 16:42:08 taft-03 rgmanager[8957]: Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdc1|","a|/dev/sdd1|","a|/dev/sde1|",
Mar 30 16:42:08 taft-03 rgmanager[8990]: Removing ownership tag (taft-03) from TAFT/ha1
Mar 30 16:42:09 taft-03 rgmanager[2303]: Service service:halvm is stopped
Mar 30 16:42:17 taft-03 lvm[1203]: ABORTING: Mirror percentage check failed.
Mar 30 16:42:17 taft-03 lvm[1203]: Repair of mirrored LV TAFT/ha1 failed.
Mar 30 16:42:17 taft-03 lvm[1203]: Failed to remove faulty devices in TAFT-ha1.
Mar 30 16:42:17 taft-03 lvm[1203]: dm_task_run failed, errno = 6, No such device or address
Mar 30 16:42:17 taft-03 lvm[1203]: dm_task_run failed, errno = 6, No such device or address
Mar 30 16:42:17 taft-03 lvm[1203]: TAFT-ha1 disappeared, detaching
Mar 30 16:42:17 taft-03 lvm[1203]: TAFT-ha1 disappeared, detaching
Mar 30 16:42:17 taft-03 lvm[1203]: No longer monitoring mirror device TAFT-ha1 for events.
Mar 30 16:42:17 taft-03 lvm[1203]: No longer monitoring mirror device TAFT-ha1 for events.


Mar 30 16:42:12 taft-01 rgmanager[2263]: Starting stopped service service:halvm
Mar 30 16:42:15 taft-01 rgmanager[3268]: Activating TAFT/ha1
Mar 30 16:42:15 taft-01 rgmanager[3331]: Making resilient : lvchange -ay TAFT/ha1
Mar 30 16:42:16 taft-01 rgmanager[3356]: Resilient command: lvchange -ay TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdb1|","a|/dev/sdc1|","a|/dev/sdd1|",
Mar 30 16:42:16 taft-01 dmeventd[3404]: dmeventd ready for processing.
Mar 30 16:42:16 taft-01 lvm[3404]: Monitoring mirror device TAFT-ha1 for events.
Mar 30 16:42:17 taft-01 rgmanager[3494]: mounting /dev/dm-6 on /mnt/fs1
Mar 30 16:42:17 taft-01 rgmanager[3516]: mount -t ext3  /dev/dm-6 /mnt/fs1
Mar 30 16:42:17 taft-01 rgmanager[2263]: Service service:halvm started
Mar 30 16:53:01 taft-01 lvm[3404]: TAFT-ha1 is now in-sync.
Comment 6 Jonathan Earl Brassow 2011-08-11 17:27:41 EDT
There is no failure here, just scary messages.  Here's what's happening:

1) Test kills one of mirror legs
2) Mirror is down-converted to linear to remove faulty device
3) Mirror is up-converted to replace faulty device (causes new log device too)
4) Poll daemon is launched to track the progress of the up-convert
5) Test forces relocation of service
6) Poll daemon fails to track sync progress b/c mirror has moved.
7) Mirror starts safely and cleanly on alternate machine
*) Scary messages remain on original node.

This analysis is verifiably correct.  Simply try the following:
terminal1~> lvcreate -L 1G -n lv vg
terminal1~> lvconvert -m1 vg/lv
terminal2~> lvchange -an vg/lv

[root@bp-01 ~]# lvcreate -L 1G -n lv vg
  Logical volume "lv" created
[root@bp-01 ~]# lvconvert -m1 vg/lv
  vg/lv: Converted: 0.0%
  ABORTING: Mirror percentage check failed.

I'm pushing this out to rhel6.3.  We may have to refile against LVM or find a different bug to describe this problem in.
Comment 7 Tom Coughlan 2011-10-21 15:38:52 EDT
(In reply to comment #6)
> There is no failure here, just scary messages. 

So should there be a 6.2 technical note, or an update to the manual that explains these messages, and how to determine they are not serious?
Comment 11 Jonathan Earl Brassow 2012-07-30 18:35:22 EDT
I'll try to make sure the poll daemon either shuts down when an LV is deactivated or that it checks to see if the LV is still active when doing the check...

This is a fantastically low priority bug though.  I'll leave in Cond NACK - design.
Comment 16 Jonathan Earl Brassow 2014-06-16 19:57:59 EDT
Fix committed upstream:
commit a20de8af208725ff2054209aa7ff7f41cc770db6
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Mon Jun 16 18:56:32 2014 -0500

    poll_daemon:  Cleanly exit polling if the LV is no longer active
    
    If the we are polling an LV due to some sort of conversion and it
    becomes inactive, a rather worrisome message is produced, e.g.:
    "  ABORTING: Mirror percentage check failed."
    
    We can cleanly exit if we do a simple check to see if the LV is
    active before performing the check.  This eliminates the scary
    message.


When a convert gets interrupted by deactivation, you will see:
[root@bp-01 lvm2]# lvcreate -L 50G -n lv vg
  Logical volume "lv" created
[root@bp-01 lvm2]# lvconvert --type mirror -m 1 vg/lv
  vg/lv: Converted: 0.0%
  vg/lv: Interrupted: No longer active.

When a pvmove gets interrupted by deactivation, you will see:
[root@bp-01 lvm2]# pvmove /dev/sdb1 /dev/sdc1
  /dev/sdb1: Moved: 0.0%
  /dev/sdb1: Interrupted: No longer active.
Comment 17 Jonathan Earl Brassow 2014-06-16 20:00:51 EDT
To test, do:
terminal1~> lvcreate -L 1G -n lv vg
terminal1~> lvconvert -m1 vg/lv
terminal2~> lvchange -an vg/lv

This should NOT produce the old "scary" message:
"  ABORTING: Mirror percentage check failed."
Comment 18 Jonathan Earl Brassow 2014-06-24 10:38:26 EDT
Fix is upstream and already in LVM package:

commit a20de8af208725ff2054209aa7ff7f41cc770db6
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Mon Jun 16 18:56:32 2014 -0500

    poll_daemon:  Cleanly exit polling if the LV is no longer active
    
    If the we are polling an LV due to some sort of conversion and it
    becomes inactive, a rather worrisome message is produced, e.g.:
    "  ABORTING: Mirror percentage check failed."
    
    We can cleanly exit if we do a simple check to see if the LV is
    active before performing the check.  This eliminates the scary
    message.
Comment 20 Nenad Peric 2014-07-17 08:38:39 EDT
[root@virt-064 ~]# lvcreate -L1G -n lv1 cluster
  Logical volume "lv1" created
.
.

[root@virt-065 ~]# lvconvert -m1 cluster/lv1
  cluster/lv1: Converted: 0.8%
.
.
[root@virt-064 ~]# lvchange -an cluster/lv1
[root@virt-064 ~]# lvs
  LV      VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log         Cpy%Sync Convert
  linear  cluster    -wi-a----- 100.00m                                                            
  lv1     cluster    mwi---m---   1.00g                                lv1_mlog                    
  mirror  cluster    mwi-a-m--- 500.00m                                mirror_mlog 100.00          
  lv_root vg_virt064 -wi-ao----   6.71g                                                            
  lv_swap vg_virt064 -wi-ao---- 816.00m                   



[root@virt-065 ~]# lvconvert -m1 cluster/lv1
  cluster/lv1: Converted: 0.8%
  cluster/lv1: Interrupted: No longer active.
[root@virt-065 ~]# 



Marking VERIFIED with:

lvm2-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
lvm2-libs-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
lvm2-cluster-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
udev-147-2.56.el6    BUILT: Fri Jul 11 16:53:07 CEST 2014
device-mapper-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-libs-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-event-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-event-libs-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
Comment 21 errata-xmlrpc 2014-10-14 04:22:45 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1387.html

Note You need to log in before you can comment on or make changes to this bug.