Bug 855398

Summary: CLVM: Stacking volume groups on cluster mirrors does not work
Product: Red Hat Enterprise Linux 6 Reporter: Jonathan Earl Brassow <jbrassow>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: unspecified Docs Contact:
Priority: high    
Version: 6.3CC: agk, cmarthal, coughlan, dwysocha, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: lvm2-2.02.98-2.el6 Doc Type: Bug Fix
Doc Text:
A regression since RHEL6.0 has caused it to be impossible to create volume groups on top of clustered mirror logical volumes; that is, to recursively stack cluster volume groups. This was caused by an improper restriction placed on only mirror logical volumes that caused them to be ignored during activation. The restriction has been refined to pass over only mirrors that could cause LVM commands to block indefinitely. It is now possible to layer clustered volume groups on cluster mirror logical volumes again.
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-21 08:13:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix for problem - awaiting review none

Description Jonathan Earl Brassow 2012-09-07 15:31:46 UTC
Customer reports that this configuration was working for RHEL6.0 and is not working now.  I have confirmed that it is no longer possible to create a logical volume from a volume group that was stacked on cluster mirrors.

This happens because clvmd skips over mirror devices.  This must happen in some cases because if the mirror should fail and block I/O, the current LVM command attempting to read the mirror will block the repair of the mirror that must take place - a circular dependency.  However, this case should be immune since the underlying mirrors are in a different volume group.

Also, this stacking works perfectly fine in single machine instances.

Comment 2 Jonathan Earl Brassow 2012-10-11 20:30:31 UTC
Trying this on local volume groups fails when clvmd is used...

'local_top' is a local volume group built on top of two single machine mirrors.  The config file has 'locking_type = 3' set.

## First, try without going through clvmd
[root@hayes-02 ~]# lvcreate -i 2 -L 500M -n stripe local_top --config 'global{locking_type=1}'
  Using default stripesize 64.00 KiB
  Rounding size (125 extents) up to stripe boundary size (126 extents)
  Logical volume "stripe" created
[root@hayes-02 ~]# lvremove -ff local_top --config 'global{locking_type=1}'
  Logical volume "stripe" successfully removed
*SUCCESS*


## Next try going through clvmd (remember, the VGs here are still non-clustered)
[root@hayes-02 ~]# lvcreate -i 2 -L 500M -n stripe local_top
  Using default stripesize 64.00 KiB
  Rounding size (125 extents) up to stripe boundary size (126 extents)
  Error locking on node hayes-02: Volume group for uuid not found: E5kLy2Qw9Rp6nww0xm1H7BMHCwweLz4EqPF1PKMMueJs84r6Xk5gOBzTDM0WayOI
  Failed to activate new LV.
  Error locking on node hayes-02: Volume group for uuid not found: E5kLy2Qw9Rp6nww0xm1H7BMHCwweLz4EqPF1PKMMueJs84r6Xk5gOBzTDM0WayOI
  Unable to deactivate failed new LV. Manual intervention required.
*FAILURE*

Comment 3 Jonathan Earl Brassow 2012-10-11 20:41:06 UTC
'_ignore_suspended_devices' is being set for clvmd, but not for local lvcreate/lvremove.

This causes lvm2/lib/activate/dev_manager.c:device_is_usable():
'if (target_type && !strcmp(target_type, "mirror") && ignore_suspended_devices())' to trigger.

Comment 4 Jonathan Earl Brassow 2012-10-11 21:40:27 UTC
The test in comment 3 could be ignored when the mirror device is not in the same volume group as the logical volume being created/activated.

That information is not available to device_is_usable() though.

Comment 5 Jonathan Earl Brassow 2012-10-11 21:49:32 UTC
lvm2/daemons/clvmd/lvm-functions.c:do_refresh_cache() calls init_ignore_suspended_devices(1).  Then later in when lib/commands/toolcontext.c:refresh_filters() calls it again with the saved value from do_refresh_cache().

Comment 6 Jonathan Earl Brassow 2012-10-22 22:28:26 UTC
Created attachment 631760 [details]
Fix for problem - awaiting review

From patch header:
cluser mirror:  Allow VGs to be built on cluster mirrors

While it is possible to create VGs on top of cluster mirrors,
it is currently worthless to do so because no LVs can be created.
This is not a limitation of 'locking_type = 1' LVM.  IOW, you can
happily stack a VG on top of a single machine LV of 'mirror'
segment type.

The disconnect comes because of the way 'ignore_suspended_devices'
is set.  That is, it is not set during lvcreate/lvremove when
running 'locking_type = 1' (i.e. single machine).  However, it is set
- every time - when 'locking_type = 3' and the activation is sent
through clvmd.

'ignore_suspended_devices' is meant to avoid reading any DM device
that is suspended.  However, a mirror device can block I/O for a couple
reasons.  The first is because it is suspended.  The second is because
it has a unaddressed device failure.  The first case would be already
addressed by the generic rejection of all DM devices that are suspended.
The second is not addressed at all by also rejecting mirror devices if
'ignore_suspended_devices' is set.  Therefore, this chunk of code is
pointless.  It also is the cause of not being able to use mirrors as
a source for VG stacking.

Comment 7 Jonathan Earl Brassow 2012-10-24 04:14:04 UTC
Fix committed upstream:

commit 9fd7ac7d035f0b2f8dcc3cb19935eb181816bd76
Author: Jonathan Brassow <jbrassow>
Date:   Tue Oct 23 23:10:33 2012 -0500

    mirror:  Avoid reading from mirrors that have failed devices
    
    Addresses: rhbz855398 (Allow VGs to be built on cluster mirrors),
               and other issues.
    
    The LVM code attempts to avoid reading labels from devices that are
    suspended to try to avoid situations that may cause the commands to
    block indefinitely.  When scanning devices, 'ignore_suspended_devices'
    can be set so the code (lib/activate/dev_manager.c:device_is_usable())
    checks any DM devices it finds and avoids them if they are suspended.
    
    The mirror target has an additional mechanism that can cause I/O to
    be blocked.  If a device in a mirror fails, all I/O will be blocked
    by the kernel until a new table (a linear target or a mirror with
    replacement devices) is loaded.  The mirror indicates that this condition
    has happened by marking a 'D' for the faulty device in its status
    output.  This condition must also be checked by 'device_is_usable()' to
    avoid the possibility of blocking LVM commands indefinitely due to an
    attempt to read the blocked mirror for labels.
    
    Until now, mirrors were avoided if the 'ignore_suspended_devices'
    condition was set.  This check seemed to suggest, "if we are concerned
    about suspended devices, then let's ignore mirrors altogether just
    in case".  This is insufficient and doesn't solve any problems.  All
    devices that are suspended are already avoided if
    'ignore_suspended_devices' is set; and if a mirror is blocking because
    of an error condition, it will block the LVM command regardless of the
    setting of that variable.
    
    Rather than avoiding mirrors whenever 'ignore_suspended_devices' is
    set, this patch causes mirrors to be avoided whenever they are blocking
    due to an error.  (As mentioned above, the case where a DM device is
    suspended is already covered.)  This solves a number of issues that weren'
    handled before.  For example, pvcreate (or any command that does a
    pv_read or vg_read, which eventually call device_is_usable()) will be
    protected from blocked mirrors regardless of how
    'ignore_suspended_devices' is set.  Additionally, a mirror that is
    neither suspended nor blocking is /allowed/ to be read regardless
    of how 'ignore_suspended_devices' is set.  (The latter point being the
    source of the fix for rhbz855398.)

Comment 8 Jonathan Earl Brassow 2012-10-24 04:15:26 UTC
Comment on attachment 631760 [details]
Fix for problem - awaiting review

patch made obsolete by better patch committed upstream.

Comment 9 Jonathan Earl Brassow 2012-10-24 04:20:44 UTC
QA test requirements:

1) Create cluster VG with two cluster mirror LVs
2) pvcreate then vgcreate a new VG on top of the cmirror LVs
3) attempt to create a striped LV in the top-level VG

#3 fails w/o the fix, and succeeds with the fix.

[If lvm.conf:'locking_type=3', this bug would be triggered whether the VGs were single machine or not.  IOW, it doesn't matter if you are testing with cmirror or not.  It does matter if the activation requests are going through clvmd (IOW, locking_type=3).  So, testing with non-clustered VGs would be acceptable to if the locking_type is set to '3'.

Comment 11 Jonathan Earl Brassow 2012-10-24 16:57:23 UTC
Unit test:

[root@bp-01 lvm2]# lvcreate -m1 -L 5G -n m1 vg
  Logical volume "m1" created
[root@bp-01 lvm2]# lvcreate -m1 -L 5G -n m2 vg
  Logical volume "m2" created
[root@bp-01 lvm2]# pvcreate /dev/vg/m
m1  m2  
[root@bp-01 lvm2]# pvcreate /dev/vg/m
m1  m2  
[root@bp-01 lvm2]# pvcreate /dev/vg/m*
  Physical volume "/dev/vg/m1" successfully created
  Physical volume "/dev/vg/m2" successfully created
[root@bp-01 lvm2]# vgcreate top /dev/vg/m*
  Clustered volume group "top" successfully created
[root@bp-01 lvm2]# lvcreate -i 2 -L 1G -n stripe top
  Using default stripesize 64.00 KiB
  Logical volume "stripe" created
[root@bp-01 lvm2]# lvs
  LV      VG      Attr      LSize   Pool Origin Data%  Move Log     Cpy%Sync Convert
  stripe  top     -wi-a----   1.00g                                                 
  m1      vg      mwi-aom--   5.00g                         m1_mlog   100.00        
  m2      vg      mwi-aom--   5.00g                         m2_mlog   100.00        
  lv_home vg_bp01 -wi-ao--- 407.43g                                                 
  lv_root vg_bp01 -wi-ao---  50.00g                                                 
  lv_swap vg_bp01 -wi-ao---   7.84g                                                 
[root@bp-01 lvm2]# pvs
  PV         VG      Fmt  Attr PSize   PFree
  /dev/sda2  vg_bp01 lvm2 a--  465.27g    0 
  /dev/sdb1  vg      lvm2 a--    1.09t 1.08t
  /dev/sdc1  vg      lvm2 a--    1.09t 1.08t
  /dev/sdd1  vg      lvm2 a--    1.09t 1.09t
  /dev/sde1  vg      lvm2 a--    1.09t 1.09t
  /dev/sdf1  vg      lvm2 a--    1.09t 1.09t
  /dev/sdg1  vg      lvm2 a--    1.09t 1.09t
  /dev/sdh1  vg      lvm2 a--    1.09t 1.09t
  /dev/sdi1  vg      lvm2 a--    1.09t 1.09t
  /dev/vg/m1 top     lvm2 a--    5.00g 4.50g
  /dev/vg/m2 top     lvm2 a--    5.00g 4.50g

Comment 12 Jonathan Earl Brassow 2012-10-25 03:33:26 UTC
POST -> ASSIGNED.

While running QA's sts test suite to look for another bug, I stumbled on a complication for the patches for this bug.

Corey's test often poll doing 'pvs's while preforming the tests and killing devices.  It isn't /that/ hard to get his tests to hang on a 'pvs' when a mirrored-log device goes bad.  This means that LVM hangs indefinitely, because the mirrors cannot get their chance to repair.  It is a very tough exercise to get out of.

This was a know issue going in and is documented in the comments for the upstream patch for the check-ins associated with this bug:
+ * _mirrored_transient_status().  FIXME: It is unable to handle mirrors
+ * with mirrored logs because it does not have a way to get the status of
+ * the mirror that forms the log, which could be blocked.
I now consider it essential to be able to recurse a mirrored log and determine it's status as well.

Comment 13 Jonathan Earl Brassow 2012-10-25 04:02:23 UTC
Here is an example of that illustrates comment 12 :

[root@bp-01 lvm2]# !lvcre
lvcreate -m 1 --mirrorlog mirrored -L 200M -n lv vg
  Logical volume "lv" created
[root@bp-01 lvm2]# devices vg
  LV                 Cpy%Sync Devices                                
  lv                   100.00 lv_mimage_0(0),lv_mimage_1(0)          
  [lv_mimage_0]               /dev/sdb1(0)                           
  [lv_mimage_1]               /dev/sdc1(0)                           
  [lv_mlog]            100.00 lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
  [lv_mlog_mimage_0]          /dev/sdh1(0)                           
  [lv_mlog_mimage_1]          /dev/sdi1(0)                           
[root@bp-01 lvm2]# killall dmeventd
[root@bp-01 lvm2]# off.sh sdi
Turning off sdi
[root@bp-01 lvm2]# !dd
dd if=/dev/zero of=/dev/vg/lv bs=4M count=10 &
[1] 8725
[root@bp-01 lvm2]# dmsetup status vg-lv vg-lv_mlog
vg-lv: 0 409600 mirror 2 253:6 253:7 400/400 1 AA 3 disk 253:5 A
vg-lv_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 AD 1 core
[root@bp-01 lvm2]# pvs
hang
hang
hang

The 'pvs' cannot proceed because it is trying to read the mirror that contains a failed mirror log.  You can see that this is tricky, because the log is blocking but doesn't register as failed in the 'vg-lv' status.

Comment 14 Jonathan Earl Brassow 2012-10-25 05:44:16 UTC
Additional fix checked-in upstream:

commit b248ba0a396d7fc9a459eea02cfdc70b33ce3441
Author: Jonathan Brassow <jbrassow>
Date:   Thu Oct 25 00:42:45 2012 -0500

    mirror:  Avoid reading mirrors with failed devices in mirrored log
    
    Commit 9fd7ac7d035f0b2f8dcc3cb19935eb181816bd76 did not handle mirrors
    that contained mirrored logs.  This is because the status line of the
    mirror does not give an indication of the health of the mirrored log,
    as you can see here:
            [root@bp-01 lvm2]# dmsetup status vg-lv vg-lv_mlog
            vg-lv: 0 409600 mirror 2 253:6 253:7 400/400 1 AA 3 disk 253:5 A
            vg-lv_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 AD 1 core
    Thus, the possibility for LVM commands to hang still persists when mirror
    have mirrored logs.  I discovered this while performing some testing that
    does polling with 'pvs' while doing I/O and killing devices.  The 'pvs'
    managed to get between the mirrored log device failure and the attempt
    by dmeventd to repair it.  The result was a very nasty block in LVM
    commands that is very difficult to remove - even for someone who knows
    what is going on.  Thus, it is absolutely essential that the log of a
    mirror be recursively checked for mirror devices which may be failed
    as well.
    
    Despite what the code comment says in the aforementioned commit...
    + * _mirrored_transient_status().  FIXME: It is unable to handle mirrors
    + * with mirrored logs because it does not have a way to get the status of
    + * the mirror that forms the log, which could be blocked.
    ... it is possible to get the status of the log because the log device
    major/minor is given to us by the status output of the top-level mirror.
    We can use that to query the log device for any DM status and see if it
    is a mirror that needs to be bypassed.  This patch does just that and is
    now able to avoid reading from mirrors that have failed devices in a
    mirrored log.

Comment 15 Jonathan Earl Brassow 2012-10-25 05:48:00 UTC
Unit test showing that the commit in comment 14 clears the objection raise in comment 12:

[root@bp-01 lvm2]# lvcreate -m 1 --mirrorlog mirrored -L 200M -n lv vg
  Logical volume "lv" created
[root@bp-01 lvm2]# devices vg
  LV                 Cpy%Sync Devices                                
  lv                   100.00 lv_mimage_0(0),lv_mimage_1(0)          
  [lv_mimage_0]               /dev/sdb1(0)                           
  [lv_mimage_1]               /dev/sdc1(0)                           
  [lv_mlog]            100.00 lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
  [lv_mlog_mimage_0]          /dev/sdh1(0)                           
  [lv_mlog_mimage_1]          /dev/sdi1(0)                           
[root@bp-01 lvm2]# killall -9 dmeventd
[root@bp-01 lvm2]# off.sh sdi
Turning off sdi
[root@bp-01 lvm2]# dd if=/dev/zero of=/dev/vg/lv bs=4M count=10 &
[1] 4878
[root@bp-01 lvm2]# dmsetup status vg-lv vg-lv_mlog
vg-lv: 0 409600 mirror 2 253:6 253:7 400/400 1 AA 3 disk 253:5 A
vg-lv_mlog: 0 8192 mirror 2 253:3 253:4 7/8 1 AD 1 core
[root@bp-01 lvm2]# pvs -vvvv >& out
[root@bp-01 lvm2]# grep Mirror out
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
#activate/dev_manager.c:239         /dev/mapper/vg-lv_mlog: Mirror image 1 marked as failed
#activate/dev_manager.c:358         /dev/mapper/vg-lv_mlog: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:239         253:5: Mirror image 1 marked as failed
#activate/dev_manager.c:358         253:5: Mirror device vg-lv_mlog not usable.
#activate/dev_manager.c:358         /dev/vg/lv: Mirror device vg-lv not usable.
[root@bp-01 lvm2]# lvconvert --repair vg/lv
  /dev/sdi1: read failed after 0 of 512 at 1197851148288: Input/output error
  /dev/sdi1: read failed after 0 of 512 at 1197851234304: Input/output error
  /dev/sdi1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdi1: read failed after 0 of 512 at 4096: Input/output error
  /dev/sdi1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid edQd0w-MMjR-xA2h-c1NF-ozv3-03YG-UoW3Xn.
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 170.554 s, 246 kB/s
  Mirror log status: 1 of 2 images failed.
Attempt to replace failed mirror log? [y/n]: y
  Trying to up-convert to 2 images, 2 logs.
[1]+  Done                    dd if=/dev/zero of=/dev/vg/lv bs=4M count=10
[root@bp-01 lvm2]# devices vg
  Couldn't find device with uuid edQd0w-MMjR-xA2h-c1NF-ozv3-03YG-UoW3Xn.
  LV                 Cpy%Sync Devices                                
  lv                   100.00 lv_mimage_0(0),lv_mimage_1(0)          
  [lv_mimage_0]               /dev/sdb1(0)                           
  [lv_mimage_1]               /dev/sdc1(0)                           
  [lv_mlog]            100.00 lv_mlog_mimage_0(0),lv_mlog_mimage_1(0)
  [lv_mlog_mimage_0]          /dev/sdh1(0)                           
  [lv_mlog_mimage_1]          /dev/sdg1(0)

Comment 19 Nenad Peric 2012-12-18 14:07:12 UTC
Marking verified based on Comment 9 and Comment 11

Additionally the test for Comment 15 was done on a single machine since log type mirrored is unavailable to cluster mirrors.

Verified with:

lvm2-2.02.98-6.el6.x86_64
lvm2-cluster-2.02.98-6.el6.x86_64
cmirror-2.02.98-6.el6.x86_64
device-mapper-1.02.77-6.el6.x86_64

Comment 20 errata-xmlrpc 2013-02-21 08:13:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0501.html