Bug 960396 - pvmove on clustered VG should check the modules and the service on all the nodes before proceeding
Summary: pvmove on clustered VG should check the modules and the service on all the no...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: pre-dev-freeze
: 6.6
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks: BrassowRHEL6Bugs
TreeView+ depends on / blocked
 
Reported: 2013-05-07 07:24 UTC by Pierguido Lambri
Modified: 2018-12-09 17:00 UTC (History)
12 users (show)

Fixed In Version: lvm2-2.02.107-1.el6
Doc Type: Bug Fix
Doc Text:
No documentation needed.
Clone Of:
Environment:
Last Closed: 2014-10-14 08:24:25 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1387 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2014-10-14 01:39:47 UTC
Red Hat Knowledge Base (Solution) 20527 None None None Never

Comment 8 Jonathan Earl Brassow 2014-04-29 20:13:29 UTC
I have a cluster of three nodes with shared storage.  The last node (bp-03) does not have 'cmirrord' running.

Attempting 'pvmove' from node3 gives:
---------------------------
[root@bp-03 ~]# devices vg
  LV   Attr       Cpy%Sync Devices     
  lv   -wi-a-----          /dev/sdb1(0)
[root@bp-03 ~]# pvmove /dev/sdb1 /dev/sdc1
  Cannot move in clustered VG vg, clustered mirror (cmirror) not detected and LVs are activated non-exclusively.
---------------------------
The error message could be cleaned up a bit, but no action is performed and the system continues to operate just fine.

Attempting 'pvmove' from node1 gives:
---------------------------
[root@bp-01 ~]# pvmove /dev/sdb1 /dev/sdc1
  Error locking on node bp-03: device-mapper: reload ioctl on  failed: Invalid argument
  Failed to suspend lv
  ABORTING: Temporary pvmove mirror activation failed.
---------------------------
This is a bit more cryptic, but we can see the problem was on bp-03.  Going there and looking at the logs, we find:
---------------------------
device-mapper: dm-log-userspace: Unable to send3
device-mapper: dm-log-userspace: Userspace log server not found                 
device-mapper: table: 253:4: mirror: Error creating mirror dirty log            
device-mapper: ioctl: error adding target to table
---------------------------

So far, I don't think this is too much for the user to handle; but the real problem is state the machines are left in.  Nodes that do not have 'cmirrord' running on them are left with a residual DM device (vg-pvmove0 in this case) that LVM is unaware of.  You can only discover this if you use 'dmsetup ls' and check all the nodes.  What's worse, cleaning up that device, starting 'cmirrord' on bp-03, and attempting pvmove again causes cluster mirror requests to retry indefinitely with no way to kill pvmove.

The error path in this case must be cleaned up - even though it is unlikely that cmirrord would be running on only a subset of nodes.

Comment 9 Jonathan Earl Brassow 2014-04-30 17:12:02 UTC
(In reply to Jonathan Earl Brassow from comment #8)
> What's worse, cleaning up that
> device, starting 'cmirrord' on bp-03, and attempting pvmove again causes
> cluster mirror requests to retry indefinitely with no way to kill pvmove.

This is not true.  After cleaning up the residual DM device and starting 'cmirrord' on bp-03, the pvmove executes just fine.  I had conflicting config files due to debugging a different problem with this cluster earlier.

It still holds that we should clean-up the error path so that the residual DM device is not there, if possible.

Comment 10 Jonathan Earl Brassow 2014-05-01 15:36:20 UTC
pvmove does try to revert the VG after the error occurs and is successful on all the nodes except for the one where the problem originates.  This is because the node not running cmirrord fails to load the table for the pvmove mirror.  This means that the LV that would contain the pvmove segment does not get a new (inactive) table.

The revert attempts to clear any inactive tables.  The nodes that have 'cmirrord' running have inactive tables that point to the pvmove segment, so they properly clear the inactive tables and remove the pvmove segment.  The remaining nodes have no inactive tables to clear and thus no pointers to the pvmove segment (which has been created but has no table).  So, the pvmove DM device stays around.

I've tried to clean this up by tweaking the code in _lv_resume when laopts->revert is set.  However, this didn't work because there are no links to the pvmove segment from any inactive tables.  I'm going to move on and see if there is something that can be done to remove a DM device if a table fails to load - that will hopefully get rid of the residual device.

Comment 12 Jonathan Earl Brassow 2014-05-02 20:41:49 UTC
In dm_tree_preload_children, when a _create_node succeeds but a _load_node fails, I've added (essentially) a call to '_deactivate_node' to remove the tableless device.  dm_tree_preload_children still returns error in this case as it always has.

When "update_metadata -> _suspend_lvs -> suspend_lvs -> suspend_lv" is called to load the pvmove segment, the above change ensures that the node that fails to load the table (due to cmirrord not running) will remove the associated device.  For a reason I can't explain, this causes the subsequent operation (vg_revert + revert_lv) to have no effect in removing the pvmove segment on the nodes that /are/ running cmirrord - something it did just fine before the above changes were made.

I can't see anything that would cause this change in behavior.  The initial 'suspend_lv' gets an error from the node not running cmirrord just as it did before.  The state of the DM devices are exactly the same before the 'revert_lv' is called with the exception that the pvmove device with the empty table on the node(s) not running cmirrord is not present anymore.

Comment 13 Jonathan Earl Brassow 2014-05-02 21:35:35 UTC
one difference.  When revert_lv is called and clvmd gets down to calling dm_task_run(dmt->type == DM_DEVICE_REMOVE) -> _do_dm_ioctl:

This one works (unpatched code):
1753            if (ioctl(_control_fd, command, dmi) < 0 &&
(gdb) p *dmi
$1 = {version = {4, 0, 0}, data_size = 16384, data_start = 312, 
  target_count = 0, open_count = 0, flags = 524, event_nr = 6327708, 
  padding = 0, dev = 64772, name = '\000' <repeats 127 times>, 
  uuid = '\000' <repeats 128 times>, data = "\000\000\000\000\000\000"}

And this one fails with errno=6 (No such device or address):
1753            if (ioctl(_control_fd, command, dmi) < 0 &&
(gdb) p *dmi
$3 = {version = {4, 0, 0}, data_size = 16384, data_start = 312, 
  target_count = 0, open_count = 0, flags = 524, event_nr = 6322248, 
  padding = 0, dev = 64772, name = "vg-pvmove0", '\000' <repeats 117 times>, 
  uuid = '\000' <repeats 128 times>, data = "\000\000\000\000\000\000"}

The difference seems to be that one is passing in a name.

Comment 14 Jonathan Earl Brassow 2014-05-02 21:59:26 UTC
Ok, I've got a working patch.  I'll clean it up before posting.

Comment 15 Jonathan Earl Brassow 2014-05-28 17:54:35 UTC
Fix committed upstream:
commit 442820aae3648e1846417d1248fa36030eba4bd8
Author: Jonathan Brassow <jbrassow@redhat.com>
Date:   Wed May 28 10:17:15 2014 -0500

Comment 16 Jonathan Earl Brassow 2014-05-28 20:33:34 UTC
The solution for this bug is to clean-up any residual device-mapper devices left over from the failed pvmove attempt.  The user is then able to check the logs for the reason for the failure and discover that cmirrord is not running on a sub-set of the nodes.  They can then start cmirrord nodes and try again.  They should /not/ need to reboot or clean-up residual device-mapper devices before trying again.

Previous steps that would reproduce the problem:
1) Setup cluster, but do not start cmirrord on a subset of nodes
2) Attempt pvmove on an LV from a node that /is/ running cmirrord
*) Previously, this would leave residual pvmove DM devices on the non-cmirrord nodes that would interfer with subsequent LVM operations.

Comment 19 Nenad Peric 2014-07-17 12:23:45 UTC
I executed a pvmove on a cluster node which had cmirrord running, but it reports that it is skipping mirror, and yet it moves the PV. 

This can be seen below. What does a Skipping mirror message mean?
The PV belonging to a mirror gets removed, so I do not understand this message. 


[root@virt-064 ~]# lvs -a -o+devices
  LV                VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log         Cpy%Sync Convert Devices                                                                    
  linear            cluster    -wi-a----- 100.00m                                                             /dev/sda(0)                                                                
  mirror            cluster    mwi-a-m--- 500.00m                                mirror_mlog 100.00           mirror_mimage_0(0),mirror_mimage_1(0),mirror_mimage_2(0),mirror_mimage_3(0)
  [mirror_mimage_0] cluster    iwi-aom--- 500.00m                                                             /dev/sda(25)                                                               
  [mirror_mimage_1] cluster    iwi-aom--- 500.00m                                                             /dev/sdd(0)                                                                
  [mirror_mimage_2] cluster    iwi-aom--- 500.00m                                                             /dev/sdi(0)                                                                
  [mirror_mimage_3] cluster    iwi-aom--- 500.00m                                                             /dev/sdh(0)                                                                
  [mirror_mlog]     cluster    lwi-aom---   4.00m                                                             /dev/sdh(125)                                                              
  lv_root           vg_virt064 -wi-ao----   6.71g                                                             /dev/vda2(0)                                                               
  lv_swap           vg_virt064 -wi-ao---- 816.00m                       



root@virt-064 ~]# vgextend cluster /dev/sdb
  Physical volume "/dev/sdb" successfully created
  Volume group "cluster" successfully extended
[root@virt-064 ~]# pvmove /dev/sdh /dev/sdb
  Skipping mirror LV mirror
  /dev/sdh: Moved: 0.8%
  /dev/sdh: Moved: 99.2%
  /dev/sdh: Moved: 100.0%


[root@virt-064 ~]# lvs -a -o+devices
  LV                VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log         Cpy%Sync Convert Devices                                                                    
  linear            cluster    -wi-a----- 100.00m                                                             /dev/sda(0)                                                                
  mirror            cluster    mwi-a-m--- 500.00m                                mirror_mlog 100.00           mirror_mimage_0(0),mirror_mimage_1(0),mirror_mimage_2(0),mirror_mimage_3(0)
  [mirror_mimage_0] cluster    iwi-aom--- 500.00m                                                             /dev/sda(25)                                                               
  [mirror_mimage_1] cluster    iwi-aom--- 500.00m                                                             /dev/sdd(0)                                                                
  [mirror_mimage_2] cluster    iwi-aom--- 500.00m                                                             /dev/sdi(0)                                                                
  [mirror_mimage_3] cluster    iwi-aom--- 500.00m                                                             /dev/sdb(0)                                                                
  [mirror_mlog]     cluster    lwi-aom---   4.00m                                                             /dev/sdb(125)                                                              
  lv_root           vg_virt064 -wi-ao----   6.71g                                                             /dev/vda2(0)                                                               
  lv_swap           vg_virt064 -wi-ao---- 816.00m                                                             /dev/vda2(1718)

Comment 21 Nenad Peric 2014-08-06 11:35:48 UTC
Opening a new bug for the message. 

Marking this one as VERIFIED with:

lvm2-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
lvm2-libs-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
lvm2-cluster-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
udev-147-2.57.el6    BUILT: Thu Jul 24 15:48:47 CEST 2014
device-mapper-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-libs-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-event-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-event-libs-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014

Comment 22 Jonathan Earl Brassow 2014-08-26 14:51:46 UTC
please update this bug with the pointer to the new bug.

that strange message is a consequence of the structure of a mirror.  In the case of comment 19, it is not the LV "mirror" that gets moved, but rather its sub-LV "mirror_mimage_3".  The code just needs to do a better job with the messaging, as you say.

Comment 23 errata-xmlrpc 2014-10-14 08:24:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1387.html


Note You need to log in before you can comment on or make changes to this bug.