Description of problem: I was tracking the copy percentage of a pvmove finishing (during the mirror_sanity segmented_pvmove test case) and noticed that the block totals change as it finishes, making the finished percent > 100%. As it finished, it went from: mirror_sanity-pvmove0: 0 999424 mirror 2 8:33 8:35 976/976 1 AA 1 clustered_cor to: mirror_sanity-pvmove0: 999424 40960 mirror 2 8:33 8:17 41/40 1 AA 1 clustered_core Maybe this isn't even a bug, but it begs for an explanation Version-Release number of selected component (if applicable): 2.6.18-128.el5 lvm2-2.02.40-6.el5 BUILT: Fri Oct 24 07:37:33 CDT 2008 lvm2-cluster-2.02.40-7.el5 BUILT: Wed Nov 26 07:19:19 CST 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.36-1.el5 BUILT: Tue Dec 9 16:38:13 CST 2008 kmod-cmirror-0.1.21-10.el5 BUILT: Wed Dec 17 15:18:59 CST 2008 How reproducible: easily
I was unable to reproduce this bug with single machine lvm2.
It is not a bug to see the number of regions change as pvmove goes from one segment of a 'segmented_pvmove' to another. The interesting part is the 41/40... the numerator should never exceed the denominator. I believe the problem there is related to bug 474174, and so should be fixed with this check-in: commit 337f12b719cb5dfb279317d2d111954c5c5c0416 Author: Jonathan Brassow <jbrassow> Date: Tue Feb 10 15:27:59 2009 -0600 clogd: Fix bug 474174 - copy percentage of corelog mirror can get stuck Upon resuming, it was possible for the first resuming cluster node to send out checkpoints with state from a previous session before getting around to resetting its log state for the current session. This would cause the nodes in the cluster to disagree on what recovery work needed to be done. It could also result in some nodes reporting erroneous sync percentage ratios. The fix is to not send out checkpoints until the log has been properly initialized (now done through an explicit check).
I have not seen this bug in many months, though it's possible that is because it's hidden behind 491340. Marking verified.
I hit this last night while running two pvmoves in a cluster at the same time. Package info: 2.6.18-152.el5 lvm2-2.02.46-5.el5 BUILT: Sat Jun 6 16:28:28 CDT 2009 lvm2-cluster-2.02.46-5.el5 BUILT: Sat Jun 6 16:27:58 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.37-1.el5 BUILT: Tue May 5 11:41:18 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:25 CDT 2009 Test log: === Iteration 2 of 10 started on morph-04 at Thu Jun 11 02:41:25 CDT 2009 === FREE PVS: /dev/sdf1 /dev/sdg2 /dev/sdi2 /dev/sdj1 INUSE PVS: pv_shuffle_A=/dev/sdg1 /dev/sdk1 /dev/sdk2 /dev/sdl2 pv_shuffle_B=/dev/sdf2 /dev/sdh1 /dev/sdh2 NOT INUSE PVS: pv_shuffle_A=/dev/sdl1 pv_shuffle_B=/dev/sdi1 /dev/sdj2 Adding /dev/sdf1 to pv_shuffle_A Adding /dev/sdg2 to pv_shuffle_B Moving data from /dev/sdg1 to /dev/sdf1 on morph-04 Moving data from /dev/sdh1 to /dev/sdg2 on morph-04 morph-04: pvmove -v /dev/sdg1 /dev/sdf1 morph-04: pvmove -v /dev/sdh1 /dev/sdg2 Finding volume group "pv_shuffle_B" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle_B" metadata (seqno 9). Creating logical volume pvmove0 Moving 896 extents of logical volume pv_shuffle_B/linear Moving 299 extents of logical volume pv_shuffle_B/stripe Finding volume group "pv_shuffle_A" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle_A" metadata (seqno 9). Creating logical volume pvmove0 Moving 896 extents of logical volume pv_shuffle_A/linear Updating volume group metadata Updating volume group metadata Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 10). Checking progress every 15 seconds Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 10). Checking progress every 15 seconds Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 12). /dev/sdg1: Moved: 5.4% ... /dev/sdg1: Moved: 100.0% Command failed Unable to get copy percent, pvmove most likely finished. Ignore 'Command failed' messages ;) Command failed Command failed Command failed Command failed Command failed Updating volume group metadata Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 11). copy percent is greater than 100% (100.04), appear to have hit bz 479749 Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 13). /dev/sdh1: Moved: 1.7% ... /dev/sdh1: Moved: 100.0%
I was able to reproduce this again this morning. Here's the raw output from dmsetup status: 0 7340032 linear 7340032 2449408 mirror 2 8:113 8:98 2393/2392 1 AA 1 clustered_core
Created attachment 351607 [details] pv_shuffle test log with extra debug output I'm still hitting this during our regression runs of 5.4 on my six node i386 cluster, tankmorph. I'm running the test case: pv_shuffle -R tankmorph.xml -i 10 I added some debug code to the test case to show when I'm getting the copy percent and the `dmsetup status` output when I hit the failure. The output is attached. The failure seems to change after the volume group backup is taken.
Created attachment 353888 [details] clogd SIGUSR1 output from all nodes I issued a SIGUSR1 to clogd on each node when I detected the bug. Here are the debug logs from /var/log/messages on each node.
The log seems to think that the region_count is greater than what the in-kernel mirror thinks. From the attached debug output, we see: Jul 15 12:30:38 morph-01 clogd[3484]: Validating dOzNp1Y0:: Jul 15 12:30:38 morph-01 clogd[3484]: lc->region_count = 7168 Jul 15 12:30:38 morph-01 clogd[3484]: lc->sync_count = 2393 Jul 15 12:30:38 morph-01 clogd[3484]: next zero bit = 2393 Assuming that the 'dmsetup status' line reports 2393/2392, it would mean that the mirror thinks there are 2392 regions... Looking at the status lines in comment 7, a region count of 7168 would correspond to the amount of regions needed for: '0 7340032 linear' while the following would correspond to a region count of 2392: '7340032 2449408 mirror 2 8:113...' So, it appears that the log is not properly cleared, or 'ti->len' is not properly updated when the call to make the log is issued.
Here is a simple sequence of commands to reproduce the problem (assuming sdb1 is in the volume group 'vg'): lvremove -ff vg lvcreate -L 1G -n lv vg /dev/sdb1 lvcreate -L 250M -n remove vg /dev/sdb1 lvextend -L +500M vg/lv /dev/sdb1 lvremove -f vg/remove pvmove /dev/sdb1
Created attachment 353912 [details] Patch that fixes problem Comments from my pending check-in: clogd: Fix for bug 479749 - erroneous sync count This bug affects pvmove when moving a segmented lv (i.e. a logical volume comprised of several parts). pvmove reuses the same UUID when moving different segments in the same LV. This can cause the cluster log daemon to become confused as to what is being referred to. The log daemon was keeping the wrong log when asked to both simultaneously delete and resume logs with the same name. Perhaps the fix should not be in the log daemon at all, but in 'pvmove'. Because the scope of the bug is limited to pvmove and segmented logical volumes and because the "fix" may cause unforeseen issues if done the way the patch does it, I will defer to 5.5. I suspect a better solution would be to have 'pvmove' use different UUIDs when moving between segments.
*** Bug 507400 has been marked as a duplicate of this bug. ***
Changes committed to rhel5 branch, which will be used for 5.5. (Also trying to get into 5.4 under bug 506843.)
Fix for this bug went into 506843... commit fa15cb16f0de9ddf06ddb50f2fcf26dd9c257931 Author: Jonathan Brassow <jbrassow> Date: Mon Jul 27 13:07:36 2009 -0500 clogd/dm-log-clustered.ko: Fix for bugs 506843, 479749, 507400 Device-mapper userspace logs (like the clustered log) are identified by a universally unique identifier (UUID). This identifier is used to associate requests from the kernel to a specific log in userspace. The UUID must be unique everywhere, since multiple machines may use this identifier when sharing a log, as is the case for cluster logs. Sometimes, device-mapper/LVM may re-use a UUID. This is the case during 'pvmove's, when moving from one segment of an LV to another, or when resizing a mirror, etc. In these cases, a new log is created with the same UUID and loaded in the "inactive" slot. When a device-mapper "resume" is issued, the "live" talbe is deactivated and the new "inactive" table becomes "live". (The "inactive" table can also be removed via a device-mapper 'clear' command.) The above two issues where colliding in the 3 bugs (possibly more) mentioned in the title. More than one log was being created with the same UUID, and there was no way to distinguish between them. So, sometimes the wrong log would be swapped out during the exchange. The solution is to create a 'uuid_instance', or perhaps a 'luid'. A local unique identifier to go along with the UUID. This new identifier is used to determine exactly which log is being referenced by the kernel when the log exchange is made. The identifier is not universally safe, but it does not need to be, since create/destroy/suspend/resume operations are bound to a specific machine; and these are the operations that make up the exchange.
I ran pv_shuffle all night and didn't see this problem. I also ran the test case in comment #15 without issue. Marking verified in cmirror-1.1.39-2.el5. [root@taft-01 ~]# lvcreate -L 1G -n lv corey /dev/sdd1 Logical volume "lv" created [root@taft-01 ~]# lvcreate -L 250M -n remove corey /dev/sdd1 Rounding up size to full physical extent 252.00 MB Logical volume "remove" created [root@taft-01 ~]# lvextend -L +500M corey/lv /dev/sdd1 Extending logical volume lv to 1.49 GB Logical volume lv successfully resized [root@taft-01 ~]# lvremove -f corey/remove Logical volume "remove" successfully removed [root@taft-01 ~]# pvmove /dev/sdd1 /dev/sdd1: Moved: 21.3% /dev/sdd1: Moved: 36.2% /dev/sdd1: Moved: 58.8% /dev/sdd1: Moved: 67.2% /dev/sdd1: Moved: 90.0% /dev/sdd1: Moved: 100.0%
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1340.html