479749 – RHEL5 cmirror tracker: as segmented pvmove finishes, it's copy percent stats change

Bug 479749 - RHEL5 cmirror tracker: as segmented pvmove finishes, it's copy percent stats change

Summary: RHEL5 cmirror tracker: as segmented pvmove finishes, it's copy percent stats ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cmirror
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	507400 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-12 21:12 UTC by Corey Marthaler
Modified:	2010-01-12 02:09 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 11:05:52 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
pv_shuffle test log with extra debug output (5.00 KB, text/plain) 2009-07-14 15:37 UTC, Nate Straz	no flags	Details
clogd SIGUSR1 output from all nodes (10.99 KB, application/x-gzip) 2009-07-15 19:15 UTC, Nate Straz	no flags	Details
Patch that fixes problem (1.71 KB, patch) 2009-07-15 21:31 UTC, Jonathan Earl Brassow	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:1340	0	normal	SHIPPED_LIVE	cmirror bug-fix update	2009-09-01 10:42:38 UTC

Description Corey Marthaler 2009-01-12 21:12:35 UTC

Description of problem:
I was tracking the copy percentage of a pvmove finishing (during the mirror_sanity segmented_pvmove test case) and noticed that the block totals change as it finishes, making the finished percent > 100%.

As it finished, it went from: 

mirror_sanity-pvmove0: 0 999424 mirror 2 8:33 8:35 976/976 1 AA 1 clustered_cor

to: 

mirror_sanity-pvmove0: 999424 40960 mirror 2 8:33 8:17 41/40 1 AA 1 clustered_core

Maybe this isn't even a bug, but it begs for an explanation

Version-Release number of selected component (if applicable):
2.6.18-128.el5                                                                                                                                                                                                                                      lvm2-2.02.40-6.el5    BUILT: Fri Oct 24 07:37:33 CDT 2008                                                                 lvm2-cluster-2.02.40-7.el5    BUILT: Wed Nov 26 07:19:19 CST 2008
device-mapper-1.02.28-2.el5    BUILT: Fri Sep 19 02:50:32 CDT 2008
cmirror-1.1.36-1.el5    BUILT: Tue Dec  9 16:38:13 CST 2008
kmod-cmirror-0.1.21-10.el5    BUILT: Wed Dec 17 15:18:59 CST 2008

How reproducible:
easily

Comment 1 Corey Marthaler 2009-01-15 20:58:19 UTC

I was unable to reproduce this bug with single machine lvm2.

Comment 2 Jonathan Earl Brassow 2009-02-10 21:33:08 UTC

It is not a bug to see the number of regions change as pvmove goes from one segment of a 'segmented_pvmove' to another.  The interesting part is the 41/40... the numerator should never exceed the denominator.

I believe the problem there is related to bug 474174, and so should be fixed with this check-in:

commit 337f12b719cb5dfb279317d2d111954c5c5c0416
Author: Jonathan Brassow <jbrassow>
Date:   Tue Feb 10 15:27:59 2009 -0600

    clogd:  Fix bug 474174 - copy percentage of corelog mirror can get stuck

    Upon resuming, it was possible for the first resuming cluster node to
    send out checkpoints with state from a previous session before getting
    around to resetting its log state for the current session.  This would
    cause the nodes in the cluster to disagree on what recovery work needed
    to be done.  It could also result in some nodes reporting erroneous
    sync percentage ratios.

    The fix is to not send out checkpoints until the log has been properly
    initialized (now done through an explicit check).

Comment 5 Corey Marthaler 2009-06-10 20:03:18 UTC

I have not seen this bug in many months, though it's possible that is because it's hidden behind 491340. Marking verified.

Comment 6 Nate Straz 2009-06-11 11:26:30 UTC

I hit this last night while running two pvmoves in a cluster at the same time.

Package info:
2.6.18-152.el5

lvm2-2.02.46-5.el5    BUILT: Sat Jun  6 16:28:28 CDT 2009
lvm2-cluster-2.02.46-5.el5    BUILT: Sat Jun  6 16:27:58 CDT 2009
device-mapper-1.02.32-1.el5    BUILT: Thu May 21 02:18:23 CDT 2009
cmirror-1.1.37-1.el5    BUILT: Tue May  5 11:41:18 CDT 2009
kmod-cmirror-0.1.21-14.el5    BUILT: Thu May 21 08:28:25 CDT 2009

Test log:
=== Iteration 2 of 10 started on morph-04 at Thu Jun 11 02:41:25 CDT 2009 ===
FREE PVS: /dev/sdf1 /dev/sdg2 /dev/sdi2 /dev/sdj1
INUSE PVS:
        pv_shuffle_A=/dev/sdg1 /dev/sdk1 /dev/sdk2 /dev/sdl2
        pv_shuffle_B=/dev/sdf2 /dev/sdh1 /dev/sdh2
NOT INUSE PVS:
        pv_shuffle_A=/dev/sdl1
        pv_shuffle_B=/dev/sdi1 /dev/sdj2

Adding /dev/sdf1 to pv_shuffle_A
Adding /dev/sdg2 to pv_shuffle_B

Moving data from /dev/sdg1 to /dev/sdf1 on morph-04
Moving data from /dev/sdh1 to /dev/sdg2 on morph-04
morph-04: pvmove -v /dev/sdg1 /dev/sdf1
morph-04: pvmove -v /dev/sdh1 /dev/sdg2
    Finding volume group "pv_shuffle_B"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "pv_shuffle_B" metadata (seqno 9).
    Creating logical volume pvmove0
    Moving 896 extents of logical volume pv_shuffle_B/linear
    Moving 299 extents of logical volume pv_shuffle_B/stripe
    Finding volume group "pv_shuffle_A"
    Executing: /sbin/modprobe dm-log-clustered
    Archiving volume group "pv_shuffle_A" metadata (seqno 9).
    Creating logical volume pvmove0
    Moving 896 extents of logical volume pv_shuffle_A/linear
    Updating volume group metadata
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 10).
    Checking progress every 15 seconds
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 10).
    Checking progress every 15 seconds
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 12).
  /dev/sdg1: Moved: 5.4%
...
  /dev/sdg1: Moved: 100.0%
Command failed
Unable to get copy percent, pvmove most likely finished.
Ignore 'Command failed' messages ;)
Command failed
Command failed
Command failed
Command failed
Command failed
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 11).
copy percent is greater than 100% (100.04), appear to have hit bz 479749
    Removing temporary pvmove LV
    Writing out final volume group after pvmove
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 13).
  /dev/sdh1: Moved: 1.7%
...
  /dev/sdh1: Moved: 100.0%

Comment 7 Nate Straz 2009-06-11 12:09:51 UTC

I was able to reproduce this again this morning.  Here's the raw output from
dmsetup status:

0 7340032 linear
7340032 2449408 mirror 2 8:113 8:98 2393/2392 1 AA 1 clustered_core

Comment 11 Nate Straz 2009-07-14 15:37:45 UTC

Created attachment 351607 [details]
pv_shuffle test log with extra debug output

I'm still hitting this during our regression runs of 5.4 on my six node i386 cluster, tankmorph.

I'm running the test case:

pv_shuffle -R tankmorph.xml -i 10

I added some debug code to the test case to show when I'm getting the copy percent and the `dmsetup status` output when I hit the failure.  The output is attached.  The failure seems to change after the volume group backup is taken.

Comment 12 Nate Straz 2009-07-15 19:15:46 UTC

Created attachment 353888 [details]
clogd SIGUSR1 output from all nodes

I issued a SIGUSR1 to clogd on each node when I detected the bug.  Here are the debug logs from /var/log/messages on each node.

Comment 13 Jonathan Earl Brassow 2009-07-15 19:46:05 UTC

The log seems to think that the region_count is greater than what the in-kernel mirror thinks.  From the attached debug output, we see:
Jul 15 12:30:38 morph-01 clogd[3484]: Validating dOzNp1Y0::
Jul 15 12:30:38 morph-01 clogd[3484]:   lc->region_count = 7168
Jul 15 12:30:38 morph-01 clogd[3484]:   lc->sync_count = 2393
Jul 15 12:30:38 morph-01 clogd[3484]:   next zero bit  = 2393

Assuming that the 'dmsetup status' line reports 2393/2392, it would mean that the mirror thinks there are 2392 regions...

Looking at the status lines in comment 7, a region count of 7168 would correspond to the amount of regions needed for:
'0 7340032 linear'
while the following would correspond to a region count of 2392:
'7340032 2449408 mirror 2 8:113...'

So, it appears that the log is not properly cleared, or 'ti->len' is not properly updated when the call to make the log is issued.

Comment 15 Jonathan Earl Brassow 2009-07-15 20:43:43 UTC

Here is a simple sequence of commands to reproduce the problem (assuming sdb1 is in the volume group 'vg'):

lvremove -ff vg
lvcreate -L 1G -n lv vg /dev/sdb1
lvcreate -L 250M -n remove vg /dev/sdb1
lvextend -L +500M vg/lv /dev/sdb1
lvremove -f vg/remove
pvmove /dev/sdb1

Comment 16 Jonathan Earl Brassow 2009-07-15 21:31:24 UTC

Created attachment 353912 [details]
Patch that fixes problem

Comments from my pending check-in:
    clogd: Fix for bug 479749 - erroneous sync count

    This bug affects pvmove when moving a segmented lv
    (i.e. a logical volume comprised of several parts).

    pvmove reuses the same UUID when moving different
    segments in the same LV.  This can cause the cluster
    log daemon to become confused as to what is being
    referred to.  The log daemon was keeping the wrong
    log when asked to both simultaneously delete and resume
    logs with the same name.  Perhaps the fix should not
    be in the log daemon at all, but in 'pvmove'.

Because the scope of the bug is limited to pvmove and segmented logical volumes and because the "fix" may cause unforeseen issues if done the way the patch does it, I will defer to 5.5.  I suspect a better solution would be to have 'pvmove' use different UUIDs when moving between segments.

Comment 17 Jonathan Earl Brassow 2009-07-21 15:14:54 UTC

*** Bug 507400 has been marked as a duplicate of this bug. ***

Comment 18 Jonathan Earl Brassow 2009-07-23 20:39:22 UTC

Changes committed to rhel5 branch, which will be used for 5.5.  (Also trying to get into 5.4 under bug 506843.)

Comment 19 Jonathan Earl Brassow 2009-07-27 18:12:24 UTC

Fix for this bug went into 506843...


commit fa15cb16f0de9ddf06ddb50f2fcf26dd9c257931
Author: Jonathan Brassow <jbrassow>
Date:   Mon Jul 27 13:07:36 2009 -0500

    clogd/dm-log-clustered.ko: Fix for bugs 506843, 479749, 507400

    Device-mapper userspace logs (like the clustered log) are
    identified by a universally unique identifier (UUID).  This
    identifier is used to associate requests from the kernel to
    a specific log in userspace.  The UUID must be unique everywhere,
    since multiple machines may use this identifier when sharing a
    log, as is the case for cluster logs.

    Sometimes, device-mapper/LVM may re-use a UUID.  This is the
    case during 'pvmove's, when moving from one segment of an LV
    to another, or when resizing a mirror, etc.  In these cases,
    a new log is created with the same UUID and loaded in the
    "inactive" slot.  When a device-mapper "resume" is issued,
    the "live" talbe is deactivated and the new "inactive" table
    becomes "live".  (The "inactive" table can also be removed
    via a device-mapper 'clear' command.)

    The above two issues where colliding in the 3 bugs (possibly
    more) mentioned in the title.  More than one log was being
    created with the same UUID, and there was no way to distinguish
    between them.  So, sometimes the wrong log would be swapped
    out during the exchange.

    The solution is to create a 'uuid_instance', or perhaps a
    'luid'.  A local unique identifier to go along with the
    UUID.  This new identifier is used to determine exactly which
    log is being referenced by the kernel when the log exchange
    is made.  The identifier is not universally safe, but it does
    not need to be, since create/destroy/suspend/resume operations
    are bound to a specific machine; and these are the operations
    that make up the exchange.

Comment 20 Corey Marthaler 2009-08-06 15:22:43 UTC

I ran pv_shuffle all night and didn't see this problem. I also ran the test case in comment #15 without issue. Marking verified in cmirror-1.1.39-2.el5.

[root@taft-01 ~]# lvcreate -L 1G -n lv corey /dev/sdd1
  Logical volume "lv" created

[root@taft-01 ~]# lvcreate -L 250M -n remove corey /dev/sdd1
  Rounding up size to full physical extent 252.00 MB
  Logical volume "remove" created

[root@taft-01 ~]# lvextend -L +500M corey/lv /dev/sdd1
  Extending logical volume lv to 1.49 GB
  Logical volume lv successfully resized

[root@taft-01 ~]# lvremove -f corey/remove
  Logical volume "remove" successfully removed

[root@taft-01 ~]# pvmove /dev/sdd1
  /dev/sdd1: Moved: 21.3%
  /dev/sdd1: Moved: 36.2%
  /dev/sdd1: Moved: 58.8%
  /dev/sdd1: Moved: 67.2%
  /dev/sdd1: Moved: 90.0%
  /dev/sdd1: Moved: 100.0%

Comment 23 errata-xmlrpc 2009-09-02 11:05:52 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-1340.html

Note You need to log in before you can comment on or make changes to this bug.