Bug 445456

Summary: RHEL5 cmirror tracker: multiple mirrors can cause copy percent to get stuck
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 5.3CC: agk, ccaulfie, dwysocha, edamato, heinzm, mbroz
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-20 21:26:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace output none

Description Corey Marthaler 2008-05-06 21:53:11 UTC
Description of problem:
I noticed that my test 'cmirror_lock_stress' had timed out and that the copy
percents were stuck. The last cmds that I noticed being run by that test were
disk -> core log converts. I didn't notice anything odd in the syslog. Even
though there are other open BZs regarding stuck sync percents, none are
currently in a NEW or ASSIGNED state and all include some type of error which I
didn't see here. The only issue I noticed this time was the sync percents.

[root@hayes-03 ~]# date
Tue May  6 16:05:55 CDT 2008
[root@hayes-03 ~]# lvs
  LV            VG          Attr   LSize   Origin Snap%  Move Log              
 Copy%  Convert
  LogVol00      VolGroup00  -wi-ao  72.44G
  LogVol01      VolGroup00  -wi-ao   1.94G
  hayes-01.9776 lock_stress mwi-a- 500.00M                                     
  14.40
  hayes-01.9778 lock_stress mwi-a- 500.00M                                     
  95.20
  hayes-01.9784 lock_stress mwi-a- 500.00M                                     
  93.60
  hayes-02.9782 lock_stress mwi-a- 500.00M                                     
  18.40
  hayes-02.9788 lock_stress mwi-a- 500.00M                    hayes-02.9788_mlog
100.00
  hayes-02.9792 lock_stress mwi-a- 500.00M                                     
  94.40
  hayes-03.9780 lock_stress mwi-a- 500.00M                                     
 100.00
  hayes-03.9786 lock_stress mwi-a- 500.00M                                     
 100.00
  hayes-03.9790 lock_stress mwi-a- 500.00M                                     
  96.00


[root@hayes-03 ~]# date
Tue May  6 16:40:43 CDT 2008
[root@hayes-03 ~]# lvs
  LV            VG          Attr   LSize   Origin Snap%  Move Log              
 Copy%  Convert
  LogVol00      VolGroup00  -wi-ao  72.44G
  LogVol01      VolGroup00  -wi-ao   1.94G
  hayes-01.9776 lock_stress mwi-a- 500.00M                                     
  14.40
  hayes-01.9778 lock_stress mwi-a- 500.00M                                     
  95.20
  hayes-01.9784 lock_stress mwi-a- 500.00M                                     
  93.60
  hayes-02.9782 lock_stress mwi-a- 500.00M                                     
  18.40
  hayes-02.9788 lock_stress mwi-a- 500.00M                    hayes-02.9788_mlog
100.00
  hayes-02.9792 lock_stress mwi-a- 500.00M                                     
  94.40
  hayes-03.9780 lock_stress mwi-a- 500.00M                                     
 100.00
  hayes-03.9786 lock_stress mwi-a- 500.00M                                     
 100.00
  hayes-03.9790 lock_stress mwi-a- 500.00M                                     
  96.00

Here are the last cmds run on all 3 nodes in the cluster:
May  6 15:23:33 hayes-01 qarshd[2409]: Running cmdline: lvs
lock_stress/hayes-01.9784 --noheadings -o copy_percent | awk {'print $1'}
May  6 15:23:36 hayes-01 lvm[3537]: No longer monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:36 hayes-01 lvm[3537]: Monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:46 hayes-01 lvm[3537]: No longer monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:47 hayes-01 lvm[3537]: Monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:55 hayes-01 lvm[3537]: No longer monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:23:56 hayes-01 lvm[3537]: Monitoring mirror device
lock_stress-hayes--02.9782 for events
[...]

May  6 15:23:48 hayes-02 qarshd[5435]: Running cmdline: lvs
lock_stress/hayes-02.9792 --noheadings -o copy_percent | awk {'print $1'}
May  6 15:23:51 hayes-02 lvm[3375]: No longer monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:23:54 hayes-02 lvm[3375]: Monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:23:58 hayes-02 qarshd[5110]: Sending child 5113 signal 2
May  6 15:23:58 hayes-02 qarshd[3852]: Sending child 3854 signal 2
May  6 15:23:58 hayes-02 qarshd[5435]: Sending child 5436 signal 2
May  6 15:24:08 hayes-02 lvm[3375]: No longer monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:24:11 hayes-02 lvm[3375]: lock_stress-hayes--02.9792 is now in-sync
May  6 15:24:11 hayes-02 lvm[3375]: Monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:24:13 hayes-02 xinetd[2735]: EXIT: qarsh status=0 pid=5110
duration=70(sec)
May  6 15:24:13 hayes-02 lvm[3375]: No longer monitoring mirror device
lock_stress-hayes--03.9786 for events
May  6 15:24:16 hayes-02 lvm[3375]: Monitoring mirror device
lock_stress-hayes--03.9786 for events
May  6 15:24:23 hayes-02 lvm[3375]: No longer monitoring mirror device
lock_stress-hayes--03.9786 for events
May  6 15:24:25 hayes-02 lvm[3375]: Monitoring mirror device
lock_stress-hayes--03.9786 for events
May  6 15:24:26 hayes-02 lvm[3375]: No longer monitoring mirror device
lock_stress-hayes--03.9790 for events
May  6 15:24:27 hayes-02 lvm[3375]: Monitoring mirror device
lock_stress-hayes--03.9790 for events
[...]

May  6 15:23:09 hayes-03 qarshd[31127]: Running cmdline: lvconvert -m 1
--corelog lock_stress/hayes-03.9780
May  6 15:23:10 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--01.9778 for events
May  6 15:23:11 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--01.9778 for events
May  6 15:23:12 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--01.9776 for events
May  6 15:23:12 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--01.9776 for events
May  6 15:23:21 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--01.9776 for events
May  6 15:23:22 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--01.9776 for events
May  6 15:23:23 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--01.9784 for events
May  6 15:23:24 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--01.9784 for events
May  6 15:23:34 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--01.9784 for events
May  6 15:23:36 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--01.9784 for events
May  6 15:23:38 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:40 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:48 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:51 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--02.9792 for events
May  6 15:23:55 hayes-03 lvm[3480]: No longer monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:23:59 hayes-03 lvm[3480]: Monitoring mirror device
lock_stress-hayes--02.9782 for events
May  6 15:24:04 hayes-03 qarshd[31127]: Sending child 31130 signal 2
May  6 15:24:04 hayes-03 qarshd[31125]: Sending child 31128 signal 2
May  6 15:24:04 hayes-03 qarshd[31126]: Sending child 31129 signal 2
[...]

What the this test does is create different mirrors on different machines in the
cluster (in this case 3 mirrors per node) and then does the following:
* down convert
* up convert
* core log convert
* disk log convert
* deactivate
* reactivate
* deactivate
* delete


Version-Release number of selected component (if applicable):
2.6.18-92.el5
[root@hayes-03 ~]# rpm -qi lvm2-cluster
Name        : lvm2-cluster                 Relocations: (not relocatable)
Version     : 2.02.32                           Vendor: Red Hat, Inc.
Release     : 4.el5                         Build Date: Wed 02 Apr 2008 03:56:50
AM CDT
Install Date: Mon 05 May 2008 11:27:54 AM CDT      Build Host:
hs20-bc2-3.build.redhat.com
[root@hayes-03 ~]# rpm -qi kmod-cmirror
Name        : kmod-cmirror                 Relocations: (not relocatable)
Version     : 0.1.8                             Vendor: Red Hat, Inc.
Release     : 1.el5                         Build Date: Thu 28 Feb 2008 01:01:13
PM CST
Install Date: Mon 05 May 2008 12:01:21 PM CDT      Build Host:
ls20-bc1-14.build.redhat.com

Comment 1 Corey Marthaler 2008-05-07 15:39:38 UTC
reproduced this, and found that not only can the different sync counts be stuck,
but they can be off from node to node.

[root@hayes-03 ~]# lvs
  LV            VG          Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  hayes-01.4588 lock_stress mwi-a- 500.00M                         96.80
  hayes-01.4592 lock_stress mwi-a- 500.00M                         11.20
  hayes-01.4594 lock_stress mwi-a- 500.00M                        100.00
  hayes-02.4582 lock_stress mwi-a- 500.00M                         94.40
  hayes-02.4586 lock_stress mwi-a- 500.00M                         96.80
  hayes-02.4590 lock_stress mwi-a- 500.00M                         17.60
  hayes-03.4584 lock_stress mwi-a- 500.00M                        100.00
  hayes-03.4596 lock_stress mwi-a- 500.00M                          8.80
  hayes-03.4598 lock_stress mwi-a- 500.00M                         24.80

[root@hayes-02 ~]# lvs
  LV            VG          Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  hayes-01.4588 lock_stress mwi-a- 500.00M                         93.60
  hayes-01.4592 lock_stress mwi-a- 500.00M                         16.00
  hayes-01.4594 lock_stress mwi-a- 500.00M                         94.40
  hayes-02.4582 lock_stress mwi-a- 500.00M                        100.00
  hayes-02.4586 lock_stress mwi-a- 500.00M                         95.20
  hayes-02.4590 lock_stress mwi-a- 500.00M                         16.00
  hayes-03.4584 lock_stress mwi-a- 500.00M                         96.80
  hayes-03.4596 lock_stress mwi-a- 500.00M                         13.60
  hayes-03.4598 lock_stress mwi-a- 500.00M                         18.40


Comment 2 Jonathan Earl Brassow 2008-05-07 19:08:35 UTC
What is the build date for the cmirror rpm you are using.  You mention
kmod-cmirror, but not cmirror.


Comment 3 Corey Marthaler 2008-05-07 20:00:11 UTC
[root@hayes-02 ~]# rpm -qi cmirror
Name        : cmirror                      Relocations: (not relocatable)
Version     : 1.1.15                            Vendor: Red Hat, Inc.
Release     : 1.el5                         Build Date: Thu 28 Feb 2008 01:04:29
PM CST
Install Date: Mon 05 May 2008 12:01:16 PM CDT      Build Host:
hs20-bc1-7.build.redhat.com


Comment 4 Corey Marthaler 2008-07-15 21:39:21 UTC
Jon and I have narrowed this down to just needing to create 8 mirrors. Doing so
usually leaves one of the mirrors' copy percent stuck at something less than at
100%.

2.6.18-92.1.5.el5

lvm2-2.02.39-2.el5    BUILT: Wed Jul  9 07:26:29 CDT 2008
lvm2-cluster-2.02.39-1.el5    BUILT: Thu Jul  3 09:31:57 CDT 2008
device-mapper-1.02.27-1.el5    BUILT: Thu Jul  3 03:22:29 CDT 2008
cmirror-1.1.19-2.el5    BUILT: Tue Jul  8 11:15:54 CDT 2008
kmod-cmirror-0.1.10-1.el5    BUILT: Tue May 20 14:55:48 CDT 2008

Comment 5 Corey Marthaler 2008-07-15 22:08:01 UTC
  sync_check_1 mirror_sanity mwi-a-  1.00G       sync_check_1_mlog 100.00
  sync_check_2 mirror_sanity mwi-a-  1.00G       sync_check_2_mlog 100.00
  sync_check_3 mirror_sanity mwi-a-  1.00G       sync_check_3_mlog 100.00
  sync_check_4 mirror_sanity mwi-a-  1.00G       sync_check_4_mlog 100.00
  sync_check_5 mirror_sanity mwi-a-  1.00G       sync_check_5_mlog 100.00
  sync_check_6 mirror_sanity mwi-a-  1.00G       sync_check_6_mlog 100.00
  sync_check_7 mirror_sanity mwi-a-  1.00G       sync_check_7_mlog 100.00
  sync_check_8 mirror_sanity mwi-a-  1.00G       sync_check_8_mlog  40.23


Comment 6 Corey Marthaler 2008-07-16 16:14:57 UTC
I appear to have hit this on hayes-01 by just doing a simple upconvert:

Create a mirror and then attempt to up convert it
lvcreate -m 1 -n mirror_up_converts -L 100M --corelog mirror_sanity
Upconvert to 2 redundant legs
lvconvert -m 2 --corelog /dev/mirror_sanity/mirror_up_converts

mirror_up_converts mirror_sanity mwi-a- 100.00M           4.00

Copy percent is stuck at 4% now, and the process is looping with the following
output I'll attach...





Comment 7 Corey Marthaler 2008-07-16 16:16:24 UTC
Created attachment 311963 [details]
strace output

The strace prints the following over and over...

Comment 8 Jonathan Earl Brassow 2008-07-16 21:14:32 UTC
I think upconverts are broken on cluster mirrors.

It seems that lvm is not properly suspending the first mirror before activating
the new one... leaving two different mirrors with overlapping devices active. 
(That's bad.)



Comment 9 Jonathan Earl Brassow 2008-07-23 20:15:18 UTC
Pretty sure this should fix the bug... at the very least, it will fix several
problems:

commit 613d97438673200c87e4b07e3c4ee659c01acf65
Author: Jonathan Brassow <jbrassow>
Date:   Wed Jul 23 15:12:24 2008 -0500

    dm-log-clustered: Fix bug 445456

    I was resetting some common memory outside of the
    protection of a lock.


Comment 10 Corey Marthaler 2008-08-25 20:38:48 UTC
Fix verified with the latest code.

2.6.18-98.el5

lvm2-2.02.32-4.el5    BUILT: Fri Apr  4 06:15:19 CDT 2008
lvm2-cluster-2.02.32-4.el5    BUILT: Wed Apr  2 03:56:50 CDT 2008
device-mapper-1.02.24-1.el5    BUILT: Thu Jan 17 16:46:05 CST 2008
cmirror-1.1.22-1.el5    BUILT: Thu Jul 24 15:59:03 CDT 2008
kmod-cmirror-0.1.13-2.el5    BUILT: Thu 24 Jul 2008 04:00:48 PM CD

Comment 12 errata-xmlrpc 2009-01-20 21:26:08 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0158.html