Description of problem: I noticed that my test 'cmirror_lock_stress' had timed out and that the copy percents were stuck. The last cmds that I noticed being run by that test were disk -> core log converts. I didn't notice anything odd in the syslog. Even though there are other open BZs regarding stuck sync percents, none are currently in a NEW or ASSIGNED state and all include some type of error which I didn't see here. The only issue I noticed this time was the sync percents. [root@hayes-03 ~]# date Tue May 6 16:05:55 CDT 2008 [root@hayes-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 72.44G LogVol01 VolGroup00 -wi-ao 1.94G hayes-01.9776 lock_stress mwi-a- 500.00M 14.40 hayes-01.9778 lock_stress mwi-a- 500.00M 95.20 hayes-01.9784 lock_stress mwi-a- 500.00M 93.60 hayes-02.9782 lock_stress mwi-a- 500.00M 18.40 hayes-02.9788 lock_stress mwi-a- 500.00M hayes-02.9788_mlog 100.00 hayes-02.9792 lock_stress mwi-a- 500.00M 94.40 hayes-03.9780 lock_stress mwi-a- 500.00M 100.00 hayes-03.9786 lock_stress mwi-a- 500.00M 100.00 hayes-03.9790 lock_stress mwi-a- 500.00M 96.00 [root@hayes-03 ~]# date Tue May 6 16:40:43 CDT 2008 [root@hayes-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 72.44G LogVol01 VolGroup00 -wi-ao 1.94G hayes-01.9776 lock_stress mwi-a- 500.00M 14.40 hayes-01.9778 lock_stress mwi-a- 500.00M 95.20 hayes-01.9784 lock_stress mwi-a- 500.00M 93.60 hayes-02.9782 lock_stress mwi-a- 500.00M 18.40 hayes-02.9788 lock_stress mwi-a- 500.00M hayes-02.9788_mlog 100.00 hayes-02.9792 lock_stress mwi-a- 500.00M 94.40 hayes-03.9780 lock_stress mwi-a- 500.00M 100.00 hayes-03.9786 lock_stress mwi-a- 500.00M 100.00 hayes-03.9790 lock_stress mwi-a- 500.00M 96.00 Here are the last cmds run on all 3 nodes in the cluster: May 6 15:23:33 hayes-01 qarshd[2409]: Running cmdline: lvs lock_stress/hayes-01.9784 --noheadings -o copy_percent | awk {'print $1'} May 6 15:23:36 hayes-01 lvm[3537]: No longer monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:36 hayes-01 lvm[3537]: Monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:46 hayes-01 lvm[3537]: No longer monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:47 hayes-01 lvm[3537]: Monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:55 hayes-01 lvm[3537]: No longer monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:23:56 hayes-01 lvm[3537]: Monitoring mirror device lock_stress-hayes--02.9782 for events [...] May 6 15:23:48 hayes-02 qarshd[5435]: Running cmdline: lvs lock_stress/hayes-02.9792 --noheadings -o copy_percent | awk {'print $1'} May 6 15:23:51 hayes-02 lvm[3375]: No longer monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:23:54 hayes-02 lvm[3375]: Monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:23:58 hayes-02 qarshd[5110]: Sending child 5113 signal 2 May 6 15:23:58 hayes-02 qarshd[3852]: Sending child 3854 signal 2 May 6 15:23:58 hayes-02 qarshd[5435]: Sending child 5436 signal 2 May 6 15:24:08 hayes-02 lvm[3375]: No longer monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:24:11 hayes-02 lvm[3375]: lock_stress-hayes--02.9792 is now in-sync May 6 15:24:11 hayes-02 lvm[3375]: Monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:24:13 hayes-02 xinetd[2735]: EXIT: qarsh status=0 pid=5110 duration=70(sec) May 6 15:24:13 hayes-02 lvm[3375]: No longer monitoring mirror device lock_stress-hayes--03.9786 for events May 6 15:24:16 hayes-02 lvm[3375]: Monitoring mirror device lock_stress-hayes--03.9786 for events May 6 15:24:23 hayes-02 lvm[3375]: No longer monitoring mirror device lock_stress-hayes--03.9786 for events May 6 15:24:25 hayes-02 lvm[3375]: Monitoring mirror device lock_stress-hayes--03.9786 for events May 6 15:24:26 hayes-02 lvm[3375]: No longer monitoring mirror device lock_stress-hayes--03.9790 for events May 6 15:24:27 hayes-02 lvm[3375]: Monitoring mirror device lock_stress-hayes--03.9790 for events [...] May 6 15:23:09 hayes-03 qarshd[31127]: Running cmdline: lvconvert -m 1 --corelog lock_stress/hayes-03.9780 May 6 15:23:10 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--01.9778 for events May 6 15:23:11 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--01.9778 for events May 6 15:23:12 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--01.9776 for events May 6 15:23:12 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--01.9776 for events May 6 15:23:21 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--01.9776 for events May 6 15:23:22 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--01.9776 for events May 6 15:23:23 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--01.9784 for events May 6 15:23:24 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--01.9784 for events May 6 15:23:34 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--01.9784 for events May 6 15:23:36 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--01.9784 for events May 6 15:23:38 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:40 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:48 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:51 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--02.9792 for events May 6 15:23:55 hayes-03 lvm[3480]: No longer monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:23:59 hayes-03 lvm[3480]: Monitoring mirror device lock_stress-hayes--02.9782 for events May 6 15:24:04 hayes-03 qarshd[31127]: Sending child 31130 signal 2 May 6 15:24:04 hayes-03 qarshd[31125]: Sending child 31128 signal 2 May 6 15:24:04 hayes-03 qarshd[31126]: Sending child 31129 signal 2 [...] What the this test does is create different mirrors on different machines in the cluster (in this case 3 mirrors per node) and then does the following: * down convert * up convert * core log convert * disk log convert * deactivate * reactivate * deactivate * delete Version-Release number of selected component (if applicable): 2.6.18-92.el5 [root@hayes-03 ~]# rpm -qi lvm2-cluster Name : lvm2-cluster Relocations: (not relocatable) Version : 2.02.32 Vendor: Red Hat, Inc. Release : 4.el5 Build Date: Wed 02 Apr 2008 03:56:50 AM CDT Install Date: Mon 05 May 2008 11:27:54 AM CDT Build Host: hs20-bc2-3.build.redhat.com [root@hayes-03 ~]# rpm -qi kmod-cmirror Name : kmod-cmirror Relocations: (not relocatable) Version : 0.1.8 Vendor: Red Hat, Inc. Release : 1.el5 Build Date: Thu 28 Feb 2008 01:01:13 PM CST Install Date: Mon 05 May 2008 12:01:21 PM CDT Build Host: ls20-bc1-14.build.redhat.com
reproduced this, and found that not only can the different sync counts be stuck, but they can be off from node to node. [root@hayes-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert hayes-01.4588 lock_stress mwi-a- 500.00M 96.80 hayes-01.4592 lock_stress mwi-a- 500.00M 11.20 hayes-01.4594 lock_stress mwi-a- 500.00M 100.00 hayes-02.4582 lock_stress mwi-a- 500.00M 94.40 hayes-02.4586 lock_stress mwi-a- 500.00M 96.80 hayes-02.4590 lock_stress mwi-a- 500.00M 17.60 hayes-03.4584 lock_stress mwi-a- 500.00M 100.00 hayes-03.4596 lock_stress mwi-a- 500.00M 8.80 hayes-03.4598 lock_stress mwi-a- 500.00M 24.80 [root@hayes-02 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert hayes-01.4588 lock_stress mwi-a- 500.00M 93.60 hayes-01.4592 lock_stress mwi-a- 500.00M 16.00 hayes-01.4594 lock_stress mwi-a- 500.00M 94.40 hayes-02.4582 lock_stress mwi-a- 500.00M 100.00 hayes-02.4586 lock_stress mwi-a- 500.00M 95.20 hayes-02.4590 lock_stress mwi-a- 500.00M 16.00 hayes-03.4584 lock_stress mwi-a- 500.00M 96.80 hayes-03.4596 lock_stress mwi-a- 500.00M 13.60 hayes-03.4598 lock_stress mwi-a- 500.00M 18.40
What is the build date for the cmirror rpm you are using. You mention kmod-cmirror, but not cmirror.
[root@hayes-02 ~]# rpm -qi cmirror Name : cmirror Relocations: (not relocatable) Version : 1.1.15 Vendor: Red Hat, Inc. Release : 1.el5 Build Date: Thu 28 Feb 2008 01:04:29 PM CST Install Date: Mon 05 May 2008 12:01:16 PM CDT Build Host: hs20-bc1-7.build.redhat.com
Jon and I have narrowed this down to just needing to create 8 mirrors. Doing so usually leaves one of the mirrors' copy percent stuck at something less than at 100%. 2.6.18-92.1.5.el5 lvm2-2.02.39-2.el5 BUILT: Wed Jul 9 07:26:29 CDT 2008 lvm2-cluster-2.02.39-1.el5 BUILT: Thu Jul 3 09:31:57 CDT 2008 device-mapper-1.02.27-1.el5 BUILT: Thu Jul 3 03:22:29 CDT 2008 cmirror-1.1.19-2.el5 BUILT: Tue Jul 8 11:15:54 CDT 2008 kmod-cmirror-0.1.10-1.el5 BUILT: Tue May 20 14:55:48 CDT 2008
sync_check_1 mirror_sanity mwi-a- 1.00G sync_check_1_mlog 100.00 sync_check_2 mirror_sanity mwi-a- 1.00G sync_check_2_mlog 100.00 sync_check_3 mirror_sanity mwi-a- 1.00G sync_check_3_mlog 100.00 sync_check_4 mirror_sanity mwi-a- 1.00G sync_check_4_mlog 100.00 sync_check_5 mirror_sanity mwi-a- 1.00G sync_check_5_mlog 100.00 sync_check_6 mirror_sanity mwi-a- 1.00G sync_check_6_mlog 100.00 sync_check_7 mirror_sanity mwi-a- 1.00G sync_check_7_mlog 100.00 sync_check_8 mirror_sanity mwi-a- 1.00G sync_check_8_mlog 40.23
I appear to have hit this on hayes-01 by just doing a simple upconvert: Create a mirror and then attempt to up convert it lvcreate -m 1 -n mirror_up_converts -L 100M --corelog mirror_sanity Upconvert to 2 redundant legs lvconvert -m 2 --corelog /dev/mirror_sanity/mirror_up_converts mirror_up_converts mirror_sanity mwi-a- 100.00M 4.00 Copy percent is stuck at 4% now, and the process is looping with the following output I'll attach...
Created attachment 311963 [details] strace output The strace prints the following over and over...
I think upconverts are broken on cluster mirrors. It seems that lvm is not properly suspending the first mirror before activating the new one... leaving two different mirrors with overlapping devices active. (That's bad.)
Pretty sure this should fix the bug... at the very least, it will fix several problems: commit 613d97438673200c87e4b07e3c4ee659c01acf65 Author: Jonathan Brassow <jbrassow> Date: Wed Jul 23 15:12:24 2008 -0500 dm-log-clustered: Fix bug 445456 I was resetting some common memory outside of the protection of a lock.
Fix verified with the latest code. 2.6.18-98.el5 lvm2-2.02.32-4.el5 BUILT: Fri Apr 4 06:15:19 CDT 2008 lvm2-cluster-2.02.32-4.el5 BUILT: Wed Apr 2 03:56:50 CDT 2008 device-mapper-1.02.24-1.el5 BUILT: Thu Jan 17 16:46:05 CST 2008 cmirror-1.1.22-1.el5 BUILT: Thu Jul 24 15:59:03 CDT 2008 kmod-cmirror-0.1.13-2.el5 BUILT: Thu 24 Jul 2008 04:00:48 PM CD
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html