Bug 464550
Summary: | RHEL5 cmirror tracker: copy percentage can get stuck during sync attempt (creation/conversion) | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 5.3 | CC: | agk, ccaulfie, dwysocha, edamato, heinzm, mbroz, pep, syeghiay | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-01-20 21:25:12 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Corey Marthaler
2008-09-29 15:13:36 UTC
FYI - a reactivation got the sync moving again. There's an issue that appears to be this bz where the sync can get stuck when it's almost finished for 10+ mins (maybe even hours). Creating mirror(s) on taft-04... taft-04: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 taft-04: lvcreate --corelog -m 1 -n syncd_secondary_core_2legs_2 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdf1:0-1000 Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 1=8.00% 2=1.58% ) 0/2 mirror(s) are fully synced: ( 1=42.17% 2=38.25% ) 0/2 mirror(s) are fully synced: ( 1=75.50% 2=74.08% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% ) [...] 1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% ) 1/2 mirror(s) are fully synced: ( 1=100.00% 2=99.67% ) After 10 minutes the mirror(s) are still not in sync This happened last night but by the time I checked on the mirrors this morning, they were actually fully synced. I'll have to bump up the sync wait time in my test, but this bz may end up blocking me from being able to verify the corruption bugs. This doesn't appear reproducible with single machine mirrors. commit 7d0f35186a7b2209f77424e2e1117cb5fc3803f3 Author: Jonathan Brassow <jbrassow> Date: Thu Oct 2 17:19:20 2008 -0500 clogd: Fix for bug 464550 - sync can get stuck after creation Nodes that get their sync/clean bits from checkpoints were being allowed to alter them in a resume - something on the master node should be allowed to do. *** Bug 464933 has been marked as a duplicate of this bug. *** I have been unable to reproduce this sync hang after creation during the small amount of testing that I've done with the latest fix (cmirror-1.1.29-1.el5). However, there are still issues with up convert sync's getting stuck, as well as pvmoves (bz 464933) still getting stuck, though no longer every single time. So it's clear that there is still a bug lurking around. I'll attempt to reproduce this and gather more info. Here are the cmds that caused the sync to get stuck: Oct 6 15:36:32 taft-03 qarshd[14462]: Running cmdline: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_1 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdh1:0-150 Oct 6 15:36:35 taft-03 qarshd[14527]: Running cmdline: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_2 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdh1:0-150 syncd_secondary_2legs_1 helter_skelter mwi-ao 600.00M syncd_secondary_2legs_1_mlog 0.00 syncd_secondary_2legs_1_mimage_0(0),syncd_secondary_2legs_1_mimage_1(0) [syncd_secondary_2legs_1_mimage_0] helter_skelter Iwi-ao 600.00M /dev/sde1(0) [syncd_secondary_2legs_1_mimage_1] helter_skelter Iwi-ao 600.00M /dev/sdg1(0) [syncd_secondary_2legs_1_mlog] helter_skelter lwi-ao 4.00M /dev/sdh1(0) syncd_secondary_2legs_2 helter_skelter mwi-ao 600.00M syncd_secondary_2legs_2_mlog 100.00 syncd_secondary_2legs_2_mimage_0(0),syncd_secondary_2legs_2_mimage_1(0) [syncd_secondary_2legs_2_mimage_0] helter_skelter iwi-ao 600.00M /dev/sde1(150) [syncd_secondary_2legs_2_mimage_1] helter_skelter iwi-ao 600.00M /dev/sdg1(150) [syncd_secondary_2legs_2_mlog] helter_skelter lwi-ao 4.00M helter_skelter-syncd_secondary_2legs_2: 0 1228800 mirror 2 253:7 253:8 1200/1200 1 AA 3 clustered_disk 253:6 A helter_skelter-syncd_secondary_2legs_1: 0 1228800 mirror 2 253:3 253:4 4/1200 1 AA 3 clustered_disk 253:2 A Here's info from the stuck pvmove: Oct 6 15:33:38 grant-03 kernel: device-mapper: dm-log-clustered: [NMtewrfC] Request timed out: [DM_CLOG_GET_RESYNC_WORK/88695] - retrying Oct 6 16:55:35 grant-03 clogd[9684]: cluster_queue: 1 Oct 6 16:55:35 grant-03 clogd[9684]: 1) DM_CLOG_GET_RESYNC_WORK, originator = 0 Oct 6 16:55:35 grant-03 clogd[9684]: free_queue : 99 Oct 6 16:55:35 grant-03 clogd[9684]: Official log list: Oct 6 16:55:35 grant-03 clogd[9684]: LVM-ShXf5vAJ7evNcZERSDEWOBNfTDb48asFw5i1zsZr4UAjTg588qFEbxtFNMtewrfC Oct 6 16:55:35 grant-03 clogd[9684]: sync_bits: Oct 6 16:55:35 grant-03 clogd[9684]: [ 0 - 15] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [ 16 - 31] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [ 32 - 47] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [ 48 - 63] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [ 64 - 79] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [ 80 - 95] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [ 96 - 111] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF Oct 6 16:55:35 grant-03 clogd[9684]: [112 - 127] FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF [root@grant-03 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 72.44G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/sda2(2318) linear pv_shuffle -wi--- 5.00G /dev/sdb3(0) [pvmove0] pv_shuffle p-C-ao 1.67G /dev/sdc2 34.66 /dev/sdc2(0),/dev/sdc1(0) stripe pv_shuffle -wI-a- 5.00G pvmove0(0),/dev/sdb1(0),/dev/sdc4(0) A fairly easy way to reproduce the issue of the stuck pvmove is to run pv_shuffle (a pvmove test located in sts). I've eventually hit this everytime I've run that test. commit 027ab643fa43ab1b97724b5702f55bcbe619b12e Author: Jonathan Brassow <jbrassow> Date: Wed Oct 8 16:11:46 2008 -0500 clogd: Fix one cause of bug 464550 - copy percentage can get stuck during If the cluster log gets a request from the kernel to shutdown immediately after it was created, then it must perform some clean-up. This covers the case where it is told to shutdown after it has requested a checkpoint, but before it has read/unlinked it. This does appear to be fixed in cmirror-1.1.30-1.el5, however I'm going to let the tests run over night before marking this VERIFIED. There is still an issue with copy percents getting stuck. Although it's no longer as easily reproducible, I was still able to hit this while running both the helter_skelter and pv_shuffle tests over the weekend. In the helter_skelter case it was just doing a mirror creation of 3 mirrors and the 2nd one didn't get fully synced. ================================================================================ Iteration 26.1 started at Fri Oct 10 06:50:59 CDT 2008 ================================================================================ Scenario: Kill primary leg of synced 2 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_primary_2legs * sync: 1 * num mirrors: 3 * disklog: /dev/sdg1 * failpv: /dev/sde1 * leg devices: /dev/sde1 /dev/sdh1 ************************************************ Creating mirror(s) on taft-04... taft-04: lvcreate -m 1 -n syncd_primary_2legs_1 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150 taft-04: lvcreate -m 1 -n syncd_primary_2legs_2 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150 taft-04: lvcreate -m 1 -n syncd_primary_2legs_3 -L 600M helter_skelter /dev/sde1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-150 Waiting until all mirrors become fully syncd... 0/3 mirror(s) are fully synced: ( 1=31.75% 2=0.08% 3=1.75% ) 0/3 mirror(s) are fully synced: ( 1=65.17% 2=0.08% 3=36.25% ) 0/3 mirror(s) are fully synced: ( 1=98.58% 2=0.08% 3=70.33% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% ) [...] 2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=0.08% 3=100.00% ) After 60 minutes the mirror(s) are still not in sync run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21. helter_skelter-syncd_primary_2legs_2: 0 1228800 mirror 2 253:7 253:8 1/1200 1 AA 3 clustered_disk 253:6 A [root@taft-03 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 2.6.18-117.el5 lvm2-2.02.40-4.el5 BUILT: Mon Oct 6 05:21:53 CDT 2008 lvm2-cluster-2.02.40-4.el5 BUILT: Mon Oct 6 05:23:25 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.30-1.el5 BUILT: Wed Oct 8 18:18:36 CDT 2008 kmod-cmirror-0.1.18-1.el5 BUILT: Mon Sep 29 16:20:21 CDT 2008 The pv_shuffle test case got stuck as well, although it looks like all of lvm may be stuck in that case. Adding /dev/sdc1 to the volume group on grant-03 Moving data from /dev/sdb1 to /dev/sdc1 on grant-03 Finding volume group "pv_shuffle" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle" metadata (seqno 54). Creating logical volume pvmove0 Moving 427 extents of logical volume pv_shuffle/stripe Updating volume group metadata Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 55). [STUCK] [root@grant-01 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 72.44G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/sda2(2318) linear pv_shuffle -wi--- 5.00G /dev/sdb2(0) [pvmove0] pv_shuffle p-C-ao 1.67G /dev/sdc3 84.78 /dev/sdc3(0),/dev/sdb3(0) stripe pv_shuffle -wI-a- 5.00G /dev/sdb1(0),pvmove0(0),/dev/sdc4(0) [root@grant-01 ~]# lvs -a -o +devices [HUNG] device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84685] - retrying device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84686] - retrying device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84687] - retrying device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84688] - retrying device-mapper: dm-log-clustered: [y2kIeQfB] Request timed out: [DM_CLOG_RESUME/84689] - retrying 2.6.18-117.el5 lvm2-2.02.40-4.el5 BUILT: Mon Oct 6 05:21:53 CDT 2008 lvm2-cluster-2.02.40-4.el5 BUILT: Mon Oct 6 05:23:25 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.30-1.el5 BUILT: Wed Oct 8 18:18:36 CDT 2008 kmod-cmirror-0.1.18-1.el5 BUILT: Mon Sep 29 16:20:21 CDT 2008 Both helter_skelter and pv_shuffle failed with sync issues with the latest rpms. 2.6.18-117.el5 lvm2-2.02.40-4.el5 BUILT: Mon Oct 6 05:21:53 CDT 2008 lvm2-cluster-2.02.40-4.el5 BUILT: Mon Oct 6 05:23:25 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.31-1.el5 BUILT: Tue Oct 14 14:14:55 CDT 2008 kmod-cmirror-0.1.19-1.el5 BUILT: Tue Oct 14 14:17:45 CDT 2008 The blocker flag needs to get set for this BZ. Stumbled upon this with just a simple create. lvcreate -m 1 -n mirror -L 1G grant Ended up stuck at 12.50% Oct 15 10:00:28 grant-03 kernel: device-mapper: dm-log-clustered: [HlWcfNdu] Request timed out: [DM_CLOG_GET_RESYNC_WORK/1519] - retrying mirror grant mwi-a- 1.00G mirror_mlog 12.50 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] grant Iwi-ao 1.00G /dev/sdb1(0) [mirror_mimage_1] grant Iwi-ao 1.00G /dev/sdb2(0) [mirror_mlog] grant lwi-ao 4.00M /dev/sdc4(0) commit e07369b28d7a569e742d80152ef10c9d42bc2650 Author: Jonathan Brassow <jbrassow> Date: Mon Oct 20 12:27:08 2008 -0500 clogd: Additional fixes for bug 460156 and 464550 - Fixed a checkpoint ordering issue, where a checkpoint could be populated with data from the wrong moment in time. - Fixed problem where resending requests when a server leaves would include requests that should have been recieved after the server had gone (so no resend should have been necessary). Appeared to have hit this again after 48 iterations of pv_shuffle. Iteration 49 of 10000 started on hayes-03 at Tue Oct 21 07:21:10 CDT 2008 FREE PVS: /dev/etherd/e1.1p1 /dev/etherd/e1.1p6 INUSE PVS: /dev/etherd/e1.1p2 /dev/etherd/e1.1p4 /dev/etherd/e1.1p5 /dev/etherd/e1.1p7 NOT INUSE PVS: /dev/etherd/e1.1p3 Adding /dev/etherd/e1.1p6 to the volume group on hayes-03 Moving data from /dev/etherd/e1.1p2 to /dev/etherd/e1.1p6 on hayes-03 Finding volume group "pv_shuffle" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle" metadata (seqno 244). Creating logical volume pvmove0 Moving 1280 extents of logical volume pv_shuffle/linear Updating volume group metadata Creating volume group backup "/etc/lvm/backup/pv_shuffle" (seqno 245). Checking progress every 15 seconds [STUCK] lvs shows it's stuck at 66%: [pvmove0] pv_shuffle p-C-ao 5.00G /dev/etherd/e1.1p2 66.09 /dev/etherd/e1.1p2(0),/dev/etherd/e1.1p6(0) stripe pv_shuffle -wi--- 5.00G /dev/etherd/e1.1p4(0),/dev/etherd/e1.1p7(0),/dev/etherd/e1.1p5(0) There were a bunch of time out messages as well. device-mapper: dm-log-clustered: [4l2zWeXM] Request timed out: [DM_CLOG_FLUSH/780360] - retrying device-mapper: dm-log-clustered: [4l2zWeXM] Request timed out: [DM_CLOG_GET_RESYNC_WORK/803642] - retrying I'll try this again with the newly built lvm2 rpm, but for now marking fails_qa since this is the latest cmirror[-kern] rpms. 2.6.18-117.el5 lvm2-2.02.40-4.el5 BUILT: Mon Oct 6 05:21:53 CDT 2008 lvm2-cluster-2.02.40-4.el5 BUILT: Mon Oct 6 05:23:25 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.32-1.el5 BUILT: Mon Oct 20 14:15:12 CDT 2008 kmod-cmirror-0.1.20-1.el5 BUILT: Mon Oct 20 14:13:37 CDT 2008 Two things to check: 1) Is clogd running on all machines 'ps -C clogd' - If not, then there is probably a core dump (I'll need that.) 2) If clogd is running, it could be stuck in an infinite loop - 'gdb clogd <clogd pid>' will tell you pretty quickly if it is in a loop I'm pretty certain that this bug is a result of the list corruption issue that I am seeing (bug 468438). I'm gonna mark this bug MODIFIED (rather than NEEDINFO), because I think this bug has been fixed by: commit cc5877e65fad20dd8657881a7ca7361e6e4c08bf Author: Jonathan Brassow <jbrassow> Date: Fri Oct 24 13:42:06 2008 -0500 clogd: Fix for bug 468438 - list corruption 'commit e07369b28d7a569e742d80152ef10c9d42bc2650' introduced the concept of a delay queue to hold requests while membership changes occurred. Sometimes, a request would be added to the delay_queue /and/ the cluster_queue, resulting in list corruption. Depending on how the list was corrupted, infinite loops could occur, or requests could simply be lost. I'm now seeing a issue where the copy percent goes over 100%. I'll file a new BZ for that since it may be new/different issue than this one, though it does still appear that there are sync issues like this BZ implies. syncd_log_2legs_1 helter_skelter mwi-ao 600.00M syncd_log_2legs_1_mlog 100.00 syncd_log_2legs_2 helter_skelter mwi-ao 600.00M syncd_log_2legs_2_mlog 100.00 syncd_log_2legs_3 helter_skelter mwi-ao 600.00M syncd_log_2legs_3_mlog 101.33 The bz filed for the issue in comment #25 is 468911. Hit this again last night. ================================================================================ Iteration 5.4 started at Wed Oct 29 15:52:03 CDT 2008 ================================================================================ Scenario: Kill secondary leg of synced 3 leg mirror(s) ****** Mirror hash info for this scenario ****** * name: syncd_secondary_3legs * sync: 1 * num mirrors: 3 * disklog: /dev/sde1 * failpv: /dev/sdh1 * leg devices: /dev/sdf1 /dev/sdh1 /dev/sdg1 ************************************************ Creating mirror(s) on taft-04... taft-04: lvcreate -m 2 -n syncd_secondary_3legs_1 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150 taft-04: lvcreate -m 2 -n syncd_secondary_3legs_2 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150 taft-04: lvcreate -m 2 -n syncd_secondary_3legs_3 -L 600M helter_skelter /dev/sdf1:0-1000 /dev/sdh1:0-1000 /dev/sdg1:0-1000 /dev/sde1:0-150 Waiting until all mirrors become fully syncd... 0/3 mirror(s) are fully synced: ( 1=14.75% 2=5.92% 3=0.08% ) 0/3 mirror(s) are fully synced: ( 1=43.92% 2=35.58% 3=0.08% ) 0/3 mirror(s) are fully synced: ( 1=73.75% 2=65.08% 3=0.08% ) 1/3 mirror(s) are fully synced: ( 1=100.00% 2=96.67% 3=0.08% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% ) 2/3 mirror(s) are fully synced: ( 1=100.00% 2=100.00% 3=0.08% ) After 60 minutes the mirror(s) are still not in sync run_iteration() method failed at /home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21. [root@taft-01 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 66.19G LogVol01 VolGroup00 -wi-ao 1.94G syncd_secondary_3legs_1 helter_skelter mwi-a- 600.00M syncd_secondary_3legs_1_mlog 100.00 syncd_secondary_3legs_2 helter_skelter mwi-a- 600.00M syncd_secondary_3legs_2_mlog 100.00 syncd_secondary_3legs_3 helter_skelter mwi-a- 600.00M syncd_secondary_3legs_3_mlog 0.00 [root@taft-01 ~]# dmsetup status helter_skelter-syncd_secondary_3legs_2: 0 1228800 mirror 3 253:8 253:9 253:10 1200/1200 1 AAA 3 clustered_disk 253:7 A helter_skelter-syncd_secondary_3legs_3_mimage_1: 0 1228800 linear helter_skelter-syncd_secondary_3legs_2_mimage_2: 0 1228800 linear helter_skelter-syncd_secondary_3legs_1: 0 1228800 mirror 3 253:3 253:4 253:5 1200/1200 1 AAA 3 clustered_disk 253:2 A helter_skelter-syncd_secondary_3legs_3_mimage_0: 0 1228800 linear helter_skelter-syncd_secondary_3legs_2_mimage_1: 0 1228800 linear helter_skelter-syncd_secondary_3legs_1_mimage_2: 0 1228800 linear helter_skelter-syncd_secondary_3legs_2_mimage_0: 0 1228800 linear helter_skelter-syncd_secondary_3legs_1_mimage_1: 0 1228800 linear helter_skelter-syncd_secondary_3legs_1_mimage_0: 0 1228800 linear helter_skelter-syncd_secondary_3legs_1_mlog: 0 8192 linear helter_skelter-syncd_secondary_3legs_2_mlog: 0 8192 linear helter_skelter-syncd_secondary_3legs_3_mlog: 0 8192 linear VolGroup00-LogVol01: 0 4063232 linear VolGroup00-LogVol00: 0 138805248 linear helter_skelter-syncd_secondary_3legs_3: 0 1228800 mirror 3 253:13 253:14 253:15 1/1200 1 AAA 3 clustered_disk 253:12 A helter_skelter-syncd_secondary_3legs_3_mimage_2: 0 1228800 linear clogd is still running on all nodes in the cluster. 2.6.18-117.el5 lvm2-2.02.40-6.el5 BUILT: Fri Oct 24 07:37:33 CDT 2008 lvm2-cluster-2.02.40-6.el5 BUILT: Fri Oct 24 07:38:44 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.33-1.el5 BUILT: Mon Oct 27 14:10:34 CDT 2008 kmod-cmirror-0.1.20-1.el5 BUILT: Mon Oct 20 14:13:37 CDT 2008 Just a note in addition to comment #27... the pv_shuffle tests pass now with out issue. I ran it over the weekend (almost 600 iterations) without any problems. So although there is still a reproducible issue here w/ mirror syncing, it is much better than before the fix that went in in comment #22. commit 4dc9f187336bbee91872e613a6b9614a358228de Author: Jonathan Brassow <jbrassow> Date: Thu Oct 30 13:45:42 2008 -0500 clogd: Additional fixes for bug 464550 + improved debugging Needed to reset the 'recoverer' when the mirror log server was suspending. Otherwise, sync work could stop. Also added signal triggered debugging output. This gives the testing folks an easy way to get useful information when a bug is hit. I'm still seeing this with the latest builds. Both with while running helter_skelter and mirror_sanity. It also appears due to that following test case taking longer to comoplete, that mirror syncing is taking longer now. MIRROR_SANITY: SCENARIO - [verify_sync_completions] Create 8 mirrors and verify that their copy percents complete hayes-01: lvcreate -m 1 -n sync_check_1 -L 1G mirror_sanity hayes-03: lvcreate -m 1 -n sync_check_2 -L 1G mirror_sanity hayes-01: lvcreate -m 1 -n sync_check_3 -L 1G mirror_sanity hayes-02: lvcreate -m 1 -n sync_check_4 -L 1G mirror_sanity hayes-02: lvcreate -m 1 -n sync_check_5 -L 1G mirror_sanity hayes-03: lvcreate -m 1 -n sync_check_6 -L 1G mirror_sanity hayes-02: lvcreate -m 1 -n sync_check_7 -L 1G mirror_sanity hayes-02: lvcreate -m 1 -n sync_check_8 -L 1G mirror_sanity Waiting until all mirrors become fully syncd... 1/8 mirror(s) are fully synced: ( 1=100.00% 2=6.59% 3=5.52% 4=0.05% 5=2.34% 6=39.94% 7=15.33% 8=0.05% ) 1/8 mirror(s) are fully synced: ( 1=100.00% 2=7.37% 3=6.64% 4=0.05% 5=3.17% 6=50.78% 7=26.12% 8=1.03% ) [...] 3/8 mirror(s) are fully synced: ( 1=100.00% 2=60.60% 3=66.11% 4=0.05% 5=57.62% 6=100.00% 7=100.00% 8=55.62% ) 3/8 mirror(s) are fully synced: ( 1=100.00% 2=61.52% 3=67.14% 4=0.05% 5=58.59% 6=100.00% 7=100.00% 8=56.69% ) After 10 minutes the mirror(s) are still not in sync [root@hayes-01 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 72.44G LogVol01 VolGroup00 -wi-ao 1.94G sync_check_1 mirror_sanity mwi-a- 1.00G sync_check_1_mlog 100.00 sync_check_2 mirror_sanity mwi-a- 1.00G sync_check_2_mlog 100.00 sync_check_3 mirror_sanity mwi-a- 1.00G sync_check_3_mlog 100.00 sync_check_4 mirror_sanity mwi-a- 1.00G sync_check_4_mlog 0.00 sync_check_5 mirror_sanity mwi-a- 1.00G sync_check_5_mlog 100.00 sync_check_6 mirror_sanity mwi-a- 1.00G sync_check_6_mlog 100.00 sync_check_7 mirror_sanity mwi-a- 1.00G sync_check_7_mlog 100.00 sync_check_8 mirror_sanity mwi-a- 1.00G sync_check_8_mlog 100.00 [root@hayes-01 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 2.6.18-117.el5 lvm2-2.02.40-6.el5 BUILT: Fri Oct 24 07:37:33 CDT 2008 lvm2-cluster-2.02.40-6.el5 BUILT: Fri Oct 24 07:38:44 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.34-5.el5 BUILT: Thu Nov 6 15:10:44 CST 2008 kmod-cmirror-0.1.21-2.el5 BUILT: Thu Nov 6 14:12:07 CST 2008 HELTER_SKELTER: Enabling device sde on taft-01 Enabling device sde on taft-02 Enabling device sde on taft-03 Enabling device sde on taft-04 Recreating PV /dev/sde1 WARNING: Volume group helter_skelter is not consistent WARNING: Volume Group helter_skelter is not consistent WARNING: Volume group helter_skelter is not consistent Extending the recreated PV back into VG helter_skelter Up converting linear(s) back to mirror(s) on taft-04... taft-04: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_1 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150 taft-04: lvconvert -m 1 -b helter_skelter/syncd_secondary_2legs_2 /dev/sdg1:0-1000 /dev/sde1:0-1000 /dev/sdf1:0-150 Verifying the up conversion from linear(s) to mirror(s) Verifying device /dev/sdg1 *IS* one of the legs in the mirror(s) Verifying device /dev/sde1 *IS* one of the legs in the mirror(s) Waiting until all mirrors become fully syncd... 0/2 mirror(s) are fully synced: ( 1=0.33% 2=10.00% ) 0/2 mirror(s) are fully synced: ( 1=0.33% 2=54.75% ) 0/2 mirror(s) are fully synced: ( 1=0.33% 2=98.83% ) 1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% ) 1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% ) 1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% ) 1/2 mirror(s) are fully synced: ( 1=0.33% 2=100.00% ) After 10 minutes the mirror(s) are still not in sync [root@taft-04 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 66.19G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 1.94G /dev/sda2(2118) syncd_secondary_2legs_1 helter_skelter mwi-ao 600.00M syncd_secondary_2legs_1_mlog 0.00 syncd_secondary_2legs_1_mimage_0(0),syn) [syncd_secondary_2legs_1_mimage_0] helter_skelter Iwi-ao 600.00M /dev/sdg1(0) [syncd_secondary_2legs_1_mimage_1] helter_skelter Iwi-ao 600.00M /dev/sde1(0) [syncd_secondary_2legs_1_mlog] helter_skelter lwi-ao 4.00M /dev/sdf1(0) syncd_secondary_2legs_2 helter_skelter mwi-ao 600.00M syncd_secondary_2legs_2_mlog 100.00 syncd_secondary_2legs_2_mimage_0(0),syn) [syncd_secondary_2legs_2_mimage_0] helter_skelter iwi-ao 600.00M /dev/sdg1(150) [syncd_secondary_2legs_2_mimage_1] helter_skelter iwi-ao 600.00M /dev/sde1(150) [syncd_secondary_2legs_2_mlog] helter_skelter lwi-ao 4.00M /dev/sdf1(1) ccs/ dlm/ gfs2/ lvm2/ nfs/ var/ [root@taft-04 ~]# /usr/tests/sts-rhel5.3/lvm2/bin/lvm_rpms 2.6.18-117.el5 lvm2-2.02.40-6.el5 BUILT: Fri Oct 24 07:37:33 CDT 2008 lvm2-cluster-2.02.40-6.el5 BUILT: Fri Oct 24 07:38:44 CDT 2008 device-mapper-1.02.28-2.el5 BUILT: Fri Sep 19 02:50:32 CDT 2008 cmirror-1.1.34-5.el5 BUILT: Thu Nov 6 15:10:44 CST 2008 kmod-cmirror-0.1.21-2.el5 BUILT: Thu Nov 6 14:12:07 CST 2008 Marking this as a 'Testblocker' since it prevents 2 different tests from completing. Created attachment 323120 [details]
current state of clogd
Attaching the dump from clogd during the latest reproduction
FYI - Jon is correct (at least in the mirror_sanity [verify_sync_completions] test case) when the mirror gets stuck, a write to that device will start the sync up again. QA has been able to hack around this bug, since there is a workaround. Removing the TestBlocker flag and proposing this bug for rhel5.4. This issue should remain open for 5.4 and should not have been closed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0158.html |