Bug 491340
Summary: | RHEL5 cmirror tracker: copy percentage can get stuck during sync attempt (creation/conversion) | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||||||||||||
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||
Priority: | high | ||||||||||||||||||||
Version: | 5.3 | CC: | agk, ccaulfie, cfeist, coughlan, dwysocha, edamato, heinzm, mbroz | ||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||
Hardware: | All | ||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||
Last Closed: | 2009-09-02 11:05:41 UTC | Type: | --- | ||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||
Embargoed: | |||||||||||||||||||||
Attachments: |
|
Description
Corey Marthaler
2009-03-20 14:21:47 UTC
Just hit this again in fact. Note how the 4th mirror was stuck at 0.05% until the hack was attempted. Waiting until all mirrors become fully syncd... 0/8 mirror(s) are fully synced: ( 1=18.75% 2=10.55% 3=9.62% 4=0.05% 5=4.88% 6=6.79% 7=5.08% 8=2.78% ) 0/8 mirror(s) are fully synced: ( 1=20.56% 2=11.96% 3=11.47% 4=0.05% 5=6.64% 6=9.08% 7=7.32% 8=4.69% ) 0/8 mirror(s) are fully synced: ( 1=22.80% 2=14.06% 3=13.92% 4=0.05% 5=8.98% 6=12.01% 7=10.21% 8=7.13% ) 0/8 mirror(s) are fully synced: ( 1=25.00% 2=15.92% 3=16.16% 4=0.05% 5=11.23% 6=15.14% 7=13.28% 8=9.62% ) 0/8 mirror(s) are fully synced: ( 1=26.86% 2=17.77% 3=18.36% 4=0.05% 5=13.13% 6=17.68% 7=15.77% 8=11.67% ) 0/8 mirror(s) are fully synced: ( 1=28.61% 2=19.19% 3=20.61% 4=0.05% 5=14.75% 6=20.17% 7=18.46% 8=13.62% ) sync percent for sync_check_4 hasn't changed in the past minute HACKING AROUND BZ 464550 by dd'ing to this stale mirror in order to restart the sync 0/8 mirror(s) are fully synced: ( 1=31.64% 2=22.36% 3=23.73% 4=0.05% 5=17.53% 6=24.37% 7=22.71% 8=16.94% ) 0/8 mirror(s) are fully synced: ( 1=33.30% 2=24.22% 3=25.73% 4=1.71% 5=19.19% 6=26.51% 7=24.76% 8=18.60% ) Hit this again after 14 iterations of helter_skelter (instead of mirror_sanity) with 4-way mirrors. Waiting until all mirrors become fully syncd... 0/3 mirror(s) are fully synced: ( 1=0.25% 2=7.17% 3=4.17% ) 0/3 mirror(s) are fully synced: ( 1=0.25% 2=23.92% 3=20.42% ) 0/3 mirror(s) are fully synced: ( 1=0.25% 2=40.92% 3=37.50% ) 0/3 mirror(s) are fully synced: ( 1=0.25% 2=56.08% 3=53.00% ) 0/3 mirror(s) are fully synced: ( 1=0.25% 2=69.50% 3=66.25% ) 0/3 mirror(s) are fully synced: ( 1=0.25% 2=84.25% 3=80.92% ) 0/3 mirror(s) are fully synced: ( 1=0.25% 2=98.33% 3=95.33% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) After 10 minutes the mirror(s) are still not in sync HACK AROUND BZ 464550 by dd'ing to the stuck mirrors in order to restart the sync process 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) 2/3 mirror(s) are fully synced: ( 1=0.25% 2=100.00% 3=100.00% ) After 15 minutes the mirror(s) are still not in sync ADDITIONAL HACK AROUND BZ 474174 by suspending/resuming the dm devices in order to restart the sync process dmsetup suspend helter_skelter-syncd_log_4legs_1 Taft-01: Apr 2 05:52:19 taft-01 clogd[8553]: 192:22) SEQ#=12885614, UUID=htC68eIU, TYPE=DM_CLOG_CLEAR_REGION, ORIG=4, RESP= Apr 2 05:52:19 taft-01 clogd[8553]: 193:23) SEQ#=12885615, UUID=htC68eIU, TYPE=DM_CLOG_FLUSH, ORIG=4, RESP=NO Apr 2 05:52:19 taft-01 clogd[8553]: 194:24) SEQ#=12885616, UUID=htC68eIU, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=4, RE Apr 2 05:52:19 taft-01 clogd[8553]: 195:25) SEQ#=12885619, UUID=0ZZDcuEW, TYPE=DM_CLOG_SET_REGION_SYNC, ORIG=4, RE Apr 2 05:52:19 taft-01 clogd[8553]: 196:26) SEQ#=12885620, UUID=0ZZDcuEW, TYPE=DM_CLOG_CLEAR_REGION, ORIG=4, RESP= Apr 2 05:52:19 taft-01 clogd[8553]: 197:27) SEQ#=12885621, UUID=0ZZDcuEW, TYPE=DM_CLOG_FLUSH, ORIG=4, RESP=NO Apr 2 05:52:19 taft-01 clogd[8553]: 198:28) SEQ#=12885622, UUID=0ZZDcuEW, TYPE=DM_CLOG_GET_RESYNC_WORK, ORIG=4, RE Apr 2 05:52:19 taft-01 clogd[8553]: 199:29) SEQ#=12885625, UUID=htC68eIU, TYPE=DM_CLOG_SET_REGION_SYNC, ORIG=4, RE Apr 2 05:52:38 taft-01 openais[8258]: [TOTEM] Retransmit List: 3af62 Apr 2 05:52:58 taft-01 lvm[8645]: helter_skelter-syncd_log_4legs_2 is now in-sync Apr 2 05:52:58 taft-01 lvm[8645]: helter_skelter-syncd_log_4legs_3 is now in-sync Apr 2 05:55:07 taft-01 openais[8258]: [TOTEM] Retransmit List: 3ce3a Apr 2 06:02:18 taft-01 openais[8258]: [TOTEM] Retransmit List: 3f584 Apr 2 06:03:44 taft-01 openais[8258]: [TOTEM] Retransmit List: 3fd53 Apr 2 06:10:37 taft-01 openais[8258]: [TOTEM] Retransmit List: 423b9 [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 58.38G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 9.75G /dev/sda2(1868) syncd_log_4legs_1 helter_skelter mwi-ao 600.00M syncd_log_4legs_1_mlog 0.00 syncd_log_4legs_1_mimage_0(0),syncd_log_4legs_1_mimage_1(0),syncd_log_4legs_1_mimage_2(0),syncd_log_4legs_1_mimage_3(0) [syncd_log_4legs_1_mimage_0] helter_skelter Iwi-ao 600.00M /dev/sde1(0) [syncd_log_4legs_1_mimage_1] helter_skelter Iwi-ao 600.00M /dev/sdh1(0) [syncd_log_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M /dev/sdf1(0) [syncd_log_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M /dev/sdg1(0) [syncd_log_4legs_1_mlog] helter_skelter lwi-ao 4.00M /dev/sdc1(0) syncd_log_4legs_2 helter_skelter mwi-ao 600.00M syncd_log_4legs_2_mlog 100.00 syncd_log_4legs_2_mimage_0(0),syncd_log_4legs_2_mimage_1(0),syncd_log_4legs_2_mimage_2(0),syncd_log_4legs_2_mimage_3(0) [syncd_log_4legs_2_mimage_0] helter_skelter iwi-ao 600.00M /dev/sde1(150) [syncd_log_4legs_2_mimage_1] helter_skelter iwi-ao 600.00M /dev/sdh1(150) [syncd_log_4legs_2_mimage_2] helter_skelter iwi-ao 600.00M /dev/sdf1(150) [syncd_log_4legs_2_mimage_3] helter_skelter iwi-ao 600.00M /dev/sdg1(150) [syncd_log_4legs_2_mlog] helter_skelter lwi-ao 4.00M /dev/sdc1(1) syncd_log_4legs_3 helter_skelter mwi-ao 600.00M syncd_log_4legs_3_mlog 100.00 syncd_log_4legs_3_mimage_0(0),syncd_log_4legs_3_mimage_1(0),syncd_log_4legs_3_mimage_2(0),syncd_log_4legs_3_mimage_3(0) [syncd_log_4legs_3_mimage_0] helter_skelter iwi-ao 600.00M /dev/sde1(300) [syncd_log_4legs_3_mimage_1] helter_skelter iwi-ao 600.00M /dev/sdh1(300) [syncd_log_4legs_3_mimage_2] helter_skelter iwi-ao 600.00M /dev/sdf1(300) [syncd_log_4legs_3_mimage_3] helter_skelter iwi-ao 600.00M /dev/sdg1(300) [syncd_log_4legs_3_mlog] helter_skelter lwi-ao 4.00M /dev/sdc1(2 Ended up hitting this with 5-way mirrors as well. I believe I have a fix pending for this in my queue. I need to check it in. commit 9d2e36adb7a14378ee5934ea273094428a278089 Author: Jonathan Brassow <jbrassow> Date: Mon Apr 20 11:02:50 2009 -0500 clogd: Fix for bug 491340 - copy percentage can get stuck during sync attempt May also help fix 480000. Introduce a delay between suspend/resume cycles to prevent the receipt of stale checkpoint data. This bug still exists, that or there's another lvconvert bug open affecting this test. Recreating PVs /dev/sde1 WARNING: Volume group helter_skelter is not consistent WARNING: Volume Group helter_skelter is not consistent WARNING: Volume group helter_skelter is not consistent Extending the recreated PVs back into VG helter_skelter Up converting linear(s) back to mirror(s) on taft-02... taft-02: lvconvert --corelog -m 3 -b helter_skelter/syncd_secondary_core_4legs_1 /dev/sdf1:0-1000 /dev/sde1:0-1000 /dev/sdg1:0-1000 /dev/sdd1:0-1000 Verifying the up conversion from linear(s) to mirror(s) Verifying device /dev/sdf1 *IS* one of the legs in the mirror(s) Verifying device /dev/sde1 *IS* one of the legs in the mirror(s) Verifying device /dev/sdg1 *IS* one of the legs in the mirror(s) Verifying device /dev/sdd1 *IS* one of the legs in the mirror(s) Verifying LOG device 0 *IS* in the mirror(s) Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 1=0.17% ) 0/1 mirror(s) are fully synced: ( 1=0.17% ) 0/1 mirror(s) are fully synced: ( 1=0.17% ) 0/1 mirror(s) are fully synced: ( 1=0.17% ) [...] [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices LogVol00 VolGroup00 -wi-ao 58.38G /dev/sda2(0) LogVol01 VolGroup00 -wi-ao 9.75G /dev/sda2(1868) syncd_secondary_core_4legs_1 helter_skelter mwi-ao 600.00M 0.00 syncd_secondary_core_4legs_1_mimage_0(0),syncd_secondary_core_4legs_1_mimage_2(0),syncd_secondary_core_4legs_1_mimage_3(0),syncd_secondary_core_4legs_1_mimage_4(0) [syncd_secondary_core_4legs_1_mimage_0] helter_skelter Iwi-ao 600.00M /dev/sdf1(0) [syncd_secondary_core_4legs_1_mimage_2] helter_skelter Iwi-ao 600.00M /dev/sdg1(0) [syncd_secondary_core_4legs_1_mimage_3] helter_skelter Iwi-ao 600.00M /dev/sdd1(0) [syncd_secondary_core_4legs_1_mimage_4] helter_skelter Iwi-ao 600.00M /dev/sde1(0) [root@taft-01 ~]# dmsetup status helter_skelter-syncd_secondary_core_4legs_1_mimage_2: 0 1228800 linear helter_skelter-syncd_secondary_core_4legs_1_mimage_0: 0 1228800 linear helter_skelter-syncd_secondary_core_4legs_1: 0 1228800 mirror 4 253:2 253:4 253:5 253:3 2/1200 1 AAAA 1 clustered_core VolGroup00-LogVol01: 0 20447232 linear VolGroup00-LogVol00: 0 122421248 linear helter_skelter-syncd_secondary_core_4legs_1_mimage_4: 0 1228800 linear helter_skelter-syncd_secondary_core_4legs_1_mimage_3: 0 1228800 linear Name: helter_skelter-syncd_secondary_core_4legs_1 State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 3 Major, minor: 253, 6 Number of targets: 1 2.6.18-150.el5 lvm2-2.02.46-3.el5 BUILT: Mon Jun 1 14:02:52 CDT 2009 lvm2-cluster-2.02.46-3.el5 BUILT: Mon Jun 1 14:06:25 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.37-1.el5 BUILT: Tue May 5 11:46:05 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:17 CDT 2009 Apear to have hit this running cmirror_lock_stress: Creating a 3 redundant legged cmirror named taft-04.18446 Logical volume "taft-04.18446" created Down converting cmirror from 3 legs to 2 on taft-04 Error locking on node taft-04-bond: Command timed out Failed to lock taft-04.18446 couldn't down convert cmirror on taft-04 # The syncs are stuck at 2 different percents on different nodes in the cluster. [root@taft-03 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 58.38G LogVol01 VolGroup00 -wi-ao 9.75G taft-01.18455 lock_stress -wi-a- 500.00M taft-02.18451 lock_stress -wi-a- 500.00M taft-03.18448 lock_stress -wi-a- 500.00M taft-04.18446 lock_stress mwi-s- 500.00M taft-04.18446_mlog 30.40 taft-04.18446_mimage_3 lock_stress -wi--- 500.00M [root@taft-04 ~]# lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert LogVol00 VolGroup00 -wi-ao 58.38G LogVol01 VolGroup00 -wi-ao 9.75G taft-01.18455 lock_stress -wi-a- 500.00M taft-02.18451 lock_stress -wi-a- 500.00M taft-03.18448 lock_stress -wi-a- 500.00M taft-04.18446 lock_stress mwi-s- 500.00M taft-04.18446_mlog 31.20 taft-04.18446_mimage_3 lock_stress -wi--- 500.00M 2.6.18-150.el5 lvm2-2.02.46-5.el5 BUILT: Sat Jun 6 16:29:23 CDT 2009 lvm2-cluster-2.02.46-5.el5 BUILT: Sat Jun 6 16:28:13 CDT 2009 device-mapper-1.02.32-1.el5 BUILT: Thu May 21 02:18:23 CDT 2009 cmirror-1.1.37-1.el5 BUILT: Tue May 5 11:46:05 CDT 2009 kmod-cmirror-0.1.21-14.el5 BUILT: Thu May 21 08:28:17 CDT 2009 bumping the priority since the work around mentioned in comment #0 doesn't work. Note for Jon... this is fairly easy to reproduce by running helter_skelter. I'm attaching current logs containing clogd debug output from all 4 taft nodes when this issue is present. Created attachment 350339 [details]
log from taft-01
Created attachment 350340 [details]
log from taft-02
Created attachment 350341 [details]
log from taft-03
Created attachment 350342 [details]
log from taft-04
Posting new logs from new binary... Created attachment 350985 [details]
log from taft-01
Created attachment 350986 [details]
log from taft-02
Created attachment 350987 [details]
log from taft-03
Created attachment 350988 [details]
log from taft-04
commit 6cf7dbc393d206e56a14261af81b1aeeb5bb160a Author: Jonathan Brassow <jbrassow> Date: Wed Jul 8 17:16:19 2009 -0500 clogd: Fix for bugs 474174 and 491340 Whether to continue mirror recovery or not was being queued off of a variable that was set by doing (x - 1). However, when 'x' was 0 the result was a very large number (unsigned). When this large number was compared against the number of total regions in the mirror, it was determined that recovery was finished. The fix is simply to do: y = x ? (x - 1) : 0; instead of: y = (x - 1); Have not seen this bug with the latest build (though we've seen plenty of other bugs that it could still potentially be hiding behind). Marking verified in cmirror-1.1.39-2.el5. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-1340.html |