Description of problem: While running the segmented_pvmove test in mirror_sanity on 8 nodes I consistently get into a state where one node is spewing the following messages: Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Reinitializing checkpoint library handle Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Feb 18 23:25:19 west-01 clogd[11502]: [ISeytYih] Reinitializing checkpoint library handle Further LVM commands hang and the pvmove hangs. INFO: task lvs:29922 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. lvs D ffff81000100caa0 0 29922 29921 (NOTLB) ffff81020dbb5cb8 0000000000000086 0000000000000400 ffffffff8001c211 ffff8100da6eced0 0000000000000009 ffff81021fa94860 ffff8101239620c0 0000135c36174b5d 00000000000156e4 ffff81021fa94a48 0000000200000000 Call Trace: [<ffffffff8001c211>] generic_make_request+0x211/0x228 [<ffffffff8006f1f5>] do_gettimeofday+0x40/0x90 [<ffffffff800647ea>] io_schedule+0x3f/0x67 [<ffffffff800f5668>] __blockdev_direct_IO+0x8da/0xa80 [<ffffffff800e669d>] blkdev_direct_IO+0x32/0x37 [<ffffffff800e65d5>] blkdev_get_blocks+0x0/0x96 [<ffffffff8000c514>] __generic_file_aio_read+0xb8/0x198 [<ffffffff8012b392>] inode_has_perm+0x56/0x63 [<ffffffff800c773d>] generic_file_read+0xac/0xc5 [<ffffffff8012b392>] inode_has_perm+0x56/0x63 [<ffffffff800a19e6>] autoremove_wake_function+0x0/0x2e [<ffffffff8004248c>] do_ioctl+0x21/0x6b [<ffffffff8012de9b>] selinux_file_permission+0x9f/0xb6 [<ffffffff8000b6b0>] vfs_read+0xcb/0x171 [<ffffffff80011c01>] sys_read+0x45/0x6e [<ffffffff8005e28d>] tracesys+0xd5/0xe0 Version-Release number of selected component (if applicable): cmirror-1.1.39-5.el5 kmod-cmirror-0.1.22-3.el5 lvm2-2.02.56-7.el5 lvm2-cluster-2.02.56-7.el5 kernel-2.6.18-187.el5 kernel-2.6.18-189.el5 openais-0.80.6-14.el5 cman-2.0.115-30.el5 How reproducible: Fairly easily Steps to Reproduce: 1. 2. 3. Actual results: mirror_sanity output: SCENARIO - [segmented_pvmove] Create a couple small segmented lvs and then pvmove them Running lv_seg on west-07 to create the segmented volumes west-07: /usr/tests/sts-rhel5.5/lvm2/bin/lv_seg -v mirror_sanity -n segment Deactivating segment0 mirror Moving data from pv /dev/sdb1 west-07: pvmove -v /dev/sdb1 Finding volume group "mirror_sanity" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "mirror_sanity" metadata (seqno 195). Creating logical volume pvmove0 Moving 781 extents of logical volume mirror_sanity/segment0 Updating volume group metadata Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 196). Checking progress every 15 seconds Updating volume group metadata Error locking on node west-06: Command timed out Failed to suspend segment0 Quick verification that pvmove is finished on other nodes as well Error locking on node west-06: Command timed out Attempt to drop cached metadata failed after reverted update for VG mirror_sanity. Error locking on node west-06: Command timed out Error locking on node west-06: Command timed out Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 197). ABORTING: Segment progression failed. Error locking on node west-06: Command timed out Unable to drop cached metadata for VG mirror_sanity. ABORTING: Failed to write new data locations to disk. /dev/sdb1: Moved: 16.0% /dev/sdb1: Moved: 30.6% /dev/sdb1: Moved: 49.8% /dev/sdb1: Moved: 63.9% /dev/sdb1: Moved: 80.0% Expected results: Additional info:
Since it is too late to address this issue in RHEL 5.5, it has been proposed for RHEL 5.6. Contact your support representative if you need to escalate this issue.
*** Bug 639114 has been marked as a duplicate of this bug. ***
Yes, I'm still hitting this in 5.6. [root@dash-03 ~]# /usr/tests/sts-rhel5.6/lvm2/bin/lvm_rpms kernel-2.6.18-229.el5 lvm2-2.02.74-1.el5 BUILT: Fri Oct 15 10:26:21 CDT 2010 lvm2-cluster-2.02.74-2.el5 BUILT: Fri Oct 29 07:48:11 CDT 2010 device-mapper-1.02.55-1.el5 BUILT: Fri Oct 15 06:15:55 CDT 2010 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Reinitializing checkpoint library handle Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Nov 12 13:17:19 dash-03 clogd[11372]: [hKkn2Kc7] Reinitializing checkpoint library handle
based on comment #6, bumping priority and marking test blocker.
Quality Engineering Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Please, can you reproduce this? I think Nate has the easiest time reproducing this. Corey was not able to (comment #7). I am not able to. I'd like it if Corey could reproduce (because he is at my location), but I'll take any reproduction as long as I can access the machines.
Now that we're testing 5.7 again, I hit this issue. === Iteration 8 of 10 started on taft-04 at Thu Apr 28 05:02:09 CDT 2011 === FREE PVS: /dev/sdc2 /dev/sde2 /dev/sdg1 /dev/sdh2 INUSE PVS: pv_shuffle_A=/dev/sdb2 /dev/sdc1 /dev/sde1 pv_shuffle_B=/dev/sdd1 /dev/sdd2 /dev/sdg2 NOT INUSE PVS: pv_shuffle_A=/dev/sdb1 /dev/sdh1 pv_shuffle_B=/dev/sdf1 /dev/sdf2 Adding /dev/sdh2 to pv_shuffle_A Adding /dev/sdc2 to pv_shuffle_B Moving data from /dev/sdc1 to /dev/sdh2 on taft-04 Moving data from /dev/sdd1 to /dev/sdc2 on taft-04 taft-04: pvmove -v /dev/sdc1 /dev/sdh2 taft-04: pvmove -v /dev/sdd1 /dev/sdc2 Finding volume group "pv_shuffle_A" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle_A" metadata (seqno 41). Finding volume group "pv_shuffle_B" Executing: /sbin/modprobe dm-log-clustered Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Reinitializing checkpoint library handle Apr 28 05:02:19 taft-03 clogd[11198]: import_checkpoint: 1 checkpoint sections found Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Reinitializing checkpoint library handle Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Apr 28 05:02:19 taft-03 clogd[11198]: [TBl5MhMf] Reinitializing checkpoint library handle
Reproduced this again on the latest rpms. 2.6.18-261.el5 lvm2-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:24 CDT 2011 lvm2-cluster-2.02.84-3.el5 BUILT: Wed Apr 27 03:42:43 CDT 2011 device-mapper-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 device-mapper-event-1.02.63-3.el5 BUILT: Thu May 19 08:09:22 CDT 2011 cmirror-1.1.39-10.el5 BUILT: Wed Sep 8 16:32:05 CDT 2010 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 === Iteration 6 of 10 started on taft-03 at Thu May 19 10:27:46 CDT 2011 === FREE PVS: /dev/sde1 /dev/sdf2 /dev/sdg1 /dev/sdg2 INUSE PVS: pv_shuffle_A=/dev/sdc1 /dev/sdh1 /dev/sdh2 pv_shuffle_B=/dev/sdb1 /dev/sdd2 /dev/sde2 NOT INUSE PVS: pv_shuffle_A=/dev/sdb2 /dev/sdd1 pv_shuffle_B=/dev/sdc2 /dev/sdf1 Adding /dev/sdf2 to pv_shuffle_A Adding /dev/sdg2 to pv_shuffle_B Moving data from /dev/sdc1 to /dev/sdf2 on taft-03 Moving data from /dev/sdd2 to /dev/sdg2 on taft-03 taft-03: pvmove -v /dev/sdc1 /dev/sdf2 taft-03: pvmove -v /dev/sdd2 /dev/sdg2 Finding volume group "pv_shuffle_A" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle_A" metadata (seqno 29). Creating logical volume pvmove0 Moving 299 extents of logical volume pv_shuffle_A/stripe Updating volume group metadata Finding volume group "pv_shuffle_B" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "pv_shuffle_B" metadata (seqno 33). Creating logical volume pvmove0 Moving 896 extents of logical volume pv_shuffle_B/linear Moving 299 extents of logical volume pv_shuffle_B/stripe Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 30). Updating volume group metadata Checking progress before waiting every 15 seconds Creating volume group backup "/etc/lvm/backup/pv_shuffle_B" (seqno 34). Checking progress before waiting every 15 seconds May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Reinitializing checkpoint library handle May 19 10:28:00 taft-03 clogd[8898]: import_checkpoint: 2 checkpoint sections found May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Reinitializing checkpoint library handle May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Reinitializing checkpoint library handle May 19 10:28:00 taft-03 clogd[8898]: [yttZ9ake] Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST
fix did not solve the problem (made it worse)... deferring to 5.8
Current code can get into an infinite loop of retries if it is unable to connect with openAIS. It is better in this case to retry a couple times and fail if necessary.
Fix verified in the latest rpms. 2.6.18-300.el5 lvm2-2.02.88-5.el5 BUILT: Fri Dec 2 12:25:45 CST 2011 lvm2-cluster-2.02.88-5.el5 BUILT: Fri Dec 2 12:48:37 CST 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-14.el5 BUILT: Wed Nov 2 17:25:33 CDT 2011 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009 ============================================================ Iteration 20 of 20 started at Wed Dec 14 17:11:07 CST 2011 ============================================================ SCENARIO - [segmented_pvmove] Create a couple small segmented lvs and then pvmove them Running lv_seg on taft-03 to create the segmented volumes taft-03: /usr/tests/sts-rhel5.8/lvm2/bin/lv_seg -v mirror_sanity -n segment Deactivating segment0 mirror Moving data from pv /dev/sdh1 on taft-03 Finding volume group "mirror_sanity" Executing: /sbin/modprobe dm-log-clustered Archiving volume group "mirror_sanity" metadata (seqno 635). Creating logical volume pvmove0 Moving 182 extents of logical volume mirror_sanity/segment0 Moving 182 extents of logical volume mirror_sanity/segment1 Updating volume group metadata Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 636). Checking progress before waiting every 15 seconds /dev/sdh1: Moved: 0.0% /dev/sdh1: Moved: 15.9% /dev/sdh1: Moved: 31.3% /dev/sdh1: Moved: 46.7% /dev/sdh1: Moved: 50.0% Updating volume group metadata Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 637). /dev/sdh1: Moved: 65.7% /dev/sdh1: Moved: 81.0% /dev/sdh1: Moved: 96.4% /dev/sdh1: Moved: 100.0% Removing temporary pvmove LV Writing out final volume group after pvmove Creating volume group backup "/etc/lvm/backup/mirror_sanity" (seqno 639). Quick verification that pvmove is finished on other nodes as well Deactivating mirror segment0... and removing Deactivating mirror segment1... and removing
Hello, It seems we are facing the exact same problem. pvmove hang the cluster and some nodes start logging infinite loop of ... Reinitializing checkpoint library handle Failed to open checkpoint: SA_AIS_ERR_NOT_EXIST Reinitializing checkpoint library handle ... The patch does not seem available on the fedora git or anywhere else. http://git.fedorahosted.org/git/?p=cluster.git;a=history;f=cmirror;h=5f0d6fe0c90c8c2dfc52e174dd7c5d8e73344791;hb=refs/heads/RHEL58 And the rpm or rpms for cmirror-1.1.39-14.el5 neither. Where can we get the patch. We could only find the -13 Also do you know any "clean" method to get out of this situation as it cause the cluster to hang. Thanks a lot.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0256.html