Bug 207132
Summary: | cmirror deactivation can cause clvmd hang | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Corey Marthaler <cmarthal> |
Component: | cmirror | Assignee: | Jonathan Earl Brassow <jbrassow> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4 | CC: | agk, cfeist, dwysocha, mbroz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-08-05 21:36:25 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Corey Marthaler
2006-09-19 16:34:47 UTC
All the pieces must not be in place yet. This is a known issue. The kernel is improperly returning a status line with a trailing space. This is causing device-mapper/ LVM2 to think it is ok to load a mirror that is already loaded (hence the "two matches" error messages.) bug 205831 addresses this issue. There is a work-around available for user-space too. I'll check to make sure it is there. The patch I made to fix the trailing space issue had an off by one error. This issue is fixed, and in the repository, but the package may not be built yet. Moving to POST. fix verified in lvm2-2.02.15-3/lvm2-cluster-2.02.15-3/cmirror-1.0.1-0 Looks like I may have spoke too soon... Just like in this original report, I ran looping creation/deactivation/deletion cmds over the long weekend on link-08 (part of 4 node cluster with link-02,04,07) and eventually hit what appears to be this exact same issue while attemping an 'lvchange -an'. [root@link-08 ~]# uname -ar Linux link-08 2.6.9-42.17.ELsmp #1 SMP Mon Oct 9 18:42:57 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux [root@link-08 ~]# rpm -q lvm2 lvm2-2.02.15-3 [root@link-08 ~]# rpm -q lvm2-cluster lvm2-cluster-2.02.15-3 [root@link-08 ~]# rpm -q device-mapper device-mapper-1.02.12-3 [root@link-08 ~]# rpm -q cmirror cmirror-1.0.1-0 [root@link-08 ~]# rpm -q cmirror-kernel cmirror-kernel-2.6.9-13.0 link-08: [...] Logical volume "mirror1" created Logical volume "mirror1" successfully removed Logical volume "mirror1" created Logical volume "mirror1" successfully removed Logical volume "mirror1" created Error locking on node link-08: Command timed out link-08 console: [...] dm-cmirror: HEY!!! There are two matches for xwEuivsF dm-cmirror: HEY!!! There are two matches for xwEuivsF dm-cmirror: HEY!!! There are two matches for xwEuivsF dm-cmirror: HEY!!! There are two matches for xwEuivsF [root@link-08 ~]# dmsetup ls corey-mirror1_mlog (253, 2) corey-mirror1_mimage_1 (253, 4) corey-mirror1_mimage_0 (253, 3) VolGroup00-LogVol01 (253, 1) VolGroup00-LogVol00 (253, 0) Here are some of the process stacks on link-08: Nov 27 04:58:48 link-08 kernel: lvchange S 0000000000000012 0 18785 3981 (NOTLB) Nov 27 04:58:48 link-08 kernel: 000001000bbffbd8 0000000000000006 00000100010547f0 0000000000000074 Nov 27 04:58:48 link-08 kernel: 00000100010527f0 0000000000000000 000001000100a000 0000000080132866 Nov 27 04:58:48 link-08 kernel: 00000100351f8030 0000000000000527 Nov 27 04:58:48 link-08 kernel: Call Trace:<ffffffff8030b35c>{schedule_timeout+224} <ffffffff801357c8>{prepare_to_wait+21} Nov 27 04:58:48 link-08 kernel: <ffffffff803064ca>{unix_stream_recvmsg+592} <ffffffff801358cc>{autoremove_wake_function+0} Nov 27 04:58:48 link-08 kernel: <ffffffff801358cc>{autoremove_wake_function+0} <ffffffff802a7a62>{sock_aio_read+297} Nov 27 04:58:48 link-08 kernel: <ffffffff802a7ba8>{sock_aio_write+306} <ffffffff801796f8>{do_sync_read+173} Nov 27 04:58:48 link-08 kernel: <ffffffff801875d0>{__user_walk+93} <ffffffff80181d3a>{vfs_stat+24} Nov 27 04:58:48 link-08 kernel: <ffffffff8030a8c9>{thread_return+0} <ffffffff8030a921>{thread_return+88} Nov 27 04:58:48 link-08 kernel: <ffffffff801358cc>{autoremove_wake_function+0} <ffffffff80193610>{dnotify_parent+34} Nov 27 04:58:48 link-08 kernel: <ffffffff80179806>{vfs_read+226} <ffffffff80179a4a>{sys_read+69} Nov 27 04:58:48 link-08 kernel: <ffffffff8011026a>{system_call+126} Nov 27 04:58:44 link-08 kernel: kmirrord S ffffffffa00fada0 0 398 7 1959 (L-TLB) Nov 27 04:58:44 link-08 kernel: 000001003f707e68 0000000000000046 000001000cbc2030 0000000000000064 Nov 27 04:58:44 link-08 kernel: 000001003f7176c0 00000000000ae000 000001000100a000 0000000000000000 Nov 27 04:58:44 link-08 kernel: 000001001f72f030 000000000000014f Nov 27 04:58:44 link-08 kernel: Call Trace:<ffffffffa00f4652>{:dm_mirror:do_work+0} <ffffffff8014791d>{worker_thread+226} Nov 27 04:58:44 link-08 kernel: <ffffffff80133f0c>{default_wake_function+0} <ffffffff80133f5d>{__wake_up_common+67} Nov 27 04:58:44 link-08 kernel: <ffffffff80133f0c>{default_wake_function+0} <ffffffff8014b67c>{keventd_create_kthread+0} Nov 27 04:58:44 link-08 kernel: <ffffffff8014783b>{worker_thread+0} <ffffffff8014b67c>{keventd_create_kthread+0} Nov 27 04:58:44 link-08 kernel: <ffffffff8014b653>{kthread+200} <ffffffff80110f47>{child_rip+8} Nov 27 04:58:44 link-08 kernel: <ffffffff8014b67c>{keventd_create_kthread+0} <ffffffff8014b58b>{kthread+0} Nov 27 04:58:44 link-08 kernel: <ffffffff80110f3f>{child_rip+0} All lvm comands are stuck on link-08, but here's what other nodes in the cluster see: [root@link-07 tmp]# dmsetup ls VolGroup00-LogVol01 (253, 1) VolGroup00-LogVol00 (253, 0) [root@link-07 tmp]# lvs -a -o +devices LV VG Attr LSize Origin Snap% Move Log Copy% Devices mirror1 corey mwi--- 10.00G mirror1_mlog mirror1_mimage_0(0),mirror1_mimage_1(0) [mirror1_mimage_0] corey iwi--- 10.00G /dev/sda1(0) [mirror1_mimage_1] corey iwi--- 10.00G /dev/sdb1(0) [mirror1_mlog] corey lwi--- 4.00M /dev/sdg1(0) [root@link-02 tmp]# cat /proc/cluster/nodes Node Votes Exp Sts Name 1 1 4 M link-02 2 1 4 M link-07 3 1 4 M link-08 4 1 4 M link-04 [root@link-02 tmp]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 4 3] DLM Lock Space: "clvmd" 4 5 run - [1 4 2 3] Interesting that you would still see "HEY!!! There are two matches for xwEuivsF", since the cluster log constructor does not allow duplicates... May indicate that the log list is corrupted? Comment #5 seems to indicate that the device has been deactivated on the other nodes, but not link-08 (the originator of the command). I don't know how the "HEY!!! There are two matches for xwEuivsF" got in there, but either way, we need to wait for a fix to 217626 to test this again with confidence. I reproduced this while looking for 213754 (an unrelated bug). And this is after I patched for bug 217626... I've failed to reproduce this for quite some time, but I'm not convinced it's gone. I could use some help reproducing. Reproduced this issue along with bz 217895 last night doing looping create/convert/remove cmirror operations. Saw this on all the nodes in the cluster except the node which had the messages in bz 217895. So these two bugs are most likely related. [...] dm-cmirror: HEY!!! There are two matches for r0bZtL8D dm-cmirror: HEY!!! There are two matches for r0bZtL8D dm-cmirror: HEY!!! There are two matches for r0bZtL8D dm-cmirror: HEY!!! There are two matches for r0bZtL8D dm-cmirror: HEY!!! There are two matches for r0bZtL8D [...] Should now be fixed with the changes made for bug 228104 Have not seen this bug after running countless iterations of 'activator' and 'cmirror_lock_stress', both of which test/stress the locking/activation of cmirrors. Marking verified. Bug fixed in latest release. |