Bug 207132

Summary: cmirror deactivation can cause clvmd hang
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: cmirrorAssignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: high    
Version: 4CC: agk, cfeist, dwysocha, mbroz
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-05 21:36:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2006-09-19 16:34:47 UTC
Description of problem:
I was running lvm_config on the rpms that jbrassow built for me and after many
iterations, a volume deactivation command (vgchange -an) caused lvm to hang on
taft-02. This was a 4 node x86_64 cluster (taft-01 - 04).

deactivating volume group
  Error locking on node taft-02: Command timed out
  Error locking on node taft-02: Command timed out
  Error locking on node taft-02: Command timed out

Hung lvs command:
[...]
lseek(4, 27136, SEEK_SET)               = 27136
read(4, "mirror_1_1694 {\nid = \"Ijhvqq-GDl"..., 2560) = 2560
close(4)                                = 0
stat("/dev/sdc2", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 34), ...}) = 0
stat("/dev/sdc2", {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 34), ...}) = 0
open("/dev/sdc2", O_RDONLY|O_DIRECT|O_NOATIME) = 4
fstat(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(8, 34), ...}) = 0
ioctl(4, BLKBSZGET, 0x5901c0)           = 0
lseek(4, 0, SEEK_SET)                   = 0
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
lseek(4, 4096, SEEK_SET)                = 4096
read(4, "mSk- LVM2 x[5A%r0N*>\1\0\0\0\0\20\0\0\0\0\0\0"..., 4096) = 4096
lseek(4, 24576, SEEK_SET)               = 24576
read(4, "mirror_1_1694 {\nid = \"Ijhvqq-GDl"..., 4096) = 4096
lseek(4, 24576, SEEK_SET)               = 24576
read(4, "mirror_1_1694 {\nid = \"Ijhvqq-GDl"..., 8192) = 8192
close(4)                                = 0
stat("/proc/lvm/VGs/mirror_1_1694", 0x7fbfffb510) = -1 ENOENT (No such file or
directory)
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
write(3, "3\1\377\277\0\0\0\0\0\0\0\0\22\0\0\0\0\1\4V_mirror_1_1"..., 36) = 36
read(3,               


[root@taft-02 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    4   M   taft-01
   2    1    4   M   taft-02
   3    1    4   M   taft-03
   4    1    4   M   taft-04
[root@taft-02 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[2 3 1 4]

DLM Lock Space:  "clvmd"                           330 330 run       -
[1 2 3 4]

[root@taft-02 ~]# dmsetup status
mirror_1_1694-mirror_1_16940_mimage_1: 0 101949440 linear
mirror_1_1694-mirror_1_16940_mimage_1: 101949440 136634368 linear
mirror_1_1694-mirror_1_16940_mimage_0: 0 101949440 linear
mirror_1_1694-mirror_1_16940_mimage_0: 101949440 136634368 linear
VolGroup00-LogVol01: 0 4063232 linear
VolGroup00-LogVol00: 0 138674176 linear
[root@taft-02 ~]# dmsetup ls
mirror_1_1694-mirror_1_16940_mimage_1   (253, 3)
mirror_1_1694-mirror_1_16940_mimage_0   (253, 2)
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)


This message is streaming to the console over and over:
dm-cmirror: HEY!!! There are two matches for 2Sjsbsf8
dm-cmirror: HEY!!! There are two matches for 2Sjsbsf8
dm-cmirror: HEY!!! There are two matches for 2Sjsbsf8
dm-cmirror: HEY!!! There are two matches for 2Sjsbsf8
[...]

Version-Release number of selected component (if applicable):
[root@taft-02 ~]# rpm -q lvm2
lvm2-2.02.06-6.0.RHEL4B
[root@taft-02 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.06-7.0.RHEL4B
[root@taft-02 ~]# rpm -q device-mapper
device-mapper-1.02.07-4.0.RHEL4B
[root@taft-02 ~]# rpm -q cmirror-kernel
cmirror-kernel-2.6.9-13.0
[root@taft-02 ~]# uname -ar
Linux taft-02 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

Comment 1 Jonathan Earl Brassow 2006-09-19 17:00:55 UTC
All the pieces must not be in place yet.  This is a known issue.

The kernel is improperly returning a status line with a trailing space.  This is causing device-mapper/
LVM2 to think it is ok to load a mirror that is already loaded (hence the "two matches" error messages.)

bug 205831 addresses this issue.  There is a work-around available for user-space too.  I'll check to 
make sure it is there.


Comment 3 Jonathan Earl Brassow 2006-10-17 14:53:18 UTC
The patch I made to fix the trailing space issue had an off by one error.  This
issue is fixed, and in the repository, but the package may not be built yet. 
Moving to POST.

Comment 4 Corey Marthaler 2006-11-22 18:26:35 UTC
fix verified in lvm2-2.02.15-3/lvm2-cluster-2.02.15-3/cmirror-1.0.1-0

Comment 5 Corey Marthaler 2006-11-27 16:02:07 UTC
Looks like I may have spoke too soon...

Just like in this original report, I ran looping creation/deactivation/deletion
cmds over the long weekend on link-08 (part of 4 node cluster with
link-02,04,07) and eventually hit what appears to be this exact same issue while
attemping an 'lvchange -an'.


[root@link-08 ~]# uname -ar
Linux link-08 2.6.9-42.17.ELsmp #1 SMP Mon Oct 9 18:42:57 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux
[root@link-08 ~]# rpm -q lvm2
lvm2-2.02.15-3
[root@link-08 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.15-3
[root@link-08 ~]# rpm -q device-mapper
device-mapper-1.02.12-3
[root@link-08 ~]# rpm -q cmirror
cmirror-1.0.1-0
[root@link-08 ~]# rpm -q cmirror-kernel
cmirror-kernel-2.6.9-13.0


link-08:
  [...]
  Logical volume "mirror1" created
  Logical volume "mirror1" successfully removed
  Logical volume "mirror1" created
  Logical volume "mirror1" successfully removed
  Logical volume "mirror1" created
  Error locking on node link-08: Command timed out

link-08 console:
[...]
dm-cmirror: HEY!!! There are two matches for xwEuivsF
dm-cmirror: HEY!!! There are two matches for xwEuivsF
dm-cmirror: HEY!!! There are two matches for xwEuivsF
dm-cmirror: HEY!!! There are two matches for xwEuivsF

[root@link-08 ~]# dmsetup ls
corey-mirror1_mlog      (253, 2)
corey-mirror1_mimage_1  (253, 4)
corey-mirror1_mimage_0  (253, 3)
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)


Here are some of the process stacks on link-08:
Nov 27 04:58:48 link-08 kernel: lvchange      S 0000000000000012     0 18785  
3981                     (NOTLB)
Nov 27 04:58:48 link-08 kernel: 000001000bbffbd8 0000000000000006
00000100010547f0 0000000000000074
Nov 27 04:58:48 link-08 kernel:        00000100010527f0 0000000000000000
000001000100a000 0000000080132866
Nov 27 04:58:48 link-08 kernel:        00000100351f8030 0000000000000527
Nov 27 04:58:48 link-08 kernel: Call
Trace:<ffffffff8030b35c>{schedule_timeout+224}
<ffffffff801357c8>{prepare_to_wait+21}
Nov 27 04:58:48 link-08 kernel:       
<ffffffff803064ca>{unix_stream_recvmsg+592}
<ffffffff801358cc>{autoremove_wake_function+0}
Nov 27 04:58:48 link-08 kernel:       
<ffffffff801358cc>{autoremove_wake_function+0} <ffffffff802a7a62>{sock_aio_read+297}
Nov 27 04:58:48 link-08 kernel:        <ffffffff802a7ba8>{sock_aio_write+306}
<ffffffff801796f8>{do_sync_read+173}
Nov 27 04:58:48 link-08 kernel:        <ffffffff801875d0>{__user_walk+93}
<ffffffff80181d3a>{vfs_stat+24}
Nov 27 04:58:48 link-08 kernel:        <ffffffff8030a8c9>{thread_return+0}
<ffffffff8030a921>{thread_return+88}
Nov 27 04:58:48 link-08 kernel:       
<ffffffff801358cc>{autoremove_wake_function+0} <ffffffff80193610>{dnotify_parent+34}
Nov 27 04:58:48 link-08 kernel:        <ffffffff80179806>{vfs_read+226}
<ffffffff80179a4a>{sys_read+69}
Nov 27 04:58:48 link-08 kernel:        <ffffffff8011026a>{system_call+126}



Nov 27 04:58:44 link-08 kernel: kmirrord      S ffffffffa00fada0     0   398   
  7          1959       (L-TLB)
Nov 27 04:58:44 link-08 kernel: 000001003f707e68 0000000000000046
000001000cbc2030 0000000000000064
Nov 27 04:58:44 link-08 kernel:        000001003f7176c0 00000000000ae000
000001000100a000 0000000000000000
Nov 27 04:58:44 link-08 kernel:        000001001f72f030 000000000000014f
Nov 27 04:58:44 link-08 kernel: Call
Trace:<ffffffffa00f4652>{:dm_mirror:do_work+0} <ffffffff8014791d>{worker_thread+226}
Nov 27 04:58:44 link-08 kernel:       
<ffffffff80133f0c>{default_wake_function+0} <ffffffff80133f5d>{__wake_up_common+67}
Nov 27 04:58:44 link-08 kernel:       
<ffffffff80133f0c>{default_wake_function+0}
<ffffffff8014b67c>{keventd_create_kthread+0}
Nov 27 04:58:44 link-08 kernel:        <ffffffff8014783b>{worker_thread+0}
<ffffffff8014b67c>{keventd_create_kthread+0}
Nov 27 04:58:44 link-08 kernel:        <ffffffff8014b653>{kthread+200}
<ffffffff80110f47>{child_rip+8}
Nov 27 04:58:44 link-08 kernel:       
<ffffffff8014b67c>{keventd_create_kthread+0} <ffffffff8014b58b>{kthread+0}
Nov 27 04:58:44 link-08 kernel:        <ffffffff80110f3f>{child_rip+0}


All lvm comands are stuck on link-08, but here's what other nodes in the cluster
see:

[root@link-07 tmp]# dmsetup ls
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)
[root@link-07 tmp]# lvs -a -o +devices
  LV                 VG    Attr   LSize  Origin Snap%  Move Log          Copy% 
Devices                         
  mirror1            corey mwi--- 10.00G                    mirror1_mlog       
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] corey iwi--- 10.00G                                       
/dev/sda1(0)                    
  [mirror1_mimage_1] corey iwi--- 10.00G                                       
/dev/sdb1(0)                    
  [mirror1_mlog]     corey lwi---  4.00M                                       
/dev/sdg1(0)                    

[root@link-02 tmp]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    4   M   link-02
   2    1    4   M   link-07
   3    1    4   M   link-08
   4    1    4   M   link-04
[root@link-02 tmp]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2 4 3]

DLM Lock Space:  "clvmd"                             4   5 run       -
[1 4 2 3]



Comment 6 Jonathan Earl Brassow 2006-11-28 21:24:27 UTC
Interesting that you would still see "HEY!!! There are two matches for
xwEuivsF", since the cluster log constructor does not allow duplicates...

May indicate that the log list is corrupted?


Comment 7 Jonathan Earl Brassow 2006-11-28 21:28:16 UTC
Comment #5 seems to indicate that the device has been deactivated on the other
nodes, but not link-08 (the originator of the command).


Comment 8 Jonathan Earl Brassow 2006-11-29 02:45:16 UTC
I don't know how the "HEY!!! There are two matches for
xwEuivsF" got in there, but either way, we need to wait for a fix to 217626 to
test this again with confidence.


Comment 9 Jonathan Earl Brassow 2006-12-01 20:33:02 UTC
I reproduced this while looking for 213754 (an unrelated bug).  And this is
after I patched for bug 217626...



Comment 10 Jonathan Earl Brassow 2007-01-08 23:06:20 UTC
I've failed to reproduce this for quite some time, but I'm not convinced it's
gone.  I could use some help reproducing.


Comment 11 Corey Marthaler 2007-02-14 19:49:11 UTC
Reproduced this issue along with bz 217895 last night doing looping
create/convert/remove cmirror operations.

Saw this on all the nodes in the cluster except the node which had the messages
in bz 217895. So these two bugs are most likely related.

[...]
dm-cmirror: HEY!!! There are two matches for r0bZtL8D
dm-cmirror: HEY!!! There are two matches for r0bZtL8D
dm-cmirror: HEY!!! There are two matches for r0bZtL8D
dm-cmirror: HEY!!! There are two matches for r0bZtL8D
dm-cmirror: HEY!!! There are two matches for r0bZtL8D
[...]

Comment 12 Jonathan Earl Brassow 2007-02-15 20:48:15 UTC
Should now be fixed with the changes made for bug 228104


Comment 13 Corey Marthaler 2007-04-10 18:35:30 UTC
Have not seen this bug after running countless iterations of 'activator' and
'cmirror_lock_stress', both of which test/stress the locking/activation of
cmirrors. Marking verified.

Comment 14 Chris Feist 2008-08-05 21:36:25 UTC
Bug fixed in latest release.