Bug 261441 - clvmd timeout causes locking issues during device recovery scenarios
clvmd timeout causes locking issues during device recovery scenarios
Status: CLOSED NEXTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: lvm2-cluster (Show other bugs)
4
All All
medium Severity medium
: ---
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-08-28 14:59 EDT by Corey Marthaler
Modified: 2010-01-11 23:06 EST (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-04-14 02:43:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-08-28 14:59:03 EDT
Description of problem:
This appears to be the same or similar to closed bz 235092, but it appears that
setting the timeout up hasn't fixed all clvmd timeout cases

I hit this while running cmirror device failure testing:

================================================================================
Iteration 13.8 started at Tue Aug 28 00:10:40 CDT 2007
================================================================================
Senario: Kill secondary leg of non synced 3 leg mirror

****** Mirror hash info for this scenario ******
* name:      fail_secondary_non_synced_3_legs
* sync:      0
* disklog:   1
* failpv:    /dev/sde1
* legs:      3
* pvs:       /dev/sda1 /dev/sde1 /dev/sdd1 /dev/sdc1
************************************************

Creating mirror on link-07...
qarsh root@link-07 lvcreate -m 2 -n fail_secondary_non_synced_3_legs -L 800M
helter_skelter /dev/sda1:0-500 /dev/sde1:0-500 /dev/sdd1:0-500 /dev/sdc1:0-50
Creating gfs on top of mirror on link-07...
Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-02...
Mounting gfs on link-02...
Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-07...
Mounting gfs on link-07...
Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-08...
Mounting gfs on link-08...

mirror is only 17.00% synced right now

Disabling device sde on link-02
Disabling device sde on link-07
Disabling device sde on link-08

Attempting I/O to cause mirror down conversion on link-07
10+0 records in
10+0 records out
Verifying the down conversion from mirror to linear
  /dev/sde1: open failed: No such device or address                            
                                   
[DEADLOCK]


link-07 was the node that had the locking failure:
Aug 28 00:21:07 link-07 kernel: dm-cmirror: I'm the cluster mirror log server
for duUcttFu
Aug 28 00:21:07 link-07 kernel: dm-cmirror: Disk Resume::  duUcttFu (active)
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Live nodes        :: 3
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   In-Use Regions    :: 0
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Good IUR's        :: 0
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Bad IUR's         :: 0
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Sync count        :: 554
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Disk Region count :: 1600
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Region count      :: 1600
Aug 28 00:21:07 link-07 kernel: dm-cmirror: Marked regions::
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   276 - 351
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   433 - 527
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   529 - 703
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   897 - 1071
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   1073 - 1247
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   1249 - 1423
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   1425 - 1600
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Total = 1046
Aug 28 00:21:07 link-07 kernel: dm-cmirror: Out-of-sync regions::
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   276 - 351
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   433 - 527
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   529 - 703
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   897 - 1071
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   1073 - 1247
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   1249 - 1423
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   1425 - 1600
Aug 28 00:21:07 link-07 kernel: dm-cmirror:   Total = 1046
Aug 28 00:21:08 link-07 qarshd[21947]: Nothing to do
Aug 28 00:21:32 link-07 last message repeated 8 times
Aug 28 00:21:33 link-07 kernel: dm-cmirror: server_id=2, server_valid=0, duUcttFu
Aug 28 00:21:33 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT
Aug 28 00:21:33 link-07 kernel: dm-cmirror: LRT_ELECTION(9): (duUcttFu)
Aug 28 00:21:33 link-07 kernel: dm-cmirror:   starter     : 2
Aug 28 00:21:33 link-07 kernel: dm-cmirror:   co-ordinator: 2
Aug 28 00:21:33 link-07 kernel: dm-cmirror:   node_count  : 0
Aug 28 00:21:35 link-07 qarshd[21947]: Nothing to do
Aug 28 00:22:05 link-07 last message repeated 10 times
Aug 28 00:22:07 link-07 lvm[5763]: Error locking on node link-07: Command timed out
Aug 28 00:22:07 link-07 lvm[5763]: Failed to lock fail_secondary_non_synced_3_legs
Aug 28 00:22:07 link-07 lvm[5763]: Wiping pre-committed helter_skelter metadata
from /dev/sda1 header at 4096


[root@link-02 ~]# dmsetup status
helter_skelter-fail_secondary_non_synced_3_legs_mlog: 0 8192 linear
helter_skelter-fail_secondary_non_synced_3_legs: 0 1638400 mirror 3 253:3 253:4
253:5 554/1600 1 AAA 3 clustered_disk 253:2 A
helter_skelter-fail_secondary_non_synced_3_legs_mimage_2: 0 1638400 linear
helter_skelter-fail_secondary_non_synced_3_legs_mimage_1: 0 1638400 linear
helter_skelter-fail_secondary_non_synced_3_legs_mimage_0: 0 1638400 linear


[root@link-07 ~]# dmsetup status
helter_skelter-fail_secondary_non_synced_3_legs_mlog: 0 8192 linear
helter_skelter-fail_secondary_non_synced_3_legs: 0 1638400 mirror 3 253:3 253:4
253:5 554/1600 1 ADA 3 clustered_disk 253:2 A
helter_skelter-fail_secondary_non_synced_3_legs_mimage_2: 0 1638400 linear
helter_skelter-fail_secondary_non_synced_3_legs_mimage_1: 0 1638400 error
helter_skelter-fail_secondary_non_synced_3_legs_mimage_0: 0 1638400 linear


[root@link-08 ~]# dmsetup status
helter_skelter-fail_secondary_non_synced_3_legs_mlog: 0 8192 linear
helter_skelter-fail_secondary_non_synced_3_legs: 0 1638400 mirror 3 253:3 253:4
253:5 554/1600 1 AAA 3 clustered_disk 253:2 A
helter_skelter-fail_secondary_non_synced_3_legs_mimage_2: 0 1638400 linear
helter_skelter-fail_secondary_non_synced_3_legs_mimage_1: 0 1638400 linear
helter_skelter-fail_secondary_non_synced_3_legs_mimage_0: 0 1638400 linear


Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.27-1.el4
2.6.9-56.ELsmp
Comment 1 Corey Marthaler 2007-08-29 09:47:45 EDT
Reproduced this again at exactly the same place in the test.

================================================================================
Iteration 13.8 started at Tue Aug 28 22:00:48 CDT 2007
================================================================================
Senario: Kill secondary leg of non synced 3 leg mirror

****** Mirror hash info for this scenario ******
* name:      fail_secondary_non_synced_3_legs
* sync:      0
* disklog:   1
* failpv:    /dev/sdd1
* legs:      3
* pvs:       /dev/sda1 /dev/sdd1 /dev/sdb1 /dev/sdc1
************************************************

Creating mirror on link-07...
qarsh root@link-07 lvcreate -m 2 -n fail_secondary_non_synced_3_legs -L 800M
helter_skelter /dev/sda1:0-500 /dev/sdd1:0-500 /dev/sdb1:0-500 /dev/sdc1:0-50
Creating gfs on top of mirror on link-07...
Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-02...
Mounting gfs on link-02...
Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-07...
Mounting gfs on link-07...
Creating mnt point /mnt/fail_secondary_non_synced_3_legs on link-08...
Mounting gfs on link-08...

mirror is only 15.50% synced right now

Disabling device sdd on link-02
Disabling device sdd on link-07
Disabling device sdd on link-08

Attempting I/O to cause mirror down conversion on link-07
10+0 records in
10+0 records out
Verifying the down conversion from mirror to linear
  /dev/sdd1: open failed: No such device or address

[DEADLOCK]


Aug 28 22:11:20 link-07 kernel: dm-cmirror: I'm the cluster mirror log server
for NYIItZCP
Aug 28 22:11:20 link-07 kernel: dm-cmirror: Disk Resume::  NYIItZCP (active)
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Live nodes        :: 3
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   In-Use Regions    :: 0
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Good IUR's        :: 0
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Bad IUR's         :: 0
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Sync count        :: 530
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Disk Region count :: 1600
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Region count      :: 1600
Aug 28 22:11:20 link-07 kernel: dm-cmirror: Marked regions::
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   252 - 351
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   433 - 527
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   529 - 703
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   897 - 1071
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   1073 - 1247
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   1249 - 1423
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   1425 - 1600
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Total = 1070
Aug 28 22:11:20 link-07 kernel: dm-cmirror: Out-of-sync regions::
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   252 - 351
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   433 - 527
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   529 - 703
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   897 - 1071
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   1073 - 1247
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   1249 - 1423
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   1425 - 1600
Aug 28 22:11:20 link-07 kernel: dm-cmirror:   Total = 1070
Aug 28 22:11:45 link-07 kernel: dm-cmirror: server_id=2, server_valid=0, NYIItZCP
Aug 28 22:11:45 link-07 kernel: dm-cmirror: trigger = LRT_GET_SYNC_COUNT
Aug 28 22:11:45 link-07 kernel: dm-cmirror: LRT_ELECTION(9): (NYIItZCP)
Aug 28 22:11:45 link-07 kernel: dm-cmirror:   starter     : 2
Aug 28 22:11:45 link-07 kernel: dm-cmirror:   co-ordinator: 2
Aug 28 22:11:45 link-07 kernel: dm-cmirror:   node_count  : 0
Aug 28 22:11:47 link-07 qarshd[25353]: Nothing to do
Aug 28 22:12:17 link-07 last message repeated 10 times
Aug 28 22:12:20 link-07 lvm[30999]: Error locking on node link-07: Command timed out
Aug 28 22:12:20 link-07 lvm[30999]: Failed to lock fail_secondary_non_synced_3_legs


[root@link-07 ~]# dmsetup status
helter_skelter-fail_secondary_non_synced_3_legs_mlog: 0 8192 linear
helter_skelter-fail_secondary_non_synced_3_legs: 0 1638400 mirror 3 253:3 253:4
253:5 530/1600 1 ADA 3 clustered_disk 253:2 A
helter_skelter-fail_secondary_non_synced_3_legs_mimage_2: 0 1638400 linear
helter_skelter-fail_secondary_non_synced_3_legs_mimage_1: 0 1638400 error
helter_skelter-fail_secondary_non_synced_3_legs_mimage_0: 0 1638400 linear
Comment 2 Corey Marthaler 2007-08-31 10:27:18 EDT
Most cmirror recovery (device failure) testing will eventually end up hitting
this issue.
Comment 3 Jonathan Earl Brassow 2007-09-24 15:47:27 EDT
Haven't been able to reproduce yet.  Are there other mirrors involved?

I am:
1) create 3-way mirror
2) mkfs
3) mount FS on 3 machines and start I/O
4) fail 2nd leg
5) wait for down-convert
6) fail 3rd leg
7) wait for down-convert
8) up convert to 3-way mirror
*) repeat steps 4-8
I never give the mirror a chance to get 'in-sync'.
Comment 4 Jonathan Earl Brassow 2007-09-27 16:34:02 EDT
should be fixed along with 290821.

assigned -> post

Note You need to log in before you can comment on or make changes to this bug.