Bug 227398

Summary:	cmirror request to LRT_CLEAR_REGION fails and causes cmd to run for many days
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	agk, cfeist, dwysocha, jbrassow, mbroz, prockai
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-08-05 21:43:44 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2007-02-05 19:52:35 UTC

Description of problem:
I had I/O going to 3 cmirrored gfs filesystems over the weekend while also doing
up and down mirror conversions of those volumes. 

Those conversion cmds started to fail due to locking errors and the filesystems
also now appear hung (as a df command hangs).

I then noticed in the console of one of the nodes in the cluster that the
following cmirror request had been running for days slowly going though each
region of the mirror and failing each time. 

# this may be a similar bug to 225337

[...]
Feb  2 22:04:21 link-04 kernel: dm-cmirror: Request to remove unrecorded region
user (4/2100)
Feb  2 22:04:21 link-04 lvm[4578]: No longer monitoring mirror device vg-mirror1
for events
Feb  2 22:04:42 link-04 kernel: dm-cmirror: Clean-up required due to new server
Feb  2 22:04:42 link-04 kernel: dm-cmirror:  - Wiping clear region list
Feb  2 22:04:42 link-04 kernel: dm-cmirror:  - 0 clear region requests wiped
Feb  2 22:04:42 link-04 kernel: dm-cmirror:  - Resending all mark region requests
Feb  2 22:04:42 link-04 kernel: dm-cmirror: Clean-up complete
Feb  2 22:04:42 link-04 kernel: dm-cmirror: Continuing request type, 8
(LRT_GET_SYNC_COUNT)
Feb  2 22:04:42 link-04 lvm[4578]: Monitoring mirror device vg-mirror1 for events
Feb  2 22:04:48 link-04 lvm[4578]: No longer monitoring mirror device vg-mirror2
for events
Feb  2 22:04:48 link-04 kernel: dm-cmirror: I'm the cluster mirror log server
for CTwiYWOQ
Feb  2 22:04:48 link-04 kernel: dm-cmirror: Disk Resume::  CTwiYWOQ (active)
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Live nodes        :: 4
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   In-Use Regions    :: 0
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Good IUR's        :: 0
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Bad IUR's         :: 0
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Sync count        :: 2441
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Disk Region count :: 10240
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Region count      :: 10240
Feb  2 22:04:48 link-04 kernel: dm-cmirror: Marked regions::
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   2438
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   2442 - 10240
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Total = 7799
Feb  2 22:04:48 link-04 kernel: dm-cmirror: Out-of-sync regions::
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   2438
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   2442 - 10240
Feb  2 22:04:48 link-04 kernel: dm-cmirror:   Total = 7799
Feb  2 22:04:48 link-04 kernel: dm-cmirror: Write bits failed on mirror log
device, 253:8
Feb  2 22:04:48 link-04 kernel: dm-cmirror: Write bits failed on mirror log
device, 253:8
Feb  2 22:04:48 link-04 kernel: dm-cmirror: Leaving while clear region requests
remain.
Feb  2 22:04:53 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
416 of 7966006 (1%)
Feb  2 22:04:53 link-04 kernel: dm-cmirror: Last request:
Feb  2 22:04:53 link-04 kernel: dm-cmirror:  - my_id   :: 4
Feb  2 22:04:53 link-04 kernel: dm-cmirror:  - server  :: 57005
Feb  2 22:04:53 link-04 kernel: dm-cmirror:  - log uuid:: CTwiYWOQ (suspended)
Feb  2 22:04:53 link-04 kernel: dm-cmirror:  - request :: LRT_CLEAR_REGION
Feb  2 22:04:53 link-04 kernel: dm-cmirror:  - error   :: -6
Feb  2 22:05:01 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
448 of 7966088 (1%)
Feb  2 22:05:01 link-04 kernel: dm-cmirror: Last request:
Feb  2 22:05:01 link-04 kernel: dm-cmirror:  - my_id   :: 4
Feb  2 22:05:01 link-04 kernel: dm-cmirror:  - server  :: 57005
Feb  2 22:05:01 link-04 kernel: dm-cmirror:  - log uuid:: CTwiYWOQ (suspended)
Feb  2 22:05:01 link-04 kernel: dm-cmirror:  - request :: LRT_CLEAR_REGION
Feb  2 22:05:01 link-04 kernel: dm-cmirror:  - error   :: -6


[...]

Feb  5 08:51:57 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
846464 of 9658132 (9%)
Feb  5 08:51:57 link-04 kernel: dm-cmirror: Last request:
Feb  5 08:51:57 link-04 kernel: dm-cmirror:  - my_id   :: 4
Feb  5 08:51:57 link-04 kernel: dm-cmirror:  - server  :: 1049214464
Feb  5 08:51:57 link-04 kernel: dm-cmirror:  - log uuid::  (active)
Feb  5 08:51:57 link-04 kernel: dm-cmirror:  - request :: LRT_CLEAR_REGION
Feb  5 08:51:57 link-04 kernel: dm-cmirror:  - error   :: -6
Feb  5 08:52:05 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
846496 of 9658196 (9%)
Feb  5 08:52:05 link-04 kernel: dm-cmirror: Last request:
Feb  5 08:52:05 link-04 kernel: dm-cmirror:  - my_id   :: 4
Feb  5 08:52:05 link-04 kernel: dm-cmirror:  - server  :: 1049214464
Feb  5 08:52:05 link-04 kernel: dm-cmirror:  - log uuid::  (active)
Feb  5 08:52:05 link-04 kernel: dm-cmirror:  - request :: LRT_CLEAR_REGION
Feb  5 08:52:05 link-04 kernel: dm-cmirror:  - error   :: -6
Feb  5 08:52:13 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
846528 of 9658260 (9%)
Feb  5 08:52:13 link-04 kernel: dm-cmirror: Last request:
Feb  5 08:52:13 link-04 kernel: dm-cmirror:  - my_id   :: 4
Feb  5 08:52:13 link-04 kernel: dm-cmirror:  - server  :: 1049214464
Feb  5 08:52:13 link-04 kernel: dm-cmirror:  - log uuid::  (active)
Feb  5 08:52:13 link-04 kernel: dm-cmirror:  - request :: LRT_CLEAR_REGION
Feb  5 08:52:13 link-04 kernel: dm-cmirror:  - error   :: -6


This process has been going on now for 3+ days and getting an error each time
and has only made it through 9% of the total mirror. 

An lvs cmd on link-04 hangs but here's what another node in the cluster sees the
mirrors as (and the sync percents are all stuck).
[root@link-07 ~]# lvs -a -o +devices
  LV                 VG   Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                               
  mirror1            vg   Mwi-so 5.00G                    mirror1_mlog  58.52
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] vg   iwi-so 5.00G                                       
/dev/sdh1(1280)                       
  [mirror1_mimage_1] vg   iwi-so 5.00G                                       
/dev/sdd1(0)                          
  [mirror1_mlog]     vg   lwi-so 4.00M                                       
/dev/sda1(2893)                       
  mirror2            vg   Mwi-so 5.00G                    mirror2_mlog 100.00
mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] vg   iwi-so 5.00G                                       
/dev/sdh1(2560)                       
  [mirror2_mimage_1] vg   iwi-so 5.00G                                       
/dev/sdb1(0)                          
  [mirror2_mlog]     vg   lwi-so 4.00M                                       
/dev/sda1(2894)                       
  mirror3            vg   Mwi-so 5.00G                    mirror3_mlog  37.27
mirror3_mimage_0(0),mirror3_mimage_1(0)
  [mirror3_mimage_0] vg   iwi-so 5.00G                                       
/dev/sdh1(3840)                       
  [mirror3_mimage_1] vg   iwi-so 5.00G                                       
/dev/sdc1(0)                          
  [mirror3_mlog]     vg   lwi-so 4.00M                                       
/dev/sda1(2892) 


Version-Release number of selected component (if applicable):
lvm2-2.02.21-1.el4
lvm2-cluster-2.02.21-1.el4

cmirror-1.0.1-1
cmirror-kernel-smp-2.6.9-18.9

device-mapper-1.02.17-2.el4

2.6.9-45.ELsmp

Comment 1 Corey Marthaler 2007-02-13 19:38:25 UTC

Just a note that this has been reproduced. The following has been going on now
for 20+ hours.
 
Feb 12 13:00:31 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
32 of 437462 (1%)
Feb 12 13:00:39 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
64 of 437526 (1%)
Feb 12 13:00:47 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
96 of 437590 (1%)
[...]
Feb 13 08:52:14 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
285728 of 1008866 (29%)
Feb 13 08:52:22 link-04 kernel: dm-cmirror: Clustered mirror retried requests ::
285760 of 1008930 (29%)

Comment 2 Jonathan Earl Brassow 2007-02-21 20:30:23 UTC

please help me reproduce with the latest cmirror-kernel package (>= 2/21/2007)

Comment 3 Jonathan Earl Brassow 2007-02-27 17:01:43 UTC

Marking modified, as I believe this has been fixed in the process of fixing
other bugs.

Comment 4 Corey Marthaler 2007-03-01 16:01:18 UTC

Hit this last night on link-04 with the latest cmirror build on the link cluster
(link-02,4,7,8)

Comment 6 Jonathan Earl Brassow 2007-03-02 16:41:15 UTC

Ok, we've proved the bug is still around.  Now we need to either:
A) find the boundary conditions of the bug
B) find quicker/easier ways to reproduce the bug

Obviously, 'B' is the best.

I need to know how many mirrors you had running on the system at the time, what
was being done to them, whether they were 3-way mirrors or not, etc.

Then I need to know if you can reproduce with just one mirror on the system,
and/or with just 2-way mirrors... simplify.

Again, the best thing that you could do is find a way to trigger the bug that is
simple and straight forward that causes it to show up in a reasonable timeframe.

Comment 7 Jonathan Earl Brassow 2007-03-02 22:32:54 UTC

I suspect that the server is getting stuck while trying to read a log device
that has been suspended.  I'm now forcing the resume to establish the log server
before it completes to try to mitigate this problem.

Comment 8 Jonathan Earl Brassow 2007-03-05 20:16:02 UTC

post -> modified

Comment 9 Jonathan Earl Brassow 2007-03-14 04:39:41 UTC

modified -> needinfo

Changes for bug 231230 included a partial reversal of some of the changes
designed for this bug.  I've tried to be careful in my selection of which
changes to revert, but this bug should be reinvestigated.

Comment 10 Jonathan Earl Brassow 2007-03-16 19:17:56 UTC

needinfo -> modified

I guess it was pointless for me to change that...  QA will retest to validate it
anyway.

Comment 11 Corey Marthaler 2007-06-27 19:33:59 UTC

I had I/O going to 3 gfs filesystems on top of cmirrors, all up and down
converting for almost 24 hours and wasn't able to reproduce this bug. Marking
verified.

Comment 13 Chris Feist 2008-08-05 21:43:44 UTC

Fixed in current release (4.7).