432998 – inability to kill all i/o processes before fs relocation can cause 'recoverying' service

Bug 432998 - inability to kill all i/o processes before fs relocation can cause 'recoverying' service

Summary: inability to kill all i/o processes before fs relocation can cause 'recoveryi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-02-15 17:04 UTC by Corey Marthaler
Modified:	2009-04-16 22:23 UTC (History)
CC List:	2 users (show)
Fixed In Version:	RHBA-2008-0353
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 14:31:03 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0353	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2008-05-20 12:46:24 UTC

Description Corey Marthaler 2008-02-15 17:04:47 UTC

Description of problem:
I'm testing ha lvm relocation and am seeing an issue where not all the i/o
processes to the filesystems get killed (most likely due to forks after the kill
attempts). This can result in the relocation failing and entering the
'recoverying' state.

I'm using xdoio/xiogen as the io load.

<rm>
    <failoverdomains>
      <failoverdomain name="GRANT_domain" ordered="1" restricted="1">
        <failoverdomainnode name="grant-01" priority="1"/>
        <failoverdomainnode name="grant-02" priority="1"/>
        <failoverdomainnode name="grant-03" priority="1"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <lvm name="lvm" vg_name="GRANT"/>
      <fs name="fs1" device="/dev/GRANT/ha1" force_fsck="0" force_unmount="1"
self_fence="0" fstype="ext3" mountpoint="/mnt/fs1" options=""/>
      <fs name="fs2" device="/dev/GRANT/ha2" force_fsck="0" force_unmount="1"
self_fence="0" fstype="ext3" mountpoint="/mnt/fs2" options=""/>
    </resources>
    <service autostart="1" domain="GRANT_domain" name="halvm" recovery="relocate">
      <lvm ref="lvm"/>
      <fs ref="fs1"/>
      <fs ref="fs2"/>
    </service>
</rm>

Version-Release number of selected component (if applicable):
2.6.18-71.el5
rgmanager-2.0.36-1.el5

How reproducible:
quite often

Comment 1 Corey Marthaler 2008-02-15 17:27:01 UTC

Relocating halvm from grant-01 to grant-02


grant-01:
Feb 15 11:16:30 grant-01 qarshd[7177]: Running cmdline: clusvcadm -r halvm -m
grant-02
Feb 15 11:16:30 grant-01 clurgmgrd[30119]: <notice> Stopping service service:halvm
Feb 15 11:16:30 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs2
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6095
(root xdoio /mnt/fs2)
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6096
(root xdoio /mnt/fs2)

[...]

Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6108
(root xdoio /mnt/fs2)
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6109
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs2
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7485
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7486
(root xdoio /mnt/fs2)

[...]

Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7498
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7499
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <err> 'umount /mnt/fs2' failed, error=0
Feb 15 11:16:42 grant-01 clurgmgrd[30119]: <notice> stop on fs "fs2" returned 2
(invalid argument(s))
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs1
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6077
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6078
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6079
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6080
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6081
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 xinetd[2764]: EXIT: qarsh status=0 pid=6072
duration=80(sec)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6082
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6084
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6085
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6086
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6087
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6088
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6089
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6090
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6091
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6093
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 qarshd[7481]: Sending child 7482 signal 2
Feb 15 11:16:43 grant-01 xinetd[2764]: EXIT: qarsh signal=13 pid=7481
duration=6(sec)
Feb 15 11:16:53 grant-01 clurgmgrd: [30119]: <err> Logical volume GRANT/ha1
failed to shutdown
Feb 15 11:16:53 grant-01 clurgmgrd[30119]: <notice> stop on lvm "lvm" returned 1
(generic error)
Feb 15 11:16:53 grant-01 clurgmgrd[30119]: <notice> Service service:halvm is stopped
Feb 15 11:17:04 grant-01 clurgmgrd[30119]: <notice> Service service:halvm is now
running on member 3



grant-02:
Feb 15 11:16:52 grant-02 clurgmgrd[9113]: <notice> Starting stopped service
service:halvm
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <notice> start on lvm "lvm" returned 1
(generic error)
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <warning> #68: Failed to start
service:halvm; return value: 1
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <notice> Stopping service service:halvm
Feb 15 11:17:03 grant-02 clurgmgrd[9113]: <notice> Service service:halvm is
recovering

Comment 3 Lon Hohberger 2008-03-19 21:44:37 UTC

http://sources.redhat.com/git/?p=cluster.git;a=commitdiff;h=4cbc5a146009d8fe648c1817641ad04633478c37;hp=8db6a0907ab1c443e9f3c2799ef98a1c069ec41d

Comment 9 Corey Marthaler 2008-03-26 21:18:17 UTC

I verified that the service with the "left over" io now goes into the 'failed'
state and no longer the 'recoverying' state.

rgmanager-2.0.37-1.el5

Comment 11 errata-xmlrpc 2008-05-21 14:31:03 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0353.html

Note You need to log in before you can comment on or make changes to this bug.