Bug 432998

Summary: inability to kill all i/o processes before fs relocation can cause 'recoverying' service
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 5.2CC: cluster-maint, edamato
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0353 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 14:31:03 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2008-02-15 17:04:47 UTC
Description of problem:
I'm testing ha lvm relocation and am seeing an issue where not all the i/o
processes to the filesystems get killed (most likely due to forks after the kill
attempts). This can result in the relocation failing and entering the
'recoverying' state.

I'm using xdoio/xiogen as the io load.

<rm>
    <failoverdomains>
      <failoverdomain name="GRANT_domain" ordered="1" restricted="1">
        <failoverdomainnode name="grant-01" priority="1"/>
        <failoverdomainnode name="grant-02" priority="1"/>
        <failoverdomainnode name="grant-03" priority="1"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <lvm name="lvm" vg_name="GRANT"/>
      <fs name="fs1" device="/dev/GRANT/ha1" force_fsck="0" force_unmount="1"
self_fence="0" fstype="ext3" mountpoint="/mnt/fs1" options=""/>
      <fs name="fs2" device="/dev/GRANT/ha2" force_fsck="0" force_unmount="1"
self_fence="0" fstype="ext3" mountpoint="/mnt/fs2" options=""/>
    </resources>
    <service autostart="1" domain="GRANT_domain" name="halvm" recovery="relocate">
      <lvm ref="lvm"/>
      <fs ref="fs1"/>
      <fs ref="fs2"/>
    </service>
</rm>

Version-Release number of selected component (if applicable):
2.6.18-71.el5
rgmanager-2.0.36-1.el5

How reproducible:
quite often

Comment 1 Corey Marthaler 2008-02-15 17:27:01 UTC
Relocating halvm from grant-01 to grant-02


grant-01:
Feb 15 11:16:30 grant-01 qarshd[7177]: Running cmdline: clusvcadm -r halvm -m
grant-02
Feb 15 11:16:30 grant-01 clurgmgrd[30119]: <notice> Stopping service service:halvm
Feb 15 11:16:30 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs2
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6095
(root xdoio /mnt/fs2)
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6096
(root xdoio /mnt/fs2)

[...]

Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6108
(root xdoio /mnt/fs2)
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6109
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs2
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7485
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7486
(root xdoio /mnt/fs2)

[...]

Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7498
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7499
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <err> 'umount /mnt/fs2' failed, error=0
Feb 15 11:16:42 grant-01 clurgmgrd[30119]: <notice> stop on fs "fs2" returned 2
(invalid argument(s))
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs1
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6077
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6078
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6079
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6080
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6081
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 xinetd[2764]: EXIT: qarsh status=0 pid=6072
duration=80(sec)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6082
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6084
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6085
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6086
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6087
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6088
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6089
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6090
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6091
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6093
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 qarshd[7481]: Sending child 7482 signal 2
Feb 15 11:16:43 grant-01 xinetd[2764]: EXIT: qarsh signal=13 pid=7481
duration=6(sec)
Feb 15 11:16:53 grant-01 clurgmgrd: [30119]: <err> Logical volume GRANT/ha1
failed to shutdown
Feb 15 11:16:53 grant-01 clurgmgrd[30119]: <notice> stop on lvm "lvm" returned 1
(generic error)
Feb 15 11:16:53 grant-01 clurgmgrd[30119]: <notice> Service service:halvm is stopped
Feb 15 11:17:04 grant-01 clurgmgrd[30119]: <notice> Service service:halvm is now
running on member 3



grant-02:
Feb 15 11:16:52 grant-02 clurgmgrd[9113]: <notice> Starting stopped service
service:halvm
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <notice> start on lvm "lvm" returned 1
(generic error)
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <warning> #68: Failed to start
service:halvm; return value: 1
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <notice> Stopping service service:halvm
Feb 15 11:17:03 grant-02 clurgmgrd[9113]: <notice> Service service:halvm is
recovering


Comment 9 Corey Marthaler 2008-03-26 21:18:17 UTC
I verified that the service with the "left over" io now goes into the 'failed'
state and no longer the 'recoverying' state.

rgmanager-2.0.37-1.el5

Comment 11 errata-xmlrpc 2008-05-21 14:31:03 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0353.html