Bug 432998 - inability to kill all i/o processes before fs relocation can cause 'recoverying' service
inability to kill all i/o processes before fs relocation can cause 'recoveryi...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
5.2
All Linux
high Severity high
: rc
: ---
Assigned To: Lon Hohberger
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-02-15 12:04 EST by Corey Marthaler
Modified: 2009-04-16 18:23 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2008-0353
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 10:31:03 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2008-02-15 12:04:47 EST
Description of problem:
I'm testing ha lvm relocation and am seeing an issue where not all the i/o
processes to the filesystems get killed (most likely due to forks after the kill
attempts). This can result in the relocation failing and entering the
'recoverying' state.

I'm using xdoio/xiogen as the io load.

<rm>
    <failoverdomains>
      <failoverdomain name="GRANT_domain" ordered="1" restricted="1">
        <failoverdomainnode name="grant-01" priority="1"/>
        <failoverdomainnode name="grant-02" priority="1"/>
        <failoverdomainnode name="grant-03" priority="1"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <lvm name="lvm" vg_name="GRANT"/>
      <fs name="fs1" device="/dev/GRANT/ha1" force_fsck="0" force_unmount="1"
self_fence="0" fstype="ext3" mountpoint="/mnt/fs1" options=""/>
      <fs name="fs2" device="/dev/GRANT/ha2" force_fsck="0" force_unmount="1"
self_fence="0" fstype="ext3" mountpoint="/mnt/fs2" options=""/>
    </resources>
    <service autostart="1" domain="GRANT_domain" name="halvm" recovery="relocate">
      <lvm ref="lvm"/>
      <fs ref="fs1"/>
      <fs ref="fs2"/>
    </service>
</rm>

Version-Release number of selected component (if applicable):
2.6.18-71.el5
rgmanager-2.0.36-1.el5

How reproducible:
quite often
Comment 1 Corey Marthaler 2008-02-15 12:27:01 EST
Relocating halvm from grant-01 to grant-02


grant-01:
Feb 15 11:16:30 grant-01 qarshd[7177]: Running cmdline: clusvcadm -r halvm -m
grant-02
Feb 15 11:16:30 grant-01 clurgmgrd[30119]: <notice> Stopping service service:halvm
Feb 15 11:16:30 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs2
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6095
(root xdoio /mnt/fs2)
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6096
(root xdoio /mnt/fs2)

[...]

Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6108
(root xdoio /mnt/fs2)
Feb 15 11:16:31 grant-01 clurgmgrd: [30119]: <warning> killing process 6109
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs2
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7485
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7486
(root xdoio /mnt/fs2)

[...]

Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7498
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <warning> killing process 7499
(root xdoio /mnt/fs2)
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <err> 'umount /mnt/fs2' failed, error=0
Feb 15 11:16:42 grant-01 clurgmgrd[30119]: <notice> stop on fs "fs2" returned 2
(invalid argument(s))
Feb 15 11:16:42 grant-01 clurgmgrd: [30119]: <notice> Forcefully unmounting /mnt/fs1
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6077
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6078
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6079
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6080
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6081
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 xinetd[2764]: EXIT: qarsh status=0 pid=6072
duration=80(sec)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6082
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6084
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6085
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6086
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6087
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6088
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6089
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6090
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6091
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 clurgmgrd: [30119]: <warning> killing process 6093
(root xdoio /mnt/fs1)
Feb 15 11:16:43 grant-01 qarshd[7481]: Sending child 7482 signal 2
Feb 15 11:16:43 grant-01 xinetd[2764]: EXIT: qarsh signal=13 pid=7481
duration=6(sec)
Feb 15 11:16:53 grant-01 clurgmgrd: [30119]: <err> Logical volume GRANT/ha1
failed to shutdown
Feb 15 11:16:53 grant-01 clurgmgrd[30119]: <notice> stop on lvm "lvm" returned 1
(generic error)
Feb 15 11:16:53 grant-01 clurgmgrd[30119]: <notice> Service service:halvm is stopped
Feb 15 11:17:04 grant-01 clurgmgrd[30119]: <notice> Service service:halvm is now
running on member 3



grant-02:
Feb 15 11:16:52 grant-02 clurgmgrd[9113]: <notice> Starting stopped service
service:halvm
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <notice> start on lvm "lvm" returned 1
(generic error)
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <warning> #68: Failed to start
service:halvm; return value: 1
Feb 15 11:16:53 grant-02 clurgmgrd[9113]: <notice> Stopping service service:halvm
Feb 15 11:17:03 grant-02 clurgmgrd[9113]: <notice> Service service:halvm is
recovering
Comment 9 Corey Marthaler 2008-03-26 17:18:17 EDT
I verified that the service with the "left over" io now goes into the 'failed'
state and no longer the 'recoverying' state.

rgmanager-2.0.37-1.el5
Comment 11 errata-xmlrpc 2008-05-21 10:31:03 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0353.html

Note You need to log in before you can comment on or make changes to this bug.