Bug 150347 - Service relocation loop failure
Service relocation loop failure
Status: CLOSED CURRENTRELEASE
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2005-03-04 14:27 EST by Derek Anderson
Modified: 2009-04-16 16:16 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-03-22 10:31:03 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Derek Anderson 2005-03-04 14:27:56 EST
Description of problem:
Ran a little bash script to continually relocate a service between my
nodes: link-08, link-10, link-11, link-12.  After many iterations I
get a failure.  The clustat at this point lists the service as so:

  Resource Group       Owner (Last)                   State
  -------- -----       ----- ------                   -----
  FOO                  (none                        ) failed

Here's the <rm> section of my config:
<!-- RESOURCE MANAGER SECTION-->
<rm>
  <!-- FAILOVERDOMAINS -->
  <failoverdomains>
    <failoverdomain name="FDOMAIN" ordered="1">
      <failoverdomainnode name="link-08" priority="1"/>
      <failoverdomainnode name="link-10" priority="1"/>
      <failoverdomainnode name="link-11" priority="1"/>
      <failoverdomainnode name="link-12" priority="1"/>
    </failoverdomain>
  </failoverdomains>

  <!-- RESOURCES -->
  <resources>
    <resourcegroup name="FOO" domain="FDOMAIN"/>
    <script name="HTTPee" file="/etc/init.d/httpd"/>
    <ip address="192.168.44.230" monitor_link="yes"/>
    <fs name="MyFS" fstype="ext3" mountpoint="/mnt/gfs1"
device="/dev/sde1"/>
  </resources>

  <!-- RESOURCE GROUPS -->
  <resourcegroup ref="FOO">
    <script ref="HTTPee"/>
    <ip ref="192.168.44.230"/>
    <fs ref="MyFS"/>
  </resourcegroup>
</rm>

The failure occurred while relocating from link-08 to link-10.  Here
is the last bit of output from my script before the failure:

...
Trying to relocate FOO to link-08...success
  FOO                  link-08                        started
Move FOO to: link-10
Trying to relocate FOO to link-10...success
  FOO                  link-10                        started
Move FOO to: link-11
Trying to relocate FOO to link-11...success
  FOO                  link-11                        started
Move FOO to: link-12
Trying to relocate FOO to link-12...success
  FOO                  link-12                        started
Move FOO to: link-08
Trying to relocate FOO to link-08...success
  FOO                  none                           recovering
Move FOO to: link-10
Trying to relocate FOO to link-10...failed
Uh ohhhhh
[root@link-12 bin]#

==========================
link-08: clurgmgrd output:
==========================
Sending gratuitous ARP: 192.168.44.230 00:00:1a:18:ea:40 brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[22692] notice: Resource group FOO started
[22737] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[22737] notice: Resource group FOO is stopped
[22737] debug: Sent relocate request to 10
[22774] notice: Starting stopped resource group FOO
mount -t ext3 /dev/sde1 /mnt/gfs1
Checking 192.168.44.230, Level 0
Attempting to add IPv4 address 192.168.44.230 (eth0)
Sending gratuitous ARP: 192.168.44.230 00:00:1a:18:ea:40 brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[22774] notice: Resource group FOO started
[22811] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[22811] notice: Resource group FOO is stopped
[22811] debug: Sent relocate request to 10
[22858] notice: Starting stopped resource group FOO
mount -t ext3 /dev/sde1 /mnt/gfs1
Checking 192.168.44.230, Level 0
Attempting to add IPv4 address 192.168.44.230 (eth0)
Sending gratuitous ARP: 192.168.44.230 00:00:1a:18:ea:40 brd
ff:ff:ff:ff:ff:ff
Starting httpd: (98)Address already in use: make_sock: could not bind
to address [::]:443
no listening sockets available, shutting down
Unable to open logs
                                                           [FAILED]
[22858] notice: start on script "HTTPee" returned 1 (generic error)
[22858] warning: #68: Failed to start FOO; return value: 1
[22858] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[22858] notice: Resource group FOO is recovering

==========================
link-10: clurgmgrd output:
==========================
Sending gratuitous ARP: 192.168.44.230 00:30:48:41:d5:f0 brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[17292] notice: Resource group FOO started
[17337] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[17337] notice: Resource group FOO is stopped
[17337] debug: Sent relocate request to 11
[17373] notice: Starting stopped resource group FOO
mount -t ext3 /dev/sde1 /mnt/gfs1
Checking 192.168.44.230, Level 0
Attempting to add IPv4 address 192.168.44.230 (eth0)
Sending gratuitous ARP: 192.168.44.230 00:30:48:41:d5:f0 brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[17373] notice: Resource group FOO started
[17418] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[17418] notice: Resource group FOO is stopped
[17418] debug: Sent relocate request to 11
[17454] err: #43: Resource group FOO has failed; can not start.
[17454] debug: Unable to stop RG FOO in failed state
[17454] debug: Handling failure request for RG FOO

==========================
link-11: clurgmgrd output:
==========================
Sending gratuitous ARP: 192.168.44.230 00:30:48:41:d8:46 brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[16550] notice: Resource group FOO started
[16595] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[16595] notice: Resource group FOO is stopped
[16595] debug: Sent relocate request to 12
[16631] notice: Starting stopped resource group FOO
mount -t ext3 /dev/sde1 /mnt/gfs1
Checking 192.168.44.230, Level 0
Attempting to add IPv4 address 192.168.44.230 (eth0)
Sending gratuitous ARP: 192.168.44.230 00:30:48:41:d8:46 brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[16631] notice: Resource group FOO started
[16676] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[16676] notice: Resource group FOO is stopped
[16676] debug: Sent relocate request to 12
[16712] debug: Not starting FOO: recovery state
[16713] err: #43: Resource group FOO has failed; can not start.
[16713] debug: Unable to stop RG FOO in failed state
[16713] debug: Handling failure request for RG FOO

==========================
link-12: clurgmgrd output:
==========================
Sending gratuitous ARP: 192.168.44.230 00:30:48:41:d6:6a brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[32729] notice: Resource group FOO started
[25642] debug: Sending resource group states to fd12
[310] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[310] notice: Resource group FOO is stopped
[310] debug: Sent relocate request to 8
[25642] debug: Sending resource group states to fd12
[350] debug: Forwarding req. to link-08.
[25642] debug: Sending resource group states to fd12
[356] debug: Forwarding req. to link-10.
[25642] debug: Sending resource group states to fd12
[362] debug: Forwarding req. to link-11.
[364] notice: Starting stopped resource group FOO
mount -t ext3 /dev/sde1 /mnt/gfs1
Checking 192.168.44.230, Level 0
Attempting to add IPv4 address 192.168.44.230 (eth0)
Sending gratuitous ARP: 192.168.44.230 00:30:48:41:d6:6a brd
ff:ff:ff:ff:ff:ff
Starting httpd:                                            [  OK  ]
[364] notice: Resource group FOO started
[25642] debug: Sending resource group states to fd12
[413] notice: Stopping resource group FOO
unmounting /dev/sde1 (/mnt/gfs1)
Checking 192.168.44.230, Level 0
Attempting to del IPv4 address 192.168.44.230 (eth0)
Checking 192.168.44.230, Level 0
Stopping httpd:                                            [  OK  ]
[413] notice: Resource group FOO is stopped
[413] debug: Sent relocate request to 8
[413] debug: Sent relocate request to 11
[413] notice: Resource group FOO is now running on member 11
[25642] debug: Sending resource group states to fd12
[453] notice: Stopping resource group FOO
/dev/sde1 is not mounted
Checking 192.168.44.230, Level 0
192.168.44.230 is not configured
Stopping httpd:                                            [FAILED]
[453] notice: stop on script "HTTPee" returned 1 (generic error)
[453] crit: #12: RG FOO failed to stop; intervention required
[453] notice: Resource group FOO is failed
[453] debug: Sent relocate request to 10
[453] debug: Sent relocate request to 11
[453] alert: #2: Resource group FOO returned failure code.  Last
Owner: none
[453] alert: #4: Administrator intervention required.
[25642] debug: Sending resource group states to fd12

=================
link-08: messages
=================
Mar  4 13:14:19 link-08 clurgmgrd[15595]: <notice> Resource group FOO
is stopped
Mar  4 13:14:25 link-08 clurgmgrd[15595]: <notice> Starting stopped
resource group FOO
Mar  4 13:14:25 link-08 kernel: kjournald starting.  Commit interval 5
seconds
Mar  4 13:14:25 link-08 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
Mar  4 13:14:25 link-08 kernel: EXT3 FS on sde1, internal journal
Mar  4 13:14:25 link-08 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Mar  4 13:14:26 link-08 httpd: (98)Address already in use: make_sock:
could notbind to address [::]:443
Mar  4 13:14:26 link-08 httpd: no listening sockets available,
shutting down
Mar  4 13:14:26 link-08 httpd: Unable to open logs
Mar  4 13:14:26 link-08 httpd: httpd startup failed
Mar  4 13:14:26 link-08 clurgmgrd[15595]: <notice> start on script
"HTTPee" returned 1 (generic error)
Mar  4 13:14:26 link-08 clurgmgrd[15595]: <warning> #68: Failed to
start FOO; return value: 1
Mar  4 13:14:26 link-08 clurgmgrd[15595]: <notice> Stopping resource
group FOO
Mar  4 13:14:27 link-08 httpd: httpd shutdown succeeded
Mar  4 13:14:27 link-08 clurgmgrd[15595]: <notice> Resource group FOO
is recovering

=================
link-10: messages
=================
Mar  4 13:10:33 link-10 clurgmgrd[11122]: <notice> Resource group FOO
started
Mar  4 13:10:33 link-10 clurgmgrd[11122]: <notice> Stopping resource
group FOO
Mar  4 13:10:34 link-10 httpd: httpd shutdown succeeded
Mar  4 13:10:34 link-10 clurgmgrd[11122]: <notice> Resource group FOO
is stopped
EXT3-fs warning: maximal mount count reached, running e2fsck is
recommended
Mar  4 13:10:44 link-10 clurgmgrd[11122]: <notice> Starting stopped
resource group FOO
Mar  4 13:10:44 link-10 kernel: kjournald starting.  Commit interval 5
seconds
Mar  4 13:10:44 link-10 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
Mar  4 13:10:44 link-10 kernel: EXT3 FS on sde1, internal journal
Mar  4 13:10:44 link-10 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Mar  4 13:10:45 link-10 httpd: httpd startup succeeded
Mar  4 13:10:45 link-10 clurgmgrd[11122]: <notice> Resource group FOO
started
Mar  4 13:10:46 link-10 clurgmgrd[11122]: <notice> Stopping resource
group FOO
Mar  4 13:10:46 link-10 httpd: httpd shutdown succeeded
Mar  4 13:10:46 link-10 clurgmgrd[11122]: <notice> Resource group FOO
is stopped
Mar  4 13:10:55 link-10 clurgmgrd[11122]: <err> #43: Resource group
FOO has failed; can not start.

=================
link-11: messages
=================
Mar  4 13:10:34 link-11 clurgmgrd[10986]: <notice> Resource group FOO
started
Mar  4 13:10:34 link-11 clurgmgrd[10986]: <notice> Stopping resource
group FOO
Mar  4 13:10:35 link-11 httpd: httpd shutdown succeeded
Mar  4 13:10:35 link-11 clurgmgrd[10986]: <notice> Resource group FOO
is stopped
EXT3-fs warning: maximal mount count reached, running e2fsck is
recommended
Mar  4 13:10:42 link-11 clurgmgrd[10986]: <notice> Starting stopped
resource group FOO
Mar  4 13:10:42 link-11 kernel: kjournald starting.  Commit interval 5
seconds
Mar  4 13:10:42 link-11 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
Mar  4 13:10:42 link-11 kernel: EXT3 FS on sde1, internal journal
Mar  4 13:10:42 link-11 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Mar  4 13:10:43 link-11 httpd: httpd startup succeeded
Mar  4 13:10:43 link-11 clurgmgrd[10986]: <notice> Resource group FOO
started
Mar  4 13:10:44 link-11 clurgmgrd[10986]: <notice> Stopping resource
group FOO
Mar  4 13:10:44 link-11 httpd: httpd shutdown succeeded
Mar  4 13:10:44 link-11 clurgmgrd[10986]: <notice> Resource group FOO
is stopped
EXT3-fs warning: maximal mount count reached, running e2fsck is
recommended
Mar  4 13:10:54 link-11 clurgmgrd[10986]: <notice> Starting stopped
resource group FOO
Mar  4 13:10:54 link-11 kernel: kjournald starting.  Commit interval 5
seconds
Mar  4 13:10:54 link-11 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
Mar  4 13:10:54 link-11 kernel: EXT3 FS on sde1, internal journal
Mar  4 13:10:54 link-11 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Mar  4 13:10:56 link-11 httpd: httpd startup succeeded
Mar  4 13:10:56 link-11 clurgmgrd[10986]: <notice> Resource group FOO
started
Mar  4 13:10:56 link-11 clurgmgrd[10986]: <notice> Stopping resource
group FOO
Mar  4 13:10:56 link-11 httpd: httpd shutdown succeeded
Mar  4 13:10:56 link-11 clurgmgrd[10986]: <notice> Resource group FOO
is stopped
Mar  4 13:11:04 link-11 clurgmgrd[10986]: <err> #43: Resource group
FOO has failed; can not start.

=================
link-12: messages
=================
Mar  4 13:10:17 link-12 clurgmgrd[25642]: <notice> Resource group FOO
started
Mar  4 13:10:17 link-12 clurgmgrd[25642]: <notice> Stopping resource
group FOO
Mar  4 13:10:17 link-12 httpd: httpd shutdown succeeded
Mar  4 13:10:17 link-12 clurgmgrd[25642]: <notice> Resource group FOO
is stopped
EXT3-fs warning: maximal mount count reached, running e2fsck is
recommended
Mar  4 13:10:28 link-12 clurgmgrd[25642]: <notice> Starting stopped
resource group FOO
Mar  4 13:10:28 link-12 kernel: kjournald starting.  Commit interval 5
seconds
Mar  4 13:10:28 link-12 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
Mar  4 13:10:28 link-12 kernel: EXT3 FS on sde1, internal journal
Mar  4 13:10:28 link-12 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Mar  4 13:10:29 link-12 httpd: httpd startup succeeded
Mar  4 13:10:29 link-12 clurgmgrd[25642]: <notice> Resource group FOO
started
Mar  4 13:10:29 link-12 clurgmgrd[25642]: <notice> Stopping resource
group FOO
Mar  4 13:10:30 link-12 httpd: httpd shutdown succeeded
Mar  4 13:10:30 link-12 clurgmgrd[25642]: <notice> Resource group FOO
is stopped
Mar  4 13:10:33 link-12 clurgmgrd[25642]: <notice> Resource group FOO
is now running on member 11
Mar  4 13:10:33 link-12 clurgmgrd[25642]: <notice> Stopping resource
group FOO
Mar  4 13:10:35 link-12 httpd: httpd shutdown failed
Mar  4 13:10:35 link-12 clurgmgrd[25642]: <notice> stop on script
"HTTPee" returned 1 (generic error)
Mar  4 13:10:35 link-12 clurgmgrd[25642]: <crit> #12: RG FOO failed to
stop; intervention required
Mar  4 13:10:35 link-12 clurgmgrd[25642]: <notice> Resource group FOO
is failed
Mar  4 13:10:36 link-12 clurgmgrd[25642]: <alert> #2: Resource group
FOO returned failure code.  Last Owner: none
Mar  4 13:10:36 link-12 clurgmgrd[25642]: <alert> #4: Administrator
intervention required.

Version-Release number of selected component (if applicable):
[root@link-12 bin]# clusvcadm -v
1.9.20

How reproducible:
Two for two so far.  Let it loop for 10-15 minutes or so before it
happens.

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Comment 1 Lon Hohberger 2005-03-04 15:14:01 EST
Looks like the httpd script wasn't configured for running in a cluster.

The script needs to return 0 if httpd wasn't running, not 1.  Sadly,
many init scripts suffer from this problem.
Comment 2 Lon Hohberger 2005-03-07 12:26:43 EST
Let me know if this happens with the init-script set to return 0 if
not running and 'stop' operation was called.

Comment 3 Lon Hohberger 2005-03-22 10:31:03 EST
Closing for now.  I haven't been able to reproduce it with proper scripts.

Note You need to log in before you can comment on or make changes to this bug.