149024 – clusvcmgrd stopping ip alias on active node

Bug 149024 - clusvcmgrd stopping ip alias on active node

Summary: clusvcmgrd stopping ip alias on active node

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	clumanager
Sub Component:
Version:	2.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-02-17 23:25 UTC by Jure Pečar
Modified:	2007-11-30 22:06 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-07-27 16:30:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jure Pečar 2005-02-17 23:25:58 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux ppc; rv:1.7.3) Gecko/20041014 Firefox/0.10.1

Description of problem:
We have a 2.1AS cluster setup for cyrus. NodeA is a dual Xeon machine, NodeB was a dual p3 machine. Since it was removed for repair last week, I replaced it with single cpu Xeon machine.

Cyrus service was running on NodeA, when today it crashed and NodeB took over. But, for some strange reason, it keeps dropping the service's shared IP. Here's the log at info level:

Feb 17 22:53:37 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Down
Feb 17 22:53:39 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Up
Feb 17 22:53:43 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Down
Feb 17 22:53:45 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Up
Feb 17 22:53:55 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Down
Feb 17 22:53:57 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Up
Feb 17 23:17:19 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Down
Feb 17 23:17:21 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Up
Feb 17 23:23:29 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Down
Feb 17 23:23:29 nodeb clusvcmgrd: [19449]: <notice> ipalias notice: Starting cluster alias
Feb 17 23:23:29 nodeb clusvcmgrd: [19449]: <info> ipalias info: Starting IP address <shared ip>
Feb 17 23:23:30 nodeb clusvcmgrd: [19449]: <info> ipalias info: Sending Gratuitous arp for <shared ip> (00:02:B3:9D:7F:06)
Feb 17 23:23:31 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Up
Feb 17 23:23:31 nodeb clusvcmgrd: [19524]: <notice> ipalias notice: Stopping cluster alias
Feb 17 23:23:31 nodeb clusvcmgrd: [19524]: <info> ipalias info: Stopping IP address <shared ip>
Feb 17 23:29:19 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Down
Feb 17 23:29:21 nodeb clusvcmgrd[1789]: <info> state change: node nodea new state Up

... and so on. Meantime on NodeA everything is ok, interface with shared ip (eth0:0) is down, machine is normaly accessible.

I have no idea why NodeB is changing its idea abount NodeA that frequent now ... also, I don't see why should a change of state of nodea trigger a stopping of shared ip on NodeB?

Version-Release number of selected component (if applicable):
clumanager-1.0.19-2

How reproducible:
Didn't try

Steps to Reproduce:
I haven't really tried to reproduce it, as this is the production system.

Additional info:

/etc/cluster.conf: 

# This file is automatically generated.  Do not manually edit!

[cluhbd]
  logLevel = 6

[clupowerd]
  logLevel = 6

[cluquorumd]
  logLevel = 6
  sameTimeNetdown = 20
  sameTimeNetup = 20

[clurmtabd]
  logLevel = 4

[cluster]
  alias_ip = <shared ip>
  name = <name>
  timestamp = 1108659552

[clusvcmgrd]
  logLevel = 7

[database]
  version = 2.0

[members]
start member0
start chan0
  name = private-link1
  type = net
end chan0
  id = 0
  name = <nodea>
  powerSwitchIPaddr = <nodea>
  powerSwitchPortName = unused
  quorumPartitionPrimary = /dev/raw/raw1
  quorumPartitionShadow = /dev/raw/raw2
end member0
start member1
start chan0
  name = private-link2
  type = net
end chan0
  id = 1
  name = <nodeb>
  powerSwitchIPaddr = <nodeb>
  powerSwitchPortName = unused
  quorumPartitionPrimary = /dev/raw/raw1
  quorumPartitionShadow = /dev/raw/raw2
end member1

[powercontrollers]
start powercontroller0
  IPaddr = <nodea>
  login = unused
  passwd = unused
  type = null
end powercontroller0
start powercontroller1
  IPaddr = <nodeb>
  login = unused
  passwd = unused
  type = null
end powercontroller1

[services]
start service0
  checkInterval = 300
  name = cyrus
start network0
  broadcast = <broadcast ip>
  ipAddress = <shared ip>
  netmask = <netmask ip>
end network0
  preferredNode = nodeb
  relocateOnPreferredNodeBoot = no
  userScript = /opt/scripts/cyrus
end service0


clustat output:

Cluster Status Monitor (<name>)                                  00:16:49

Cluster alias: <dns name of shared ip>

=========================  M e m b e r   S t a t u s  ==========================

  Member         Status     Node Id    Power Switch
  -------------- ---------- ---------- ------------
  nodea          Up         0          Good        
  nodeb          Up         1          Good        

=========================  H e a r t b e a t   S t a t u s  ====================

  Name                           Type       Status    
  ------------------------------ ---------- ------------
  private-link <--> private-link network    Unknown             

=========================  S e r v i c e   S t a t u s  ========================

                                         Last             Monitor  Restart
  Service        Status   Owner          Transition       Interval Count   
  -------------- -------- -------------- ---------------- -------- -------
  cyrus          started  nodeb          15:33:32 Feb 17  300      0      


I have "private-link" defined in /etc/hosts and it's normally pingable from both machines. It's dedicated network interface connected with crossover cable. Clustat never showed anything else but "Unknown" about it ... 


I have already scheduled a downtime next friday to upgrade to the latest 1.0.28 version. I see there are some "nice to have" fixes in start/stop logic. I'll report if it fixes this problem; I hope it does :)

If you have any other idea I could try, please come up with it till the next friday :)

Comment 1 Lon Hohberger 2005-02-18 16:31:27 UTC

Hi, have you filed a request with Red Hat Support?

http://www.redhat.com/apps/support/

Comment 2 Lon Hohberger 2005-07-27 16:30:55 UTC

As it turns out, the IP alias ("cluster alias IP") always will run on node ID 0
if node ID 0 is online.

From the source code:

    /*
     * Cluster alias management. Low node ID wins the alias when both are up.
     */

Cluster alias IP != service IP, which should only move (with the rest of the
service) if "preferred node" and "relocate on preferred node boot" are set.

Note You need to log in before you can comment on or make changes to this bug.