From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070226 Fedora/1.5.0.10-1.fc6 Firefox/1.5.0.10 pango-text Description of problem: 3 node GFS cluster sharing 2 virtual IPs as 2 different services. IPs are listed as services in the cluster.conf and the failover is set to use ordered/restricted. IP failover when the box goes down but does not return to the correctly prioritized box when it returns. <failoverdomain name="ip_domain2" ordered="1" restricted="1"> <failoverdomainnode name="fs102" priority="1"/> <failoverdomainnode name="fs101" priority="2"/> <failoverdomainnode name="fs02" priority="3"/> Version-Release number of selected component (if applicable): rgmanager-1.9.54-1 How reproducible: Always Steps to Reproduce: 1. Configure failover domain with ordered/restricted flags and service of a VIP 2. Shutdown primary box and failover IP 3. Bring up primaryy box and watch logs to see service returns Actual Results: Service does not fail back Mar 8 11:03:26 fs101 clurgmgrd[5684]: <debug> Relocating group nfs_ip2 to better node fs102 Mar 8 11:03:26 fs101 clurgmgrd[5684]: <debug> Event (0:2:1) Processed Mar 8 11:03:26 fs101 clurgmgrd[5684]: <notice> Stopping service nfs_ip2 Mar 8 11:03:26 fs101 clurgmgrd[5684]: <err> #52: Failed changing RG status Mar 8 11:03:26 fs101 clurgmgrd[5684]: <debug> Handling failure request for RG nfs_ip2 Mar 8 11:03:26 fs101 clurgmgrd[5684]: <err> #57: Failed changing RG status Expected Results: The IP should fail back to better node Additional info:
This isn't actually a policy bug; the cause of error #52 is the key here - that shouldn't happen. Could you try with the 1.9.54-3.228823 packages available here: http://people.redhat.com/lhh/packages.html
Tried with new rgmanager package and I get the same results Mar 20 16:49:03 fs102 clurgmgrd[5659]: <info> State change: fs101 UP Mar 20 16:49:04 fs102 clurgmgrd[5659]: <debug> Evaluating RG nfs_ip1, state started, owner fs102 Mar 20 16:49:04 fs102 clurgmgrd[5659]: <debug> Relocating group nfs_ip1 to better node fs101 Mar 20 16:49:04 fs102 clurgmgrd[5659]: <debug> Evaluating RG nfs_ip2, state started, owner fs102 Mar 20 16:49:04 fs102 clurgmgrd[5659]: <debug> Event (0:3:1) Processed Mar 20 16:49:04 fs102 clurgmgrd[5659]: <notice> Stopping service nfs_ip1 Mar 20 16:49:04 fs102 clurgmgrd[5659]: <err> #52: Failed changing RG status Mar 20 16:49:04 fs102 clurgmgrd[5659]: <debug> Handling failure request for RG nfs_ip1 Mar 20 16:49:04 fs102 clurgmgrd[5659]: <err> #57: Failed changing RG status Mar 20 16:49:19 fs102 clurgmgrd: [5659]: <debug> Checking 172.16.1.224, Level 0
Hi, I tried to reproduce this several times - and haven't been able to. Could you give me some hints about your systems? It must be some sort of a race condition. Last Thursday, I received a patch from a community user of linux-cluster which *may* address this if you're willing to try it (though, I must be clear, I couldn't get it to happen with or without their patch). The reason it *may* address this is because it fixes two bugs in the view-formation (data distribution) code and an error case in the rgmanager message code.
By hints, I mean things like RAM / processor speed / # of cores
Both boxes are identical(Dell 1950s) 2 Dual Core Intel Xeon 2Ghz processors 2GB RAM Qlogic QLA2432 fibre card Broadcom BCM5708 Gigabit Ethernet
Ok - I'll have to build using the patch from the community users. The patch addresses several things - including bugs in the vft subsystem (the part that's throwing errors :) ).
This *should* be fixed in 4.5; could you retest on the current rgmanager package?
Per comment #3.