327721 – failed RG status change while relocating to preferred failover node

Bug 327721 - failed RG status change while relocating to preferred failover node

Summary: failed RG status change while relocating to preferred failover node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	399681
TreeView+	depends on / blocked

Reported:	2007-10-11 14:24 UTC by Corey Marthaler
Modified:	2018-10-19 21:33 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2008-0347
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:58:05 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Fix for cman; currently testing (541 bytes, patch) 2007-11-19 17:42 UTC, Lon Hohberger	no flags	Details \| Diff
Test program (4.31 KB, application/octet-stream) 2007-11-19 19:57 UTC, Lon Hohberger	no flags	Details
Here's my cluster.conf. (4.14 KB, text/plain) 2007-11-26 15:01 UTC, Shih-Hsin Yang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0347	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2008-05-20 12:39:41 UTC

Description Corey Marthaler 2007-10-11 14:24:41 UTC

Description of problem:
Lon and I have been seeing this lately while recovering/relocating nfs/gfs services.

Oct  9 14:44:09 taft-03 clurgmgrd[9330]: <notice> Relocating service:TAFT HA GFS
to better no
de taft-02.lab.msp.redhat.com
Oct  9 14:44:09 taft-03 clurgmgrd[9330]: <notice> Stopping service service:TAFT
HA GFS
Oct  9 14:44:13 taft-03 kernel: dlm: connecting to 1
Oct  9 14:44:13 taft-03 kernel: dlm: got connection from 1
Oct  9 14:44:24 taft-03 clurgmgrd[9330]: <err> #52: Failed changing RG status

Is this something that should be retried?

Version-Release number of selected component (if applicable):
2.6.18-52.el5
rgmanager-2.0.31-1.el5

Comment 1 RHEL Program Management 2007-10-16 03:35:04 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Lon Hohberger 2007-10-18 20:55:44 UTC

I think I've reproduced this!

Comment 3 Lon Hohberger 2007-11-14 19:15:11 UTC

AFAIK this happens after a node rejoins the cluster after being fenced.

Comment 4 Shih-Hsin Yang 2007-11-16 21:27:10 UTC

My cluster config is like below.

                        <failoverdomain name="oracle" ordered="1"
restricted="1">
                                <failoverdomainnode
name="vsvr1.kfb.or.kr" priority="1"/>
                                <failoverdomainnode
name="vsvr2.kfb.or.kr" priority="2"/>
                        </failoverdomain>

                <vm autostart="1" domain="oracle" exclusive="0"
name="oracle" path="/etc/xen" recovery="restart"/>


[root@vsvr1 ~]# ls -la /etc/xen/oracle
lrwxrwxrwx 1 root root 18 Nov  8 16:53 /etc/xen/oracle ->
/xen/config/oracle

[root@vsvr2 ~]# mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
none on /sys/kernel/config type configfs (rw)
/dev/mapper/VG--CX3-dom0.xen on /xen type gfs
(rw,noatime,hostdata=jid=1:id=65537:first=0)
[root@vsvr2 ~]# ls -la /xen/config/oracle
-rw------- 1 root root 474 Nov  9 11:46 /xen/config/oracle

When vsvr1.kfb.or.kr failed(system crash), vm could start on
vsvr2.kfb.or.kr.
but vsvr1 recovered, didn't failback with below error messages in
/var/log/messages.

Nov 17 05:00:37 vsvr2 clurgmgrd[10076]: <notice> Migrating vm:oracle
to better node vsvr1.kfb.or.kr
Nov 17 05:00:42 vsvr2 kernel: dlm: connecting to 1
Nov 17 05:00:52 vsvr2 clurgmgrd[10076]: <err> #75: Failed changing
service status

Comment 5 Lon Hohberger 2007-11-19 16:18:02 UTC

Ok - there's a bug here in cman where the port bits aren't getting zeroed out on
node death.  This causes rgmanager to try to talk to the node after it reboots
but before rgmanager is restarted on that node.

Comment 6 Lon Hohberger 2007-11-19 17:42:50 UTC

Created attachment 263721 [details]
Fix for cman; currently testing

Comment 7 Lon Hohberger 2007-11-19 17:49:59 UTC

The patch worked for me.  rgmanager gets a spurious node event, but it's
inconsequential and doesn't seem to cause problems.  I'm filing a separate bug
for that one.

Comment 8 Lon Hohberger 2007-11-19 19:57:50 UTC

Created attachment 263901 [details]
Test program

This program illustrates the bug.  Copy this to a two node cluster and run the
following on both nodes:

   tar -xzvf bz327721-test.tar.gz
   cd bz327721-test ; make
   ./cman_port_bug

You will see output like this:

  Node 2 listening on port 192
  Node 1 listening on port 192

Leave cman_port_bug running on both and hard-reboot one of the nodes (i.e.
"reboot -fn").	Ensure it is fenced and removed from the cluster correctly.  On
the surviving node, you will see output like this (the node ID may change):

  Node 1 offline (while listening)

Restart the node you rebooted and start cman.  You will see this line:

  Node 1 online

WITHOUT the patch, you will also see the following error:

  ERROR: cman reports node 1 is listening, but no PORTOPENED event was
received!

WITH the patch, the above error will *not* appear (correct behavior).

Comment 10 Shih-Hsin Yang 2007-11-20 01:35:40 UTC

You mean that patch is not final ?

If you think that this patch works for me, please make me a cman package.
I'm using cman-2.0.73-1.el5_1.1 for x86_64.

Comment 11 Christine Caulfield 2007-11-20 09:23:00 UTC

That patch is fine lon, Thanks.

Checked in to -rRHEL5

Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.2.14; previous revision: 1.55.2.13
done

Comment 16 Shih-Hsin Yang 2007-11-24 15:05:59 UTC

Hi lon,

I tested your test package.
But didn't work for me.

failback test
  : vsvr1 crash(echo 'c' > /proc/sysrq-trigger)
    vsvr2 fence vsvr1 --> VM(oracle) started on vsvr2
    after vsvr1 cluster join, failback failed
    but manual migration from vsvr2 to vsvr1 success


[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started        

[root@vsvr1 ~]# echo 'c' > /proc/sysrq-trigger

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr2.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         

Nov 24 23:01:04 vsvr2 clurgmgrd[7523]: <notice> Migrating vm:oracle to better
node vsvr1.kfb.or.kr 
Nov 24 23:01:09 vsvr2 kernel: dlm: connecting to 1
Nov 24 23:01:19 vsvr2 clurgmgrd[7523]: <err> #75: Failed changing service status 

[root@vsvr2 ~]# clusvcadm -M vm:oracle -m vsvr1.kfb.or.kr
Trying to migrate vm:oracle to vsvr1.kfb.or.kr...Success

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                migrating       
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started

Comment 17 Lon Hohberger 2007-11-26 14:07:41 UTC

That's strange - it worked in my testing.  I'll try some more tests today.

Comment 18 Lon Hohberger 2007-11-26 14:16:13 UTC

I need your cluster.conf.

Comment 19 Shih-Hsin Yang 2007-11-26 15:01:04 UTC

Created attachment 268981 [details]
Here's my cluster.conf.

Comment 20 Lon Hohberger 2007-11-26 16:20:48 UTC

It looks like there's another part we needed to zap the port bits:

diff -u -r1.55.2.14 commands.c
--- commands.c  20 Nov 2007 09:21:51 -0000      1.55.2.14
+++ commands.c  26 Nov 2007 16:20:40 -0000
@@ -2021,6 +2021,7 @@
 
        case NODESTATE_LEAVING:
                node->state = NODESTATE_DEAD;
+               memset(&node->port_bits, 0, sizeof(node->port_bits));
                cluster_members--;
 
                if ((node->leave_reason & 0xF) & CLUSTER_LEAVEFLAG_REMOVED)

Comment 24 Lon Hohberger 2007-11-26 18:39:30 UTC

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.src.rpm

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.i386.rpm
http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.i386.rpm
http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.i386.rpm

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm
http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.x86_64.rpm
http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.x86_64.rpm

http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.ia64.rpm
http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.ia64.rpm
http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.ia64.rpm

Fixed packages.  Old packages didn't have the correct patch applied.

I tested this using:
(a) Ordered failover domain in a 2 node virtual cluster using a service.
(b) Ordered failover domain in a 5 node cluster using a service.
(c) Ordered failover domain in a 5 node cluster using a paravirtual machine (vm).

Comment 26 Denis Brækhus 2008-03-27 15:23:22 UTC

(In reply to comment #24)
> http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm

I have tested this package in a 2 node cluster which did exhibit the problem
behaviour before. The updated cman fixed the issue.

Comment 27 Corey Marthaler 2008-03-27 17:41:36 UTC

Marking this verified per comment #26.

Comment 29 errata-xmlrpc 2008-05-21 15:58:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.