Bug 327721 - failed RG status change while relocating to preferred failover node
failed RG status change while relocating to preferred failover node
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.0
All Linux
low Severity low
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks: 399681
  Show dependency treegraph
 
Reported: 2007-10-11 10:24 EDT by Corey Marthaler
Modified: 2010-10-22 15:25 EDT (History)
5 users (show)

See Also:
Fixed In Version: RHBA-2008-0347
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:58:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Fix for cman; currently testing (541 bytes, patch)
2007-11-19 12:42 EST, Lon Hohberger
no flags Details | Diff
Test program (4.31 KB, application/octet-stream)
2007-11-19 14:57 EST, Lon Hohberger
no flags Details
Here's my cluster.conf. (4.14 KB, text/plain)
2007-11-26 10:01 EST, Shih-Hsin Yang
no flags Details

  None (edit)
Description Corey Marthaler 2007-10-11 10:24:41 EDT
Description of problem:
Lon and I have been seeing this lately while recovering/relocating nfs/gfs services.

Oct  9 14:44:09 taft-03 clurgmgrd[9330]: <notice> Relocating service:TAFT HA GFS
to better no
de taft-02.lab.msp.redhat.com
Oct  9 14:44:09 taft-03 clurgmgrd[9330]: <notice> Stopping service service:TAFT
HA GFS
Oct  9 14:44:13 taft-03 kernel: dlm: connecting to 1
Oct  9 14:44:13 taft-03 kernel: dlm: got connection from 1
Oct  9 14:44:24 taft-03 clurgmgrd[9330]: <err> #52: Failed changing RG status

Is this something that should be retried?

Version-Release number of selected component (if applicable):
2.6.18-52.el5
rgmanager-2.0.31-1.el5
Comment 1 RHEL Product and Program Management 2007-10-15 23:35:04 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 2 Lon Hohberger 2007-10-18 16:55:44 EDT
I think I've reproduced this!
Comment 3 Lon Hohberger 2007-11-14 14:15:11 EST
AFAIK this happens after a node rejoins the cluster after being fenced.
Comment 4 Shih-Hsin Yang 2007-11-16 16:27:10 EST
My cluster config is like below.

                        <failoverdomain name="oracle" ordered="1"
restricted="1">
                                <failoverdomainnode
name="vsvr1.kfb.or.kr" priority="1"/>
                                <failoverdomainnode
name="vsvr2.kfb.or.kr" priority="2"/>
                        </failoverdomain>

                <vm autostart="1" domain="oracle" exclusive="0"
name="oracle" path="/etc/xen" recovery="restart"/>


[root@vsvr1 ~]# ls -la /etc/xen/oracle
lrwxrwxrwx 1 root root 18 Nov  8 16:53 /etc/xen/oracle ->
/xen/config/oracle

[root@vsvr2 ~]# mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
none on /sys/kernel/config type configfs (rw)
/dev/mapper/VG--CX3-dom0.xen on /xen type gfs
(rw,noatime,hostdata=jid=1:id=65537:first=0)
[root@vsvr2 ~]# ls -la /xen/config/oracle
-rw------- 1 root root 474 Nov  9 11:46 /xen/config/oracle

When vsvr1.kfb.or.kr failed(system crash), vm could start on
vsvr2.kfb.or.kr.
but vsvr1 recovered, didn't failback with below error messages in
/var/log/messages.

Nov 17 05:00:37 vsvr2 clurgmgrd[10076]: <notice> Migrating vm:oracle
to better node vsvr1.kfb.or.kr
Nov 17 05:00:42 vsvr2 kernel: dlm: connecting to 1
Nov 17 05:00:52 vsvr2 clurgmgrd[10076]: <err> #75: Failed changing
service status
Comment 5 Lon Hohberger 2007-11-19 11:18:02 EST
Ok - there's a bug here in cman where the port bits aren't getting zeroed out on
node death.  This causes rgmanager to try to talk to the node after it reboots
but before rgmanager is restarted on that node.
Comment 6 Lon Hohberger 2007-11-19 12:42:50 EST
Created attachment 263721 [details]
Fix for cman; currently testing
Comment 7 Lon Hohberger 2007-11-19 12:49:59 EST
The patch worked for me.  rgmanager gets a spurious node event, but it's
inconsequential and doesn't seem to cause problems.  I'm filing a separate bug
for that one.
Comment 8 Lon Hohberger 2007-11-19 14:57:50 EST
Created attachment 263901 [details]
Test program

This program illustrates the bug.  Copy this to a two node cluster and run the
following on both nodes:

   tar -xzvf bz327721-test.tar.gz
   cd bz327721-test ; make
   ./cman_port_bug

You will see output like this:

  Node 2 listening on port 192
  Node 1 listening on port 192

Leave cman_port_bug running on both and hard-reboot one of the nodes (i.e.
"reboot -fn").	Ensure it is fenced and removed from the cluster correctly.  On
the surviving node, you will see output like this (the node ID may change):

  Node 1 offline (while listening)

Restart the node you rebooted and start cman.  You will see this line:

  Node 1 online

WITHOUT the patch, you will also see the following error:

  ERROR: cman reports node 1 is listening, but no PORTOPENED event was
received!

WITH the patch, the above error will *not* appear (correct behavior).
Comment 10 Shih-Hsin Yang 2007-11-19 20:35:40 EST
You mean that patch is not final ?

If you think that this patch works for me, please make me a cman package.
I'm using cman-2.0.73-1.el5_1.1 for x86_64.
Comment 11 Christine Caulfield 2007-11-20 04:23:00 EST
That patch is fine lon, Thanks.

Checked in to -rRHEL5

Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.2.14; previous revision: 1.55.2.13
done
Comment 16 Shih-Hsin Yang 2007-11-24 10:05:59 EST
Hi lon,

I tested your test package.
But didn't work for me.

failback test
  : vsvr1 crash(echo 'c' > /proc/sysrq-trigger)
    vsvr2 fence vsvr1 --> VM(oracle) started on vsvr2
    after vsvr1 cluster join, failback failed
    but manual migration from vsvr2 to vsvr1 success


[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started        

[root@vsvr1 ~]# echo 'c' > /proc/sysrq-trigger

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr2.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         

Nov 24 23:01:04 vsvr2 clurgmgrd[7523]: <notice> Migrating vm:oracle to better
node vsvr1.kfb.or.kr 
Nov 24 23:01:09 vsvr2 kernel: dlm: connecting to 1
Nov 24 23:01:19 vsvr2 clurgmgrd[7523]: <err> #75: Failed changing service status 

[root@vsvr2 ~]# clusvcadm -M vm:oracle -m vsvr1.kfb.or.kr
Trying to migrate vm:oracle to vsvr1.kfb.or.kr...Success

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                migrating       
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         
Comment 17 Lon Hohberger 2007-11-26 09:07:41 EST
That's strange - it worked in my testing.  I'll try some more tests today.
Comment 18 Lon Hohberger 2007-11-26 09:16:13 EST
I need your cluster.conf.
Comment 19 Shih-Hsin Yang 2007-11-26 10:01:04 EST
Created attachment 268981 [details]
Here's my cluster.conf.
Comment 20 Lon Hohberger 2007-11-26 11:20:48 EST
It looks like there's another part we needed to zap the port bits:

diff -u -r1.55.2.14 commands.c
--- commands.c  20 Nov 2007 09:21:51 -0000      1.55.2.14
+++ commands.c  26 Nov 2007 16:20:40 -0000
@@ -2021,6 +2021,7 @@
 
        case NODESTATE_LEAVING:
                node->state = NODESTATE_DEAD;
+               memset(&node->port_bits, 0, sizeof(node->port_bits));
                cluster_members--;
 
                if ((node->leave_reason & 0xF) & CLUSTER_LEAVEFLAG_REMOVED) 
Comment 26 Denis Brækhus 2008-03-27 11:23:22 EDT
(In reply to comment #24)
> http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm

I have tested this package in a 2 node cluster which did exhibit the problem
behaviour before. The updated cman fixed the issue.
 

Comment 27 Corey Marthaler 2008-03-27 13:41:36 EDT
Marking this verified per comment #26.
Comment 29 errata-xmlrpc 2008-05-21 11:58:05 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.