Bug 327721

Summary: failed RG status change while relocating to preferred failover node
Product: Red Hat Enterprise Linux 5 Reporter: Corey Marthaler <cmarthal>
Component: cmanAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: low Docs Contact:
Priority: low    
Version: 5.0CC: ccaulfie, cluster-maint, huafusheng, jos, syang
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0347 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 15:58:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 399681    
Attachments:
Description Flags
Fix for cman; currently testing
none
Test program
none
Here's my cluster.conf. none

Description Corey Marthaler 2007-10-11 14:24:41 UTC
Description of problem:
Lon and I have been seeing this lately while recovering/relocating nfs/gfs services.

Oct  9 14:44:09 taft-03 clurgmgrd[9330]: <notice> Relocating service:TAFT HA GFS
to better no
de taft-02.lab.msp.redhat.com
Oct  9 14:44:09 taft-03 clurgmgrd[9330]: <notice> Stopping service service:TAFT
HA GFS
Oct  9 14:44:13 taft-03 kernel: dlm: connecting to 1
Oct  9 14:44:13 taft-03 kernel: dlm: got connection from 1
Oct  9 14:44:24 taft-03 clurgmgrd[9330]: <err> #52: Failed changing RG status

Is this something that should be retried?

Version-Release number of selected component (if applicable):
2.6.18-52.el5
rgmanager-2.0.31-1.el5

Comment 1 RHEL Program Management 2007-10-16 03:35:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Lon Hohberger 2007-10-18 20:55:44 UTC
I think I've reproduced this!

Comment 3 Lon Hohberger 2007-11-14 19:15:11 UTC
AFAIK this happens after a node rejoins the cluster after being fenced.

Comment 4 Shih-Hsin Yang 2007-11-16 21:27:10 UTC
My cluster config is like below.

                        <failoverdomain name="oracle" ordered="1"
restricted="1">
                                <failoverdomainnode
name="vsvr1.kfb.or.kr" priority="1"/>
                                <failoverdomainnode
name="vsvr2.kfb.or.kr" priority="2"/>
                        </failoverdomain>

                <vm autostart="1" domain="oracle" exclusive="0"
name="oracle" path="/etc/xen" recovery="restart"/>


[root@vsvr1 ~]# ls -la /etc/xen/oracle
lrwxrwxrwx 1 root root 18 Nov  8 16:53 /etc/xen/oracle ->
/xen/config/oracle

[root@vsvr2 ~]# mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
none on /sys/kernel/config type configfs (rw)
/dev/mapper/VG--CX3-dom0.xen on /xen type gfs
(rw,noatime,hostdata=jid=1:id=65537:first=0)
[root@vsvr2 ~]# ls -la /xen/config/oracle
-rw------- 1 root root 474 Nov  9 11:46 /xen/config/oracle

When vsvr1.kfb.or.kr failed(system crash), vm could start on
vsvr2.kfb.or.kr.
but vsvr1 recovered, didn't failback with below error messages in
/var/log/messages.

Nov 17 05:00:37 vsvr2 clurgmgrd[10076]: <notice> Migrating vm:oracle
to better node vsvr1.kfb.or.kr
Nov 17 05:00:42 vsvr2 kernel: dlm: connecting to 1
Nov 17 05:00:52 vsvr2 clurgmgrd[10076]: <err> #75: Failed changing
service status

Comment 5 Lon Hohberger 2007-11-19 16:18:02 UTC
Ok - there's a bug here in cman where the port bits aren't getting zeroed out on
node death.  This causes rgmanager to try to talk to the node after it reboots
but before rgmanager is restarted on that node.

Comment 6 Lon Hohberger 2007-11-19 17:42:50 UTC
Created attachment 263721 [details]
Fix for cman; currently testing

Comment 7 Lon Hohberger 2007-11-19 17:49:59 UTC
The patch worked for me.  rgmanager gets a spurious node event, but it's
inconsequential and doesn't seem to cause problems.  I'm filing a separate bug
for that one.

Comment 8 Lon Hohberger 2007-11-19 19:57:50 UTC
Created attachment 263901 [details]
Test program

This program illustrates the bug.  Copy this to a two node cluster and run the
following on both nodes:

   tar -xzvf bz327721-test.tar.gz
   cd bz327721-test ; make
   ./cman_port_bug

You will see output like this:

  Node 2 listening on port 192
  Node 1 listening on port 192

Leave cman_port_bug running on both and hard-reboot one of the nodes (i.e.
"reboot -fn").	Ensure it is fenced and removed from the cluster correctly.  On
the surviving node, you will see output like this (the node ID may change):

  Node 1 offline (while listening)

Restart the node you rebooted and start cman.  You will see this line:

  Node 1 online

WITHOUT the patch, you will also see the following error:

  ERROR: cman reports node 1 is listening, but no PORTOPENED event was
received!

WITH the patch, the above error will *not* appear (correct behavior).

Comment 10 Shih-Hsin Yang 2007-11-20 01:35:40 UTC
You mean that patch is not final ?

If you think that this patch works for me, please make me a cman package.
I'm using cman-2.0.73-1.el5_1.1 for x86_64.

Comment 11 Christine Caulfield 2007-11-20 09:23:00 UTC
That patch is fine lon, Thanks.

Checked in to -rRHEL5

Checking in commands.c;
/cvs/cluster/cluster/cman/daemon/commands.c,v  <--  commands.c
new revision: 1.55.2.14; previous revision: 1.55.2.13
done


Comment 16 Shih-Hsin Yang 2007-11-24 15:05:59 UTC
Hi lon,

I tested your test package.
But didn't work for me.

failback test
  : vsvr1 crash(echo 'c' > /proc/sysrq-trigger)
    vsvr2 fence vsvr1 --> VM(oracle) started on vsvr2
    after vsvr1 cluster join, failback failed
    but manual migration from vsvr2 to vsvr1 success


[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started        

[root@vsvr1 ~]# echo 'c' > /proc/sysrq-trigger

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr2.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         

Nov 24 23:01:04 vsvr2 clurgmgrd[7523]: <notice> Migrating vm:oracle to better
node vsvr1.kfb.or.kr 
Nov 24 23:01:09 vsvr2 kernel: dlm: connecting to 1
Nov 24 23:01:19 vsvr2 clurgmgrd[7523]: <err> #75: Failed changing service status 

[root@vsvr2 ~]# clusvcadm -M vm:oracle -m vsvr1.kfb.or.kr
Trying to migrate vm:oracle to vsvr1.kfb.or.kr...Success

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                migrating       
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         

[root@vsvr3 ~]# clustat
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  vsvr1.kfb.or.kr                       1 Online, rgmanager
  vsvr2.kfb.or.kr                       2 Online, rgmanager
  vsvr3.kfb.or.kr                       3 Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  service:dhcp         vsvr3.kfb.or.kr                started         
  vm:oracle            vsvr1.kfb.or.kr                started         
  vm:smartflow         vsvr2.kfb.or.kr                started         
  vm:dns               vsvr3.kfb.or.kr                started         
  vm:mail              vsvr3.kfb.or.kr                started         


Comment 17 Lon Hohberger 2007-11-26 14:07:41 UTC
That's strange - it worked in my testing.  I'll try some more tests today.

Comment 18 Lon Hohberger 2007-11-26 14:16:13 UTC
I need your cluster.conf.

Comment 19 Shih-Hsin Yang 2007-11-26 15:01:04 UTC
Created attachment 268981 [details]
Here's my cluster.conf.

Comment 20 Lon Hohberger 2007-11-26 16:20:48 UTC
It looks like there's another part we needed to zap the port bits:

diff -u -r1.55.2.14 commands.c
--- commands.c  20 Nov 2007 09:21:51 -0000      1.55.2.14
+++ commands.c  26 Nov 2007 16:20:40 -0000
@@ -2021,6 +2021,7 @@
 
        case NODESTATE_LEAVING:
                node->state = NODESTATE_DEAD;
+               memset(&node->port_bits, 0, sizeof(node->port_bits));
                cluster_members--;
 
                if ((node->leave_reason & 0xF) & CLUSTER_LEAVEFLAG_REMOVED) 

Comment 26 Denis Brækhus 2008-03-27 15:23:22 UTC
(In reply to comment #24)
> http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm

I have tested this package in a 2 node cluster which did exhibit the problem
behaviour before. The updated cman fixed the issue.
 



Comment 27 Corey Marthaler 2008-03-27 17:41:36 UTC
Marking this verified per comment #26.

Comment 29 errata-xmlrpc 2008-05-21 15:58:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html