Bug 327721
Summary: | failed RG status change while relocating to preferred failover node | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
Component: | cman | Assignee: | Lon Hohberger <lhh> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | low | Docs Contact: | |||||||||
Priority: | low | ||||||||||
Version: | 5.0 | CC: | ccaulfie, cluster-maint, huafusheng, jos, syang | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | RHBA-2008-0347 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2008-05-21 15:58:05 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 399681 | ||||||||||
Attachments: |
|
Description
Corey Marthaler
2007-10-11 14:24:41 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. I think I've reproduced this! AFAIK this happens after a node rejoins the cluster after being fenced. My cluster config is like below. <failoverdomain name="oracle" ordered="1" restricted="1"> <failoverdomainnode name="vsvr1.kfb.or.kr" priority="1"/> <failoverdomainnode name="vsvr2.kfb.or.kr" priority="2"/> </failoverdomain> <vm autostart="1" domain="oracle" exclusive="0" name="oracle" path="/etc/xen" recovery="restart"/> [root@vsvr1 ~]# ls -la /etc/xen/oracle lrwxrwxrwx 1 root root 18 Nov 8 16:53 /etc/xen/oracle -> /xen/config/oracle [root@vsvr2 ~]# mount /dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda1 on /boot type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) none on /sys/kernel/config type configfs (rw) /dev/mapper/VG--CX3-dom0.xen on /xen type gfs (rw,noatime,hostdata=jid=1:id=65537:first=0) [root@vsvr2 ~]# ls -la /xen/config/oracle -rw------- 1 root root 474 Nov 9 11:46 /xen/config/oracle When vsvr1.kfb.or.kr failed(system crash), vm could start on vsvr2.kfb.or.kr. but vsvr1 recovered, didn't failback with below error messages in /var/log/messages. Nov 17 05:00:37 vsvr2 clurgmgrd[10076]: <notice> Migrating vm:oracle to better node vsvr1.kfb.or.kr Nov 17 05:00:42 vsvr2 kernel: dlm: connecting to 1 Nov 17 05:00:52 vsvr2 clurgmgrd[10076]: <err> #75: Failed changing service status Ok - there's a bug here in cman where the port bits aren't getting zeroed out on node death. This causes rgmanager to try to talk to the node after it reboots but before rgmanager is restarted on that node. Created attachment 263721 [details]
Fix for cman; currently testing
The patch worked for me. rgmanager gets a spurious node event, but it's inconsequential and doesn't seem to cause problems. I'm filing a separate bug for that one. Created attachment 263901 [details] Test program This program illustrates the bug. Copy this to a two node cluster and run the following on both nodes: tar -xzvf bz327721-test.tar.gz cd bz327721-test ; make ./cman_port_bug You will see output like this: Node 2 listening on port 192 Node 1 listening on port 192 Leave cman_port_bug running on both and hard-reboot one of the nodes (i.e. "reboot -fn"). Ensure it is fenced and removed from the cluster correctly. On the surviving node, you will see output like this (the node ID may change): Node 1 offline (while listening) Restart the node you rebooted and start cman. You will see this line: Node 1 online WITHOUT the patch, you will also see the following error: ERROR: cman reports node 1 is listening, but no PORTOPENED event was received! WITH the patch, the above error will *not* appear (correct behavior). You mean that patch is not final ? If you think that this patch works for me, please make me a cman package. I'm using cman-2.0.73-1.el5_1.1 for x86_64. That patch is fine lon, Thanks. Checked in to -rRHEL5 Checking in commands.c; /cvs/cluster/cluster/cman/daemon/commands.c,v <-- commands.c new revision: 1.55.2.14; previous revision: 1.55.2.13 done Hi lon, I tested your test package. But didn't work for me. failback test : vsvr1 crash(echo 'c' > /proc/sysrq-trigger) vsvr2 fence vsvr1 --> VM(oracle) started on vsvr2 after vsvr1 cluster join, failback failed but manual migration from vsvr2 to vsvr1 success [root@vsvr3 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vsvr1.kfb.or.kr 1 Online, rgmanager vsvr2.kfb.or.kr 2 Online, rgmanager vsvr3.kfb.or.kr 3 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:dhcp vsvr3.kfb.or.kr started vm:oracle vsvr1.kfb.or.kr started vm:smartflow vsvr2.kfb.or.kr started vm:dns vsvr3.kfb.or.kr started vm:mail vsvr3.kfb.or.kr started [root@vsvr1 ~]# echo 'c' > /proc/sysrq-trigger [root@vsvr3 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vsvr1.kfb.or.kr 1 Online, rgmanager vsvr2.kfb.or.kr 2 Online, rgmanager vsvr3.kfb.or.kr 3 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:dhcp vsvr3.kfb.or.kr started vm:oracle vsvr2.kfb.or.kr started vm:smartflow vsvr2.kfb.or.kr started vm:dns vsvr3.kfb.or.kr started vm:mail vsvr3.kfb.or.kr started Nov 24 23:01:04 vsvr2 clurgmgrd[7523]: <notice> Migrating vm:oracle to better node vsvr1.kfb.or.kr Nov 24 23:01:09 vsvr2 kernel: dlm: connecting to 1 Nov 24 23:01:19 vsvr2 clurgmgrd[7523]: <err> #75: Failed changing service status [root@vsvr2 ~]# clusvcadm -M vm:oracle -m vsvr1.kfb.or.kr Trying to migrate vm:oracle to vsvr1.kfb.or.kr...Success [root@vsvr3 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vsvr1.kfb.or.kr 1 Online, rgmanager vsvr2.kfb.or.kr 2 Online, rgmanager vsvr3.kfb.or.kr 3 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:dhcp vsvr3.kfb.or.kr started vm:oracle vsvr1.kfb.or.kr migrating vm:smartflow vsvr2.kfb.or.kr started vm:dns vsvr3.kfb.or.kr started vm:mail vsvr3.kfb.or.kr started [root@vsvr3 ~]# clustat Member Status: Quorate Member Name ID Status ------ ---- ---- ------ vsvr1.kfb.or.kr 1 Online, rgmanager vsvr2.kfb.or.kr 2 Online, rgmanager vsvr3.kfb.or.kr 3 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:dhcp vsvr3.kfb.or.kr started vm:oracle vsvr1.kfb.or.kr started vm:smartflow vsvr2.kfb.or.kr started vm:dns vsvr3.kfb.or.kr started vm:mail vsvr3.kfb.or.kr started That's strange - it worked in my testing. I'll try some more tests today. I need your cluster.conf. Created attachment 268981 [details]
Here's my cluster.conf.
It looks like there's another part we needed to zap the port bits: diff -u -r1.55.2.14 commands.c --- commands.c 20 Nov 2007 09:21:51 -0000 1.55.2.14 +++ commands.c 26 Nov 2007 16:20:40 -0000 @@ -2021,6 +2021,7 @@ case NODESTATE_LEAVING: node->state = NODESTATE_DEAD; + memset(&node->port_bits, 0, sizeof(node->port_bits)); cluster_members--; if ((node->leave_reason & 0xF) & CLUSTER_LEAVEFLAG_REMOVED) http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.src.rpm http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.i386.rpm http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.i386.rpm http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.i386.rpm http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.x86_64.rpm http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.x86_64.rpm http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.ia64.rpm http://people.redhat.com/lhh/cman-devel-2.0.73-1.6.el5.test.bz327721.ia64.rpm http://people.redhat.com/lhh/cman-debuginfo-2.0.73-1.6.el5.test.bz327721.ia64.rpm Fixed packages. Old packages didn't have the correct patch applied. I tested this using: (a) Ordered failover domain in a 2 node virtual cluster using a service. (b) Ordered failover domain in a 5 node cluster using a service. (c) Ordered failover domain in a 5 node cluster using a paravirtual machine (vm). (In reply to comment #24) > http://people.redhat.com/lhh/cman-2.0.73-1.6.el5.test.bz327721.x86_64.rpm I have tested this package in a 2 node cluster which did exhibit the problem behaviour before. The updated cman fixed the issue. Marking this verified per comment #26. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0347.html |