Bug 608397

Summary: rgmanager: fail to recover from clurgmgrd crash
Product: Red Hat Enterprise Linux 5 Reporter: yeylon <yeylon>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Brandon Perkins <bperkins>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.5.zCC: bperkins, ccaulfie, cluster-maint, edamato, jkortus, jwest, srevivo, tdunnon, ykaul
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-2.0.52-6.10.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 608709 609181 (view as bug list) Environment:
Last Closed: 2011-01-13 23:27:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 609182    
Attachments:
Description Flags
Make the rgmanager watchdog process kill CMAN if rgmanager (main) crashes
ccaulfie: review+
Patch using Mike's proposal none

Description yeylon@redhat.com 2010-06-27 10:02:49 UTC
Description of problem:

i've try to verify that the cluster can recover from clurgmgrd crash
so i've killed it using  kill -9 `pidof -s clurgmgrd`

1. the rgmanager on that host has stopped.
2. the service is still running on the host
3. the VM seems to be running on the host but it is not responding

[root@green-vdsa ~]# ps -aux | grep kvm
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root     24500 31.9 12.6 2316192 2074448 ?     Sl   12:30   6:16 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2048 -smp 1 -name RHEVM-HA -uuid 00000000-0000-0000-0000-000000000002 -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEVM-HA.pid -localtime -boot c -drive file=/dev/rhevm-cluster01/rhev-image,if=ide,index=0,boot=on,cache=none -drive file=/home/iso/windows_server_2008_r2.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=00:1a:4a:23:66:fd,vlan=0 -net tap,fd=18,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0 -k en-us

the VM will not be allocated nor the VM service.

1. the VM does not respond to ping 
2. the host does respond to ping but i can not connect to him in any way.

bottom line the cluster system is not responding!!!

i would think that the cluster system should recognize the cluster fail-over and will try and restart the vm service on the second node.

the failed node should be rebooted 
 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Jun 27 12:28:08 green-vdsa rhev-check.sh[24033]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:28:15 green-vdsa rhev-check.sh[24060]: <err> RHEV Status check on 10.35.72.253 failed; last HTTP code: 500
Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> status on vm "RHEVM-HA" returned 1 (generic error)
Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> Stopping service vm:RHEVM-HA
Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Interface vnet0.IPv6 no longer relevant for mDNS.
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Leaving mDNS multicast group on interface vnet0.IPv6 with address fe80::1c4a:7dff:fe35:d0c7.
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Withdrawing address record for fe80::1c4a:7dff:fe35:d0c7 on vnet0.
Jun 27 12:30:16 green-vdsa kernel: device vnet0 left promiscuous mode
Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state
Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA is recovering
Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Recovering failed service vm:RHEVM-HA
Jun 27 12:30:21 green-vdsa kernel: device vnet0 entered promiscuous mode
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering learning state
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: topology change detected, propagating
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering forwarding state
Jun 27 12:30:21 green-vdsa rhev-check.sh[24528]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: New relevant interface vnet0.IPv6 for mDNS.
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::4ced:b1ff:feac:210f.
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Registering new address record for fe80::4ced:b1ff:feac:210f on vnet0.
Jun 27 12:30:51 green-vdsa rhev-check.sh[24593]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:30:56 green-vdsa rhev-check.sh[24605]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:15 green-vdsa rhev-check.sh[24653]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:20 green-vdsa rhev-check.sh[24666]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:21 green-vdsa rhev-check.sh[24685]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:26 green-vdsa rhev-check.sh[24695]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:26 green-vdsa rhev-check.sh[24714]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:31 green-vdsa rhev-check.sh[24727]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:32:10 green-vdsa rhev-check.sh[24807]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:32:15 green-vdsa rhev-check.sh[24818]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:32:19 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA started

Comment 1 yeylon@redhat.com 2010-06-27 10:37:48 UTC
rgmanager-2.0.52-6.el5_5.7

Comment 2 Lon Hohberger 2010-06-28 14:14:56 UTC
This looks like a kernel bug:

kvm: exiting hardware virtualization
Synchronizing SCSI cache for disk sdu: 
Synchronizing SCSI cache for disk sdt: 
Synchronizing SCSI cache for disk sds: 
Synchronizing SCSI cache for disk sdr: 
Synchronizing SCSI cache for disk sdq: 
Synchronizing SCSI cache for disk sdp: 
Synchronizing SCSI cache for disk sdo: 
Synchronizing SCSI cache for disk sdn: 
Synchronizing SCSI cache for disk sdm: 
Synchronizing SCSI cache for disk sdl: 
Synchronizing SCSI cache for disk sdk: 
Synchronizing SCSI cache for disk sdj: 
Synchronizing SCSI cache for disk sdi: 
Synchronizing SCSI cache for disk sdh: 
Synchronizing SCSI cache for disk sdg: 
Synchronizing SCSI cache for disk sdf: 
Synchronizing SCSI cache for disk sde: 
Synchronizing SCSI cache for disk sdd: 
Synchronizing SCSI cache for disk sdc: 
Synchronizing SCSI cache for disk sdb: 
Restarting system.
.
machine restart

I am still logged in to this machine.

Comment 3 Lon Hohberger 2010-06-28 14:16:02 UTC
This occurred after the syscall 'reboot(RB_AUTOBOOT)', which should never fail.

Comment 4 Lon Hohberger 2010-06-28 14:16:43 UTC
Jun 28 17:13:09 green-vdsa clurgmgrd[5472]: <crit> 
             Watchdog: Daemon died, rebooting... 
Jun 28 17:13:09 green-vdsa kernel: md: stopping all md 
             devices.

Comment 5 Lon Hohberger 2010-06-28 14:20:57 UTC
The machine was running one qemu-kvm instance.

As a crash recovery measure, rgmanager has a watchdog process which reboots the host if the main rgmanager process fails unexpectedly.  This causes the node to get fenced and rgmanager to recover the service on the other host.

When we kill rgmanager proper, the watchdog process calls reboot(RB_AUTOBOOT).  At this point, we log the above messages in comment #2, the kernel reported messages as per comment #2, and the machine never rebooted.

Comment 6 Lon Hohberger 2010-06-28 15:14:51 UTC
I've cloned this for the kernel component, but I believe it is possible to work around this issue in rgmanager by issuing a cman_kill_node() to the local host.

I will test this.

Comment 7 Lon Hohberger 2010-06-28 16:56:09 UTC
The cman_kill_node(x, me) works as expected during preliminary testing.  I will test another scenario prior to posting a patch here.

Comment 8 Lon Hohberger 2010-06-28 19:19:41 UTC
 Jun 28 15:16:52 frederick clurgmgrd[2936]: <crit> Watchdog: Daemon died, rebooting... 
Jun 28 15:16:52 frederick openais[2858]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading all openais components 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_confdb v0 (20/10) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cpg v0 (19/8) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cfg v0 (18/7) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_msg v0 (17/6) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_lck v0 (16/5) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evt v0 (15/4) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_ckpt v0 (14/3) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_amf v0 (13/2) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_clm v0 (12/1) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evs v0 (11/0) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cman v0 (10/9) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). 
Jun 28 15:16:53 frederick dlm_controld[2894]: cluster is down, exiting
Jun 28 15:16:53 frederick gfs_controld[2900]: groupd_dispatch error -1 errno 0
Jun 28 15:16:53 frederick gfs_controld[2900]: groupd connection died
Jun 28 15:16:53 frederick gfs_controld[2900]: cluster is down, exiting
Jun 28 15:16:53 frederick fenced[2888]: cluster is down, exiting
Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 2
Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 1
Jun 28 15:16:53 frederick qdiskd[1954]: <err> cman_dispatch: Host is down 
Jun 28 15:16:53 frederick qdiskd[1954]: <err> Halting qdisk operations 
Jun 28 15:16:54 frederick kernel: md: stopping all md devices.


Irrespective of whether reboot occurs at this point, the node has been kicked itself out of the cluster and therefore will be fenced by the other node.

Comment 9 Lon Hohberger 2010-06-28 19:23:30 UTC
Created attachment 427487 [details]
Make the rgmanager watchdog process kill CMAN if rgmanager (main) crashes

This patch works around the case that the reboot() system call fails for some reason.  Effectively, it issues a cman_kill_node(c, my_node_id) to cause an unclean eviction of the cluster node, resulting in the host being fenced by the cluster.

Because there are cases where cman_admin_init() may also fail, we must still reboot from the watchdog process.

Comment 11 Lon Hohberger 2010-06-28 20:24:16 UTC
Another possibility which was suggested by Mike Snitzer is to try writing 'b' to /proc/sysrq-trigger; I will attach a patch for this as well.

Comment 13 Lon Hohberger 2010-06-28 20:35:46 UTC
Created attachment 427506 [details]
Patch using Mike's proposal

This makes rgmanager tickle /proc/sysrq-trigger in order to do a reboot, which has a less error-prone path than the reboot() system call.

Comment 23 errata-xmlrpc 2011-01-13 23:27:02 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0134.html