Bug 609181 - rgmanager: fail to recover from crash
rgmanager: fail to recover from crash
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: rgmanager (Show other bugs)
6.0
All Linux
urgent Severity high
: rc
: ---
Assigned To: Lon Hohberger
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-29 11:16 EDT by Lon Hohberger
Modified: 2010-11-10 17:16 EST (History)
8 users (show)

See Also:
Fixed In Version: rgmanager-3.0.12-3.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 608397
Environment:
Last Closed: 2010-11-10 17:16:26 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Lon Hohberger 2010-06-29 11:16:55 EDT
+++ This bug was initially created as a clone of Bug #608397 +++

Description of problem:

i've try to verify that the cluster can recover from clurgmgrd crash
so i've killed it using  kill -9 `pidof -s clurgmgrd`

1. the rgmanager on that host has stopped.
2. the service is still running on the host
3. the VM seems to be running on the host but it is not responding

[root@green-vdsa ~]# ps -aux | grep kvm
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root     24500 31.9 12.6 2316192 2074448 ?     Sl   12:30   6:16 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2048 -smp 1 -name RHEVM-HA -uuid 00000000-0000-0000-0000-000000000002 -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEVM-HA.pid -localtime -boot c -drive file=/dev/rhevm-cluster01/rhev-image,if=ide,index=0,boot=on,cache=none -drive file=/home/iso/windows_server_2008_r2.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=00:1a:4a:23:66:fd,vlan=0 -net tap,fd=18,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0 -k en-us

the VM will not be allocated nor the VM service.

1. the VM does not respond to ping 
2. the host does respond to ping but i can not connect to him in any way.

bottom line the cluster system is not responding!!!

i would think that the cluster system should recognize the cluster fail-over and will try and restart the vm service on the second node.

the failed node should be rebooted 
 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Jun 27 12:28:08 green-vdsa rhev-check.sh[24033]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:28:15 green-vdsa rhev-check.sh[24060]: <err> RHEV Status check on 10.35.72.253 failed; last HTTP code: 500
Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> status on vm "RHEVM-HA" returned 1 (generic error)
Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> Stopping service vm:RHEVM-HA
Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Interface vnet0.IPv6 no longer relevant for mDNS.
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Leaving mDNS multicast group on interface vnet0.IPv6 with address fe80::1c4a:7dff:fe35:d0c7.
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Withdrawing address record for fe80::1c4a:7dff:fe35:d0c7 on vnet0.
Jun 27 12:30:16 green-vdsa kernel: device vnet0 left promiscuous mode
Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state
Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA is recovering
Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Recovering failed service vm:RHEVM-HA
Jun 27 12:30:21 green-vdsa kernel: device vnet0 entered promiscuous mode
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering learning state
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: topology change detected, propagating
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering forwarding state
Jun 27 12:30:21 green-vdsa rhev-check.sh[24528]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: New relevant interface vnet0.IPv6 for mDNS.
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::4ced:b1ff:feac:210f.
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Registering new address record for fe80::4ced:b1ff:feac:210f on vnet0.
Jun 27 12:30:51 green-vdsa rhev-check.sh[24593]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:30:56 green-vdsa rhev-check.sh[24605]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:15 green-vdsa rhev-check.sh[24653]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:20 green-vdsa rhev-check.sh[24666]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:21 green-vdsa rhev-check.sh[24685]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:26 green-vdsa rhev-check.sh[24695]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:26 green-vdsa rhev-check.sh[24714]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:31 green-vdsa rhev-check.sh[24727]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:32:10 green-vdsa rhev-check.sh[24807]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:32:15 green-vdsa rhev-check.sh[24818]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:32:19 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA started

--- Additional comment from yeylon@redhat.com on 2010-06-27 06:37:48 EDT ---

rgmanager-2.0.52-6.el5_5.7

--- Additional comment from lhh@redhat.com on 2010-06-28 10:14:56 EDT ---

This looks like a kernel bug:

kvm: exiting hardware virtualization
Synchronizing SCSI cache for disk sdu: 
Synchronizing SCSI cache for disk sdt: 
Synchronizing SCSI cache for disk sds: 
Synchronizing SCSI cache for disk sdr: 
Synchronizing SCSI cache for disk sdq: 
Synchronizing SCSI cache for disk sdp: 
Synchronizing SCSI cache for disk sdo: 
Synchronizing SCSI cache for disk sdn: 
Synchronizing SCSI cache for disk sdm: 
Synchronizing SCSI cache for disk sdl: 
Synchronizing SCSI cache for disk sdk: 
Synchronizing SCSI cache for disk sdj: 
Synchronizing SCSI cache for disk sdi: 
Synchronizing SCSI cache for disk sdh: 
Synchronizing SCSI cache for disk sdg: 
Synchronizing SCSI cache for disk sdf: 
Synchronizing SCSI cache for disk sde: 
Synchronizing SCSI cache for disk sdd: 
Synchronizing SCSI cache for disk sdc: 
Synchronizing SCSI cache for disk sdb: 
Restarting system.
.
machine restart

I am still logged in to this machine.

--- Additional comment from lhh@redhat.com on 2010-06-28 10:16:02 EDT ---

This occurred after the syscall 'reboot(RB_AUTOBOOT)', which should never fail.

--- Additional comment from lhh@redhat.com on 2010-06-28 10:16:43 EDT ---

Jun 28 17:13:09 green-vdsa clurgmgrd[5472]: <crit> 
             Watchdog: Daemon died, rebooting... 
Jun 28 17:13:09 green-vdsa kernel: md: stopping all md 
             devices.

--- Additional comment from lhh@redhat.com on 2010-06-28 10:20:57 EDT ---

The machine was running one qemu-kvm instance.

As a crash recovery measure, rgmanager has a watchdog process which reboots the host if the main rgmanager process fails unexpectedly.  This causes the node to get fenced and rgmanager to recover the service on the other host.

When we kill rgmanager proper, the watchdog process calls reboot(RB_AUTOBOOT).  At this point, we log the above messages in comment #2, the kernel reported messages as per comment #2, and the machine never rebooted.

--- Additional comment from lhh@redhat.com on 2010-06-28 11:14:51 EDT ---

I've cloned this for the kernel component, but I believe it is possible to work around this issue in rgmanager by issuing a cman_kill_node() to the local host.

I will test this.

--- Additional comment from lhh@redhat.com on 2010-06-28 12:56:09 EDT ---

The cman_kill_node(x, me) works as expected during preliminary testing.  I will test another scenario prior to posting a patch here.

--- Additional comment from lhh@redhat.com on 2010-06-28 15:19:41 EDT ---

 Jun 28 15:16:52 frederick clurgmgrd[2936]: <crit> Watchdog: Daemon died, rebooting... 
Jun 28 15:16:52 frederick openais[2858]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading all openais components 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_confdb v0 (20/10) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cpg v0 (19/8) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cfg v0 (18/7) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_msg v0 (17/6) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_lck v0 (16/5) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evt v0 (15/4) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_ckpt v0 (14/3) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_amf v0 (13/2) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_clm v0 (12/1) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evs v0 (11/0) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cman v0 (10/9) 
Jun 28 15:16:53 frederick openais[2858]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). 
Jun 28 15:16:53 frederick dlm_controld[2894]: cluster is down, exiting
Jun 28 15:16:53 frederick gfs_controld[2900]: groupd_dispatch error -1 errno 0
Jun 28 15:16:53 frederick gfs_controld[2900]: groupd connection died
Jun 28 15:16:53 frederick gfs_controld[2900]: cluster is down, exiting
Jun 28 15:16:53 frederick fenced[2888]: cluster is down, exiting
Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 2
Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 1
Jun 28 15:16:53 frederick qdiskd[1954]: <err> cman_dispatch: Host is down 
Jun 28 15:16:53 frederick qdiskd[1954]: <err> Halting qdisk operations 
Jun 28 15:16:54 frederick kernel: md: stopping all md devices.


Irrespective of whether reboot occurs at this point, the node has been kicked itself out of the cluster and therefore will be fenced by the other node.

--- Additional comment from lhh@redhat.com on 2010-06-28 15:23:30 EDT ---

Created an attachment (id=427487)
Make the rgmanager watchdog process kill CMAN if rgmanager (main) crashes

This patch works around the case that the reboot() system call fails for some reason.  Effectively, it issues a cman_kill_node(c, my_node_id) to cause an unclean eviction of the cluster node, resulting in the host being fenced by the cluster.

Because there are cases where cman_admin_init() may also fail, we must still reboot from the watchdog process.

--- Additional comment from lhh@redhat.com on 2010-06-28 16:24:16 EDT ---

Another possibility which was suggested by Mike Snitzer is to try writing 'b' to /proc/sysrq-trigger; I will attach a patch for this as well.

--- Additional comment from lhh@redhat.com on 2010-06-28 16:35:46 EDT ---

Created an attachment (id=427506)
Patch using Mike's proposal

This makes rgmanager tickle /proc/sysrq-trigger in order to do a reboot, which has a less error-prone path than the reboot() system call.
Comment 2 RHEL Product and Program Management 2010-06-29 11:23:12 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 7 Dean Jansa 2010-08-12 16:10:23 EDT
Verified, sysreq happens after kill -11 $(pidof -s rgmanager)
Comment 8 releng-rhel@redhat.com 2010-11-10 17:16:26 EST
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.