Description of problem: i've try to verify that the cluster can recover from clurgmgrd crash so i've killed it using kill -9 `pidof -s clurgmgrd` 1. the rgmanager on that host has stopped. 2. the service is still running on the host 3. the VM seems to be running on the host but it is not responding [root@green-vdsa ~]# ps -aux | grep kvm Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ root 24500 31.9 12.6 2316192 2074448 ? Sl 12:30 6:16 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2048 -smp 1 -name RHEVM-HA -uuid 00000000-0000-0000-0000-000000000002 -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEVM-HA.pid -localtime -boot c -drive file=/dev/rhevm-cluster01/rhev-image,if=ide,index=0,boot=on,cache=none -drive file=/home/iso/windows_server_2008_r2.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=00:1a:4a:23:66:fd,vlan=0 -net tap,fd=18,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0 -k en-us the VM will not be allocated nor the VM service. 1. the VM does not respond to ping 2. the host does respond to ping but i can not connect to him in any way. bottom line the cluster system is not responding!!! i would think that the cluster system should recognize the cluster fail-over and will try and restart the vm service on the second node. the failed node should be rebooted Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Jun 27 12:28:08 green-vdsa rhev-check.sh[24033]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:28:15 green-vdsa rhev-check.sh[24060]: <err> RHEV Status check on 10.35.72.253 failed; last HTTP code: 500 Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> status on vm "RHEVM-HA" returned 1 (generic error) Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> Stopping service vm:RHEVM-HA Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Interface vnet0.IPv6 no longer relevant for mDNS. Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Leaving mDNS multicast group on interface vnet0.IPv6 with address fe80::1c4a:7dff:fe35:d0c7. Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Withdrawing address record for fe80::1c4a:7dff:fe35:d0c7 on vnet0. Jun 27 12:30:16 green-vdsa kernel: device vnet0 left promiscuous mode Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA is recovering Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Recovering failed service vm:RHEVM-HA Jun 27 12:30:21 green-vdsa kernel: device vnet0 entered promiscuous mode Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering learning state Jun 27 12:30:21 green-vdsa kernel: vmNetwork: topology change detected, propagating Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering forwarding state Jun 27 12:30:21 green-vdsa rhev-check.sh[24528]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: New relevant interface vnet0.IPv6 for mDNS. Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::4ced:b1ff:feac:210f. Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Registering new address record for fe80::4ced:b1ff:feac:210f on vnet0. Jun 27 12:30:51 green-vdsa rhev-check.sh[24593]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:30:56 green-vdsa rhev-check.sh[24605]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:31:15 green-vdsa rhev-check.sh[24653]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:31:20 green-vdsa rhev-check.sh[24666]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:31:21 green-vdsa rhev-check.sh[24685]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:31:26 green-vdsa rhev-check.sh[24695]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:31:26 green-vdsa rhev-check.sh[24714]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:31:31 green-vdsa rhev-check.sh[24727]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:32:10 green-vdsa rhev-check.sh[24807]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:32:15 green-vdsa rhev-check.sh[24818]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:32:19 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA started
rgmanager-2.0.52-6.el5_5.7
This looks like a kernel bug: kvm: exiting hardware virtualization Synchronizing SCSI cache for disk sdu: Synchronizing SCSI cache for disk sdt: Synchronizing SCSI cache for disk sds: Synchronizing SCSI cache for disk sdr: Synchronizing SCSI cache for disk sdq: Synchronizing SCSI cache for disk sdp: Synchronizing SCSI cache for disk sdo: Synchronizing SCSI cache for disk sdn: Synchronizing SCSI cache for disk sdm: Synchronizing SCSI cache for disk sdl: Synchronizing SCSI cache for disk sdk: Synchronizing SCSI cache for disk sdj: Synchronizing SCSI cache for disk sdi: Synchronizing SCSI cache for disk sdh: Synchronizing SCSI cache for disk sdg: Synchronizing SCSI cache for disk sdf: Synchronizing SCSI cache for disk sde: Synchronizing SCSI cache for disk sdd: Synchronizing SCSI cache for disk sdc: Synchronizing SCSI cache for disk sdb: Restarting system. . machine restart I am still logged in to this machine.
This occurred after the syscall 'reboot(RB_AUTOBOOT)', which should never fail.
Jun 28 17:13:09 green-vdsa clurgmgrd[5472]: <crit> Watchdog: Daemon died, rebooting... Jun 28 17:13:09 green-vdsa kernel: md: stopping all md devices.
The machine was running one qemu-kvm instance. As a crash recovery measure, rgmanager has a watchdog process which reboots the host if the main rgmanager process fails unexpectedly. This causes the node to get fenced and rgmanager to recover the service on the other host. When we kill rgmanager proper, the watchdog process calls reboot(RB_AUTOBOOT). At this point, we log the above messages in comment #2, the kernel reported messages as per comment #2, and the machine never rebooted.
I've cloned this for the kernel component, but I believe it is possible to work around this issue in rgmanager by issuing a cman_kill_node() to the local host. I will test this.
The cman_kill_node(x, me) works as expected during preliminary testing. I will test another scenario prior to posting a patch here.
Jun 28 15:16:52 frederick clurgmgrd[2936]: <crit> Watchdog: Daemon died, rebooting... Jun 28 15:16:52 frederick openais[2858]: [CMAN ] cman killed by node 2 because we were killed by cman_tool or other application Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading all openais components Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_confdb v0 (20/10) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cpg v0 (19/8) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cfg v0 (18/7) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_msg v0 (17/6) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_lck v0 (16/5) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evt v0 (15/4) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_ckpt v0 (14/3) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_amf v0 (13/2) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_clm v0 (12/1) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_evs v0 (11/0) Jun 28 15:16:53 frederick openais[2858]: [SERV ] Unloading openais component: openais_cman v0 (10/9) Jun 28 15:16:53 frederick openais[2858]: [SERV ] AIS Executive exiting (reason: CMAN kill requested, exiting). Jun 28 15:16:53 frederick dlm_controld[2894]: cluster is down, exiting Jun 28 15:16:53 frederick gfs_controld[2900]: groupd_dispatch error -1 errno 0 Jun 28 15:16:53 frederick gfs_controld[2900]: groupd connection died Jun 28 15:16:53 frederick gfs_controld[2900]: cluster is down, exiting Jun 28 15:16:53 frederick fenced[2888]: cluster is down, exiting Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 2 Jun 28 15:16:53 frederick kernel: dlm: closing connection to node 1 Jun 28 15:16:53 frederick qdiskd[1954]: <err> cman_dispatch: Host is down Jun 28 15:16:53 frederick qdiskd[1954]: <err> Halting qdisk operations Jun 28 15:16:54 frederick kernel: md: stopping all md devices. Irrespective of whether reboot occurs at this point, the node has been kicked itself out of the cluster and therefore will be fenced by the other node.
Created attachment 427487 [details] Make the rgmanager watchdog process kill CMAN if rgmanager (main) crashes This patch works around the case that the reboot() system call fails for some reason. Effectively, it issues a cman_kill_node(c, my_node_id) to cause an unclean eviction of the cluster node, resulting in the host being fenced by the cluster. Because there are cases where cman_admin_init() may also fail, we must still reboot from the watchdog process.
Another possibility which was suggested by Mike Snitzer is to try writing 'b' to /proc/sysrq-trigger; I will attach a patch for this as well.
Created attachment 427506 [details] Patch using Mike's proposal This makes rgmanager tickle /proc/sysrq-trigger in order to do a reboot, which has a less error-prone path than the reboot() system call.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0134.html