+++ This bug was initially created as a clone of Bug #608397 +++ Description of problem: i've try to verify that the cluster can recover from clurgmgrd crash so i've killed it using kill -9 `pidof -s clurgmgrd` 1. the rgmanager on that host has stopped. 2. the service is still running on the host 3. the VM seems to be running on the host but it is not responding [root@green-vdsa ~]# ps -aux | grep kvm Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ root 24500 31.9 12.6 2316192 2074448 ? Sl 12:30 6:16 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2048 -smp 1 -name RHEVM-HA -uuid 00000000-0000-0000-0000-000000000002 -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEVM-HA.pid -localtime -boot c -drive file=/dev/rhevm-cluster01/rhev-image,if=ide,index=0,boot=on,cache=none -drive file=/home/iso/windows_server_2008_r2.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=00:1a:4a:23:66:fd,vlan=0 -net tap,fd=18,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0 -k en-us the VM will not be allocated nor the VM service. 1. the VM does not respond to ping 2. the host does respond to ping but i can not connect to him in any way. bottom line the cluster system is not responding!!! i would think that the cluster system should recognize the cluster fail-over and will try and restart the vm service on the second node. the failed node should be rebooted Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Jun 27 12:28:08 green-vdsa rhev-check.sh[24033]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:28:15 green-vdsa rhev-check.sh[24060]: <err> RHEV Status check on 10.35.72.253 failed; last HTTP code: 500 Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> status on vm "RHEVM-HA" returned 1 (generic error) Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> Stopping service vm:RHEVM-HA Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Interface vnet0.IPv6 no longer relevant for mDNS. Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Leaving mDNS multicast group on interface vnet0.IPv6 with address fe80::1c4a:7dff:fe35:d0c7. Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Withdrawing address record for fe80::1c4a:7dff:fe35:d0c7 on vnet0. Jun 27 12:30:16 green-vdsa kernel: device vnet0 left promiscuous mode Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA is recovering Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Recovering failed service vm:RHEVM-HA Jun 27 12:30:21 green-vdsa kernel: device vnet0 entered promiscuous mode Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering learning state Jun 27 12:30:21 green-vdsa kernel: vmNetwork: topology change detected, propagating Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering forwarding state Jun 27 12:30:21 green-vdsa rhev-check.sh[24528]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: New relevant interface vnet0.IPv6 for mDNS. Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::4ced:b1ff:feac:210f. Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Registering new address record for fe80::4ced:b1ff:feac:210f on vnet0. Jun 27 12:30:51 green-vdsa rhev-check.sh[24593]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:30:56 green-vdsa rhev-check.sh[24605]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:31:15 green-vdsa rhev-check.sh[24653]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:31:20 green-vdsa rhev-check.sh[24666]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:31:21 green-vdsa rhev-check.sh[24685]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:31:26 green-vdsa rhev-check.sh[24695]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:31:26 green-vdsa rhev-check.sh[24714]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:31:31 green-vdsa rhev-check.sh[24727]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:32:10 green-vdsa rhev-check.sh[24807]: <err> RHEV Status check on 10.35.72.253 failed Jun 27 12:32:15 green-vdsa rhev-check.sh[24818]: <info> Checking RHEV status on 10.35.72.253 Jun 27 12:32:19 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA started --- Additional comment from yeylon on 2010-06-27 06:37:48 EDT --- rgmanager-2.0.52-6.el5_5.7 --- Additional comment from lhh on 2010-06-28 10:14:56 EDT --- This looks like a kernel bug: kvm: exiting hardware virtualization Synchronizing SCSI cache for disk sdu: Synchronizing SCSI cache for disk sdt: Synchronizing SCSI cache for disk sds: Synchronizing SCSI cache for disk sdr: Synchronizing SCSI cache for disk sdq: Synchronizing SCSI cache for disk sdp: Synchronizing SCSI cache for disk sdo: Synchronizing SCSI cache for disk sdn: Synchronizing SCSI cache for disk sdm: Synchronizing SCSI cache for disk sdl: Synchronizing SCSI cache for disk sdk: Synchronizing SCSI cache for disk sdj: Synchronizing SCSI cache for disk sdi: Synchronizing SCSI cache for disk sdh: Synchronizing SCSI cache for disk sdg: Synchronizing SCSI cache for disk sdf: Synchronizing SCSI cache for disk sde: Synchronizing SCSI cache for disk sdd: Synchronizing SCSI cache for disk sdc: Synchronizing SCSI cache for disk sdb: Restarting system. . machine restart I am still logged in to this machine. --- Additional comment from lhh on 2010-06-28 10:16:02 EDT --- This occurred after the syscall 'reboot(RB_AUTOBOOT)', which should never fail. --- Additional comment from lhh on 2010-06-28 10:16:43 EDT --- Jun 28 17:13:09 green-vdsa clurgmgrd[5472]: <crit> Watchdog: Daemon died, rebooting... Jun 28 17:13:09 green-vdsa kernel: md: stopping all md devices. --- Additional comment from lhh on 2010-06-28 10:20:57 EDT --- The machine was running one qemu-kvm instance. As a crash recovery measure, rgmanager has a watchdog process which reboots the host if the main rgmanager process fails unexpectedly. This causes the node to get fenced and rgmanager to recover the service on the other host. When we kill rgmanager proper, the watchdog process calls reboot(RB_AUTOBOOT). At this point, we log the above messages in comment #2, the kernel reported messages as per comment #2, and the machine never rebooted.
Reproduced on Fedora 11 outside of the cluster software. I had several KVM machines running and issued 'reboot -fn'; my machine did not actually reboot after 10 minutes.
Is this 100% reproducible, or probabilistic?
Message-Id: <1285499115-9166-1-git-send-email-avi> Subject: [PATCH RHEL5.6 RHEL5.5.z] KVM: Fix reboot on Intel hosts
Created attachment 457728 [details] serial log
Can reproduce this bug on RHEL5.6 kvm-83-221.el5, kernel 2.6.18-235.el5. steps: 1. start a kvm guest on rhel5.6 host. 2. run some jobs inside the guest. for example dd if=/dev/vda of=a.out bs=1M count=2048 3. reboot the host by "reboot -fn" I tested 5 times and reproduced 3 times. dhcp-91-65.nay.redhat.com login: Ebtables v2.0 registered ip6_tables: (C) 2000-2006 Netfilter Core Team kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0x0 kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x130076 kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffdd0014 kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010004 data 0x0 kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010000 data 0x130076 kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffdd0014 kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076 kvm: 3888: cpu2 unimplemented perfctr wrmsr: 0xc0010004 data 0x0 kvm: 3888: cpu2 unimplemented perfctr wrmsr: 0xc0010000 data 0x130076 Synchronizing SCSI cache for disk sdb: Synchronizing SCSI cache for disk sda: Restarting system. . machine restart But I am still logged in to this machine. I waited about 10 mins, this host doesn't boot up automatically until I run (kill -9 $kvm_pid).
Do you have serial console logs for this failure?
Also, please describe the test in detail - qemu command line - guest type - image format so I can try to reproduce it.
(In reply to comment #26) > - qemu command line /usr/libexec/qemu-kvm -m 8G -smp 4 -uuid `uuidgen` -monitor stdio -boot c -drive file=/dev/vgtest/r6.0-64.raw,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -net nic,macaddr=00:22:00:73:b5:26,model=virtio,vlan=0 -net tap,vlan=0,script=/etc/qemu-ifup -usb -vnc :1 -soundhw ac97 > - guest type r6.0-64.raw > - image format raw
(In reply to comment #25) > Do you have serial console logs for this failure? attached the serial log. (but there isn't error information in serial log)
Created attachment 468594 [details] serial log (I pressed the hard power key after the bug happed, so the host boot up)