Bug 608709 - reboot(RB_AUTOBOOT) fails if kvm instance is running
reboot(RB_AUTOBOOT) fails if kvm instance is running
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm (Show other bugs)
5.5.z
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Karen Noel
Virtualization Bugs
: ZStream
Depends On:
Blocks: Rhel5KvmTier2 637520 638501 661397
  Show dependency treegraph
 
Reported: 2010-06-28 10:22 EDT by Lon Hohberger
Modified: 2013-01-10 22:05 EST (History)
12 users (show)

See Also:
Fixed In Version: kvm-83-221.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 608397
: 637520 638501 (view as bug list)
Environment:
Last Closed: 2011-01-19 06:36:36 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
serial log (27.07 KB, text/plain)
2010-11-04 04:22 EDT, Golita Yue
no flags Details
serial log (I pressed the hard power key after the bug happed, so the host boot up) (21.92 KB, text/plain)
2010-12-14 07:14 EST, Golita Yue
no flags Details

  None (edit)
Description Lon Hohberger 2010-06-28 10:22:39 EDT
+++ This bug was initially created as a clone of Bug #608397 +++

Description of problem:

i've try to verify that the cluster can recover from clurgmgrd crash
so i've killed it using  kill -9 `pidof -s clurgmgrd`

1. the rgmanager on that host has stopped.
2. the service is still running on the host
3. the VM seems to be running on the host but it is not responding

[root@green-vdsa ~]# ps -aux | grep kvm
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root     24500 31.9 12.6 2316192 2074448 ?     Sl   12:30   6:16 /usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 2048 -smp 1 -name RHEVM-HA -uuid 00000000-0000-0000-0000-000000000002 -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//RHEVM-HA.pid -localtime -boot c -drive file=/dev/rhevm-cluster01/rhev-image,if=ide,index=0,boot=on,cache=none -drive file=/home/iso/windows_server_2008_r2.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=00:1a:4a:23:66:fd,vlan=0 -net tap,fd=18,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -usbdevice tablet -vnc 127.0.0.1:0 -k en-us

the VM will not be allocated nor the VM service.

1. the VM does not respond to ping 
2. the host does respond to ping but i can not connect to him in any way.

bottom line the cluster system is not responding!!!

i would think that the cluster system should recognize the cluster fail-over and will try and restart the vm service on the second node.

the failed node should be rebooted 
 



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Jun 27 12:28:08 green-vdsa rhev-check.sh[24033]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:28:15 green-vdsa rhev-check.sh[24060]: <err> RHEV Status check on 10.35.72.253 failed; last HTTP code: 500
Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> status on vm "RHEVM-HA" returned 1 (generic error)
Jun 27 12:28:15 green-vdsa clurgmgrd[31398]: <notice> Stopping service vm:RHEVM-HA
Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Interface vnet0.IPv6 no longer relevant for mDNS.
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Leaving mDNS multicast group on interface vnet0.IPv6 with address fe80::1c4a:7dff:fe35:d0c7.
Jun 27 12:30:16 green-vdsa avahi-daemon[5285]: Withdrawing address record for fe80::1c4a:7dff:fe35:d0c7 on vnet0.
Jun 27 12:30:16 green-vdsa kernel: device vnet0 left promiscuous mode
Jun 27 12:30:16 green-vdsa kernel: vmNetwork: port 2(vnet0) entering disabled state
Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA is recovering
Jun 27 12:30:21 green-vdsa clurgmgrd[31398]: <notice> Recovering failed service vm:RHEVM-HA
Jun 27 12:30:21 green-vdsa kernel: device vnet0 entered promiscuous mode
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering learning state
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: topology change detected, propagating
Jun 27 12:30:21 green-vdsa kernel: vmNetwork: port 2(vnet0) entering forwarding state
Jun 27 12:30:21 green-vdsa rhev-check.sh[24528]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: New relevant interface vnet0.IPv6 for mDNS.
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::4ced:b1ff:feac:210f.
Jun 27 12:30:23 green-vdsa avahi-daemon[5285]: Registering new address record for fe80::4ced:b1ff:feac:210f on vnet0.
Jun 27 12:30:51 green-vdsa rhev-check.sh[24593]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:30:56 green-vdsa rhev-check.sh[24605]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:15 green-vdsa rhev-check.sh[24653]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:20 green-vdsa rhev-check.sh[24666]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:21 green-vdsa rhev-check.sh[24685]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:26 green-vdsa rhev-check.sh[24695]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:31:26 green-vdsa rhev-check.sh[24714]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:31:31 green-vdsa rhev-check.sh[24727]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:32:10 green-vdsa rhev-check.sh[24807]: <err> RHEV Status check on 10.35.72.253 failed
Jun 27 12:32:15 green-vdsa rhev-check.sh[24818]: <info> Checking RHEV status on 10.35.72.253
Jun 27 12:32:19 green-vdsa clurgmgrd[31398]: <notice> Service vm:RHEVM-HA started

--- Additional comment from yeylon@redhat.com on 2010-06-27 06:37:48 EDT ---

rgmanager-2.0.52-6.el5_5.7

--- Additional comment from lhh@redhat.com on 2010-06-28 10:14:56 EDT ---

This looks like a kernel bug:

kvm: exiting hardware virtualization
Synchronizing SCSI cache for disk sdu: 
Synchronizing SCSI cache for disk sdt: 
Synchronizing SCSI cache for disk sds: 
Synchronizing SCSI cache for disk sdr: 
Synchronizing SCSI cache for disk sdq: 
Synchronizing SCSI cache for disk sdp: 
Synchronizing SCSI cache for disk sdo: 
Synchronizing SCSI cache for disk sdn: 
Synchronizing SCSI cache for disk sdm: 
Synchronizing SCSI cache for disk sdl: 
Synchronizing SCSI cache for disk sdk: 
Synchronizing SCSI cache for disk sdj: 
Synchronizing SCSI cache for disk sdi: 
Synchronizing SCSI cache for disk sdh: 
Synchronizing SCSI cache for disk sdg: 
Synchronizing SCSI cache for disk sdf: 
Synchronizing SCSI cache for disk sde: 
Synchronizing SCSI cache for disk sdd: 
Synchronizing SCSI cache for disk sdc: 
Synchronizing SCSI cache for disk sdb: 
Restarting system.
.
machine restart

I am still logged in to this machine.

--- Additional comment from lhh@redhat.com on 2010-06-28 10:16:02 EDT ---

This occurred after the syscall 'reboot(RB_AUTOBOOT)', which should never fail.

--- Additional comment from lhh@redhat.com on 2010-06-28 10:16:43 EDT ---

Jun 28 17:13:09 green-vdsa clurgmgrd[5472]: <crit> 
             Watchdog: Daemon died, rebooting... 
Jun 28 17:13:09 green-vdsa kernel: md: stopping all md 
             devices.

--- Additional comment from lhh@redhat.com on 2010-06-28 10:20:57 EDT ---

The machine was running one qemu-kvm instance.

As a crash recovery measure, rgmanager has a watchdog process which reboots the host if the main rgmanager process fails unexpectedly.  This causes the node to get fenced and rgmanager to recover the service on the other host.

When we kill rgmanager proper, the watchdog process calls reboot(RB_AUTOBOOT).  At this point, we log the above messages in comment #2, the kernel reported messages as per comment #2, and the machine never rebooted.
Comment 1 Lon Hohberger 2010-06-28 15:27:37 EDT
Reproduced on Fedora 11 outside of the cluster software.

I had several KVM machines running and issued 'reboot -fn'; my machine did not actually reboot after 10 minutes.
Comment 5 Avi Kivity 2010-07-13 07:32:55 EDT
Is this 100% reproducible, or probabilistic?
Comment 6 Avi Kivity 2010-09-26 10:53:15 EDT
Message-Id: <1285499115-9166-1-git-send-email-avi@redhat.com>
Subject: [PATCH RHEL5.6 RHEL5.5.z] KVM: Fix reboot on Intel hosts
Comment 13 Golita Yue 2010-11-04 04:22:07 EDT
Created attachment 457728 [details]
serial log
Comment 24 Golita Yue 2010-12-10 02:15:53 EST
Can reproduce this bug on RHEL5.6  kvm-83-221.el5, kernel 2.6.18-235.el5.

steps: 
1. start a kvm guest on rhel5.6 host.
2. run some jobs inside the guest. for example dd if=/dev/vda of=a.out bs=1M count=2048
3. reboot the host by "reboot -fn"

I tested 5 times and reproduced 3 times. 

dhcp-91-65.nay.redhat.com login: Ebtables v2.0 registered
ip6_tables: (C) 2000-2006 Netfilter Core Team
kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0x0
kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x130076
kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffdd0014
kvm: 3888: cpu0 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076
kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010004 data 0x0
kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010000 data 0x130076
kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010004 data 0xffffffffffdd0014
kvm: 3888: cpu1 unimplemented perfctr wrmsr: 0xc0010000 data 0x530076
kvm: 3888: cpu2 unimplemented perfctr wrmsr: 0xc0010004 data 0x0
kvm: 3888: cpu2 unimplemented perfctr wrmsr: 0xc0010000 data 0x130076
Synchronizing SCSI cache for disk sdb:
Synchronizing SCSI cache for disk sda:
Restarting system.
.
machine restart

But I am still logged in to this machine. 
I waited about 10 mins, this host doesn't boot up automatically until I run (kill -9 $kvm_pid).
Comment 25 Avi Kivity 2010-12-14 05:21:31 EST
Do you have serial console logs for this failure?
Comment 26 Avi Kivity 2010-12-14 05:58:31 EST
Also, please describe the test in detail

- qemu command line
- guest type
- image format

so I can try to reproduce it.
Comment 27 Golita Yue 2010-12-14 07:03:06 EST
(In reply to comment #26)
> - qemu command line

/usr/libexec/qemu-kvm -m 8G -smp 4 -uuid `uuidgen` -monitor stdio -boot c -drive file=/dev/vgtest/r6.0-64.raw,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -net nic,macaddr=00:22:00:73:b5:26,model=virtio,vlan=0 -net tap,vlan=0,script=/etc/qemu-ifup -usb -vnc :1 -soundhw ac97

> - guest type
r6.0-64.raw

> - image format
raw
Comment 28 Golita Yue 2010-12-14 07:09:46 EST
(In reply to comment #25)
> Do you have serial console logs for this failure?

attached the serial log. (but there isn't error information in serial log)
Comment 29 Golita Yue 2010-12-14 07:14:22 EST
Created attachment 468594 [details]
serial log (I pressed the hard power key after the bug happed, so the host boot up)

Note You need to log in before you can comment on or make changes to this bug.