1374918 – rebooting KVM VM occasionally hangs (via libreswan test suite)

Bug 1374918 - rebooting KVM VM occasionally hangs (via libreswan test suite)

Summary: rebooting KVM VM occasionally hangs (via libreswan test suite)

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	qemu
Sub Component:
Version:	27
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Fedora Virtualization Maintainers
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-10 11:57 UTC by Cagney
Modified:	2018-11-30 19:04 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-11-30 19:04:13 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Cagney 2016-09-10 11:57:34 UTC

Description of problem:

Running a testsuite that repeatedly reboots domains.  They eventually:

- hang
- suspend
- disappear
- loose their disk

Version-Release number of selected component (if applicable):

The domain is based on f22.

How reproducible:

See below:

Steps to Reproduce:

1. create a domain

I'm using a test framework that builds lots of domains based on f22, but I don't think that is the cause:

git clone https://github.com/libreswan/libreswan.git
cd libreswan
make install-kvm-test-domains

2. modify the domain to just reboot (I use "east")

cat <<EOF > /etc/rc.d/rc.local
#!/bin/sh
reboot &
EOF

3. set it going ....

Actual results:

Here's output from a simple domain where it spontaneously disappeared:

[  OK  ] Started udev Kernel Device Manager.
[  OK  ] Started Create Volatile Files and Directories.
         Starting Security Auditing Service...
[  OK  ] Started Security Auditing Service.
         Starting Update UTMP about System Boot/Shutdown...
[  OK  ] Started Update UTMP about System Boot/Shutdown.
[    9.573047] piix4_smbus 0000:00:01.3: SMBus Host Controller at 0x700, revision 0

Here's debug output from a domain "hanging", it sees a reboot message then nothing else, the code catches it recovers :-/

DEBUG ws1.shell nic 6:28:12.00/2:43.00: read <<' 250.760530] re'>>>
DEBUG ws1.shell nic 6:28:12.00/2:43.00: read <<'boot: Power down\r\n'>>>
DEBUG ws1.virsh nic 6:28:12.00/2:43.00: 'reboot: Power down\r\n' matched after 2.2 seconds
INFO ws1.virsh nic 6:28:12.00/2:43.00: domain rebooted after 2.2 seconds
DEBUG ws1.virsh nic 6:28:12.00/2:43.00: waiting 60 seconds for 'login: ' or 'reboot: Power down\r\n' or <class 'pexpect.exceptions.TIMEOUT'>
DEBUG ws1.virsh nic 6:29:12.01/3:43.00: <class 'pexpect.exceptions.TIMEOUT'> matched after 60.1 seconds
ERROR ws1.virsh nic 6:29:12.01/3:43.00: domain appears stuck, no output received after waiting 60 seconds
DEBUG ws1.virsh nic 6:29:12.01/3:43.00: running: ['sudo', 'virsh', 'domstate', 'ws1.nic']
DEBUG ws1.virsh nic 6:29:12.01/3:43.01: output: running
ERROR ws1.virsh nic 6:29:12.01/3:43.01: domain hung, trying to pull the power cord


Expected results:

Infinite reboots.


Additional info:

Comment 1 Cole Robinson 2017-03-14 20:10:04 UTC

Are you still seeing this with latest host packages?

Can you grab an example qemu invocation from 'ps' output?

Comment 2 Cagney 2017-03-20 17:56:56 UTC

Still happens on up-to-date f24.  A PS shows lines like:

cagney   25963 34.3  1.8 1759156 298560 ?      Sl   13:51   1:35 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name dh.east,debug-threads=on -S -machine pc-0.15,accel=kvm,usb=off -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 8f3a6d27-0ed6-42d3-88c4-c14e19dca349 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1765-dh.east/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0xa -drive file=/var/tmp/pool/dh.east.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0xb,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -fsdev local,security_model=none,id=fsdev-fs0,path=/home/libreswan/wip-dh/testing -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=testing,bus=pci.0,addr=0x3 -fsdev local,security_model=none,id=fsdev-fs1,path=/home/libreswan/wip-dh -device virtio-9p-pci,id=fs1,fsdev=fsdev-fs1,mount_tag=swansource,bus=pci.0,addr=0x4 -fsdev local,security_model=none,id=fsdev-fs2,path=/tmp -device virtio-9p-pci,id=fs2,fsdev=fsdev-fs2,mount_tag=tmp,bus=pci.0,addr=0x5 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=12:00:00:dc:bc:ff,bus=pci.0,addr=0x6 -netdev tap,fd=32,id=hostnet1,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=12:00:00:64:64:23,bus=pci.0,addr=0x8 -netdev tap,fd=35,id=hostnet2,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet2,id=net2,mac=12:00:00:32:64:23,bus=pci.0,addr=0x9 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xc -object rng-random,id=objrng0,filename=/dev/random -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on

Comment 3 Cole Robinson 2017-03-21 20:13:29 UTC

How is the reboot being triggered? From virsh on the host, or from inside the guest?
Is the guest OS still fedora 22?

reboot can hang for non-virt reasons as well, can you connect to a hung VM and check journalctl or dmesg logs to see if any error is jumping out? There's some more ideas in here: https://freedesktop.org/wiki/Software/systemd/Debugging/

Comment 4 Cagney 2017-03-28 17:06:24 UTC

(In reply to Cole Robinson from comment #3)
> How is the reboot being triggered? From virsh on the host, or from inside
> the guest?

It uses virsh; specifically:

- it waits for the login prompt to appear (using virsh console and "expect") so that it knows that the guest OS is running

- issues: virsh reboot
- it uses virsh console to wait for the guest OS to "reset", it sees the output:

     250.760530] reboot: Power down\r\n'

- it uses virsh console to wait for the guest OS to finish the fresh boot, it sees no further output - the above is all that is seen

- it starts recovery ....

> Is the guest OS still fedora 22?

Yes.

> reboot can hang for non-virt reasons as well, can you connect to a hung VM
> and check journalctl or dmesg logs to see if any error is jumping out?
> There's some more ideas in here:
> https://freedesktop.org/wiki/Software/systemd/Debugging/

It is the kvm that has become wedged; not the guest OS.

Here's further logging:

DEBUG virsh east 8:16:43.09/2.04: 'reboot: Power down\r\n' matched after 1.4 seconds
INFO virsh east 8:16:43.09/2.04: domain rebooted after 1.4 seconds

- per above, guest OS seen to "reboot"

- presumably the guest OS issues a "reset" and the fresh boot should start

DEBUG virsh east 8:16:43.09/2.04: waiting 10 seconds for 'login: ' or 'reboot: Power down\r\n' or <class 'pexpect.exceptions.TIMEOUT'>
DEBUG virsh east 8:16:53.09/12.04: <class 'pexpect.exceptions.TIMEOUT'> matched after 10.0 seconds
ERROR virsh east 8:16:53.09/12.04: domain appears stuck, no output received after waiting 10 seconds

- no further output seen from the the guest OS; after waiting 10 seconds grub output should have appeared


- now lets look at what is needed to recover things, there appears to be two cases:


case #1, domain "paused" instead of resetting as shown by 'virsh domstate':

DEBUG virsh east 8:16:53.09/12.04: running: ['sudo', 'virsh', 'domstate', 'east']
DEBUG virsh east 8:16:53.09/12.04: output: paused

- Try "resuming" the VM using "virsh resume" and see what happens:

ERROR virsh east 8:16:53.09/12.04: domain suspended, trying resume
DEBUG virsh east 8:16:53.09/12.04: running: ['sudo', 'virsh', 'resume', 'east']
DEBUG virsh east 8:16:54.00/12.05: virsh exited with unexpected status code 1

- and here's the error from virsh:

error: Failed to resume domain east
error: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

- so like it says, pull the cord


case 2, domain still "running", reset never happened:

DEBUG virsh nic 11:28:16.00/31.09: running: ['sudo', 'virsh', 'domstate', 'nic']
DEBUG virsh nic 11:28:16.01/31.09: output: running

- give up and "pull the cord"

Comment 5 Richard W.M. Jones 2017-03-28 18:09:59 UTC

Is it possible that although this is a F24 host, that you have
updated glib2 >= 2.50, perhaps by installing the F25 glib2 package?

The guest has an ISA serial console, and we have observed in another
bug (bug 1435432) that the serial console occasionally hangs because
of lost glib2 main loop events.  There are two ways you can use to
determine if you're suffering from the same bug:

(1) Patch your copy of glib2 (on the host) so that it reverts the
commit shown here:
https://bugzilla.redhat.com/show_bug.cgi?id=1435432#c6

(2) Or, and this may or may not be easy depending on your test framework,
arrange to inject keypresses into the serial console (eg. sending the
Enter key once per second) and see if that cures the problem.

Comment 6 Cole Robinson 2017-03-28 18:20:52 UTC

(typed this before I say rich's comment)

Thanks for the info. My read is that qemu process hanging in this way is kinda by design, notice the -no-shutdown option on the command line, which tells qemu 'dont exit the process after VM shutdown is complete'. Libvirt passes this to qemu so it can interact with the qemu process before it disappears, then will shut it down itself.

qemu doesn't have a native command to reboot the VM in the semantics of the libvirt API, so 'virsh reboot' has to fake it by doing

- request shutdown
- wait for event from qemu process indicating the guest OS shutdown
- issue qemu reset, issue qemu start

So something is either going wrong in:

- the guest OS (qemu doesn't see the shutdown event... probably unlikely)
- qemu issue messing up the notification delivery
- something in libvirt's handling

This stuff will be a pain to debug unfortunately. So for starters if you can distill a reproducer outside of your test suite that would be helpful because then I can debug on my own machine. Maybe use a stock VM image from virt-builder for example. Otherwise seeing if this reproduces on f25 qemu/libvirt, and if so, check fedora-virt-preview as well which has latest versions.

But if this is blocking you in the meantime you can alter your test suite to do 'shutdown; wait for VM to shutdown; if not, virsh destroy; virsh start;' or something along those lines

Comment 7 Cagney 2017-03-30 15:17:17 UTC

(In reply to Richard W.M. Jones from comment #5)
> Is it possible that although this is a F24 host, that you have
> updated glib2 >= 2.50, perhaps by installing the F25 glib2 package?

$ rpm -q glib2
glib2-2.48.2-1.fc24.x86_64

(the system is up-to-date)

> The guest has an ISA serial console, and we have observed in another
> bug (bug 1435432) that the serial console occasionally hangs because
> of lost glib2 main loop events.  There are two ways you can use to
> determine if you're suffering from the same bug:
> 
> (1) Patch your copy of glib2 (on the host) so that it reverts the
> commit shown here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1435432#c6
> 
> (2) Or, and this may or may not be easy depending on your test framework,
> arrange to inject keypresses into the serial console (eg. sending the
> Enter key once per second) and see if that cures the problem.

(I wonder if this is why I found f25 worse)

I'll leave this as a todo for now:

- I've not otherwise noticed the console hanging

- I'm using glib2 < 2.50

- some (not all) hangs involve the domain magically switching to a broken "paused" state, i.e.,:

  $ sudo virsh domstate east
  paused

  $ sudo virsh resume east
  error: Failed to resume domain east
  error: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

(unless of you suspect I'm fighting two bugs).

Comment 8 Cagney 2017-03-30 15:36:41 UTC

(In reply to Cole Robinson from comment #6)
> (typed this before I say rich's comment)
> 
> - request shutdown
> - wait for event from qemu process indicating the guest OS shutdown
> - issue qemu reset, issue qemu start

Thanks for filling in this detail.

> This stuff will be a pain to debug unfortunately. So for starters if you can
> distill a reproducer outside of your test suite that would be helpful
> because then I can debug on my own machine. Maybe use a stock VM image from
> virt-builder for example. Otherwise seeing if this reproduces on f25
> qemu/libvirt, and if so, check fedora-virt-preview as well which has latest
> versions.

It's a while ago, but I believe I did try a more vanilia vm but had no luck.  If I get a chance I'll poke it again.

(as for trying f25 again, I'd rather wait until f26)

> But if this is blocking you in the meantime you can alter your test suite to
> do 'shutdown; wait for VM to shutdown; if not, virsh destroy; virsh start;'
> or something along those lines

Yes, I'm already doing this.

Comment 9 Fedora End Of Life 2017-07-25 22:57:27 UTC

This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 10 Fedora End Of Life 2017-08-08 17:16:11 UTC

Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 11 Paul Wouters 2017-08-08 18:58:29 UTC

still happening on f26

Comment 12 Cole Robinson 2017-08-08 19:03:45 UTC

(In reply to Paul Wouters from comment #11)
> still happening on f26

paul are you seeing this within the context of the libreswan test suite like the original comment, or something else?

Comment 13 Paul Wouters 2017-08-08 19:05:02 UTC

Yes I am, but possibly only because that's my only heavy duty case of firing up and down qemu instances via libvirtd ?

Comment 14 Cagney 2017-08-11 01:41:39 UTC

On F26 there's a slightly different behaviour:

- VM told to reboot
- Guest OS shuts down hits the "reset"
- VIRSH exits (so to it seems does the VM)

what should have instead happened is VIRSH stayed connected and the VM continued to run.

Comment 15 Cole Robinson 2017-08-16 21:31:46 UTC

(In reply to cagney from comment #14)
> On F26 there's a slightly different behaviour:
> 
> - VM told to reboot
> - Guest OS shuts down hits the "reset"
> - VIRSH exits (so to it seems does the VM)
> 
> what should have instead happened is VIRSH stayed connected and the VM
> continued to run.

This is a reboot initiated from inside the guest?
And virsh in this case is 'virsh console' ?

Any sign of the reboot hang initially reported?

Comment 16 Cagney 2017-08-17 01:21:54 UTC

This isn't a new problem, comments above still apply.

- virsh is used to initiate reboot
- virsh console exits (presumably because the guest exited instead of rebooting)

Comment 17 Tom Horsley 2017-12-09 00:58:28 UTC

I see something very similar on my Windows 10 virtual machine. When Windows Update says it needs to reboot and I click Reboot Now, the machine cycles back to the console window and the BIOS reports no bootable device found.

This happens 100% of the time when Windows Update is doing the reboot (I've been watching it install the Fall Creators Update all day, and every time it does one of its reboots, I get the BIOS boot failure).

If I force off the virtual machine, then boot it from scratch, it seems to come up OK (hopefully nothing is corrupted). With the creators update it starts at the point where it left off as though it had rebooted itself.

If I'm just running Windows 10 normally, and I use the reboot button, I don't see the problem, only windows update seems to be unable to reboot.

Comment 18 Cagney 2018-01-04 01:37:55 UTC

Still happens on F27

Comment 19 Ben Cotton 2018-11-27 17:35:34 UTC

This message is a reminder that Fedora 27 is nearing its end of life.
On 2018-Nov-30  Fedora will stop maintaining and issuing updates for
Fedora 27. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora  'version' of '27'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 27 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 20 Ben Cotton 2018-11-30 19:04:13 UTC

Fedora 27 changed to end-of-life (EOL) status on 2018-11-30. Fedora 27 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.