Bug 623030
Summary: | Failed to resume paused domain with error"Timed out during operation: cannot acquire state change lock". | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | wangyimiao <yimwang> |
Component: | libvirt | Assignee: | Eric Blake <eblake> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.0 | CC: | berrange, dallan, dyuan, eblake, jclift, llim, moshiro, xen-maint, yoyzhang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-11-18 18:31:40 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
wangyimiao
2010-08-11 07:14:14 UTC
> Failed to resume paused domain with error"Timed out during operation: cannot
> acquire state change lock".
This indicates that a previous monitor command is still running.
I'm having problems reliably reproducing it with: kernel-2.6.32-66.el6.x86_64 qemu-kvm-0.12.1.2-2.112.el6.x86_64 libvirt-0.8.1-27.el6.x86_64 I've tried both raw images and qcow2 images. I did get a hang once, but right now, all my save/restore attempts are either working (when I use /var/lib/libvirt/images/ as the place to store the state) or failing without locking up (when I use /root/ as the place to store state, since qemu is running as qemu:qemu, I get: error: Failed to save domain rawhide to /root/dom.sav error: operation failed: Migration unexpectedly failed and the VM still continues to run) But what I did notice when I did get the hang, compared to what is happening to me right now, is that 'virsh list --all' shows a saved domain as 'shut off', but the hung case showed as 'paused'. Maybe there's some sort of timing race where qemu and libvirtd get out of sync as to what state the VM has when qemu can't save because of permissions issues, and if the race is lost, then all subsequent attempts to get libvirtd to control the domain are hung because libvirtd is still waiting for a monitor command to qemu to complete, even though qemu has nothing further to do based on the failed migration. I have to wonder if this is related to https://bugzilla.redhat.com/show_bug.cgi?id=626544 Ouch, just got bitten by this when migrating a guest instance backwards and forwards between two RHEL 6.0 hosts. After only the 4th migration too. Should we mark this as a RHEL 6.0 blocker? host1 (destination host, where it's hung) ************************* # rpm -qa | grep libvirt libvirt-client-0.8.1-27.el6.x86_64 libvirt-0.8.1-27.el6.x86_64 libvirt-python-0.8.1-27.el6.x86_64 libvirt-devel-0.8.1-27.el6.x86_64 libvirt-debuginfo-0.8.1-27.el6.x86_64 # rpm -qa | grep kvm qemu-kvm-0.12.1.2-2.113.el6.x86_64 # ps -ef|grep qemu qemu 21899 1 0 21:16 ? 00:00:02 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Nehalem,+x2apic -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name vm1 -uuid 4ce376a6-db4a-9382-4f06-03ea4d2f6d0b -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/home/nfs_images/bz615941-vm1.img,if=none,id=drive-virtio-disk0,boot=on,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=22,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:bd:48:7e,bus=pci.0,addr=0x5 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0,tls,x509verify=/etc/pki/libvirt-vnc -k en-us -vga cirrus -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 # uname -a Linux host1 2.6.32-70.el6.x86_64 #1 SMP Wed Aug 25 10:17:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux host2 (source host) ******************* # rpm -qa | grep libvirt libvirt-python-0.8.1-27.el6.x86_64 libvirt-client-0.8.1-27.el6.x86_64 libvirt-devel-0.8.1-27.el6.x86_64 libvirt-0.8.1-27.el6.x86_64 libvirt-debuginfo-0.8.1-27.el6.x86_64 # rpm -qa | grep kvm qemu-kvm-0.12.1.2-2.113.el6.x86_64 qemu-kvm-tools-0.12.1.2-2.113.el6.x86_64 # ps -ef|grep qemu qemu 17136 1 5 21:16 ? 00:00:46 /usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu Nehalem,+x2apic -enable-kvm -m 1024 -smp 2,sockets=2,cores=1,threads=1 -name vm1 -uuid 4ce376a6-db4a-9382-4f06-03ea4d2f6d0b -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/home/nfs_images/bz615941-vm1.img,if=none,id=drive-virtio-disk0,boot=on,format=qcow2 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=22,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:bd:48:7e,bus=pci.0,addr=0x5 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -vga cirrus -incoming tcp:0.0.0.0:49152 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 # uname -a Linux host2 2.6.32-70.el6.x86_64 #1 SMP Wed Aug 25 10:17:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux As a data point, even after killing the qemu process on both servers, plus the libvirtd process on both servers, the vm would not start up on host1. (same error) No lock file or anything in /var/run/libvirt/*. There was a /var/run/libvirt/network/default.pid file though, with a valid pid, on both hosts. # cd /var/run/libvirt/ # find . ./lxc ./network ./network/default.pid ./qemu # cat network/default.pid 1720 # ps -ef|grep 1720 nobody 1720 1 0 Aug29 ? 00:00:00 /usr/sbin/dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file= --listen-address 192.168.122.1 --except-interface lo --dhcp-range 192.168.122.2,192.168.122.254 --dhcp-lease-max=253 --dhcp-no-override And same kind of thing on the other host. What allowed things to start working again on these hosts was (on each host): 1. Killing the process listed in the pid file 2. Remove the now-stale pid file 3. Flush any iptables rules (just in case) -> iptables -F 4. Start libvirtd again The guest was able to be started on host1 after that, and then migrated backwards and forwards between hosts. So far, so good. (using SSH non-tunnelled migrations now, rather than the TLS ones being used when the process wedged) This error "error: Timed out during operation: cannot acquire state change lock", is happening frequently here. Eric, is there any kind of useful diag stuff I can do to help figure out the cause? If it helps, I've found a way to replicate this problem fairly easily, and to make that cause go away. (No idea of the underlying problem though) For the two hosts here, if: a) the libvirt daemon is not told to listen (ie --listen is not enabled in /etc/sysconfig/libvirtd), but b) the QEMU config _has_ been told to use TLS (plus had the TLS certificates configured) In this situation, migration between the hosts will wedge with the "cannot acquire state change lock" error within 3 migrations from one to the other. (sometimes on the first go) In this situation the source migration host is still running the guest, but the destination one has it in "paused" state, and that's as far as it gets. Any operation attempted on the destination host's one gives that error message. Additionally, even though TLS is configured, the migrations still wedge if done using SSH. Thinking it might be related to the TLS vnc options on each host or something, which are there regardless of migration type, if TLS is configured in qemu.conf. The way to avoid the problem... disable TLS in the qemu.conf. Migrations seem to happen pretty flawlessly then. *************************************************************************** Simple script I use to move a guest between hosts, to see if the process is working in a stable fashion. *************************************************************************** #!/bin/bash # Migrate a guest back and forth between two hosts, printing progress as it goes HOST1=host1 HOST2=host2 GUEST="vm1" TRANSPORT="ssh" #TRANSPORT="tls" for i in `seq 1 10`; do echo "Loop ${i}: Migrating from ${HOST1} to ${HOST2}" virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${GUEST} --live qemu+${TRANSPORT}://root@${HOST2}/system echo "Loop ${i}: Migrating back from ${HOST2} to ${HOST1}" virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${GUEST} --live qemu+${TRANSPORT}://root@${HOST1}/system done > Additionally, even though TLS is configured, the migrations still wedge if done
> using SSH. Thinking it might be related to the TLS vnc options on each host or
> something, which are there regardless of migration type, if TLS is configured
> in qemu.conf.
I smell a lack of entropy on /dev/random. I bet QEMU is blocking waiting on /dev/random at startup, and won't respond on the QMP monitor until it has sufficient entropy. This could then block libvirt's startup of QEMU, waiting for QEMU monitor to wake up.
That may also explain why the system it wedges on still feels a bit weird for a while afterwards. Even though the libvirt and qemu processes have been nuked, it could still be taking a period of time to build up enough useful entropy again. Uh oh... it occurs to me that if this is the underlying cause... then hosts with large numbers of guests (ie 500+) could be looking at a pretty lengthy time to get all their guests moved off. ie scenario where hardware has reported imminent failure, needs repair or looking at, etc As additional data, this problem also occurs in _guests_ using installation types other than "minimum installation". I've just now installed a guest using the "Software Development Workstation" (or similar worded name) install, and this error occurs immediately upon migration if QEMU is configured for TLS. Both host systems, and the guest are running the RHEL6 nightly build from last night (20100901), so this is occurring with the current QEMU code. (wonder if I should try an older QEMU build and see what happens?) Migration generally works using an older version of QEMU: qemu-kvm-0.12.1.2-2.90.el6.x86_64.rpm qemu-img-0.12.1.2-2.90.el6.x86_64.rpm qemu-kvm-tools-0.12.1.2-2.90.el6.x86_64.rpm (built from SRPMs from brew) It also kernel panics reasonably often during migration though, so not a huge win. :/ Note that the migrations take a lot longer than "normal" to occur. It seems to exhibit the same kind of problem as reported in the bug: -> guest is started on destination in "paused" state, then waits for ages before <failing> With this older QEMU though, instead of virsh coming back with "cannot acquire state change lock" after the long period of time, instead it generally finishes the migration successfully (after a "paused state of 2+ minutes). Is there any chance that with the old version of qemu, after some internal timeout occurs it would allow things to continue, but the newer versions have changed to abort instead, thus the error message being reported? Timings for clarification using the downgraded QEMU: TLS enabled migration - host1 to host2 ************************************** $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --p2p --tunnelled real 2m54.508s user 0m0.006s sys 0m0.009s $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live real 2m59.868s user 0m0.016s sys 0m0.033s $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live error: Timed out during operation: cannot acquire state change lock real 0m30.405s user 0m0.014s sys 0m0.021s (I manually rebooted host here with "sudo reboot") $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --persistent real 2m30.338s user 0m0.016s sys 0m0.027s $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --persistent real 2m46.492s user 0m0.021s sys 0m0.028s TLS enabled migration - host2 to host1 ************************************** $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live --p2p --tunnelled real 2m54.731s user 0m0.005s sys 0m0.003s $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live (kernel panic) :( (no long pause) --> This looks to be the only occasion where the migration "worked properly" $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live real 0m23.507s user 0m0.015s sys 0m0.011s $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live (kernel panic again) TLS not enabled version - host1 to host2 **************************************** $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --p2p --tunnelled real 2m25.240s user 0m0.004s sys 0m0.010s $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live --p2p --tunnelled real 2m39.303s user 0m0.007s sys 0m0.020s $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live (kernel panic) $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live real 2m20.949s user 0m0.018s sys 0m0.017s $ time sudo virsh migrate rhel6-full qemu+ssh://host2/system --live real 2m16.579s user 0m0.016s sys 0m0.038s TLS not enabled version - host1 to host2 **************************************** $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live real 0m22.101s user 0m0.010s sys 0m0.019s $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live real 0m22.075s user 0m0.013s sys 0m0.037s $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live real 0m20.878s user 0m0.017s sys 0m0.011s $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live real 0m20.671s user 0m0.011s sys 0m0.025s $ time sudo virsh migrate rhel6-full qemu+ssh://host1/system --live real 0m25.632s user 0m0.013s sys 0m0.010s Weirdly, the only migrations here consistently "ok" are those from host2 to host2 with TLS not enabled in QEMU. Even the host1 -> host2 version is having troubles. That should be: Weirdly, the only migrations here consistently "ok" are those from host2 to host1 with TLS not enabled in QEMU. Even the host1 -> host2 version is having troubles. Hi Moritoshi, After updating both host servers from yesterday's nightly tree, this problem doesn't seem to be occurring any more. Migration (using --live --p2p --tunnelled) over TLS is working properly and consistently, and hasn't given any indication of problems when run in the simple scripting loop. It's done about 60 migrations so far in the last few minutes, no problems at all. Notice the migrations are also consistently faster by a few seconds, than the fastest of the times above. $ ./migrate_test.sh Wed Sep 8 01:32:06 EST 2010 Loop 1: Migrating from host1 to host2 real 0m14.065s user 0m0.010s sys 0m0.002s Loop 1: Migrating back from host2 to host1 real 0m14.034s user 0m0.008s sys 0m0.003s Loop 2: Migrating from host1 to host2 real 0m14.112s user 0m0.006s sys 0m0.005s Loop 2: Migrating back from host2 to host1 real 0m14.175s user 0m0.010s sys 0m0.000s Loop 3: Migrating from host1 to host2 real 0m14.073s user 0m0.009s sys 0m0.002s Loop 3: Migrating back from host2 to host1 real 0m14.011s user 0m0.007s sys 0m0.004s Loop 4: Migrating from host1 to host2 real 0m14.091s user 0m0.008s sys 0m0.003s Loop 4: Migrating back from host2 to host1 real 0m14.257s user 0m0.009s sys 0m0.001s Loop 5: Migrating from host1 to host2 real 0m14.117s user 0m0.008s sys 0m0.003s Loop 5: Migrating back from host2 to host1 real 0m14.097s user 0m0.008s sys 0m0.002s Loop 6: Migrating from host1 to host2 real 0m14.018s user 0m0.008s sys 0m0.004s Loop 6: Migrating back from host2 to host1 real 0m14.106s user 0m0.008s sys 0m0.002s Loop 7: Migrating from host1 to host2 real 0m14.174s user 0m0.008s sys 0m0.003s Loop 7: Migrating back from host2 to host1 real 0m14.034s user 0m0.008s sys 0m0.002s Loop 8: Migrating from host1 to host2 real 0m14.065s user 0m0.010s sys 0m0.001s Loop 8: Migrating back from host2 to host1 real 0m14.028s user 0m0.007s sys 0m0.005s <snipped> Based on comments 17 and 18, I believe this BZ is resolved. |