676205 – libvirtd: Timed out during operation: cannot acquire state change lock

Bug 676205 - libvirtd: Timed out during operation: cannot acquire state change lock

Summary: libvirtd: Timed out during operation: cannot acquire state change lock

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	5.6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Laine Stump
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-02-09 04:13 UTC by Douglas Schilling Landgraf
Modified:	2018-12-03 17:15 UTC (History)
CC List:	23 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-10-26 19:05:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
debug logs (129.87 KB, application/x-gzip) 2011-02-09 04:18 UTC, Douglas Schilling Landgraf	no flags	Details
Screenshot of virt-manager running on Debian Squeeze (over X-forwarding on MacOSX) when the problem with libvirtd occured (88.07 KB, image/png) 2011-02-21 14:56 UTC, John Paul Adrian Glaubitz	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Legacy)	39308	0	None	None	None	Never

Description Douglas Schilling Landgraf 2011-02-09 04:13:47 UTC

Description of problem:

Customer cannot resume virtual machine, below the error:

# virsh resume v_rhel5_prod 
error: Failed to resume domain v_rhel5_prod 
error: Timed out during operation: cannot acquire state change lock

Version-Release number of selected component (if applicable):
libvirt-0.8.2-15.el5_6.1.x86_64                           
libvirt-python-0.8.2-15.el5_6.1.x86_64                 

Also tried:

# virsh destroy v_rhel5_prod
error: Failed to destroy domain v_rhel5_prod
error: Timed out during operation: cannot acquire state change lock

# virsh start v_rhel5_prod
error: Domain is already active

Additional info:
Even rebooting the host the VM keep locked.

Similar issue:
https://bugzilla.redhat.com/show_bug.cgi?id=668438

Attached the debug logs

Comment 1 Douglas Schilling Landgraf 2011-02-09 04:18:50 UTC

Created attachment 477730 [details]
debug logs

Comment 4 Daniel Berrangé 2011-02-09 17:18:24 UTC

Are there any files in /var/lib/libvirt/qemu/save  or /var/lib/libvirt/qemu/snapshot ?  And is the 'libvirt-guests' initscript enabled on boot ?

Most likely guess would be that there was a saved guest that failed to restore properly on boot

Comment 6 Douglas Schilling Landgraf 2011-02-15 21:34:53 UTC

Hello Daniel,

Sorry the delay here, customer decided to moving to RHEL 6. They are uploading these files to further analyze. Would you like to make any suggestion?

Thanks
Douglas

Comment 9 John Paul Adrian Glaubitz 2011-02-21 14:54:19 UTC

Hi,

I am running libvirtd with kvm on a Debian Squeeze host and I am experiencing the same problem from time to time. I'm using virt-manager to control my virtual machines and sometimes libvirtd runs into problems controlling a kvm domain.

I cannot exactly say, when the problem occurs but it usually happens when I start and stop several virtual machines one after another. I.e., I have several virtual machines with test installations for development and since some of them are running unstable versions (Debian for example), I start all the VMs once a week to update them, usually not more than two at the same time since my kvm host has only 4GB of RAM. It then usually happens that the state of the virtual machines is not updated in virt-manager and when trying to start a virtual machine which is powered off, I receive the aforementioned error message.

The problem is always fixed by:

killall -9 libvirtd
rm /var/run/libvirtd.pid
/etc/init.d/libvirt-bin restart

The virtual machines are never affected by this problem, they still continue to run without any problems. It simply seems that libvirtd at some point cannot connect to the kvm host anymore due to a race condition. I'm attaching a screenshot of the error message in virt-manager the last time it happened. In this case, I logged into my Debian Squeeze kvm host over ssh and used X-forwarding to display virt-manager on the MacOS X host. virt-manager was not running on the Mac.

Version numbers:

dpkg -l libvirt\* |grep -e '^ii'
ii libvirt-bin 0.8.3-5 the programs for the libvirt library
ii libvirt0

dpkg -l virt\* |grep -e '^ii'
ii virt-manager 0.8.4-8 desktop application for managing virtual machines
ii virt-viewer 0.2.1-1 Displaying the graphical console of a virtual machine
ii virtinst 0.500.3-2

Regards,

Adrian

Comment 10 John Paul Adrian Glaubitz 2011-02-21 14:56:13 UTC

Created attachment 479933 [details]
Screenshot of virt-manager running on Debian Squeeze (over X-forwarding on MacOSX) when the problem with libvirtd occured

Comment 12 John Sopko 2011-04-29 15:32:28 UTC

I had the same problem.

Running RHEL5.6 host machine with latest patches as of today.

# lsb_release -r
Release:        5.6

# uname -a
Linux lark.cs.unc.edu 2.6.18-238.9.1.el5 #1 SMP Fri Mar 18 12:42:39 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

# rpm -qa|grep virt
virt-manager-0.6.1-13.el5
libvirt-0.8.2-15.el5_6.3
libvirt-0.8.2-15.el5_6.3
python-virtinst-0.400.3-11.el5
libvirt-python-0.8.2-15.el5_6.3

Installed a RHEL6.0 virtual machine, install went fine. The install rebooted at the end, the virtual machine window hung, the virt-manger window hung, let them sit for 5 minutes or so then killed them. Have 2 other old virtual machines that continued to run. Tried to start the new machine:

virsh # start lark-virtx
error: Failed to start domain lark-virtx
error: Timed out during operation: cannot acquire state change lock

Stopped librvirtd:

service libvirtd stop

Removed run directory:

rm -rf /var/run/libvirt

Note after the shutdown there was no /var/run/libvirt.pid file

Started libvirtd:

service libvirtd start

Was able to use virt-manager to start the new virtual machine.

Thanks Adrian, that worked!

Comment 13 John Paul Adrian Glaubitz 2011-04-29 17:19:41 UTC

Hi John,

on a side note: I recently upgraded libvirt from 0.8.2 to 0.9.0 and haven't seen the problem ever since. So, if you have the possibility to upgrade your libvirtd to the more recent version 0.9.0 or newer, I highly suggest you to do so and see if that permanently fixes the problem for you as well.

It's certainly also nice for the maintainers/developers to know whether the new version fixes the bug and if several people independently claim it does, they will be able to change this bug report to "fixed" =).

Greetings from Norway,

Adrian

Comment 14 John Sopko 2011-05-02 19:08:02 UTC

Got a chance to play with this again for a few minutes. If I
login and halt a vm or do a Shut Down from virt-manager this
consistently hangs virt-manager and the vnc client window.

If I do a "service restart libvirtd" virt-manager is able to
re-connect again. I do not have to remove any /var/run/libvirt*
files.

I have been running 2 virtual machines for over a year. I just
noticed this problem because I created a new machine and started
seeing virt-manager hanging with the "Timed out during operation: cannot acquire state change lock" error.

Hope an update comes out soon.

Comment 15 John Paul Adrian Glaubitz 2011-05-02 19:18:59 UTC

John,

as I previously mentioned, the bug has been fixed in the version 0.9.0 and later. But since you are using an older version and cannot easily upgrade, the most reasonable solution would be a backport of the fix, which means that the appropriate lines of code that were changed in 0.9.0 to address this particular problem should also be changed in 0.8.x, however, without changing anything else to make sure that no other, possible new problems are introduced.

I haven't checked the changelog of libvirt 0.9.0, so I don't know which change actually fixed the problem, but I am pretty sure that it can easily backported and will be backported since many people are actually using libvirt 0.8.x on RHEL which they have paid support for.

Adrian

Comment 16 John Sopko 2011-05-02 19:35:26 UTC

Yes, just thought having a consistent way, shutting down the system,
to reproduce the problem would be some helpful information.

Comment 26 Daniel Berrangé 2011-09-28 16:26:37 UTC

Summary of situation wrt "Timed out during operation: cannot acquire state change lock"


There are a few reasons why you might see that error message in RHEL-5

   
     1. The QEMU process has hung.

        QEMU won't respond to monitor commands. The API call making the first monitor command will wait forever, any subsequent API calls issuing monitor commands will timeout after ~30 seconds with this libvirt error message.

        This is expected behaviour when QEMU has hung.

     2. The QEMU process is working on a very long/slow monitor command

        The API call making the long monitor command will wait until it (eventually) finishes. Any subsequent API calls wanting to issue monitor commands will wait upto ~30 seconds, for the first call to finish, after which they return this libvirt error message.

        This is also expected behaviour when one API call is running a very long monitor command.

     3. Migration is aborted in between the 'Prepare' and 'Finish' step.

        Migration is a 3 phase process. First we 'Prepare' on the target host, acquiring the lock. Then we run on the source host. Finally we 'Finish' on the target host, releasing the lock. If the libvirt client dies/quits half way through, the lock may never be released. In this case, further monitor commands will return this libvirt error message.

        This is a bug

     4. Libvirt has a bug in lock handling

        libvirt might run a monitor command, but forgets to release the 'state change lock' once complete. Again further monitor commands will return this message.

        This is a bug.

In RHEL-6.2 we have done a number of things to address / mitigate these problems

 - It is now always possible to destroy a guest, even if the monitor is stuck. This lets you destroy a guest in scenario 1, which is not always possible with RHEL-5 libvirt, without restarting libvirtd.

 - Some pieces of code which held the lock for a long time, have been refactored to hold it for a much shorter period. This is primarily migration/save/restore/snapshot code. This should address some of the common reasons for seeing this error message

 - The migration code has been made more robust, to guarantee that all locks are released, even if  migration client aborts/quits without calling Finish.

So in RHEL-6.2, only scenario 1/2 should remain and those should occur less frequently, or at least be recoverable without requiring a libvirtd daemon restart, by killing the guest in question.

The changes made in RHEL-6.1/6.2 to deal with this error message required alot of changes across all areas of the code. These changes would not be practical to backport to RHEL-5, because of the risk of them introducing regressions in other areas.

Comment 28 RHEL Program Management 2011-10-26 19:05:16 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 29 nigil 2012-11-21 05:59:08 UTC

I want to add: 
When tried to shutdown VM, it went to paused state and just hangs. Could not resume/shutdown vm from the paused state. 
[root@lnx132-75 vol_vm_data_disk_f63]# virsh list
 Id Name                 State
----------------------------------
 12 vm2_rhel6_x86_64     paused
 15 vm6_win2003_x86_64   running
 16 vm7_win2008_x86_64   running
 17 vm8_win7_x86         running
 18 vm9_win7_x86_64      running

[root@lnx132-75]# virsh shutdown vm2_rhel6_x86_64
error: Failed to shutdown domain vm2_rhel6_x86_64
error: Timed out during operation: cannot acquire state change lock

[root@lnx132-75]# virsh resume vm2_rhel6_x86_64
error: Failed to resume domain vm2_rhel6_x86_64
error: Timed out during operation: cannot acquire state change lock

[root@lnx132-75]# virsh start vm2_rhel6_x86_64
error: Domain is already active

[root@lnx132-75]# lsb_release -r
Release:        5.8

[root@lnx132-75]# rpm -qa | grep libvirt
libvirt-cim-0.5.8-3.el5
libvirt-0.8.2-25.el5
libvirt-0.8.2-25.el5
libvirt-python-0.8.2-25.el5

Found xml of the VM is saved as .save.
[root@lnx132-75 save]# pwd
/var/lib/libvirt/qemu/save
[root@lnx132-75 save]# ls
vm2_rhel6_x86_64.save

Removed .save file and tried to resume/shutdown, but same issue has been observed.

Comment 30 nigil 2012-11-21 10:58:10 UTC

I want to add: 
When tried to shutdown VM, it went to paused state and just hangs. Could not resume/shutdown vm from the paused state. 
[root@lnx132-75 vol_vm_data_disk_f63]# virsh list
 Id Name                 State
----------------------------------
 12 vm2_rhel6_x86_64     paused
 15 vm6_win2003_x86_64   running
 16 vm7_win2008_x86_64   running
 17 vm8_win7_x86         running
 18 vm9_win7_x86_64      running

[root@lnx132-75]# virsh shutdown vm2_rhel6_x86_64
error: Failed to shutdown domain vm2_rhel6_x86_64
error: Timed out during operation: cannot acquire state change lock

[root@lnx132-75]# virsh resume vm2_rhel6_x86_64
error: Failed to resume domain vm2_rhel6_x86_64
error: Timed out during operation: cannot acquire state change lock

[root@lnx132-75]# virsh start vm2_rhel6_x86_64
error: Domain is already active

[root@lnx132-75]# lsb_release -r
Release:        5.8

[root@lnx132-75]# rpm -qa | grep libvirt
libvirt-cim-0.5.8-3.el5
libvirt-0.8.2-25.el5
libvirt-0.8.2-25.el5
libvirt-python-0.8.2-25.el5

Found xml of the VM is saved as .save.
[root@lnx132-75 save]# pwd
/var/lib/libvirt/qemu/save
[root@lnx132-75 save]# ls
vm2_rhel6_x86_64.save

Removed .save file and tried to resume/shutdown, but same issue has been observed.

Note You need to log in before you can comment on or make changes to this bug.

berrange
cww
dag
dallan
dyasny
dyuan
eblake
glaubitz
ianauati
jens.osterkamp
jwest
jyang
ltoscano
nigil
plyons
rdassen
rwu
sopko
sraje
virt-maint
wdaniel
xen-maint
yoyzhang