Bug 1394405

Summary: libvirtError: Timed out during operation: cannot acquire state change lock
Product: Red Hat OpenStack Reporter: Aaron Thomas <aathomas>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED CURRENTRELEASE QA Contact: Prasanth Anbalagan <panbalag>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.0 (RHEL 7)CC: aathomas, awaugama, berrange, dasmith, eglynn, jthomas, kchamart, sbauza, sferdjao, sgordon, srevivo, vromanso, wlehman
Target Milestone: asyncKeywords: ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-25 17:31:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Aaron Thomas 2016-11-11 21:26:35 UTC
Description of problem:
-----------------------------------------
Appears to be similar to the upstream nova bug attached to this BZ with queries against a domain which is being worked on asynchronously can result in hung clients and the inability to do anything further with that domain.

Version-Release number of selected component (if applicable):
-----------------------------------------
libvirt-1.2.8-16.el7_1.3.x86_64
libvirt-client-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-network-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-config-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-interface-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-lxc-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-network-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-nodedev-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-nwfilter-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-qemu-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-secret-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-driver-storage-1.2.8-16.el7_1.3.x86_64
libvirt-daemon-kvm-1.2.8-16.el7_1.3.x86_64
libvirt-python-1.2.8-7.el7_1.1.x86_64
openstack-nova-common-2014.1.5-31.el7ost.noarch
openstack-nova-compute-2014.1.5-31.el7ost.noarch

How reproducible:
-----------------------------------------
Appears to occur every few days based on heavy queries however logging appears to indicate two separate threads that have made libvirt API calls against the same instance. Based on https://bugs.launchpad.net/nova/+bug/1254872 one of the calls has either hung completely, or is taking a very long time to respond causing the second API call to report this error message "libvirtError: Timed out during operation: cannot acquire state change lock".

Actual results:
-----------------------------------------
Queries against a domain which is being worked on asynchronously
can result in hung clients and the inability to do anything
further with that domain.

Expected results:
-----------------------------------------
Queries against a domain which is being worked on asynchronously
do not result in hung clients and the inability to do anything
further with that domain.

Additional info:
-----------------------------------------
We've requested the customer enable verbose libvirt logging to help identity the specific API calls that were last reported as successful as it appears possible this can be caused by a lot of factors.

Comment 1 Kashyap Chamarthy 2016-11-15 14:22:39 UTC
Assuming they are about to provide debug logs with the below log filters:

  1. In /etc/libvirt/libvirtd.conf, have these two config attributes:

     . . .
     log_filters="1:libvirt 1:qemu 1:conf 1:security 3:event 3:json 3:file 3:object 1:util"
     log_outputs="1:file:/var/log/libvirt/libvirtd.log"
     . . .

  2. Restart libvirtd:

     $ systemctl restart libvirtd

  3. Repeat the test.

  4. Capture the libvirt logs and attach them as plain text to the
     bug.