807996 – libvirtd may hang during tunneled migration

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 807996 - libvirtd may hang during tunneled migration

Summary: libvirtd may hang during tunneled migration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	6.3
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	807907
Blocks:	840699 847946
TreeView+	depends on / blocked

Reported:	2012-03-29 10:09 UTC by hongming
Modified:	2013-02-21 07:09 UTC (History)
CC List:	13 users (show)
Fixed In Version:	libvirt-0.10.0-0rc0.el6
Doc Type:	Bug Fix
Doc Text:	Previously, repeatedly migrating a guest between two machines while using the tunnelled migration could cause the libvirt daemon to lock up unexpectedly. The bug in the code for locking remote drivers has been fixed and repeated tunnelled migrations of domains now work as expected.
Clone Of:
Environment:
Last Closed:	2013-02-21 07:09:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
virsh debug log (4.13 MB, text/plain) 2012-03-29 10:11 UTC, hongming	no flags	Details
script (574 bytes, text/plain) 2012-03-29 10:41 UTC, hongming	no flags	Details
mig-script (156 bytes, text/plain) 2012-03-29 10:44 UTC, hongming	no flags	Details
gdb log (1.95 KB, text/plain) 2012-03-30 06:33 UTC, hongming	no flags	Details
gdb log (2.07 KB, text/plain) 2012-03-30 06:35 UTC, hongming	no flags	Details
libvirtd hang log on one side (6.25 MB, text/plain) 2012-05-08 07:29 UTC, weizhang	no flags	Details
libvirtd hang log on the other side (6.32 MB, text/plain) 2012-05-08 07:30 UTC, weizhang	no flags	Details
backtrace of libvirtd hang on one side (13.11 KB, text/plain) 2012-05-18 05:34 UTC, weizhang	no flags	Details
backtrace of libvirtd hang on other side (14.86 KB, text/plain) 2012-05-18 05:35 UTC, weizhang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2013:0276	0	normal	SHIPPED_LIVE	Moderate: libvirt security, bug fix, and enhancement update	2013-02-20 21:18:26 UTC

Description hongming 2012-03-29 10:09:29 UTC

Description of problem:
when bidirectional migrate multiple guests simultaneously in 500 loop times ,use tls connection and "--tunnelled --p2p --live" flags , the following errors occur.

error: Timed out during operation: cannot acquire state change lock

when perform the same test using "qemu+ssh" connection  and "--p2p" flag ,the error never occurs.



Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1.Create TLS connection between host A and host B.
2.Start 8 guests  in host A and host B respectively.
3.Start migration for multiple guests between host A and host B bidirectional at the same time.
  
Actual results:
the following error occurs
error: Timed out during operation: cannot acquire state change lock

Expected results:
no error

Additional info:

Comment 1 hongming 2012-03-29 10:11:13 UTC

Created attachment 573615 [details]
virsh debug log

Comment 3 hongming 2012-03-29 10:26:57 UTC

Version-Release number of selected component (if applicable):
libvirt-0.9.10-9.el6.x86_64
qemu-kvm-0.12.1.2-2.267.el6.x86_64
kernel-2.6.32-250.el6.x86_64

Comment 4 hongming 2012-03-29 10:41:43 UTC

Created attachment 573621 [details]
script

Comment 5 hongming 2012-03-29 10:44:11 UTC

Created attachment 573622 [details]
mig-script

Comment 6 Jiri Denemark 2012-03-29 11:56:32 UTC

Could you, please, attach debug libvirtd logs from both hosts involved in the migration?

Comment 8 hongming 2012-03-30 06:33:04 UTC

Created attachment 573890 [details]
gdb log

Comment 9 hongming 2012-03-30 06:35:15 UTC

Created attachment 573891 [details]
gdb log

Both libvirtd will hang after the migrations run for some loops. Please see attached gdb log.

Comment 10 Jiri Denemark 2012-03-30 12:39:54 UTC

Thanks. BTW, log files can generally be very well compressed; these logs could be compressed to less than 13.5 MB, which is a huge improvement from their original size of about 0.5 GB logs and downloading them would take much less than one hour.

Comment 11 Jiri Denemark 2012-03-30 14:50:27 UTC

Hmm, gdb logs would really much more useful if they contained backtraces of all threads (thread apply all backtrace).

Comment 12 Jiri Denemark 2012-04-03 10:28:04 UTC

Unfortunately, even though the logs are big they don't contain the important parts. Could you try to reproduce again with the following settings in /etc/libvirt/libvirtd.conf (it should result in smaller debug logs):

log_level = 3
log_filters="1:conf 1:libvirt 1:qemu"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

and also get the full backtrace of hung libvirtd.

Please, do not cut anything from the logs or gdb backtrace and make sure the backtrace corresponds to the libvirtd processes that generated the logs.

Comment 14 Jiri Denemark 2012-04-06 18:44:18 UTC

The logs show that keepalive protocol closed the connection to destination host and it seems tunneled migration does not handle such situation properly. The reason for "Timed out during operation: cannot acquire state change lock" is that source libvirtd is waiting for reply for query-migrate monitor command but I believe this is just a consequence.

Comment 15 Jiri Denemark 2012-04-25 14:19:52 UTC

This is similar to bug 807907 except for the forgotten job issue resulting in "cannot acquire state change lock", which did not happen in 807907 and I don't understand what happened yet.

Comment 16 Jiri Denemark 2012-04-27 15:01:05 UTC

Could you retest this once the patches for bug 807907 are built into a next libvirt package?

Comment 21 weizhang 2012-05-08 07:27:32 UTC

I test the bi-migration on snapshot 2 with the following command on both sides 
# for j in {1..500};do for i in `virsh list|awk 'NR>2{print $2}'`; do virsh migrate-setspeed $i 10000000;  virsh migrate --live --p2p --tunnelled $i qemu+tls://10.66.83.191/system --unsafe; done; done

which cause libvirtd hang on both sides

The guests I using is 16 different kinds of guests including linux and windows.

Is that a different bug?

Comment 22 weizhang 2012-05-08 07:29:13 UTC

Created attachment 582896 [details]
libvirtd hang log on one side

Comment 23 weizhang 2012-05-08 07:30:24 UTC

Created attachment 582897 [details]
libvirtd hang log on the other side

Comment 24 Jiri Denemark 2012-05-16 13:44:02 UTC

@hongming:
It looks like you hit bug 728603. That is, the source closed the connection
because of keep alive timeout but the destination did not properly cancel the
incoming migration. When your script then tries to migrate the domain back
from destination, the incoming migration is still believed to be ongoing and
thus you see "cannot acquire state change lock".

Now, the interesting part is why you still see connections being closed
because of keep alive timeouts. Could you change log_level and log_filters
options on both sides to

log_level = 1
log_filters = "3:util/json"

and try to reproduce again? The produced logs are going to be much larger,
though.

Comment 25 Jiri Denemark 2012-05-16 13:56:33 UTC

@weizhang:

Probably not. Anyway, could you provide backtrace of both libvirt daemons that hung?

Comment 26 hongming 2012-05-17 08:51:10 UTC

Now I can't get the "cannot acquire state change lock" error after run the case  many times. The libvirtd on both sides always hang during migration.The error never occurs before the libvirtd hang .

Versions
libvirt-0.9.10-16.el6.x86_64 and libvirt-0.9.10-18.el6.x86_64 
qemu-kvm-0.12.1.2-2.292.el6.x86_64
kernel-2.6.32-269.el6.x86_64

Comment 28 weizhang 2012-05-18 05:34:13 UTC

Created attachment 585337 [details]
backtrace of libvirtd hang on one side

Comment 29 weizhang 2012-05-18 05:35:01 UTC

Created attachment 585338 [details]
backtrace of libvirtd hang on other side

Comment 31 Jiri Denemark 2012-07-18 09:48:24 UTC

The hang issues reported in earlier comments were partially caused by bug
728603 and incorrect patch for bug 807907. However, the backtraces in comments
28 and 29 also show a real bug in locking in remote driver. Stream APIs in
remote driver did not properly unlock remote driver before entering client
event loop, which may result in a hang when destination libvirtd is stuck. The
hang in those comments was largely influenced by the incorrect fix for bug
807907, which could make it hard to reproduce with current libvirt using the
same steps. However, the hang can be easily reproduced with the following
steps:

- make sure keepalive is turned off in qemu driver on source host and restart
  libvirtd on source after turning that off (see /etc/libvirt/qemu.conf)
- start tunneled migration
- run "virsh domjobinfo" in a loop in a separate terminal
- SIGSTOP destination daemon once migration data starts flowing
- once send buffers get full on source host, virsh domjobinfo loop should just
  hang; with this bug fixed, the loop will keep running but it won't show any
  change in the amount of transferred data

Comment 32 Jiri Denemark 2012-07-18 09:50:23 UTC

This bug is now fixed upstream by v0.9.13-73-g17f3be0:

commit 17f3be079c3c421eff203fcd311b0357ec42d801
Author: Jiri Denemark <jdenemar>
Date:   Tue Jul 17 16:36:23 2012 +0200

    remote: Fix locking in stream APIs
    
    Remote driver needs to make sure the driver lock is released before
    entering client IO loop as that may block indefinitely in poll(). As a
    direct consequence of not following this in stream APIs, tunneled
    migration to a destination host which becomes non-responding may block
    qemu driver. Luckily, if keepalive is turned for p2p migrations, both
    remote and qemu drivers will get automagically unblocked after keepalive
    timeout.

Comment 34 weizhang 2012-08-06 12:01:43 UTC

Verify pass on
kernel-2.6.32-289.el6.x86_64
qemu-kvm-0.12.1.2-2.302.el6.x86_64
libvirt-0.10.0-0rc0.el6.x86_64

Steps
1. On source add "keepalive_interval = -1" in /etc/libvirt/qemu.conf  and restart libvirtd 
2. Open another console and do
# while true; do virsh domjobinfo guest; done
3. Do migration with
#  virsh migrate --live --p2p --tunnelled guest qemu+tcp://10.66.84.16/system --unsafe --verbose
When migration start to progress, on target host, do
# kill -SIGSTOP `pidof libvirtd`

see the libvirtd status, libvirtd still running

Can reproduced on libvirt-0.9.10-21.el6_3.2.x86_64.rpm

Comment 36 errata-xmlrpc 2013-02-21 07:09:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.