Bug 807996 - libvirtd may hang during tunneled migration
libvirtd may hang during tunneled migration
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.3
x86_64 Linux
urgent Severity high
: rc
: ---
Assigned To: Jiri Denemark
Virtualization Bugs
: ZStream
Depends On: 807907
Blocks: 840699 847946
  Show dependency treegraph
 
Reported: 2012-03-29 06:09 EDT by hongming
Modified: 2013-02-21 02:09 EST (History)
13 users (show)

See Also:
Fixed In Version: libvirt-0.10.0-0rc0.el6
Doc Type: Bug Fix
Doc Text:
Previously, repeatedly migrating a guest between two machines while using the tunnelled migration could cause the libvirt daemon to lock up unexpectedly. The bug in the code for locking remote drivers has been fixed and repeated tunnelled migrations of domains now work as expected.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 02:09:26 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
virsh debug log (4.13 MB, text/plain)
2012-03-29 06:11 EDT, hongming
no flags Details
script (574 bytes, text/plain)
2012-03-29 06:41 EDT, hongming
no flags Details
mig-script (156 bytes, text/plain)
2012-03-29 06:44 EDT, hongming
no flags Details
gdb log (1.95 KB, text/plain)
2012-03-30 02:33 EDT, hongming
no flags Details
gdb log (2.07 KB, text/plain)
2012-03-30 02:35 EDT, hongming
no flags Details
libvirtd hang log on one side (6.25 MB, text/plain)
2012-05-08 03:29 EDT, weizhang
no flags Details
libvirtd hang log on the other side (6.32 MB, text/plain)
2012-05-08 03:30 EDT, weizhang
no flags Details
backtrace of libvirtd hang on one side (13.11 KB, text/plain)
2012-05-18 01:34 EDT, weizhang
no flags Details
backtrace of libvirtd hang on other side (14.86 KB, text/plain)
2012-05-18 01:35 EDT, weizhang
no flags Details

  None (edit)
Description hongming 2012-03-29 06:09:29 EDT
Description of problem:
when bidirectional migrate multiple guests simultaneously in 500 loop times ,use tls connection and "--tunnelled --p2p --live" flags , the following errors occur.

error: Timed out during operation: cannot acquire state change lock

when perform the same test using "qemu+ssh" connection  and "--p2p" flag ,the error never occurs.



Version-Release number of selected component (if applicable):


How reproducible:
100% 

Steps to Reproduce:
1.Create TLS connection between host A and host B.
2.Start 8 guests  in host A and host B respectively.
3.Start migration for multiple guests between host A and host B bidirectional at the same time.
  
Actual results:
the following error occurs
error: Timed out during operation: cannot acquire state change lock

Expected results:
no error

Additional info:
Comment 1 hongming 2012-03-29 06:11:13 EDT
Created attachment 573615 [details]
virsh debug log
Comment 3 hongming 2012-03-29 06:26:57 EDT
Version-Release number of selected component (if applicable):
libvirt-0.9.10-9.el6.x86_64
qemu-kvm-0.12.1.2-2.267.el6.x86_64
kernel-2.6.32-250.el6.x86_64
Comment 4 hongming 2012-03-29 06:41:43 EDT
Created attachment 573621 [details]
script
Comment 5 hongming 2012-03-29 06:44:11 EDT
Created attachment 573622 [details]
mig-script
Comment 6 Jiri Denemark 2012-03-29 07:56:32 EDT
Could you, please, attach debug libvirtd logs from both hosts involved in the migration?
Comment 8 hongming 2012-03-30 02:33:04 EDT
Created attachment 573890 [details]
gdb log
Comment 9 hongming 2012-03-30 02:35:15 EDT
Created attachment 573891 [details]
gdb log

Both libvirtd will hang after the migrations run for some loops. Please see attached gdb log.
Comment 10 Jiri Denemark 2012-03-30 08:39:54 EDT
Thanks. BTW, log files can generally be very well compressed; these logs could be compressed to less than 13.5 MB, which is a huge improvement from their original size of about 0.5 GB logs and downloading them would take much less than one hour.
Comment 11 Jiri Denemark 2012-03-30 10:50:27 EDT
Hmm, gdb logs would really much more useful if they contained backtraces of all threads (thread apply all backtrace).
Comment 12 Jiri Denemark 2012-04-03 06:28:04 EDT
Unfortunately, even though the logs are big they don't contain the important parts. Could you try to reproduce again with the following settings in /etc/libvirt/libvirtd.conf (it should result in smaller debug logs):

log_level = 3
log_filters="1:conf 1:libvirt 1:qemu"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

and also get the full backtrace of hung libvirtd.

Please, do not cut anything from the logs or gdb backtrace and make sure the backtrace corresponds to the libvirtd processes that generated the logs.
Comment 14 Jiri Denemark 2012-04-06 14:44:18 EDT
The logs show that keepalive protocol closed the connection to destination host and it seems tunneled migration does not handle such situation properly. The reason for "Timed out during operation: cannot acquire state change lock" is that source libvirtd is waiting for reply for query-migrate monitor command but I believe this is just a consequence.
Comment 15 Jiri Denemark 2012-04-25 10:19:52 EDT
This is similar to bug 807907 except for the forgotten job issue resulting in "cannot acquire state change lock", which did not happen in 807907 and I don't understand what happened yet.
Comment 16 Jiri Denemark 2012-04-27 11:01:05 EDT
Could you retest this once the patches for bug 807907 are built into a next libvirt package?
Comment 21 weizhang 2012-05-08 03:27:32 EDT
I test the bi-migration on snapshot 2 with the following command on both sides 
# for j in {1..500};do for i in `virsh list|awk 'NR>2{print $2}'`; do virsh migrate-setspeed $i 10000000;  virsh migrate --live --p2p --tunnelled $i qemu+tls://10.66.83.191/system --unsafe; done; done

which cause libvirtd hang on both sides

The guests I using is 16 different kinds of guests including linux and windows.

Is that a different bug?
Comment 22 weizhang 2012-05-08 03:29:13 EDT
Created attachment 582896 [details]
libvirtd hang log on one side
Comment 23 weizhang 2012-05-08 03:30:24 EDT
Created attachment 582897 [details]
libvirtd hang log on the other side
Comment 24 Jiri Denemark 2012-05-16 09:44:02 EDT
@hongming:
It looks like you hit bug 728603. That is, the source closed the connection
because of keep alive timeout but the destination did not properly cancel the
incoming migration. When your script then tries to migrate the domain back
from destination, the incoming migration is still believed to be ongoing and
thus you see "cannot acquire state change lock".

Now, the interesting part is why you still see connections being closed
because of keep alive timeouts. Could you change log_level and log_filters
options on both sides to

log_level = 1
log_filters = "3:util/json"

and try to reproduce again? The produced logs are going to be much larger,
though.
Comment 25 Jiri Denemark 2012-05-16 09:56:33 EDT
@weizhang:

Probably not. Anyway, could you provide backtrace of both libvirt daemons that hung?
Comment 26 hongming 2012-05-17 04:51:10 EDT
Now I can't get the "cannot acquire state change lock" error after run the case  many times. The libvirtd on both sides always hang during migration.The error never occurs before the libvirtd hang .

Versions
libvirt-0.9.10-16.el6.x86_64 and libvirt-0.9.10-18.el6.x86_64 
qemu-kvm-0.12.1.2-2.292.el6.x86_64
kernel-2.6.32-269.el6.x86_64
Comment 28 weizhang 2012-05-18 01:34:13 EDT
Created attachment 585337 [details]
backtrace of libvirtd hang on one side
Comment 29 weizhang 2012-05-18 01:35:01 EDT
Created attachment 585338 [details]
backtrace of libvirtd hang on other side
Comment 31 Jiri Denemark 2012-07-18 05:48:24 EDT
The hang issues reported in earlier comments were partially caused by bug
728603 and incorrect patch for bug 807907. However, the backtraces in comments
28 and 29 also show a real bug in locking in remote driver. Stream APIs in
remote driver did not properly unlock remote driver before entering client
event loop, which may result in a hang when destination libvirtd is stuck. The
hang in those comments was largely influenced by the incorrect fix for bug
807907, which could make it hard to reproduce with current libvirt using the
same steps. However, the hang can be easily reproduced with the following
steps:

- make sure keepalive is turned off in qemu driver on source host and restart
  libvirtd on source after turning that off (see /etc/libvirt/qemu.conf)
- start tunneled migration
- run "virsh domjobinfo" in a loop in a separate terminal
- SIGSTOP destination daemon once migration data starts flowing
- once send buffers get full on source host, virsh domjobinfo loop should just
  hang; with this bug fixed, the loop will keep running but it won't show any
  change in the amount of transferred data
Comment 32 Jiri Denemark 2012-07-18 05:50:23 EDT
This bug is now fixed upstream by v0.9.13-73-g17f3be0:

commit 17f3be079c3c421eff203fcd311b0357ec42d801
Author: Jiri Denemark <jdenemar@redhat.com>
Date:   Tue Jul 17 16:36:23 2012 +0200

    remote: Fix locking in stream APIs
    
    Remote driver needs to make sure the driver lock is released before
    entering client IO loop as that may block indefinitely in poll(). As a
    direct consequence of not following this in stream APIs, tunneled
    migration to a destination host which becomes non-responding may block
    qemu driver. Luckily, if keepalive is turned for p2p migrations, both
    remote and qemu drivers will get automagically unblocked after keepalive
    timeout.
Comment 34 weizhang 2012-08-06 08:01:43 EDT
Verify pass on
kernel-2.6.32-289.el6.x86_64
qemu-kvm-0.12.1.2-2.302.el6.x86_64
libvirt-0.10.0-0rc0.el6.x86_64

Steps
1. On source add "keepalive_interval = -1" in /etc/libvirt/qemu.conf  and restart libvirtd 
2. Open another console and do
# while true; do virsh domjobinfo guest; done
3. Do migration with
#  virsh migrate --live --p2p --tunnelled guest qemu+tcp://10.66.84.16/system --unsafe --verbose
When migration start to progress, on target host, do
# kill -SIGSTOP `pidof libvirtd`

see the libvirtd status, libvirtd still running

Can reproduced on libvirt-0.9.10-21.el6_3.2.x86_64.rpm
Comment 36 errata-xmlrpc 2013-02-21 02:09:26 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0276.html

Note You need to log in before you can comment on or make changes to this bug.