Summary: | libvirtd may hang during tunneled migration | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | hongming <honzhang> | ||||||||||||||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||
Priority: | urgent | ||||||||||||||||||||||
Version: | 6.3 | CC: | acathrow, ajia, dallan, dyasny, dyuan, gsun, honzhang, jdenemar, jpallich, mzhan, rwu, veillard, weizhan | ||||||||||||||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||
Fixed In Version: | libvirt-0.10.0-0rc0.el6 | Doc Type: | Bug Fix | ||||||||||||||||||||
Doc Text: |
Previously, repeatedly migrating a guest between two machines while using the tunnelled migration could cause the libvirt daemon to lock up unexpectedly. The bug in the code for locking remote drivers has been fixed and repeated tunnelled migrations of domains now work as expected.
|
Story Points: | --- | ||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||
Last Closed: | 2013-02-21 07:09:26 UTC | Type: | --- | ||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
Bug Depends On: | 807907 | ||||||||||||||||||||||
Bug Blocks: | 840699, 847946 | ||||||||||||||||||||||
Attachments: |
|
Description
hongming
2012-03-29 10:09:29 UTC
Created attachment 573615 [details]
virsh debug log
Version-Release number of selected component (if applicable): libvirt-0.9.10-9.el6.x86_64 qemu-kvm-0.12.1.2-2.267.el6.x86_64 kernel-2.6.32-250.el6.x86_64 Created attachment 573621 [details]
script
Created attachment 573622 [details]
mig-script
Could you, please, attach debug libvirtd logs from both hosts involved in the migration? Created attachment 573890 [details]
gdb log
Created attachment 573891 [details]
gdb log
Both libvirtd will hang after the migrations run for some loops. Please see attached gdb log.
Thanks. BTW, log files can generally be very well compressed; these logs could be compressed to less than 13.5 MB, which is a huge improvement from their original size of about 0.5 GB logs and downloading them would take much less than one hour. Hmm, gdb logs would really much more useful if they contained backtraces of all threads (thread apply all backtrace). Unfortunately, even though the logs are big they don't contain the important parts. Could you try to reproduce again with the following settings in /etc/libvirt/libvirtd.conf (it should result in smaller debug logs): log_level = 3 log_filters="1:conf 1:libvirt 1:qemu" log_outputs="1:file:/var/log/libvirt/libvirtd.log" and also get the full backtrace of hung libvirtd. Please, do not cut anything from the logs or gdb backtrace and make sure the backtrace corresponds to the libvirtd processes that generated the logs. The logs show that keepalive protocol closed the connection to destination host and it seems tunneled migration does not handle such situation properly. The reason for "Timed out during operation: cannot acquire state change lock" is that source libvirtd is waiting for reply for query-migrate monitor command but I believe this is just a consequence. This is similar to bug 807907 except for the forgotten job issue resulting in "cannot acquire state change lock", which did not happen in 807907 and I don't understand what happened yet. Could you retest this once the patches for bug 807907 are built into a next libvirt package? I test the bi-migration on snapshot 2 with the following command on both sides # for j in {1..500};do for i in `virsh list|awk 'NR>2{print $2}'`; do virsh migrate-setspeed $i 10000000; virsh migrate --live --p2p --tunnelled $i qemu+tls://10.66.83.191/system --unsafe; done; done which cause libvirtd hang on both sides The guests I using is 16 different kinds of guests including linux and windows. Is that a different bug? Created attachment 582896 [details]
libvirtd hang log on one side
Created attachment 582897 [details]
libvirtd hang log on the other side
@hongming: It looks like you hit bug 728603. That is, the source closed the connection because of keep alive timeout but the destination did not properly cancel the incoming migration. When your script then tries to migrate the domain back from destination, the incoming migration is still believed to be ongoing and thus you see "cannot acquire state change lock". Now, the interesting part is why you still see connections being closed because of keep alive timeouts. Could you change log_level and log_filters options on both sides to log_level = 1 log_filters = "3:util/json" and try to reproduce again? The produced logs are going to be much larger, though. @weizhang: Probably not. Anyway, could you provide backtrace of both libvirt daemons that hung? Now I can't get the "cannot acquire state change lock" error after run the case many times. The libvirtd on both sides always hang during migration.The error never occurs before the libvirtd hang . Versions libvirt-0.9.10-16.el6.x86_64 and libvirt-0.9.10-18.el6.x86_64 qemu-kvm-0.12.1.2-2.292.el6.x86_64 kernel-2.6.32-269.el6.x86_64 Created attachment 585337 [details]
backtrace of libvirtd hang on one side
Created attachment 585338 [details]
backtrace of libvirtd hang on other side
The hang issues reported in earlier comments were partially caused by bug 728603 and incorrect patch for bug 807907. However, the backtraces in comments 28 and 29 also show a real bug in locking in remote driver. Stream APIs in remote driver did not properly unlock remote driver before entering client event loop, which may result in a hang when destination libvirtd is stuck. The hang in those comments was largely influenced by the incorrect fix for bug 807907, which could make it hard to reproduce with current libvirt using the same steps. However, the hang can be easily reproduced with the following steps: - make sure keepalive is turned off in qemu driver on source host and restart libvirtd on source after turning that off (see /etc/libvirt/qemu.conf) - start tunneled migration - run "virsh domjobinfo" in a loop in a separate terminal - SIGSTOP destination daemon once migration data starts flowing - once send buffers get full on source host, virsh domjobinfo loop should just hang; with this bug fixed, the loop will keep running but it won't show any change in the amount of transferred data This bug is now fixed upstream by v0.9.13-73-g17f3be0: commit 17f3be079c3c421eff203fcd311b0357ec42d801 Author: Jiri Denemark <jdenemar@redhat.com> Date: Tue Jul 17 16:36:23 2012 +0200 remote: Fix locking in stream APIs Remote driver needs to make sure the driver lock is released before entering client IO loop as that may block indefinitely in poll(). As a direct consequence of not following this in stream APIs, tunneled migration to a destination host which becomes non-responding may block qemu driver. Luckily, if keepalive is turned for p2p migrations, both remote and qemu drivers will get automagically unblocked after keepalive timeout. Verify pass on kernel-2.6.32-289.el6.x86_64 qemu-kvm-0.12.1.2-2.302.el6.x86_64 libvirt-0.10.0-0rc0.el6.x86_64 Steps 1. On source add "keepalive_interval = -1" in /etc/libvirt/qemu.conf and restart libvirtd 2. Open another console and do # while true; do virsh domjobinfo guest; done 3. Do migration with # virsh migrate --live --p2p --tunnelled guest qemu+tcp://10.66.84.16/system --unsafe --verbose When migration start to progress, on target host, do # kill -SIGSTOP `pidof libvirtd` see the libvirtd status, libvirtd still running Can reproduced on libvirt-0.9.10-21.el6_3.2.x86_64.rpm Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-0276.html |