Red Hat Bugzilla – Bug 1172387
Failed to start domain continually due to virDBusCall() during concurrent jobs
Last modified: 2016-05-03 21:30:36 EDT
Description of problem: Failed to start domain continually due to virDBusCall() during concurrent jobs Version-Release number of selected component (if applicable): dbus-1.6.12-8.el7.x86_64 kernel-3.10.0-201.el7.x86_64 libvirt-1.2.8-9.el7.x86_64 qemu-kvm-rhev-2.1.2-13.el7.x86_64 selinux-policy-3.13.1-9.el7.noarch How reproducible: 100% Steps to Reproduce: <1> Scenario 1, Concurrent migrate 3 guests for 512 rounds through TLS connection --p2p [root@dhcp-66-83-39 433881]# cat /etc/libvirt/libvirtd.conf | grep "^[^#]" listen_tls = 1 auth_tls = "none" max_anonymous_clients = 20000 max_workers = 20000 max_requests = 20000 max_client_requests = 5000 log_level = 1 log_filters="1:qemu_monitor_json" log_outputs="1:file:/var/log/libvirt/libvirtd.log" [root@dhcp-66-83-39 433881]# cat /etc/libvirt/qemu.conf | grep "^[^#]" max_processes = 1000 max_files = 10000 [root@dhcp-66-83-39 433881]# sh mig.sh test 10.66.83.39 10.66.83.41 Tue 9 Dec 11:05:19 CST 2014 Loop 1: Migrating test123 from 10.66.83.39 to 10.66.83.41 COMMAND: virsh -c qemu+tls://root@10.66.83.39/system migrate --live --p2p test123 qemu+tls://root@10.66.83.41/system Migration: [100 %] Migration: [ 76 %] real 0m6.096s user 0m0.033s sys 0m0.016s Migration: [100 %] real 0m12.832s user 0m0.036s sys 0m0.023s Migration: [100 %] real 0m15.936s user 0m0.044s sys 0m0.019s Loop 1: Migrating test back from 10.66.83.41 to 10.66.83.39 COMMAND: virsh -c qemu+tls://root@10.66.83.41/system migrate --live --p2p test123 qemu+tls://root@10.66.83.39/system Migration: [100 %]error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. real 0m30.471s user 0m0.058s sys 0m0.036s Migration: [100 %] Migration: [100 %] ... <2> Scenario 2, Libvirt event handler investigation lifecycle oeprations for guests concurrently [root@ibm-x3650m3-07 433879]#sh multi_life.sh & ... Domain r49 started error: Failed to start domain r64 error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Domain win7 started Domain win2012 started error: Failed to destroy domain r64 error: Requested operation is not valid: domain is not running error: Failed to start domain win2008r2 error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Actual results: As shown above steps, when set selinux policy to Permissive, we still can reproduce it. Logs: 2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollCalculateTimeout:370 : Timeout at 1418051665733 due in 429 ms 2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollRunOnce:642 : EVENT_POLL_RUN: nhandles=15 timeout=429 2014-12-08 15:14:25.448+0000: 37358: error : virDBusCall:1429 : error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Expected results: libvirt should improve this behavior. Additional info: Scripts for above scenarios: <1> [root@dhcp-66-83-39 433881]# cat mig.sh #!/bin/bash # Migrate a guest back and forth between two hosts, printing progress as it goes GUEST=$1 HOST1=$2 HOST2=$3 OPTIONS="--live --p2p" TRANSPORT="tls" date for i in `seq 1 20`; do echo "Loop ${i}: Migrating ${GUEST}123 from ${HOST1} to ${HOST2}" echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST2}/system" time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid1=$! time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid2=$! time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid3=$! wait $pid1 wait $pid2 wait $pid3 echo "Loop ${i}: Migrating ${GUEST} back from ${HOST2} to ${HOST1}" echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST1}/system" time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid1=$! time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid2=$! time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid3=$! wait $pid1 wait $pid2 wait $pid3 done date <2> [root@ibm-x3650m3-07 433879]# cat template.sh #!/bin/sh Save_dir=/test/save-dir Dump_dir=/test/dump-dir XML_dir=/test/guest-xml mkdir -p /test/save-dir mkdir -p /test/dump-dir mkdir -p /test/guest-xml echo "this is the lifecycle oeprations for guest" $1 virsh dumpxml $1 > $XML_dir/$1.xml for i in {1..10} do virsh start $1 virsh destroy $1 done virsh start $1 for i in {1..50} do virsh suspend $1 virsh resume $1 done for i in {1..50} do virsh save $1 $Save_dir/$1.save virsh restore $Save_dir/$1.save done # dump [--live] [--crash] <domain> <file> for i in {1..50} do virsh dump --crash $1 $Dump_dir/$1.dump virsh start $1 done sleep 30 virsh destroy $1 for i in {1..50} do virsh undefine $1 virsh define $XML_dir/$1.xml done [root@ibm-x3650m3-07 433879]# cat multi_life.sh #! /bin/sh - Guests=("r49" "r63" "r64" "r71" "r7" "win2008r2" "win2012" "win7" "win8") for guest in "${Guests[@]}" do sh ./template.sh $guest & done
Libvirt currently uses timeout of 30 seconds to all DBus calls. If systemd is unable to reply to us in that time I don't think there's much we can do. I mean, yeah, we can lift the timeout, but sooner or later you'll hit it again, just migrate more domains at once.
I've discussed this bug on the internal call and nobody seems to have any bright idea. I mean, the timeout is big already so I don't think raising it will help anything. Therefore I'm closing this one. I mean, if host is under so heavy load that couple of mkdirs() take systemd more than 30 seconds, you certainly don't want to run virtual machines there.
Hi, I had encountered a similar issue and investigation showed that the reason for the issue is the behavior of systemd-machined. See the detailed description and the bug itself filed on Centos bugzilla https://bugs.centos.org/view.php?id=8564 To summarize, seems that machined might miss handling dbus messages in some cases which causes libvirt to fail spawning vm.
(In reply to Solganik Alexander from comment #4) > Hi, I had encountered a similar issue and investigation showed that the > reason for the issue is the behavior of systemd-machined. > See the detailed description and the bug itself filed on Centos bugzilla > https://bugs.centos.org/view.php?id=8564 > To summarize, seems that machined might miss handling dbus messages in some > cases which causes libvirt to fail spawning vm. This confirms my initial suspicion that libvirt's innocent. Swithing over to systemd.
Would you mind sending that patch to systemd-devel mailinglist, unless you have already done so?
sent, http://lists.freedesktop.org/archives/systemd-devel/2015-April/031343.html
Can you please test https://copr.fedoraproject.org/coprs/lnykryn/systemd/ , this hsould be in 7.2
You mean below src rpm in your link, but I can not find the related patch(mentioned in comment 7) in that rpm. Could you give me a scratch build? https://people.redhat.com/lnykryn/systemd/systemd-219-3.el7.1.src.rpm
Sorry, please ignore my comment 9. After using your repo, I installed below systemd version, right? [root@localhost rpmbuild]# rpm -qa| grep systemd systemd-devel-219-3.el7.centos.1.x86_64 systemd-sysv-219-3.el7.centos.1.x86_64 systemd-python-219-3.el7.centos.1.x86_64 systemd-debuginfo-208-18.el7.x86_64 systemd-libs-219-3.el7.centos.1.x86_64 systemd-219-3.el7.centos.1.x86_64 If the version is right, I'll do some testing next week, thanks.
Yep that patch is not there, whole loop was rewritten.
Were you able to test the rebased version?
If you give me a candidate version or link, I can help to test it. Thanks.
Have you tried the version from my copr repo? https://copr.fedoraproject.org/coprs/lnykryn/systemd/
When I use your mentioned version, the issue is disappeared, I think the patch works. [root@localhost reproduce]# rpm -q systemd libvirt systemd-219-3.el7.centos.1.x86_64 libvirt-1.2.15-2.el7.x86_64 I still can reproduce it on below old version: [root@localhost reproduce]# rpm -q systemd libvirt systemd-208-20.el7.x86_64 libvirt-1.2.15-2.el7.x86_64 ... Domain test saved to /test/save-dir/test.save error: Failed to restore domain from /test/save-dir/usb.save error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. ... I think you can move on to next step.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2092.html