Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Previously, migrating several guests at once in some cases failed with a "did not receive reply" error. This update improves the responsiveness of systemd when handling multiple guests at the same time, which prevents the described problem from occurring.
Description of problem:
Failed to start domain continually due to virDBusCall() during concurrent jobs
Version-Release number of selected component (if applicable):
dbus-1.6.12-8.el7.x86_64
kernel-3.10.0-201.el7.x86_64
libvirt-1.2.8-9.el7.x86_64
qemu-kvm-rhev-2.1.2-13.el7.x86_64
selinux-policy-3.13.1-9.el7.noarch
How reproducible:
100%
Steps to Reproduce:
<1> Scenario 1, Concurrent migrate 3 guests for 512 rounds through TLS connection --p2p
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/libvirtd.conf | grep "^[^#]"
listen_tls = 1
auth_tls = "none"
max_anonymous_clients = 20000
max_workers = 20000
max_requests = 20000
max_client_requests = 5000
log_level = 1
log_filters="1:qemu_monitor_json"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/qemu.conf | grep "^[^#]"
max_processes = 1000
max_files = 10000
[root@dhcp-66-83-39 433881]# sh mig.sh test 10.66.83.39 10.66.83.41
Tue 9 Dec 11:05:19 CST 2014
Loop 1: Migrating test123 from 10.66.83.39 to 10.66.83.41
COMMAND: virsh -c qemu+tls://root.83.39/system migrate --live --p2p test123 qemu+tls://root.83.41/system
Migration: [100 %]
Migration: [ 76 %]
real 0m6.096s
user 0m0.033s
sys 0m0.016s
Migration: [100 %]
real 0m12.832s
user 0m0.036s
sys 0m0.023s
Migration: [100 %]
real 0m15.936s
user 0m0.044s
sys 0m0.019s
Loop 1: Migrating test back from 10.66.83.41 to 10.66.83.39
COMMAND: virsh -c qemu+tls://root.83.41/system migrate --live --p2p test123 qemu+tls://root.83.39/system
Migration: [100 %]error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
real 0m30.471s
user 0m0.058s
sys 0m0.036s
Migration: [100 %]
Migration: [100 %]
...
<2> Scenario 2, Libvirt event handler investigation
lifecycle oeprations for guests concurrently
[root@ibm-x3650m3-07 433879]#sh multi_life.sh &
...
Domain r49 started
error: Failed to start domain r64
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Domain win7 started
Domain win2012 started
error: Failed to destroy domain r64
error: Requested operation is not valid: domain is not running
error: Failed to start domain win2008r2
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Actual results:
As shown above steps, when set selinux policy to Permissive, we still can reproduce it.
Logs:
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollCalculateTimeout:370 : Timeout at 1418051665733 due in 429 ms
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollRunOnce:642 : EVENT_POLL_RUN: nhandles=15 timeout=429
2014-12-08 15:14:25.448+0000: 37358: error : virDBusCall:1429 : error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Expected results:
libvirt should improve this behavior.
Additional info:
Scripts for above scenarios:
<1>
[root@dhcp-66-83-39 433881]# cat mig.sh
#!/bin/bash
# Migrate a guest back and forth between two hosts, printing progress as it goes
GUEST=$1
HOST1=$2
HOST2=$3
OPTIONS="--live --p2p"
TRANSPORT="tls"
date
for i in `seq 1 20`;
do
echo "Loop ${i}: Migrating ${GUEST}123 from ${HOST1} to ${HOST2}"
echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST2}/system"
time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid1=$!
time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid2=$!
time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid3=$!
wait $pid1
wait $pid2
wait $pid3
echo "Loop ${i}: Migrating ${GUEST} back from ${HOST2} to ${HOST1}"
echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST1}/system"
time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid1=$!
time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid2=$!
time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid3=$!
wait $pid1
wait $pid2
wait $pid3
done
date
<2>
[root@ibm-x3650m3-07 433879]# cat template.sh
#!/bin/sh
Save_dir=/test/save-dir
Dump_dir=/test/dump-dir
XML_dir=/test/guest-xml
mkdir -p /test/save-dir
mkdir -p /test/dump-dir
mkdir -p /test/guest-xml
echo "this is the lifecycle oeprations for guest" $1
virsh dumpxml $1 > $XML_dir/$1.xml
for i in {1..10}
do
virsh start $1
virsh destroy $1
done
virsh start $1
for i in {1..50}
do
virsh suspend $1
virsh resume $1
done
for i in {1..50}
do
virsh save $1 $Save_dir/$1.save
virsh restore $Save_dir/$1.save
done
# dump [--live] [--crash] <domain> <file>
for i in {1..50}
do
virsh dump --crash $1 $Dump_dir/$1.dump
virsh start $1
done
sleep 30
virsh destroy $1
for i in {1..50}
do
virsh undefine $1
virsh define $XML_dir/$1.xml
done
[root@ibm-x3650m3-07 433879]# cat multi_life.sh
#! /bin/sh -
Guests=("r49" "r63" "r64" "r71" "r7" "win2008r2" "win2012" "win7" "win8")
for guest in "${Guests[@]}"
do
sh ./template.sh $guest &
done
Libvirt currently uses timeout of 30 seconds to all DBus calls. If systemd is unable to reply to us in that time I don't think there's much we can do. I mean, yeah, we can lift the timeout, but sooner or later you'll hit it again, just migrate more domains at once.
I've discussed this bug on the internal call and nobody seems to have any bright idea. I mean, the timeout is big already so I don't think raising it will help anything. Therefore I'm closing this one. I mean, if host is under so heavy load that couple of mkdirs() take systemd more than 30 seconds, you certainly don't want to run virtual machines there.
Comment 4Solganik Alexander
2015-04-28 08:47:13 UTC
Hi, I had encountered a similar issue and investigation showed that the reason for the issue is the behavior of systemd-machined.
See the detailed description and the bug itself filed on Centos bugzilla
https://bugs.centos.org/view.php?id=8564
To summarize, seems that machined might miss handling dbus messages in some cases which causes libvirt to fail spawning vm.
(In reply to Solganik Alexander from comment #4)
> Hi, I had encountered a similar issue and investigation showed that the
> reason for the issue is the behavior of systemd-machined.
> See the detailed description and the bug itself filed on Centos bugzilla
> https://bugs.centos.org/view.php?id=8564
> To summarize, seems that machined might miss handling dbus messages in some
> cases which causes libvirt to fail spawning vm.
This confirms my initial suspicion that libvirt's innocent. Swithing over to systemd.
Sorry, please ignore my comment 9.
After using your repo, I installed below systemd version, right?
[root@localhost rpmbuild]# rpm -qa| grep systemd
systemd-devel-219-3.el7.centos.1.x86_64
systemd-sysv-219-3.el7.centos.1.x86_64
systemd-python-219-3.el7.centos.1.x86_64
systemd-debuginfo-208-18.el7.x86_64
systemd-libs-219-3.el7.centos.1.x86_64
systemd-219-3.el7.centos.1.x86_64
If the version is right, I'll do some testing next week, thanks.
When I use your mentioned version, the issue is disappeared, I think the patch works.
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-219-3.el7.centos.1.x86_64
libvirt-1.2.15-2.el7.x86_64
I still can reproduce it on below old version:
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-208-20.el7.x86_64
libvirt-1.2.15-2.el7.x86_64
...
Domain test saved to /test/save-dir/test.save
error: Failed to restore domain from /test/save-dir/usb.save
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
...
I think you can move on to next step.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://rhn.redhat.com/errata/RHBA-2015-2092.html
Description of problem: Failed to start domain continually due to virDBusCall() during concurrent jobs Version-Release number of selected component (if applicable): dbus-1.6.12-8.el7.x86_64 kernel-3.10.0-201.el7.x86_64 libvirt-1.2.8-9.el7.x86_64 qemu-kvm-rhev-2.1.2-13.el7.x86_64 selinux-policy-3.13.1-9.el7.noarch How reproducible: 100% Steps to Reproduce: <1> Scenario 1, Concurrent migrate 3 guests for 512 rounds through TLS connection --p2p [root@dhcp-66-83-39 433881]# cat /etc/libvirt/libvirtd.conf | grep "^[^#]" listen_tls = 1 auth_tls = "none" max_anonymous_clients = 20000 max_workers = 20000 max_requests = 20000 max_client_requests = 5000 log_level = 1 log_filters="1:qemu_monitor_json" log_outputs="1:file:/var/log/libvirt/libvirtd.log" [root@dhcp-66-83-39 433881]# cat /etc/libvirt/qemu.conf | grep "^[^#]" max_processes = 1000 max_files = 10000 [root@dhcp-66-83-39 433881]# sh mig.sh test 10.66.83.39 10.66.83.41 Tue 9 Dec 11:05:19 CST 2014 Loop 1: Migrating test123 from 10.66.83.39 to 10.66.83.41 COMMAND: virsh -c qemu+tls://root.83.39/system migrate --live --p2p test123 qemu+tls://root.83.41/system Migration: [100 %] Migration: [ 76 %] real 0m6.096s user 0m0.033s sys 0m0.016s Migration: [100 %] real 0m12.832s user 0m0.036s sys 0m0.023s Migration: [100 %] real 0m15.936s user 0m0.044s sys 0m0.019s Loop 1: Migrating test back from 10.66.83.41 to 10.66.83.39 COMMAND: virsh -c qemu+tls://root.83.41/system migrate --live --p2p test123 qemu+tls://root.83.39/system Migration: [100 %]error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. real 0m30.471s user 0m0.058s sys 0m0.036s Migration: [100 %] Migration: [100 %] ... <2> Scenario 2, Libvirt event handler investigation lifecycle oeprations for guests concurrently [root@ibm-x3650m3-07 433879]#sh multi_life.sh & ... Domain r49 started error: Failed to start domain r64 error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Domain win7 started Domain win2012 started error: Failed to destroy domain r64 error: Requested operation is not valid: domain is not running error: Failed to start domain win2008r2 error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Actual results: As shown above steps, when set selinux policy to Permissive, we still can reproduce it. Logs: 2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollCalculateTimeout:370 : Timeout at 1418051665733 due in 429 ms 2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollRunOnce:642 : EVENT_POLL_RUN: nhandles=15 timeout=429 2014-12-08 15:14:25.448+0000: 37358: error : virDBusCall:1429 : error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Expected results: libvirt should improve this behavior. Additional info: Scripts for above scenarios: <1> [root@dhcp-66-83-39 433881]# cat mig.sh #!/bin/bash # Migrate a guest back and forth between two hosts, printing progress as it goes GUEST=$1 HOST1=$2 HOST2=$3 OPTIONS="--live --p2p" TRANSPORT="tls" date for i in `seq 1 20`; do echo "Loop ${i}: Migrating ${GUEST}123 from ${HOST1} to ${HOST2}" echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST2}/system" time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid1=$! time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid2=$! time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid3=$! wait $pid1 wait $pid2 wait $pid3 echo "Loop ${i}: Migrating ${GUEST} back from ${HOST2} to ${HOST1}" echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST1}/system" time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid1=$! time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid2=$! time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid3=$! wait $pid1 wait $pid2 wait $pid3 done date <2> [root@ibm-x3650m3-07 433879]# cat template.sh #!/bin/sh Save_dir=/test/save-dir Dump_dir=/test/dump-dir XML_dir=/test/guest-xml mkdir -p /test/save-dir mkdir -p /test/dump-dir mkdir -p /test/guest-xml echo "this is the lifecycle oeprations for guest" $1 virsh dumpxml $1 > $XML_dir/$1.xml for i in {1..10} do virsh start $1 virsh destroy $1 done virsh start $1 for i in {1..50} do virsh suspend $1 virsh resume $1 done for i in {1..50} do virsh save $1 $Save_dir/$1.save virsh restore $Save_dir/$1.save done # dump [--live] [--crash] <domain> <file> for i in {1..50} do virsh dump --crash $1 $Dump_dir/$1.dump virsh start $1 done sleep 30 virsh destroy $1 for i in {1..50} do virsh undefine $1 virsh define $XML_dir/$1.xml done [root@ibm-x3650m3-07 433879]# cat multi_life.sh #! /bin/sh - Guests=("r49" "r63" "r64" "r71" "r7" "win2008r2" "win2012" "win7" "win8") for guest in "${Guests[@]}" do sh ./template.sh $guest & done