Bug 1172387 - Failed to start domain continually due to virDBusCall() during concurrent jobs
Failed to start domain continually due to virDBusCall() during concurrent jobs
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd (Show other bugs)
7.1
x86_64 Linux
unspecified Severity medium
: rc
: ---
Assigned To: systemd-maint
Frantisek Sumsal
: Reopened, ZStream
Depends On:
Blocks: 1243401
  Show dependency treegraph
 
Reported: 2014-12-09 20:02 EST by Hu Jianwei
Modified: 2016-05-03 21:30 EDT (History)
14 users (show)

See Also:
Fixed In Version: systemd-219-1.el7
Doc Type: Bug Fix
Doc Text:
Previously, migrating several guests at once in some cases failed with a "did not receive reply" error. This update improves the responsiveness of systemd when handling multiple guests at the same time, which prevents the described problem from occurring.
Story Points: ---
Clone Of:
: 1243401 (view as bug list)
Environment:
Last Closed: 2015-11-19 10:02:17 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:2092 normal SHIPPED_LIVE systemd bug fix and enhancement update 2015-11-19 07:13:57 EST

  None (edit)
Description Hu Jianwei 2014-12-09 20:02:58 EST
Description of problem:
Failed to start domain continually due to virDBusCall() during concurrent jobs

Version-Release number of selected component (if applicable):
dbus-1.6.12-8.el7.x86_64
kernel-3.10.0-201.el7.x86_64
libvirt-1.2.8-9.el7.x86_64
qemu-kvm-rhev-2.1.2-13.el7.x86_64
selinux-policy-3.13.1-9.el7.noarch

How reproducible:
100%

Steps to Reproduce:
<1> Scenario 1, Concurrent migrate 3 guests for 512 rounds through TLS connection --p2p
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/libvirtd.conf | grep "^[^#]"
listen_tls = 1
auth_tls = "none"
max_anonymous_clients = 20000
max_workers = 20000
max_requests = 20000
max_client_requests = 5000
log_level = 1
log_filters="1:qemu_monitor_json"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/qemu.conf | grep "^[^#]"
max_processes = 1000
max_files = 10000

[root@dhcp-66-83-39 433881]# sh mig.sh test 10.66.83.39 10.66.83.41
Tue  9 Dec 11:05:19 CST 2014
Loop 1: Migrating test123 from 10.66.83.39 to 10.66.83.41
COMMAND: virsh -c qemu+tls://root@10.66.83.39/system migrate --live --p2p test123 qemu+tls://root@10.66.83.41/system
Migration: [100 %]
Migration: [ 76 %]
real        0m6.096s
user        0m0.033s
sys        0m0.016s
Migration: [100 %]

real        0m12.832s
user        0m0.036s
sys        0m0.023s
Migration: [100 %]

real        0m15.936s
user        0m0.044s
sys        0m0.019s
Loop 1: Migrating test back from 10.66.83.41 to 10.66.83.39
COMMAND: virsh -c qemu+tls://root@10.66.83.41/system migrate --live --p2p test123 qemu+tls://root@10.66.83.39/system
Migration: [100 %]error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


real        0m30.471s
user        0m0.058s
sys        0m0.036s
Migration: [100 %]
Migration: [100 %]

...

<2> Scenario 2, Libvirt event handler investigation
lifecycle oeprations for guests concurrently

[root@ibm-x3650m3-07 433879]#sh multi_life.sh &
...

Domain r49 started



error: Failed to start domain r64
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Domain win7 started

Domain win2012 started

error: Failed to destroy domain r64
error: Requested operation is not valid: domain is not running
error: Failed to start domain win2008r2
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


Actual results:
As shown above steps, when set selinux policy to Permissive, we still can reproduce it.

Logs:
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollCalculateTimeout:370 : Timeout at 1418051665733 due in 429 ms
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollRunOnce:642 : EVENT_POLL_RUN: nhandles=15 timeout=429
2014-12-08 15:14:25.448+0000: 37358: error : virDBusCall:1429 : error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Expected results:
libvirt should improve this behavior.

Additional info:
Scripts for above scenarios:
<1>
[root@dhcp-66-83-39 433881]# cat mig.sh
#!/bin/bash

# Migrate a guest back and forth between two hosts, printing progress as it goes
GUEST=$1
HOST1=$2
HOST2=$3
OPTIONS="--live --p2p"
TRANSPORT="tls"


date
for i in `seq 1 20`;
do
    echo "Loop ${i}: Migrating ${GUEST}123 from ${HOST1} to ${HOST2}"
    echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST2}/system"
    time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid1=$!
   time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid2=$!
   time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid3=$!
    wait $pid1
    wait $pid2
    wait $pid3

    echo "Loop ${i}: Migrating ${GUEST} back from ${HOST2} to ${HOST1}"
    echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST1}/system"
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid1=$!
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid2=$!
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid3=$!
    wait $pid1
    wait $pid2
    wait $pid3


done
date

<2>
[root@ibm-x3650m3-07 433879]# cat template.sh
#!/bin/sh

Save_dir=/test/save-dir
Dump_dir=/test/dump-dir
XML_dir=/test/guest-xml

mkdir -p /test/save-dir
mkdir -p /test/dump-dir
mkdir -p /test/guest-xml

echo "this is the lifecycle oeprations for guest" $1

virsh dumpxml $1 > $XML_dir/$1.xml
for i in {1..10}
do
virsh start $1
virsh destroy $1
done

virsh start $1
for i in {1..50}
do
virsh suspend $1
virsh resume $1
done

for i in {1..50}
do
virsh save $1 $Save_dir/$1.save
virsh restore $Save_dir/$1.save
done

# dump [--live] [--crash] <domain> <file>
for i in {1..50}
do
virsh dump --crash $1 $Dump_dir/$1.dump
virsh start $1
done

sleep 30

virsh destroy $1

for i in {1..50}
do
virsh undefine $1
virsh define $XML_dir/$1.xml
done

[root@ibm-x3650m3-07 433879]# cat multi_life.sh
#! /bin/sh -
Guests=("r49" "r63" "r64" "r71" "r7" "win2008r2" "win2012" "win7" "win8")

for guest in "${Guests[@]}"
do
sh ./template.sh $guest &
done
Comment 2 Michal Privoznik 2014-12-17 09:52:48 EST
Libvirt currently uses timeout of 30 seconds to all DBus calls. If systemd is unable to reply to us in that time I don't think there's much we can do. I mean, yeah, we can lift the timeout, but sooner or later you'll hit it again, just migrate more domains at once.
Comment 3 Michal Privoznik 2014-12-18 03:09:52 EST
I've discussed this bug on the internal call and nobody seems to have any bright idea. I mean, the timeout is big already so I don't think raising it will help anything. Therefore I'm closing this one. I mean, if host is under so heavy load that couple of mkdirs() take systemd more than 30 seconds, you certainly don't want to run virtual machines there.
Comment 4 Solganik Alexander 2015-04-28 04:47:13 EDT
Hi, I had encountered a similar issue and investigation showed that the reason for the issue is the behavior of systemd-machined.
See the detailed description and the bug itself filed on Centos bugzilla
https://bugs.centos.org/view.php?id=8564
To summarize, seems that machined might miss handling dbus messages in some cases which causes libvirt to fail spawning vm.
Comment 5 Michal Privoznik 2015-04-28 05:21:51 EDT
(In reply to Solganik Alexander from comment #4)
> Hi, I had encountered a similar issue and investigation showed that the
> reason for the issue is the behavior of systemd-machined.
> See the detailed description and the bug itself filed on Centos bugzilla
> https://bugs.centos.org/view.php?id=8564
> To summarize, seems that machined might miss handling dbus messages in some
> cases which causes libvirt to fail spawning vm.

This confirms my initial suspicion that libvirt's innocent. Swithing over to systemd.
Comment 6 Lukáš Nykrýn 2015-04-28 06:02:17 EDT
Would you mind sending that patch to systemd-devel mailinglist, unless you have already done so?
Comment 8 Lukáš Nykrýn 2015-05-29 06:56:47 EDT
Can you please test https://copr.fedoraproject.org/coprs/lnykryn/systemd/ , this hsould be in 7.2
Comment 9 Hu Jianwei 2015-05-29 10:32:07 EDT
You mean below src rpm in your link, but I can not find the related patch(mentioned in comment 7) in that rpm. Could you give me a scratch build?

https://people.redhat.com/lnykryn/systemd/systemd-219-3.el7.1.src.rpm
Comment 10 Hu Jianwei 2015-05-29 10:45:17 EDT
Sorry, please ignore my comment 9.

After using your repo, I installed below systemd version, right?

[root@localhost rpmbuild]# rpm -qa| grep systemd
systemd-devel-219-3.el7.centos.1.x86_64
systemd-sysv-219-3.el7.centos.1.x86_64
systemd-python-219-3.el7.centos.1.x86_64
systemd-debuginfo-208-18.el7.x86_64
systemd-libs-219-3.el7.centos.1.x86_64
systemd-219-3.el7.centos.1.x86_64

If the version is right, I'll do some testing next week, thanks.
Comment 11 Lukáš Nykrýn 2015-05-29 10:56:50 EDT
Yep that patch is not there, whole loop was rewritten.
Comment 12 Lukáš Nykrýn 2015-06-08 08:03:56 EDT
Were you able to test the rebased version?
Comment 13 Hu Jianwei 2015-06-08 21:51:26 EDT
If you give me a candidate version or link, I can help to test it.

Thanks.
Comment 14 Lukáš Nykrýn 2015-06-09 05:59:29 EDT
Have you tried the version from my copr repo?

https://copr.fedoraproject.org/coprs/lnykryn/systemd/
Comment 15 Hu Jianwei 2015-06-11 21:15:39 EDT
When I use your mentioned version, the issue is disappeared, I think the patch works.
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-219-3.el7.centos.1.x86_64
libvirt-1.2.15-2.el7.x86_64

I still can reproduce it on below old version:
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-208-20.el7.x86_64
libvirt-1.2.15-2.el7.x86_64

...
Domain test saved to /test/save-dir/test.save

error: Failed to restore domain from /test/save-dir/usb.save
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

...

I think you can move on to next step.
Comment 22 errata-xmlrpc 2015-11-19 10:02:17 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2092.html

Note You need to log in before you can comment on or make changes to this bug.