Bug 1172387

Summary: Failed to start domain continually due to virDBusCall() during concurrent jobs
Product: Red Hat Enterprise Linux 7 Reporter: Hu Jianwei <jiahu>
Component: systemdAssignee: systemd-maint
Status: CLOSED ERRATA QA Contact: Frantisek Sumsal <fsumsal>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.1CC: dyuan, ebarrera, fsumsal, honzhang, jherrman, lnykryn, lyarwood, mzhan, pdhange, psklenar, rbalakri, shyu, solganik, systemd-maint-list
Target Milestone: rcKeywords: Reopened, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: systemd-219-1.el7 Doc Type: Bug Fix
Doc Text:
Previously, migrating several guests at once in some cases failed with a "did not receive reply" error. This update improves the responsiveness of systemd when handling multiple guests at the same time, which prevents the described problem from occurring.
Story Points: ---
Clone Of:
: 1243401 (view as bug list) Environment:
Last Closed: 2015-11-19 15:02:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1243401    

Description Hu Jianwei 2014-12-10 01:02:58 UTC
Description of problem:
Failed to start domain continually due to virDBusCall() during concurrent jobs

Version-Release number of selected component (if applicable):
dbus-1.6.12-8.el7.x86_64
kernel-3.10.0-201.el7.x86_64
libvirt-1.2.8-9.el7.x86_64
qemu-kvm-rhev-2.1.2-13.el7.x86_64
selinux-policy-3.13.1-9.el7.noarch

How reproducible:
100%

Steps to Reproduce:
<1> Scenario 1, Concurrent migrate 3 guests for 512 rounds through TLS connection --p2p
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/libvirtd.conf | grep "^[^#]"
listen_tls = 1
auth_tls = "none"
max_anonymous_clients = 20000
max_workers = 20000
max_requests = 20000
max_client_requests = 5000
log_level = 1
log_filters="1:qemu_monitor_json"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/qemu.conf | grep "^[^#]"
max_processes = 1000
max_files = 10000

[root@dhcp-66-83-39 433881]# sh mig.sh test 10.66.83.39 10.66.83.41
Tue  9 Dec 11:05:19 CST 2014
Loop 1: Migrating test123 from 10.66.83.39 to 10.66.83.41
COMMAND: virsh -c qemu+tls://root.83.39/system migrate --live --p2p test123 qemu+tls://root.83.41/system
Migration: [100 %]
Migration: [ 76 %]
real        0m6.096s
user        0m0.033s
sys        0m0.016s
Migration: [100 %]

real        0m12.832s
user        0m0.036s
sys        0m0.023s
Migration: [100 %]

real        0m15.936s
user        0m0.044s
sys        0m0.019s
Loop 1: Migrating test back from 10.66.83.41 to 10.66.83.39
COMMAND: virsh -c qemu+tls://root.83.41/system migrate --live --p2p test123 qemu+tls://root.83.39/system
Migration: [100 %]error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


real        0m30.471s
user        0m0.058s
sys        0m0.036s
Migration: [100 %]
Migration: [100 %]

...

<2> Scenario 2, Libvirt event handler investigation
lifecycle oeprations for guests concurrently

[root@ibm-x3650m3-07 433879]#sh multi_life.sh &
...

Domain r49 started



error: Failed to start domain r64
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Domain win7 started

Domain win2012 started

error: Failed to destroy domain r64
error: Requested operation is not valid: domain is not running
error: Failed to start domain win2008r2
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


Actual results:
As shown above steps, when set selinux policy to Permissive, we still can reproduce it.

Logs:
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollCalculateTimeout:370 : Timeout at 1418051665733 due in 429 ms
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollRunOnce:642 : EVENT_POLL_RUN: nhandles=15 timeout=429
2014-12-08 15:14:25.448+0000: 37358: error : virDBusCall:1429 : error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Expected results:
libvirt should improve this behavior.

Additional info:
Scripts for above scenarios:
<1>
[root@dhcp-66-83-39 433881]# cat mig.sh
#!/bin/bash

# Migrate a guest back and forth between two hosts, printing progress as it goes
GUEST=$1
HOST1=$2
HOST2=$3
OPTIONS="--live --p2p"
TRANSPORT="tls"


date
for i in `seq 1 20`;
do
    echo "Loop ${i}: Migrating ${GUEST}123 from ${HOST1} to ${HOST2}"
    echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST2}/system"
    time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid1=$!
   time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid2=$!
   time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid3=$!
    wait $pid1
    wait $pid2
    wait $pid3

    echo "Loop ${i}: Migrating ${GUEST} back from ${HOST2} to ${HOST1}"
    echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST1}/system"
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid1=$!
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid2=$!
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid3=$!
    wait $pid1
    wait $pid2
    wait $pid3


done
date

<2>
[root@ibm-x3650m3-07 433879]# cat template.sh
#!/bin/sh

Save_dir=/test/save-dir
Dump_dir=/test/dump-dir
XML_dir=/test/guest-xml

mkdir -p /test/save-dir
mkdir -p /test/dump-dir
mkdir -p /test/guest-xml

echo "this is the lifecycle oeprations for guest" $1

virsh dumpxml $1 > $XML_dir/$1.xml
for i in {1..10}
do
virsh start $1
virsh destroy $1
done

virsh start $1
for i in {1..50}
do
virsh suspend $1
virsh resume $1
done

for i in {1..50}
do
virsh save $1 $Save_dir/$1.save
virsh restore $Save_dir/$1.save
done

# dump [--live] [--crash] <domain> <file>
for i in {1..50}
do
virsh dump --crash $1 $Dump_dir/$1.dump
virsh start $1
done

sleep 30

virsh destroy $1

for i in {1..50}
do
virsh undefine $1
virsh define $XML_dir/$1.xml
done

[root@ibm-x3650m3-07 433879]# cat multi_life.sh
#! /bin/sh -
Guests=("r49" "r63" "r64" "r71" "r7" "win2008r2" "win2012" "win7" "win8")

for guest in "${Guests[@]}"
do
sh ./template.sh $guest &
done

Comment 2 Michal Privoznik 2014-12-17 14:52:48 UTC
Libvirt currently uses timeout of 30 seconds to all DBus calls. If systemd is unable to reply to us in that time I don't think there's much we can do. I mean, yeah, we can lift the timeout, but sooner or later you'll hit it again, just migrate more domains at once.

Comment 3 Michal Privoznik 2014-12-18 08:09:52 UTC
I've discussed this bug on the internal call and nobody seems to have any bright idea. I mean, the timeout is big already so I don't think raising it will help anything. Therefore I'm closing this one. I mean, if host is under so heavy load that couple of mkdirs() take systemd more than 30 seconds, you certainly don't want to run virtual machines there.

Comment 4 Solganik Alexander 2015-04-28 08:47:13 UTC
Hi, I had encountered a similar issue and investigation showed that the reason for the issue is the behavior of systemd-machined.
See the detailed description and the bug itself filed on Centos bugzilla
https://bugs.centos.org/view.php?id=8564
To summarize, seems that machined might miss handling dbus messages in some cases which causes libvirt to fail spawning vm.

Comment 5 Michal Privoznik 2015-04-28 09:21:51 UTC
(In reply to Solganik Alexander from comment #4)
> Hi, I had encountered a similar issue and investigation showed that the
> reason for the issue is the behavior of systemd-machined.
> See the detailed description and the bug itself filed on Centos bugzilla
> https://bugs.centos.org/view.php?id=8564
> To summarize, seems that machined might miss handling dbus messages in some
> cases which causes libvirt to fail spawning vm.

This confirms my initial suspicion that libvirt's innocent. Swithing over to systemd.

Comment 6 Lukáš Nykrýn 2015-04-28 10:02:17 UTC
Would you mind sending that patch to systemd-devel mailinglist, unless you have already done so?

Comment 8 Lukáš Nykrýn 2015-05-29 10:56:47 UTC
Can you please test https://copr.fedoraproject.org/coprs/lnykryn/systemd/ , this hsould be in 7.2

Comment 9 Hu Jianwei 2015-05-29 14:32:07 UTC
You mean below src rpm in your link, but I can not find the related patch(mentioned in comment 7) in that rpm. Could you give me a scratch build?

https://people.redhat.com/lnykryn/systemd/systemd-219-3.el7.1.src.rpm

Comment 10 Hu Jianwei 2015-05-29 14:45:17 UTC
Sorry, please ignore my comment 9.

After using your repo, I installed below systemd version, right?

[root@localhost rpmbuild]# rpm -qa| grep systemd
systemd-devel-219-3.el7.centos.1.x86_64
systemd-sysv-219-3.el7.centos.1.x86_64
systemd-python-219-3.el7.centos.1.x86_64
systemd-debuginfo-208-18.el7.x86_64
systemd-libs-219-3.el7.centos.1.x86_64
systemd-219-3.el7.centos.1.x86_64

If the version is right, I'll do some testing next week, thanks.

Comment 11 Lukáš Nykrýn 2015-05-29 14:56:50 UTC
Yep that patch is not there, whole loop was rewritten.

Comment 12 Lukáš Nykrýn 2015-06-08 12:03:56 UTC
Were you able to test the rebased version?

Comment 13 Hu Jianwei 2015-06-09 01:51:26 UTC
If you give me a candidate version or link, I can help to test it.

Thanks.

Comment 14 Lukáš Nykrýn 2015-06-09 09:59:29 UTC
Have you tried the version from my copr repo?

https://copr.fedoraproject.org/coprs/lnykryn/systemd/

Comment 15 Hu Jianwei 2015-06-12 01:15:39 UTC
When I use your mentioned version, the issue is disappeared, I think the patch works.
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-219-3.el7.centos.1.x86_64
libvirt-1.2.15-2.el7.x86_64

I still can reproduce it on below old version:
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-208-20.el7.x86_64
libvirt-1.2.15-2.el7.x86_64

...
Domain test saved to /test/save-dir/test.save

error: Failed to restore domain from /test/save-dir/usb.save
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

...

I think you can move on to next step.

Comment 22 errata-xmlrpc 2015-11-19 15:02:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2092.html