1172387 – Failed to start domain continually due to virDBusCall() during concurrent jobs

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1172387 - Failed to start domain continually due to virDBusCall() during concurrent jobs

Summary: Failed to start domain continually due to virDBusCall() during concurrent jobs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	systemd
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	systemd-maint
QA Contact:	Frantisek Sumsal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1243401
TreeView+	depends on / blocked

Reported:	2014-12-10 01:02 UTC by Hu Jianwei
Modified:	2019-10-10 09:31 UTC (History)
CC List:	14 users (show)
Fixed In Version:	systemd-219-1.el7
Doc Type:	Bug Fix
Doc Text:	Previously, migrating several guests at once in some cases failed with a "did not receive reply" error. This update improves the responsiveness of systemd when handling multiple guests at the same time, which prevents the described problem from occurring.
Clone Of:
Clones:	1243401 (view as bug list)
Environment:
Last Closed:	2015-11-19 15:02:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2092	0	normal	SHIPPED_LIVE	systemd bug fix and enhancement update	2015-11-19 12:13:57 UTC

Description Hu Jianwei 2014-12-10 01:02:58 UTC

Description of problem:
Failed to start domain continually due to virDBusCall() during concurrent jobs

Version-Release number of selected component (if applicable):
dbus-1.6.12-8.el7.x86_64
kernel-3.10.0-201.el7.x86_64
libvirt-1.2.8-9.el7.x86_64
qemu-kvm-rhev-2.1.2-13.el7.x86_64
selinux-policy-3.13.1-9.el7.noarch

How reproducible:
100%

Steps to Reproduce:
<1> Scenario 1, Concurrent migrate 3 guests for 512 rounds through TLS connection --p2p
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/libvirtd.conf | grep "^[^#]"
listen_tls = 1
auth_tls = "none"
max_anonymous_clients = 20000
max_workers = 20000
max_requests = 20000
max_client_requests = 5000
log_level = 1
log_filters="1:qemu_monitor_json"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"
[root@dhcp-66-83-39 433881]# cat /etc/libvirt/qemu.conf | grep "^[^#]"
max_processes = 1000
max_files = 10000

[root@dhcp-66-83-39 433881]# sh mig.sh test 10.66.83.39 10.66.83.41
Tue  9 Dec 11:05:19 CST 2014
Loop 1: Migrating test123 from 10.66.83.39 to 10.66.83.41
COMMAND: virsh -c qemu+tls://root.83.39/system migrate --live --p2p test123 qemu+tls://root.83.41/system
Migration: [100 %]
Migration: [ 76 %]
real        0m6.096s
user        0m0.033s
sys        0m0.016s
Migration: [100 %]

real        0m12.832s
user        0m0.036s
sys        0m0.023s
Migration: [100 %]

real        0m15.936s
user        0m0.044s
sys        0m0.019s
Loop 1: Migrating test back from 10.66.83.41 to 10.66.83.39
COMMAND: virsh -c qemu+tls://root.83.41/system migrate --live --p2p test123 qemu+tls://root.83.39/system
Migration: [100 %]error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


real        0m30.471s
user        0m0.058s
sys        0m0.036s
Migration: [100 %]
Migration: [100 %]

...

<2> Scenario 2, Libvirt event handler investigation
lifecycle oeprations for guests concurrently

[root@ibm-x3650m3-07 433879]#sh multi_life.sh &
...

Domain r49 started



error: Failed to start domain r64
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Domain win7 started

Domain win2012 started

error: Failed to destroy domain r64
error: Requested operation is not valid: domain is not running
error: Failed to start domain win2008r2
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


Actual results:
As shown above steps, when set selinux policy to Permissive, we still can reproduce it.

Logs:
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollCalculateTimeout:370 : Timeout at 1418051665733 due in 429 ms
2014-12-08 15:14:25.304+0000: 37356: debug : virEventPollRunOnce:642 : EVENT_POLL_RUN: nhandles=15 timeout=429
2014-12-08 15:14:25.448+0000: 37358: error : virDBusCall:1429 : error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Expected results:
libvirt should improve this behavior.

Additional info:
Scripts for above scenarios:
<1>
[root@dhcp-66-83-39 433881]# cat mig.sh
#!/bin/bash

# Migrate a guest back and forth between two hosts, printing progress as it goes
GUEST=$1
HOST1=$2
HOST2=$3
OPTIONS="--live --p2p"
TRANSPORT="tls"


date
for i in `seq 1 20`;
do
    echo "Loop ${i}: Migrating ${GUEST}123 from ${HOST1} to ${HOST2}"
    echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST2}/system"
    time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid1=$!
   time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid2=$!
   time virsh -c qemu+${TRANSPORT}://root@${HOST1}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST2}/system --verbose& pid3=$!
    wait $pid1
    wait $pid2
    wait $pid3

    echo "Loop ${i}: Migrating ${GUEST} back from ${HOST2} to ${HOST1}"
    echo "COMMAND: virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}123 qemu+${TRANSPORT}://root@${HOST1}/system"
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}1 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid1=$!
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}2 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid2=$!
    time virsh -c qemu+${TRANSPORT}://root@${HOST2}/system migrate ${OPTIONS} ${GUEST}3 qemu+${TRANSPORT}://root@${HOST1}/system --verbose& pid3=$!
    wait $pid1
    wait $pid2
    wait $pid3


done
date

<2>
[root@ibm-x3650m3-07 433879]# cat template.sh
#!/bin/sh

Save_dir=/test/save-dir
Dump_dir=/test/dump-dir
XML_dir=/test/guest-xml

mkdir -p /test/save-dir
mkdir -p /test/dump-dir
mkdir -p /test/guest-xml

echo "this is the lifecycle oeprations for guest" $1

virsh dumpxml $1 > $XML_dir/$1.xml
for i in {1..10}
do
virsh start $1
virsh destroy $1
done

virsh start $1
for i in {1..50}
do
virsh suspend $1
virsh resume $1
done

for i in {1..50}
do
virsh save $1 $Save_dir/$1.save
virsh restore $Save_dir/$1.save
done

# dump [--live] [--crash] <domain> <file>
for i in {1..50}
do
virsh dump --crash $1 $Dump_dir/$1.dump
virsh start $1
done

sleep 30

virsh destroy $1

for i in {1..50}
do
virsh undefine $1
virsh define $XML_dir/$1.xml
done

[root@ibm-x3650m3-07 433879]# cat multi_life.sh
#! /bin/sh -
Guests=("r49" "r63" "r64" "r71" "r7" "win2008r2" "win2012" "win7" "win8")

for guest in "${Guests[@]}"
do
sh ./template.sh $guest &
done

Comment 2 Michal Privoznik 2014-12-17 14:52:48 UTC

Libvirt currently uses timeout of 30 seconds to all DBus calls. If systemd is unable to reply to us in that time I don't think there's much we can do. I mean, yeah, we can lift the timeout, but sooner or later you'll hit it again, just migrate more domains at once.

Comment 3 Michal Privoznik 2014-12-18 08:09:52 UTC

I've discussed this bug on the internal call and nobody seems to have any bright idea. I mean, the timeout is big already so I don't think raising it will help anything. Therefore I'm closing this one. I mean, if host is under so heavy load that couple of mkdirs() take systemd more than 30 seconds, you certainly don't want to run virtual machines there.

Comment 4 Solganik Alexander 2015-04-28 08:47:13 UTC

Hi, I had encountered a similar issue and investigation showed that the reason for the issue is the behavior of systemd-machined.
See the detailed description and the bug itself filed on Centos bugzilla
https://bugs.centos.org/view.php?id=8564
To summarize, seems that machined might miss handling dbus messages in some cases which causes libvirt to fail spawning vm.

Comment 5 Michal Privoznik 2015-04-28 09:21:51 UTC

(In reply to Solganik Alexander from comment #4)
> Hi, I had encountered a similar issue and investigation showed that the
> reason for the issue is the behavior of systemd-machined.
> See the detailed description and the bug itself filed on Centos bugzilla
> https://bugs.centos.org/view.php?id=8564
> To summarize, seems that machined might miss handling dbus messages in some
> cases which causes libvirt to fail spawning vm.

This confirms my initial suspicion that libvirt's innocent. Swithing over to systemd.

Comment 6 Lukáš Nykrýn 2015-04-28 10:02:17 UTC

Would you mind sending that patch to systemd-devel mailinglist, unless you have already done so?

Comment 7 Solganik Alexander 2015-04-29 07:20:35 UTC

sent, 
http://lists.freedesktop.org/archives/systemd-devel/2015-April/031343.html

Comment 8 Lukáš Nykrýn 2015-05-29 10:56:47 UTC

Can you please test https://copr.fedoraproject.org/coprs/lnykryn/systemd/ , this hsould be in 7.2

Comment 9 Hu Jianwei 2015-05-29 14:32:07 UTC

You mean below src rpm in your link, but I can not find the related patch(mentioned in comment 7) in that rpm. Could you give me a scratch build?

https://people.redhat.com/lnykryn/systemd/systemd-219-3.el7.1.src.rpm

Comment 10 Hu Jianwei 2015-05-29 14:45:17 UTC

Sorry, please ignore my comment 9.

After using your repo, I installed below systemd version, right?

[root@localhost rpmbuild]# rpm -qa| grep systemd
systemd-devel-219-3.el7.centos.1.x86_64
systemd-sysv-219-3.el7.centos.1.x86_64
systemd-python-219-3.el7.centos.1.x86_64
systemd-debuginfo-208-18.el7.x86_64
systemd-libs-219-3.el7.centos.1.x86_64
systemd-219-3.el7.centos.1.x86_64

If the version is right, I'll do some testing next week, thanks.

Comment 11 Lukáš Nykrýn 2015-05-29 14:56:50 UTC

Yep that patch is not there, whole loop was rewritten.

Comment 12 Lukáš Nykrýn 2015-06-08 12:03:56 UTC

Were you able to test the rebased version?

Comment 13 Hu Jianwei 2015-06-09 01:51:26 UTC

If you give me a candidate version or link, I can help to test it.

Thanks.

Comment 14 Lukáš Nykrýn 2015-06-09 09:59:29 UTC

Have you tried the version from my copr repo?

https://copr.fedoraproject.org/coprs/lnykryn/systemd/

Comment 15 Hu Jianwei 2015-06-12 01:15:39 UTC

When I use your mentioned version, the issue is disappeared, I think the patch works.
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-219-3.el7.centos.1.x86_64
libvirt-1.2.15-2.el7.x86_64

I still can reproduce it on below old version:
[root@localhost reproduce]# rpm -q systemd libvirt
systemd-208-20.el7.x86_64
libvirt-1.2.15-2.el7.x86_64

...
Domain test saved to /test/save-dir/test.save

error: Failed to restore domain from /test/save-dir/usb.save
error: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

...

I think you can move on to next step.

Comment 22 errata-xmlrpc 2015-11-19 15:02:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2092.html

Note You need to log in before you can comment on or make changes to this bug.