Bug 1043776 - [LXC] guest failed to start when starting many containers
Summary: [LXC] guest failed to start when starting many containers
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Daniel Berrangé
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 922108
TreeView+ depends on / blocked
 
Reported: 2013-12-17 06:43 UTC by Monson Shao
Modified: 2016-04-10 14:11 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-10 14:11:21 UTC
Embargoed:


Attachments (Terms of Use)
container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf" (7.06 KB, text/x-log)
2013-12-17 06:43 UTC, Monson Shao
no flags Details

Description Monson Shao 2013-12-17 06:43:08 UTC
Created attachment 837575 [details]
container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf"

Description of problem:

When starting many containers (~5000) sequentially, some may fail due to:

[04600]...
Domain bash-04600 started

[04601]...
error: Failed to start domain bash-04601
error: internal error: guest failed to start: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


[04602]...
Domain bash-04602 started


Note that it's not 100% to fail, but the possiblity increases with the amount on running containers. It seems 80% fail when running 10000 containers.


Version-Release number of selected component (if applicable):
kernel-3.10.0-54.0.1.el7.x86_64
libvirt-1.1.1-12.el7.x86_64
libvirt-sandbox-0.5.0-5.el7.x86_64
libvirt-glib-0.1.7-1.el7.x86_64


How reproducible:
The fails is began at about 5000 containers, and frequent at 10000.

Steps to Reproduce:
1. create many containers.
2. start containers like below.
num=30000
for i in $(seq -w 1 $num); do
    echo "[$i]..."
    virsh -c lxc:/// start bash-$i
done

Actual results:
Some containers fail.

Expected results:
No containers fail when resource is available.

Additional info:

Comment 2 Daniel Berrangé 2013-12-18 10:56:06 UTC
(In reply to Monson Shao from comment #0)
> Created attachment 837575 [details]
> container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf"
> 
> Description of problem:
> 
> When starting many containers (~5000) sequentially, some may fail due to:
> 
> [04600]...
> Domain bash-04600 started
> 
> [04601]...
> error: Failed to start domain bash-04601
> error: internal error: guest failed to start: Did not receive a reply.
> Possible causes include: the remote application did not send a reply, the
> message bus security policy blocked the reply, the reply timeout expired, or
> the network connection was broken.

This looks like a message when trying to create the cgroups via systemd.

Can you provide the /var/log/libvirt/lxc/bash-04601.log file

Also see if there are any unusal messages in /var/log/messages or dmesg or journalctl at this time.

Also what is the CPU load on the host like at this time. With so many containers active it might be that things are simply too slow and thus we're hitting the dbus timeout.

Comment 3 Monson Shao 2013-12-19 10:31:50 UTC
Created attachment 838918 [details]
file /var/log/libvirt/lxc/bash-04601.log

No strange found in /var/log/messages or journalctl, dmesg is empty.

When this bug is reproduced, at first systemd is running 100% CPU while libvirtd about 75%, and then both drop to 0% for few seconds, then the expired failure occurs. It seems some signals have been dropped out, which is being waited for.

Comment 4 Daniel Berrangé 2013-12-19 10:57:58 UTC
So this definitely just sounds like dbus timeout due to high load then. Can you re-test but add in a sleep. eg 

for i in $(seq -w 1 $num); do
    echo "[$i]..."
    virsh -c lxc:/// start bash-$i
    n=`expr $i % 100`
    if test $n == 0 ; then
      sleep 30
    fi
done


IOW, every 100 VMs that are started sleep for 30 seconds to give the host a chance to "settle down".

Comment 5 Monson Shao 2014-01-13 08:08:39 UTC
I use following strategy to test:
1. If a container fails to start, sleep 10 seconds and retry.
2. Retry 10 times and give up, move on next container.

Then I collect the data:
NR of containers   seconds for starting
0	      	     0
1000	      	     1279
2000	      	     1743
3000	      	     1977
4000	      	     2230
5000	      	     3156
6000	      	     7047
7000	      	     7794
8000	      	     10587
9000	      	     15608
10000	      	     22678
11000	      	     28104
12000	      	     42630
13000	      	     56909
13990	      	     155916              

It seems hardly reaching 14000, while system only consumes 64G memory (128G in total).
So dbus timeout seems the bottleneck of LXC scalability test.

Comment 12 Cole Robinson 2016-04-10 14:11:21 UTC
Given that this is a pathological case, reporter is gone, and the issue may not even exist any more, just closing


Note You need to log in before you can comment on or make changes to this bug.