Bug 1043776

Summary: [LXC] guest failed to start when starting many containers
Product: [Community] Virtualization Tools Reporter: Monson Shao <jshao>
Component: libvirtAssignee: Daniel Berrangé <berrange>
Status: CLOSED DEFERRED QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: ajia, arozansk, berrange, ccui, crobinso, dwalsh, dyuan, jsuchane, kzhang, lsu, mzhan, rbalakri
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-10 14:11:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 922108    
Attachments:
Description Flags
container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf" none

Description Monson Shao 2013-12-17 06:43:08 UTC
Created attachment 837575 [details]
container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf"

Description of problem:

When starting many containers (~5000) sequentially, some may fail due to:

[04600]...
Domain bash-04600 started

[04601]...
error: Failed to start domain bash-04601
error: internal error: guest failed to start: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


[04602]...
Domain bash-04602 started


Note that it's not 100% to fail, but the possiblity increases with the amount on running containers. It seems 80% fail when running 10000 containers.


Version-Release number of selected component (if applicable):
kernel-3.10.0-54.0.1.el7.x86_64
libvirt-1.1.1-12.el7.x86_64
libvirt-sandbox-0.5.0-5.el7.x86_64
libvirt-glib-0.1.7-1.el7.x86_64


How reproducible:
The fails is began at about 5000 containers, and frequent at 10000.

Steps to Reproduce:
1. create many containers.
2. start containers like below.
num=30000
for i in $(seq -w 1 $num); do
    echo "[$i]..."
    virsh -c lxc:/// start bash-$i
done

Actual results:
Some containers fail.

Expected results:
No containers fail when resource is available.

Additional info:

Comment 2 Daniel Berrangé 2013-12-18 10:56:06 UTC
(In reply to Monson Shao from comment #0)
> Created attachment 837575 [details]
> container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf"
> 
> Description of problem:
> 
> When starting many containers (~5000) sequentially, some may fail due to:
> 
> [04600]...
> Domain bash-04600 started
> 
> [04601]...
> error: Failed to start domain bash-04601
> error: internal error: guest failed to start: Did not receive a reply.
> Possible causes include: the remote application did not send a reply, the
> message bus security policy blocked the reply, the reply timeout expired, or
> the network connection was broken.

This looks like a message when trying to create the cgroups via systemd.

Can you provide the /var/log/libvirt/lxc/bash-04601.log file

Also see if there are any unusal messages in /var/log/messages or dmesg or journalctl at this time.

Also what is the CPU load on the host like at this time. With so many containers active it might be that things are simply too slow and thus we're hitting the dbus timeout.

Comment 3 Monson Shao 2013-12-19 10:31:50 UTC
Created attachment 838918 [details]
file /var/log/libvirt/lxc/bash-04601.log

No strange found in /var/log/messages or journalctl, dmesg is empty.

When this bug is reproduced, at first systemd is running 100% CPU while libvirtd about 75%, and then both drop to 0% for few seconds, then the expired failure occurs. It seems some signals have been dropped out, which is being waited for.

Comment 4 Daniel Berrangé 2013-12-19 10:57:58 UTC
So this definitely just sounds like dbus timeout due to high load then. Can you re-test but add in a sleep. eg 

for i in $(seq -w 1 $num); do
    echo "[$i]..."
    virsh -c lxc:/// start bash-$i
    n=`expr $i % 100`
    if test $n == 0 ; then
      sleep 30
    fi
done


IOW, every 100 VMs that are started sleep for 30 seconds to give the host a chance to "settle down".

Comment 5 Monson Shao 2014-01-13 08:08:39 UTC
I use following strategy to test:
1. If a container fails to start, sleep 10 seconds and retry.
2. Retry 10 times and give up, move on next container.

Then I collect the data:
NR of containers   seconds for starting
0	      	     0
1000	      	     1279
2000	      	     1743
3000	      	     1977
4000	      	     2230
5000	      	     3156
6000	      	     7047
7000	      	     7794
8000	      	     10587
9000	      	     15608
10000	      	     22678
11000	      	     28104
12000	      	     42630
13000	      	     56909
13990	      	     155916              

It seems hardly reaching 14000, while system only consumes 64G memory (128G in total).
So dbus timeout seems the bottleneck of LXC scalability test.

Comment 12 Cole Robinson 2016-04-10 14:11:21 UTC
Given that this is a pathological case, reporter is gone, and the issue may not even exist any more, just closing