Created attachment 837575 [details] container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf" Description of problem: When starting many containers (~5000) sequentially, some may fail due to: [04600]... Domain bash-04600 started [04601]... error: Failed to start domain bash-04601 error: internal error: guest failed to start: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [04602]... Domain bash-04602 started Note that it's not 100% to fail, but the possiblity increases with the amount on running containers. It seems 80% fail when running 10000 containers. Version-Release number of selected component (if applicable): kernel-3.10.0-54.0.1.el7.x86_64 libvirt-1.1.1-12.el7.x86_64 libvirt-sandbox-0.5.0-5.el7.x86_64 libvirt-glib-0.1.7-1.el7.x86_64 How reproducible: The fails is began at about 5000 containers, and frequent at 10000. Steps to Reproduce: 1. create many containers. 2. start containers like below. num=30000 for i in $(seq -w 1 $num); do echo "[$i]..." virsh -c lxc:/// start bash-$i done Actual results: Some containers fail. Expected results: No containers fail when resource is available. Additional info:
(In reply to Monson Shao from comment #0) > Created attachment 837575 [details] > container log with LIBVIRT_LOG_FILTERS="1:libvirt 1:lxc 1:conf" > > Description of problem: > > When starting many containers (~5000) sequentially, some may fail due to: > > [04600]... > Domain bash-04600 started > > [04601]... > error: Failed to start domain bash-04601 > error: internal error: guest failed to start: Did not receive a reply. > Possible causes include: the remote application did not send a reply, the > message bus security policy blocked the reply, the reply timeout expired, or > the network connection was broken. This looks like a message when trying to create the cgroups via systemd. Can you provide the /var/log/libvirt/lxc/bash-04601.log file Also see if there are any unusal messages in /var/log/messages or dmesg or journalctl at this time. Also what is the CPU load on the host like at this time. With so many containers active it might be that things are simply too slow and thus we're hitting the dbus timeout.
Created attachment 838918 [details] file /var/log/libvirt/lxc/bash-04601.log No strange found in /var/log/messages or journalctl, dmesg is empty. When this bug is reproduced, at first systemd is running 100% CPU while libvirtd about 75%, and then both drop to 0% for few seconds, then the expired failure occurs. It seems some signals have been dropped out, which is being waited for.
So this definitely just sounds like dbus timeout due to high load then. Can you re-test but add in a sleep. eg for i in $(seq -w 1 $num); do echo "[$i]..." virsh -c lxc:/// start bash-$i n=`expr $i % 100` if test $n == 0 ; then sleep 30 fi done IOW, every 100 VMs that are started sleep for 30 seconds to give the host a chance to "settle down".
I use following strategy to test: 1. If a container fails to start, sleep 10 seconds and retry. 2. Retry 10 times and give up, move on next container. Then I collect the data: NR of containers seconds for starting 0 0 1000 1279 2000 1743 3000 1977 4000 2230 5000 3156 6000 7047 7000 7794 8000 10587 9000 15608 10000 22678 11000 28104 12000 42630 13000 56909 13990 155916 It seems hardly reaching 14000, while system only consumes 64G memory (128G in total). So dbus timeout seems the bottleneck of LXC scalability test.
Given that this is a pathological case, reporter is gone, and the issue may not even exist any more, just closing