Description of problem: After upgrading F19->F20 I am unable to start LXC containers: # virsh --connect lxc:// start krelln error: Failed to start domain krelln error: Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused SELinux is in permissive mode. Version-Release number of selected component (if applicable): libvirt-1.1.3.3-5.fc20.x86_64 libvirt-daemon-lxc-1.1.3.3-5.fc20.x86_64 libvirt-daemon-driver-lxc-1.1.3.3-5.fc20.x86_64 systemd-208-14.fc20.x86_64 How reproducible: Always. Steps to Reproduce: 1. Try to start LXC container. Additional info: # LIBVIRT_LOG_FILTERS=1:lxc LIBVIRT_LOG_OUTPUTS=1:stderr /usr/sbin/libvirtd 2014-02-19 10:36:11.856+0000: 7627: info : libvirt version: 1.1.3.3, package: 5.fc20 (Fedora Project, 2014-02-01-20:08:12, buildvm-17.phx2.fedoraproject.org) 2014-02-19 10:36:11.856+0000: 7627: debug : lxcContainerAvailable:2076 : container support is enabled 2014-02-19 10:36:11.864+0000: 7775: warning : qemuMonitorVMStatusToPausedReason:3365 : Qemu reported unknown VM status: 'suspended' 2014-02-19 10:36:11.923+0000: 7627: debug : lxcContainerAvailable:2076 : container support is enabled 2014-02-19 10:36:11.923+0000: 7627: info : lxcSecurityInit:1451 : lxcSecurityInit (null) 2014-02-19 10:36:11.925+0000: 7627: debug : virLXCDriverCapsInit:143 : Initialized caps for security driver "selinux" with DOI "0" 2014-02-19 10:36:11.926+0000: 7627: error : virDomainObjParseNode:12677 : XML error: unexpected root element <domain>, expecting <domstatus> 2014-02-19 10:36:11.929+0000: 7627: debug : virLXCProcessStart:1043 : Setting current domain def as transient 2014-02-19 10:36:11.930+0000: 7627: debug : virLXCProcessStart:1068 : Preparing host devices 2014-02-19 10:36:11.930+0000: 7627: debug : virLXCProcessStart:1084 : Generating domain security label (if required) 2014-02-19 10:36:11.930+0000: 7627: debug : virLXCProcessStart:1095 : Setting domain security labels 2014-02-19 10:36:11.931+0000: 7627: debug : virLXCProcessSetupInterfaceBridged:252 : calling vethCreate() 2014-02-19 10:36:11.936+0000: 7627: debug : virLXCProcessSetupInterfaceBridged:256 : parentVeth: veth0, containerVeth: veth1 2014-02-19 10:36:12.075+0000: 7627: error : virNetSocketNewConnectUNIX:567 : Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused 2014-02-19 10:36:12.076+0000: 7627: debug : virLXCMonitorDispose:199 : mon=0x7f7b38215640 2014-02-19 10:36:12.083+0000: 7627: error : virCommandWait:2376 : internal error: Child process (ip link del veth1) unexpected exit status 1: Cannot find device "veth1" 2014-02-19 10:36:12.083+0000: 7627: error : virLXCProcessAutostartDomain:1365 : Failed to autostart VM 'krelln': Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused 2014-02-19 10:36:43.262+0000: 7621: debug : virLXCProcessStart:1043 : Setting current domain def as transient 2014-02-19 10:36:43.263+0000: 7621: debug : virLXCProcessStart:1068 : Preparing host devices 2014-02-19 10:36:43.263+0000: 7621: debug : virLXCProcessStart:1084 : Generating domain security label (if required) 2014-02-19 10:36:43.264+0000: 7621: debug : virLXCProcessStart:1095 : Setting domain security labels 2014-02-19 10:36:43.264+0000: 7621: debug : virLXCProcessSetupInterfaceBridged:252 : calling vethCreate() 2014-02-19 10:36:43.270+0000: 7621: debug : virLXCProcessSetupInterfaceBridged:256 : parentVeth: veth0, containerVeth: veth1 2014-02-19 10:36:43.435+0000: 7621: error : virNetSocketNewConnectUNIX:567 : Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused 2014-02-19 10:36:43.435+0000: 7621: debug : virLXCMonitorDispose:199 : mon=0x7f7b2c003780 2014-02-19 10:36:43.441+0000: 7621: error : virCommandWait:2376 : internal error: Child process (ip link del veth1) unexpected exit status 1: Cannot find device "veth1" ##################### # cat /var/log/libvirt/lxc/krelln.log 2014-02-19 10:36:11.961+0000: starting up PATH=/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_FILTERS=1:lxc LIBVIRT_LOG_OUTPUTS=1:stderr /usr/libexec/libvirt_lxc --name krelln --console 27 --security=selinux --handshake 30 --background --veth veth1 2014-02-19 10:36:12.055+0000: 7792: info : libvirt version: 1.1.3.3, package: 5.fc20 (Fedora Project, 2014-02-01-20:08:12, buildvm-17.phx2.fedoraproject.org) 2014-02-19 10:36:12.055+0000: 7792: info : virLXCDriverCapsInit:145 : No driver, not initializing security driver 2014-02-19 10:36:12.057+0000: 7792: debug : main:2373 : Security model selinux type none label (null) imagelabel (null) 2014-02-19 10:36:12.064+0000: 7792: error : virDBusCallMethod:1173 : Input/output error 2014-02-19 10:36:43.296+0000: starting up PATH=/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_FILTERS=1:lxc LIBVIRT_LOG_OUTPUTS=1:stderr /usr/libexec/libvirt_lxc --name krelln --console 28 --security=selinux --handshake 31 --background --veth veth1 2014-02-19 10:36:43.340+0000: 7931: info : libvirt version: 1.1.3.3, package: 5.fc20 (Fedora Project, 2014-02-01-20:08:12, buildvm-17.phx2.fedoraproject.org) 2014-02-19 10:36:43.340+0000: 7931: info : virLXCDriverCapsInit:145 : No driver, not initializing security driver 2014-02-19 10:36:43.418+0000: 7931: debug : main:2373 : Security model selinux type none label (null) imagelabel (null) 2014-02-19 10:36:43.424+0000: 7931: error : virDBusCallMethod:1173 : Input/output error Similar bugs: #996851 #782038 #1012648
> 2014-02-19 10:36:43.424+0000: 7931: error : virDBusCallMethod:1173 : Input/output error Sigh. This is an unhelpful error message returned by systemd/machined I'm afraid. Can you give me some more info from the following commands: $ machinectl $ systemctl --all | grep machine
# machinectl MACHINE CONTAINER SERVICE qemu-rolling vm libvirt-qemu # # systemctl --all | grep machine machine-lxc\x2dkrelln.scope loaded failed failed Container lxc-krelln machine-qemu\x2drolling.scope loaded active running Virtual Machine qemu-rolling machine.slice loaded active active Virtual Machine and Container Slice # systemctl status machine-lxc\x2dkrelln.scope machine-lxcx2dkrelln.scope Loaded: not-found (Reason: No such file or directory) Active: inactive (dead) # systemctl status systemd-machined systemd-machined.service - Virtual Machine and Container Registration Service Loaded: loaded (/usr/lib/systemd/system/systemd-machined.service; static) Active: inactive (dead) Docs: man:systemd-machined.service(8) http://www.freedesktop.org/wiki/Software/systemd/machined Feb 19 11:36:43 mother.pipebreaker.pl systemd-machined[5805]: Failed to start machine scope: Unit machine-lxc\x2dkrelln.scope already exists. Feb 19 11:40:56 mother.pipebreaker.pl systemd-machined[5805]: Failed to start machine scope: Unit machine-lxc\x2dkrelln.scope already exists. Feb 19 11:41:04 mother.pipebreaker.pl systemd-machined[5805]: Failed to start machine scope: Unit machine-lxc\x2dkrelln.scope already exists.
Ok, that's what I was afraid of. In some cases when systemd fails to create the machine for libvirt, it'll fail to cleanup properly. This means that later attempts to start the VM will show the 'already exists' error. Unfortunately I've not found any way to get out of this problem state once systemd gets confused, other than to change the name of the VM, or reboot both of which suck :-( What is not clear though, is what the root cause was that made systemd fail the very first time around. Do you recall anything special about the very first time you tried todo 'virsh start krelln' ? If you reboot, and this problem still occurs, then best thing is to run 'strace' on the machined process. eg reboot strace -s 10000 -f -p PID-OF-SYSTEMD-MACHINED and then virsh start krelln. Somewhere in the strace data will be systemd's hidden real error message about the root cause.
In /sys/fs/cgroup/systemd/system.slice I have: drwxr-xr-x. 2 root root 0 Jan 6 12:34 machine-krelln.scope drwxr-xr-x. 2 root root 0 Jan 6 12:34 machine-lxc\x2dkrelln.scope But I cannot remove those dirs using "rm", operation not permitted even as root. Unfortunately I do not remember anything special. I'm not sure what January 6th was special, machine was booted on Sun Dec 29 14:38:53 2013. Strace at the moment is not very enlightening: 19744 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=../src/machine/machine.c\nCODE_LINE=246\nCODE_FUNCTION=machine_start_scope\nSYSLOG_IDENTIFIER=systemd-machined\n", 147}, {"MESSAGE=", 8}, {"Failed to start machine scope: Unit machine-lxc\\x2dkrelln.scope already exists.", 79}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 235
Yep, I'm afraid strace will only be useful if you're able to reboot your host, or alternatively if you can reproduce the problem with a 2nd LXC guest again. Basically we need the strace the first time systemd fails for a particular guest name.
I was able to do cleanup _without_ reboot and start the guest. I will later shut it down and try to start again. For the record, the required cleanup steps were: systemctl stop systemd-machined /sys/fs/cgroup ]# find . -name *krelln* ./perf_event/machine.slice/machine-lxc\x2dkrelln.scope ./net_cls/machine.slice/machine-lxc\x2dkrelln.scope ./freezer/machine.slice/machine-lxc\x2dkrelln.scope ./cpuset/machine.slice/machine-lxc\x2dkrelln.scope # above directories could not be rm'd. But can move moved (mv) aside to some temporary name /run/systemd ]# find . -name *krelln* ./system/machine-lxc\x2dkrelln.scope ./system/machine-lxc\x2dkrelln.scope.d # rm -rf those systemctl stop "machine-lxc\x2dkrelln.scope" systemctl kill "machine-lxc\x2dkrelln.scope" # lignering unit disappeared systemctl daemon-reload systemctl start systemd-machined At that point, virsh --connect lxc:// start krelln succeeds!
Few restart later and I cannot reproduce the issue, meh. Nevertheless, cleanup steps from #c6 can be used to workaround issue when it happens.
Thanks for figuring out the cleanup steps and attempting to reproduce it. I'll close this bug INSUFFICIENT DATA since we can't figure out how to reproduce it currently. If you see it again feel free to re-open this of course.
I am working with Fedora Core 20-based installations of the PlanetLab software (see http://benlomond.nntb.no/lxc/2015.02.26--nightly--lxc-f20-linux64/), which is based on LXC containers, and experience the same (or a quite similar) issue: destroying a container does not work properly. "virsh destroy ..." leaves files like: /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dbs_test24.scope /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dbs_test24.scope /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dbs_test24.scope /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dbs_test24.scope The following "virsh undefine" also does not remove them. Then, re-creating a container with the same name fails. I can reliably reproduce this issue in my automatic build-and-test setup at http://benlomond.nntb.no/lxc/ . The people at OneLab experience the same issue in their build-and-test setup at http://build.onelab.eu/lxc/ as well. I furthermore made a manual installation for debugging. If there is need for looking at certain files, I can provide help for further debugging of the problem.
An important thing to mention: when reproducing the bug in my build-and-test system, the virsh destroy/virsh undefine are running inside a systemd-based service. In this case, the issue seems to happen always. When running the destroy/undefines manually, some destroy -> recreate rounds may work. But after some rounds, the system always runs into the same issue. Obervation 1: When the problem *not* occurs, there will be no remaining scope files after "virsh destroy". Otherwise, if there are such files left after "virsh destroy", the problem is there and the scope files will not be removed any more. Obervation 2: I also added a call to shutdown, before destroy, with a 5s delay between shutdown and destroy => no change. An interesting thing I noticed: In the log file of the LXC container, i.e. /var/log/libvirt/lxc/<sliver>.log: Round 1: 2015-02-25 20:03:53.909+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=9415a960-f32f-4f8d-ad32-9c90fd764356 LIBVIRT_LXC_UUID=9415a960-f32f-4f8d-ad32-9c90fd764356 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init Round 2: 2015-02-25 20:06:00.121+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=71410263-418d-431a-8635-8a35e70077a3 LIBVIRT_LXC_UUID=71410263-418d-431a-8635-8a35e70077a3 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init Round 3: 2015-02-25 20:08:07.562+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=5d0040a4-4d64-4336-b09b-cf2ce329c840 LIBVIRT_LXC_UUID=5d0040a4-4d64-4336-b09b-cf2ce329c840 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init Round 4: 2015-02-25 20:10:15.479+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=ec1253fa-445b-4b27-8718-305ac216ce60 LIBVIRT_LXC_UUID=ec1253fa-445b-4b27-8718-305ac216ce60 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init Round 5: 2015-02-25 20:12:23.974+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=2aeeab9b-ef50-4542-9c8b-96500301782a LIBVIRT_LXC_UUID=2aeeab9b-ef50-4542-9c8b-96500301782a LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init So far, everything went fine for 5 create -> destroy rounds. Round 6: 2015-02-25 20:14:30.783+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=304c7c78-3952-4e37-8217-095988a19181 LIBVIRT_LXC_UUID=304c7c78-3952-4e37-8217-095988a19181 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init 2015-02-25 20:15:34.843+0000: 3976: info : libvirt version: 1.2.11, package: 2 (PlanetLab/OneLab, 2015-02-24-14:30:38, 2015-02-24--test--lxc-f20-linux64) 2015-02-25 20:15:34.843+0000: 3976: error : virNetSocketReadWire:1575 : End of file while reading data: Input/output error The two lines above are exactly at the point when the destroy call is made. virNetSocketReadWire() is a function in libvirt. (The last destruction leaves the files, from nodemanager log, with debug output I added: Wed Feb 25 20:15:35 2015: -TMP-FIND cs_qqq003_0 Wed Feb 25 20:15:35 2015: log_call: running command /usr/bin/find /sys/fs/ -name *cs_qqq003_0* -exec /usr/bin/echo {} ; Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dcs_qqq003_0.scope Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dcs_qqq003_0.scope Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2dcs_qqq003_0.scope Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/machine.slice/machine-lxc\x2dcs_qqq003_0.scope Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dcs_qqq003_0.scope Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/systemd/machine.slice/machine-lxc\x2dcs_qqq003_0.scope) The following round then fails creating the sliver: Round 7: 2015-02-25 20:16:38.476+0000: starting up PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0 2015-02-25 20:16:38.486+0000: 16585: info : libvirt version: 1.2.11, package: 2 (PlanetLab/OneLab, 2015-02-24-14:30:38, 2015-02-24--test--lxc-f20-linux64) 2015-02-25 20:16:38.486+0000: 16585: error : virDBusCall:1542 : error from service: CreateMachine: Input/output error error from service: CreateMachine: Input/output error
(In reply to Thomas Dreibholz from comment #9) > I am working with Fedora Core 20-based installations of the PlanetLab > software (see > http://benlomond.nntb.no/lxc/2015.02.26--nightly--lxc-f20-linux64/), which > is based on LXC containers, and experience the same (or a quite similar) > issue: destroying a container does not work properly. > > "virsh destroy ..." leaves files like: > /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dbs_test24.scope > /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dbs_test24.scope > /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dbs_test24.scope > /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dbs_test24.scope When these dirs get left, can you check the 'tasks' file in those dirs and see if any PIDs are listed virsh destroy should kill all the cgroup directories. Also what version of libvirt do you have currently.
[root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# rpm -q libvirt libvirt-1.2.11-2.x86_64 But the issue is already much older. Since about May 2014, the automatic build-and-test scripts fail for F20. The libvirt version by then was about 1.2.3 or 1.2.4. The tasks file is empty: [root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# pwd /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dcs_qqq003_0.scope [root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# ls -al total 0 drwxr-xr-x 2 root root 0 Feb 25 21:13 . drwxr-xr-x 67 root root 0 Feb 25 20:14 .. -rw-r--r-- 1 root root 0 Feb 25 21:13 cgroup.clone_children -rw-r--r-- 1 root root 0 Feb 25 21:13 cgroup.procs -r--r--r-- 1 root root 0 Feb 25 21:13 freezer.parent_freezing -r--r--r-- 1 root root 0 Feb 25 21:13 freezer.self_freezing -rw-r--r-- 1 root root 0 Feb 25 21:13 freezer.state -rw-r--r-- 1 root root 0 Feb 25 21:13 notify_on_release -rw-r--r-- 1 root root 0 Feb 25 20:15 tasks [root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# cat tasks [root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# cat * 0 0 0 THAWED 0
About "virsh destroy should kill all the cgroup directories.": this is also what I found in the libvirt code. The interesting thing (from journalctl): Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16) Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16) Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16) Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16) (16) is the value of errno. 16 = EBUSY. The error message seems to be from virCgroupRemoveRecursively() in libvirt. As the name says, it tries to recursively remove the directory structure. However, there are still files inside: # ls -al /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/* -rw-r--r-- 1 root root 0 Feb 25 21:04 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/cgroup.clone_children -rw-r--r-- 1 root root 0 Feb 25 21:04 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/cgroup.procs -rw-r--r-- 1 root root 0 Feb 25 21:04 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/notify_on_release -rw-r--r-- 1 root root 0 Feb 25 20:14 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/tasks So, EBUSY is plausible. virCgroupRemoveRecursively() tries to remove directories (rmdir) but not files.
I also tried "rm" on the files (in the shell, as root): rm /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/* rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.clone_children'? y rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.clone_children': Operation not permitted rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.procs'? y rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.procs': Operation not permitted rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/notify_on_release'? y rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/notify_on_release': Operation not permitted rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/tasks'? y rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/tasks': Operation not permitted So, may be the issue is that libvirt cannot remove the files (permission denied), and then it is unable to remove the directories?
The cgroups filesystem is slightly magic in that you don't ever delete individual files. rmdir on a directory will succeeed if the 'tasks' file has no PIDs in it.
If you can reproduce it reasonably reliable, it would be interesting to get extra debug logs In /etc/libvirt/lxc.conf set log_with_libvirtd=1 In /etc/libvirt/libvirtd.conf set log_filters="1:lxc 1:cgroup 1:systemd" log_outputs="1:stderr 1:file:/var/log/libvirt/libvirtd.log" NB, i intentionally have 2 outputs in this second setting even though we won't care about stderr from libvirtd itself, this ensures stderr from libvirt_lxc gets saved into the per VM log. Delete all existing VM logs /var/log/libvirt/lxc/*.log Then systemctl restart libvirtd.service Now try to reproduce the error with a guest. Once you succeed, then attach the /var/log/libvirt/libvirtd.log and /var/log/libvirt/lxc/$GUEST.log to this bug.
Created attachment 995552 [details] libvirtd log file
Created attachment 995553 [details] Container log file
Here are the log files. The container as_nornet1_went through 3 create -> destroy cycles. The 3rd container destruction failed with the files left, finally causing the 4th container creation to fail.
Excellant, that is useful information. I can indeed see an error on the 3rd shutdown attempt 2015-02-26 12:30:32.227+0000: 31959: debug : virLXCProcessStop:660 : Stopping VM name=as_nornet1_0 pid=6254 reason=2 2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillPainfully:3592 : cgroup=0x7ffc1001e1b0 path= 2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillRecursive:3571 : group=0x7ffc1001e1b0 path= signum=15 2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillRecursiveInternal:3505 : group=0x7ffc1001e1b0 path= signum=15 pids=0x7ffc14177c50 2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillInternal:3374 : group=0x7ffc1001e1b0 path= signum=15 pids=0x7ffc14177c50 2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillInternal:3386 : No file /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/tasks, assuming done 2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillRecursiveInternal:3515 : Iterate over children of /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (killedAny=0) 2015-02-26 12:30:32.229+0000: 31959: debug : virCgroupKillRecursiveInternal:3518 : Path /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ does not exist, assuming done 2015-02-26 12:30:32.229+0000: 31959: debug : virCgroupKillPainfully:3603 : Iteration 0 rc=0 2015-02-26 12:30:32.229+0000: 31959: debug : virCgroupKillPainfully:3610 : Complete 0 2015-02-26 12:30:32.229+0000: 31959: debug : virLXCProcessCleanup:158 : Stopping VM name=as_nornet1_0 pid=6254 reason=2 2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorClose:213 : mon=0x7ffc10020fb0 2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorClose:219 : Clear EOF callback mon=0x7ffc10020fb0 2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorEOFNotify:123 : EOF notify mon=0x7ffc10020fb0 2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorEOFNotify:133 : No EOF callback 2015-02-26 12:30:32.229+0000: 31951: debug : virLXCMonitorDispose:201 : mon=0x7ffc10020fb0 2015-02-26 12:30:32.230+0000: 6254: error : virNetSocketReadWire:1575 : End of file while reading data: Input/output error 2015-02-26 12:30:32.230+0000: 6254: debug : virLXCControllerClientCloseHook:719 : Client 0x7f899b12fc70 has closed 2015-02-26 12:30:32.230+0000: 6254: debug : virLXCControllerClientPrivateFree:731 : Got private data free 0x7f899b129fa0 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3331 : Removing cgroup 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2das_nornet1_0.scope/ 2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16) 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/memory/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/devices/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2das_nornet1_0.scope/ 2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16) 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/blkio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ 2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16) 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2das_nornet1_0.scope/ 2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16) 2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3356 : Done removing cgroup Notice it is getting errno==16==EBUSY when trying to delete the cgroup. Also, it is earlier getting an ENOENT error killing tasks in /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ I wonder if there are somehow tasks present in the other cgroup controllers, that were not present in the main cpu,cpuacct controller. That could certainly explain the scenario we're seeing. Out of interest what do you have actually running inside the container ? Are they single applications, or full Fedora installs, or something else ?
Created attachment 995588 [details] "rpm -qa" output inside a container
The system inside the container is a probably quite standard, lightweight installation of Fedora Core 18. I have attached the output of "rpm -qa". Note, that using Fedora Core 20 inside the container also runs into the same problem. So, the version inside the container makes no difference for this bug.
An important thing to mention is: when using Fedora Core 18 for hosting the system, the bug does not occur. In this case, the container destruction always works as expected. The Fedora Core 18 system also uses the same version of libvirt (1.2.11), but it had worked with older versions as well. So, for my productive system installation of NorNet, I am still on Fedora Core 18 therefore. The bug actually prevents me from upgrading my systems from 18 to 20.
IIRC, Fedora 18 did not have the systemd support for machined, so libvirt would be taking a differnet codepath on that host, thus avoiding the error scenario
The last successful F20 build-and-test run at OneLab seems to have been on May 26, 2014 (http://web.archive.org/web/20140621155333/http://build.onelab.eu/lxc/): 2014.05.26--f20.warn. Unfortunately, Wayback has not stored the build-and-test log. Then, probably some change in the Fedora Core 20 packages introduced the issue. The build takes most of the packages directly from an F20 mirror, and just adds a newer libvirt version (upstream sources without any changes) and some applications that should not affect containers.
I would like to chime in at this point. I do not think it is reliable to assume that this feature was working for us before that date (may 26 last year). I think I remember the tests at that time were not as complete as they are now, and that issue was in there but just went undetected. As for the question about what is going on in the container: in the context of the tests, each container is a rather small fedora20 instance (Thomas uses fedora18, but as he says the results are the same). We do add a very simple user-defined init-script that gets triggered through systemd. Also we have a home-brewed mechanism for punching holes in the isolation between host and guest, named vsys, that is implemented as processes running in the host and listening to pipes that can be reached from the guest. It might very well be that either of these 2 mechanisms is in the way here when we try to turn down a container properly, but if it is the case it is hard for us to fix the problem because we do not understand what is going on under the hood. hope this helps
I also tried with a container installation created from the Fedora template /usr/share/lxc/templates/lxc-fedora, i.e. a plain, minimal F20 installation inside the container. There is no difference in the result, the bug still occurs as before.
This message is a reminder that Fedora 20 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 20. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '20'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 20 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Given that f20 is end of life soon, I suggest any of the reporters check to see if this bug is still relevant on f21 or f22
Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.