Bug 1066926 - cannot start libvirt-lxc containers: Failed to connect socket
Summary: cannot start libvirt-lxc containers: Failed to connect socket
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: libvirt
Version: 20
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Libvirt Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-19 10:44 UTC by Tomasz Torcz
Modified: 2015-06-29 15:18 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-06-29 15:18:42 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
libvirtd log file (2.99 MB, text/plain)
2015-02-26 12:35 UTC, Thomas Dreibholz
no flags Details
Container log file (1.34 MB, text/plain)
2015-02-26 12:36 UTC, Thomas Dreibholz
no flags Details
"rpm -qa" output inside a container (9.53 KB, text/plain)
2015-02-26 13:19 UTC, Thomas Dreibholz
no flags Details

Description Tomasz Torcz 2014-02-19 10:44:36 UTC
Description of problem:
After upgrading F19->F20 I am unable to start LXC containers:

#  virsh --connect lxc:// start krelln
error: Failed to start domain krelln
error: Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused

SELinux is in permissive mode.

Version-Release number of selected component (if applicable):
libvirt-1.1.3.3-5.fc20.x86_64
libvirt-daemon-lxc-1.1.3.3-5.fc20.x86_64
libvirt-daemon-driver-lxc-1.1.3.3-5.fc20.x86_64
systemd-208-14.fc20.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Try to start LXC container.


Additional info:
# LIBVIRT_LOG_FILTERS=1:lxc LIBVIRT_LOG_OUTPUTS=1:stderr /usr/sbin/libvirtd
2014-02-19 10:36:11.856+0000: 7627: info : libvirt version: 1.1.3.3, package: 5.fc20 (Fedora Project, 2014-02-01-20:08:12, buildvm-17.phx2.fedoraproject.org)
2014-02-19 10:36:11.856+0000: 7627: debug : lxcContainerAvailable:2076 : container support is enabled
2014-02-19 10:36:11.864+0000: 7775: warning : qemuMonitorVMStatusToPausedReason:3365 : Qemu reported unknown VM status: 'suspended'
2014-02-19 10:36:11.923+0000: 7627: debug : lxcContainerAvailable:2076 : container support is enabled
2014-02-19 10:36:11.923+0000: 7627: info : lxcSecurityInit:1451 : lxcSecurityInit (null)
2014-02-19 10:36:11.925+0000: 7627: debug : virLXCDriverCapsInit:143 : Initialized caps for security driver "selinux" with DOI "0"
2014-02-19 10:36:11.926+0000: 7627: error : virDomainObjParseNode:12677 : XML error: unexpected root element <domain>, expecting <domstatus>
2014-02-19 10:36:11.929+0000: 7627: debug : virLXCProcessStart:1043 : Setting current domain def as transient
2014-02-19 10:36:11.930+0000: 7627: debug : virLXCProcessStart:1068 : Preparing host devices
2014-02-19 10:36:11.930+0000: 7627: debug : virLXCProcessStart:1084 : Generating domain security label (if required)
2014-02-19 10:36:11.930+0000: 7627: debug : virLXCProcessStart:1095 : Setting domain security labels
2014-02-19 10:36:11.931+0000: 7627: debug : virLXCProcessSetupInterfaceBridged:252 : calling vethCreate()
2014-02-19 10:36:11.936+0000: 7627: debug : virLXCProcessSetupInterfaceBridged:256 : parentVeth: veth0, containerVeth: veth1
2014-02-19 10:36:12.075+0000: 7627: error : virNetSocketNewConnectUNIX:567 : Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused
2014-02-19 10:36:12.076+0000: 7627: debug : virLXCMonitorDispose:199 : mon=0x7f7b38215640
2014-02-19 10:36:12.083+0000: 7627: error : virCommandWait:2376 : internal error: Child process (ip link del veth1) unexpected exit status 1: Cannot find device "veth1"

2014-02-19 10:36:12.083+0000: 7627: error : virLXCProcessAutostartDomain:1365 : Failed to autostart VM 'krelln': Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused

2014-02-19 10:36:43.262+0000: 7621: debug : virLXCProcessStart:1043 : Setting current domain def as transient
2014-02-19 10:36:43.263+0000: 7621: debug : virLXCProcessStart:1068 : Preparing host devices
2014-02-19 10:36:43.263+0000: 7621: debug : virLXCProcessStart:1084 : Generating domain security label (if required)
2014-02-19 10:36:43.264+0000: 7621: debug : virLXCProcessStart:1095 : Setting domain security labels
2014-02-19 10:36:43.264+0000: 7621: debug : virLXCProcessSetupInterfaceBridged:252 : calling vethCreate()
2014-02-19 10:36:43.270+0000: 7621: debug : virLXCProcessSetupInterfaceBridged:256 : parentVeth: veth0, containerVeth: veth1
2014-02-19 10:36:43.435+0000: 7621: error : virNetSocketNewConnectUNIX:567 : Failed to connect socket to '/var/run/libvirt/lxc/krelln.sock': Connection refused
2014-02-19 10:36:43.435+0000: 7621: debug : virLXCMonitorDispose:199 : mon=0x7f7b2c003780
2014-02-19 10:36:43.441+0000: 7621: error : virCommandWait:2376 : internal error: Child process (ip link del veth1) unexpected exit status 1: Cannot find device "veth1"


#####################
# cat /var/log/libvirt/lxc/krelln.log

2014-02-19 10:36:11.961+0000: starting up
PATH=/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_FILTERS=1:lxc  LIBVIRT_LOG_OUTPUTS=1:stderr /usr/libexec/libvirt_lxc --name krelln --console 27 --security=selinux --handshake 30 --background --veth veth1
2014-02-19 10:36:12.055+0000: 7792: info : libvirt version: 1.1.3.3, package: 5.fc20 (Fedora Project, 2014-02-01-20:08:12, buildvm-17.phx2.fedoraproject.org)
2014-02-19 10:36:12.055+0000: 7792: info : virLXCDriverCapsInit:145 : No driver, not initializing security driver
2014-02-19 10:36:12.057+0000: 7792: debug : main:2373 : Security model selinux type none label (null) imagelabel (null)
2014-02-19 10:36:12.064+0000: 7792: error : virDBusCallMethod:1173 : Input/output error
2014-02-19 10:36:43.296+0000: starting up
PATH=/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_FILTERS=1:lxc  LIBVIRT_LOG_OUTPUTS=1:stderr /usr/libexec/libvirt_lxc --name krelln --console 28 --security=selinux --handshake 31 --background --veth veth1
2014-02-19 10:36:43.340+0000: 7931: info : libvirt version: 1.1.3.3, package: 5.fc20 (Fedora Project, 2014-02-01-20:08:12, buildvm-17.phx2.fedoraproject.org)
2014-02-19 10:36:43.340+0000: 7931: info : virLXCDriverCapsInit:145 : No driver, not initializing security driver
2014-02-19 10:36:43.418+0000: 7931: debug : main:2373 : Security model selinux type none label (null) imagelabel (null)
2014-02-19 10:36:43.424+0000: 7931: error : virDBusCallMethod:1173 : Input/output error



Similar bugs: #996851 #782038 #1012648

Comment 1 Daniel Berrangé 2014-02-19 11:14:37 UTC
> 2014-02-19 10:36:43.424+0000: 7931: error : virDBusCallMethod:1173 : Input/output error

Sigh. This is an unhelpful error message returned by systemd/machined I'm afraid.

Can you give me some more info from the following commands:

 $ machinectl 
 $ systemctl --all | grep machine

Comment 2 Tomasz Torcz 2014-02-19 11:27:46 UTC
# machinectl
MACHINE                          CONTAINER SERVICE
qemu-rolling                     vm        libvirt-qemu

#

# systemctl --all | grep machine
machine-lxc\x2dkrelln.scope       loaded failed failed  Container lxc-krelln
machine-qemu\x2drolling.scope     loaded active running Virtual Machine qemu-rolling
machine.slice                     loaded active active  Virtual Machine and Container Slice

# systemctl status machine-lxc\x2dkrelln.scope
machine-lxcx2dkrelln.scope
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)


# systemctl status systemd-machined
systemd-machined.service - Virtual Machine and Container Registration Service
   Loaded: loaded (/usr/lib/systemd/system/systemd-machined.service; static)
   Active: inactive (dead)
     Docs: man:systemd-machined.service(8)
           http://www.freedesktop.org/wiki/Software/systemd/machined

Feb 19 11:36:43 mother.pipebreaker.pl systemd-machined[5805]: Failed to start machine scope: Unit machine-lxc\x2dkrelln.scope already exists.
Feb 19 11:40:56 mother.pipebreaker.pl systemd-machined[5805]: Failed to start machine scope: Unit machine-lxc\x2dkrelln.scope already exists.
Feb 19 11:41:04 mother.pipebreaker.pl systemd-machined[5805]: Failed to start machine scope: Unit machine-lxc\x2dkrelln.scope already exists.

Comment 3 Daniel Berrangé 2014-02-19 11:37:59 UTC
Ok, that's what I was afraid of.

In some cases when systemd fails to create the machine for libvirt, it'll fail to cleanup properly. This means that later attempts to start the VM will show the 'already exists' error.  Unfortunately I've not found any way to get out of this problem state once systemd gets confused, other than to change the name of the VM, or reboot both of which suck :-(

What is not clear though, is what the root cause was that made systemd fail the very first time around. Do you recall anything special about the very first time you tried todo 'virsh start krelln' ?

If you reboot, and this problem still occurs, then best thing is to run 'strace' on the machined process. eg

 reboot
 strace -s 10000 -f -p PID-OF-SYSTEMD-MACHINED

and then virsh start krelln. Somewhere in the strace data will be systemd's hidden real error message about the root cause.

Comment 4 Tomasz Torcz 2014-02-19 11:46:58 UTC
In /sys/fs/cgroup/systemd/system.slice I have:
drwxr-xr-x. 2 root root 0 Jan  6 12:34 machine-krelln.scope
drwxr-xr-x. 2 root root 0 Jan  6 12:34 machine-lxc\x2dkrelln.scope

But I cannot remove those dirs using "rm", operation not permitted even as root.

Unfortunately I do not remember anything special. I'm not sure what January 6th was special, machine was booted on Sun Dec 29 14:38:53 2013.

Strace at the moment is not very enlightening:

19744 sendmsg(3, {msg_name(0)=NULL, msg_iov(4)=[{"PRIORITY=3\nSYSLOG_FACILITY=4\nCODE_FILE=../src/machine/machine.c\nCODE_LINE=246\nCODE_FUNCTION=machine_start_scope\nSYSLOG_IDENTIFIER=systemd-machined\n", 147}, {"MESSAGE=", 8}, {"Failed to start machine scope: Unit machine-lxc\\x2dkrelln.scope already exists.", 79}, {"\n", 1}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 235

Comment 5 Daniel Berrangé 2014-02-19 11:57:16 UTC
Yep, I'm afraid strace will only be useful if you're able to reboot your host, or alternatively if you can reproduce the problem with a 2nd LXC guest again. Basically we need the strace the first time systemd fails for a particular guest name.

Comment 6 Tomasz Torcz 2014-02-20 12:44:47 UTC
I was able to do cleanup _without_ reboot and start the guest. I will later shut it down and try to start again. For the record, the required cleanup steps were:

systemctl stop systemd-machined

/sys/fs/cgroup ]# find . -name *krelln*
./perf_event/machine.slice/machine-lxc\x2dkrelln.scope
./net_cls/machine.slice/machine-lxc\x2dkrelln.scope
./freezer/machine.slice/machine-lxc\x2dkrelln.scope
./cpuset/machine.slice/machine-lxc\x2dkrelln.scope
# above directories could not be rm'd.  But can move moved (mv) aside to some temporary name


/run/systemd ]# find . -name *krelln*
./system/machine-lxc\x2dkrelln.scope
./system/machine-lxc\x2dkrelln.scope.d
# rm -rf those


systemctl stop "machine-lxc\x2dkrelln.scope"
systemctl kill "machine-lxc\x2dkrelln.scope"
# lignering unit disappeared

systemctl daemon-reload

systemctl start systemd-machined


At that point, virsh --connect lxc:// start krelln succeeds!

Comment 7 Tomasz Torcz 2014-02-22 09:41:00 UTC
Few restart later and I cannot reproduce the issue, meh.

Nevertheless, cleanup steps from #c6 can be used to workaround issue when it happens.

Comment 8 Daniel Berrangé 2014-02-24 10:04:53 UTC
Thanks for figuring out the cleanup steps and attempting to reproduce it. I'll close this bug INSUFFICIENT DATA since we can't figure out how to reproduce it currently. If you see it again feel free to re-open this of course.

Comment 9 Thomas Dreibholz 2015-02-26 10:12:06 UTC
I am working with Fedora Core 20-based installations of the PlanetLab software (see http://benlomond.nntb.no/lxc/2015.02.26--nightly--lxc-f20-linux64/), which is based on LXC containers, and experience the same (or a quite similar) issue: destroying a container does not work properly.

"virsh destroy ..." leaves files like:
/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dbs_test24.scope
/sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dbs_test24.scope
/sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dbs_test24.scope
/sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dbs_test24.scope

The following "virsh undefine" also does not remove them. Then, re-creating a container with the same name fails. I can reliably reproduce this issue in my automatic build-and-test setup at http://benlomond.nntb.no/lxc/ . The people at OneLab experience the same issue in their build-and-test setup at http://build.onelab.eu/lxc/ as well.

I furthermore made a manual installation for debugging. If there is need for looking at certain files, I can provide help for further debugging of the problem.

Comment 10 Thomas Dreibholz 2015-02-26 10:18:30 UTC
An important thing to mention: when reproducing the bug in my build-and-test system, the virsh destroy/virsh undefine are running inside a systemd-based service. In this case, the issue seems to happen always. When running the destroy/undefines manually, some destroy -> recreate rounds may work. But after some rounds, the system always runs into the same issue.

Obervation 1: When the problem *not* occurs, there will be no remaining scope files after "virsh destroy". Otherwise, if there are such files left after "virsh destroy", the problem is there and the scope files will not be removed any more.

Obervation 2: I also added a call to shutdown, before destroy, with a 5s delay between shutdown and destroy => no change. An interesting thing I noticed: In the log file of the LXC container, i.e. /var/log/libvirt/lxc/<sliver>.log:

Round 1:
2015-02-25 20:03:53.909+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=9415a960-f32f-4f8d-ad32-9c90fd764356 LIBVIRT_LXC_UUID=9415a960-f32f-4f8d-ad32-9c90fd764356 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init

Round 2:
2015-02-25 20:06:00.121+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=71410263-418d-431a-8635-8a35e70077a3 LIBVIRT_LXC_UUID=71410263-418d-431a-8635-8a35e70077a3 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init

Round 3:
2015-02-25 20:08:07.562+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=5d0040a4-4d64-4336-b09b-cf2ce329c840 LIBVIRT_LXC_UUID=5d0040a4-4d64-4336-b09b-cf2ce329c840 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init

Round 4:
2015-02-25 20:10:15.479+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=ec1253fa-445b-4b27-8718-305ac216ce60 LIBVIRT_LXC_UUID=ec1253fa-445b-4b27-8718-305ac216ce60 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init

Round 5:
2015-02-25 20:12:23.974+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=2aeeab9b-ef50-4542-9c8b-96500301782a LIBVIRT_LXC_UUID=2aeeab9b-ef50-4542-9c8b-96500301782a LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init


So far, everything went fine for 5 create -> destroy rounds.


Round 6:
2015-02-25 20:14:30.783+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
PATH=/bin:/sbin TERM=linux container=lxc-libvirt HOME=/ container_uuid=304c7c78-3952-4e37-8217-095988a19181 LIBVIRT_LXC_UUID=304c7c78-3952-4e37-8217-095988a19181 LIBVIRT_LXC_NAME=cs_qqq003_0 /sbin/init
2015-02-25 20:15:34.843+0000: 3976: info : libvirt version: 1.2.11, package: 2 (PlanetLab/OneLab, 2015-02-24-14:30:38, 2015-02-24--test--lxc-f20-linux64)
2015-02-25 20:15:34.843+0000: 3976: error : virNetSocketReadWire:1575 : End of file while reading data: Input/output error

The two lines above are exactly at the point when the destroy call is made. virNetSocketReadWire() is a function in libvirt.


(The last destruction leaves the files, from nodemanager log, with debug output I added:
Wed Feb 25 20:15:35 2015: -TMP-FIND cs_qqq003_0
Wed Feb 25 20:15:35 2015: log_call: running command /usr/bin/find /sys/fs/ -name *cs_qqq003_0* -exec /usr/bin/echo {} ;
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
Wed Feb 25 20:15:35 2015: log_call: /sys/fs/cgroup/systemd/machine.slice/machine-lxc\x2dcs_qqq003_0.scope)


The following round then fails creating the sliver:

Round 7:
2015-02-25 20:16:38.476+0000: starting up
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin LIBVIRT_DEBUG=3 LIBVIRT_LOG_OUTPUTS=3:stderr /usr/libexec/libvirt_lxc --name cs_qqq003_0 --console 20 --security=none --handshake 32 --background --veth vnet0
2015-02-25 20:16:38.486+0000: 16585: info : libvirt version: 1.2.11, package: 2 (PlanetLab/OneLab, 2015-02-24-14:30:38, 2015-02-24--test--lxc-f20-linux64)
2015-02-25 20:16:38.486+0000: 16585: error : virDBusCall:1542 : error from service: CreateMachine: Input/output error
error from service: CreateMachine: Input/output error

Comment 11 Daniel Berrangé 2015-02-26 10:27:57 UTC
(In reply to Thomas Dreibholz from comment #9)
> I am working with Fedora Core 20-based installations of the PlanetLab
> software (see
> http://benlomond.nntb.no/lxc/2015.02.26--nightly--lxc-f20-linux64/), which
> is based on LXC containers, and experience the same (or a quite similar)
> issue: destroying a container does not work properly.
> 
> "virsh destroy ..." leaves files like:
> /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dbs_test24.scope
> /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dbs_test24.scope
> /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dbs_test24.scope
> /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dbs_test24.scope

When these dirs get left, can you check the 'tasks' file in those dirs and see if any PIDs are listed

virsh destroy should kill all the cgroup directories.

Also what version of libvirt do you have currently.

Comment 12 Thomas Dreibholz 2015-02-26 10:38:08 UTC
[root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# rpm -q libvirt
libvirt-1.2.11-2.x86_64

But the issue is already much older. Since about May 2014, the automatic build-and-test scripts fail for F20. The libvirt version by then was about 1.2.3 or 1.2.4.


The tasks file is empty:

[root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# pwd
/sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dcs_qqq003_0.scope
[root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# ls -al
total 0
drwxr-xr-x  2 root root 0 Feb 25 21:13 .
drwxr-xr-x 67 root root 0 Feb 25 20:14 ..
-rw-r--r--  1 root root 0 Feb 25 21:13 cgroup.clone_children
-rw-r--r--  1 root root 0 Feb 25 21:13 cgroup.procs
-r--r--r--  1 root root 0 Feb 25 21:13 freezer.parent_freezing
-r--r--r--  1 root root 0 Feb 25 21:13 freezer.self_freezing
-rw-r--r--  1 root root 0 Feb 25 21:13 freezer.state
-rw-r--r--  1 root root 0 Feb 25 21:13 notify_on_release
-rw-r--r--  1 root root 0 Feb 25 20:15 tasks
[root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# cat tasks
[root@pinnekjoett machine-lxc\x2dcs_qqq003_0.scope]# cat *
0
0
0
THAWED
0

Comment 13 Thomas Dreibholz 2015-02-26 10:42:27 UTC
About "virsh destroy should kill all the cgroup directories.": this is also what I found in the libvirt code.

The interesting thing (from journalctl):
Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16)
Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16)
Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16)
Feb 25 20:15:35 pinnekjoett.alpha.test libvirtd[9757]: Unable to remove /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/ (16)

(16) is the value of errno. 16 = EBUSY.

The error message seems to be from virCgroupRemoveRecursively() in libvirt. As the name says, it tries to recursively remove the directory structure. However, there are still files inside:
# ls -al /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/*
-rw-r--r-- 1 root root 0 Feb 25 21:04 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/cgroup.clone_children
-rw-r--r-- 1 root root 0 Feb 25 21:04 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/cgroup.procs
-rw-r--r-- 1 root root 0 Feb 25 21:04 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/notify_on_release
-rw-r--r-- 1 root root 0 Feb 25 20:14 /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2dcs_qqq003_0.scope/tasks

So, EBUSY is plausible. virCgroupRemoveRecursively() tries to remove directories (rmdir) but not files.

Comment 14 Thomas Dreibholz 2015-02-26 10:43:40 UTC
I also tried "rm" on the files (in the shell, as root):
rm /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/*
rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.clone_children'? y
rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.clone_children': Operation not permitted
rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.procs'? y
rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/cgroup.procs': Operation not permitted
rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/notify_on_release'? y
rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/notify_on_release': Operation not permitted
rm: remove regular empty file '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/tasks'? y
rm: cannot remove '/sys/fs/cgroup/perf_event/machine.slice/machine-lxc\\x2dcs_qqq003_0.scope/tasks': Operation not permitted

So, may be the issue is that libvirt cannot remove the files (permission denied), and then it is unable to remove the directories?

Comment 15 Daniel Berrangé 2015-02-26 11:07:14 UTC
The cgroups filesystem is slightly magic in that you don't ever delete individual files. rmdir on a directory will succeeed if the 'tasks' file has no PIDs in it.

Comment 16 Daniel Berrangé 2015-02-26 11:16:38 UTC
If you can reproduce it reasonably reliable, it would be interesting to get extra debug logs

In /etc/libvirt/lxc.conf set

  log_with_libvirtd=1

In /etc/libvirt/libvirtd.conf set

  log_filters="1:lxc 1:cgroup 1:systemd"
  log_outputs="1:stderr 1:file:/var/log/libvirt/libvirtd.log"

NB, i intentionally have 2 outputs in this second setting even though we won't care about stderr from libvirtd itself, this ensures stderr from libvirt_lxc gets saved into the per VM log.

Delete all existing VM logs /var/log/libvirt/lxc/*.log

Then

  systemctl restart libvirtd.service


Now try to reproduce the error with a guest. Once you succeed, then attach the /var/log/libvirt/libvirtd.log and /var/log/libvirt/lxc/$GUEST.log to this bug.

Comment 17 Thomas Dreibholz 2015-02-26 12:35:58 UTC
Created attachment 995552 [details]
libvirtd log file

Comment 18 Thomas Dreibholz 2015-02-26 12:36:28 UTC
Created attachment 995553 [details]
Container log file

Comment 19 Thomas Dreibholz 2015-02-26 12:38:22 UTC
Here are the log files. The container as_nornet1_went through 3 create -> destroy cycles. The 3rd container destruction failed with the files left, finally causing the 4th container creation to fail.

Comment 20 Daniel Berrangé 2015-02-26 12:49:53 UTC
Excellant, that is useful information. I can indeed see an error on the 3rd shutdown attempt

2015-02-26 12:30:32.227+0000: 31959: debug : virLXCProcessStop:660 : Stopping VM name=as_nornet1_0 pid=6254 reason=2
2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillPainfully:3592 : cgroup=0x7ffc1001e1b0 path=
2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillRecursive:3571 : group=0x7ffc1001e1b0 path= signum=15
2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillRecursiveInternal:3505 : group=0x7ffc1001e1b0 path= signum=15 pids=0x7ffc14177c50
2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillInternal:3374 : group=0x7ffc1001e1b0 path= signum=15 pids=0x7ffc14177c50
2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillInternal:3386 : No file /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/tasks, assuming done
2015-02-26 12:30:32.228+0000: 31959: debug : virCgroupKillRecursiveInternal:3515 : Iterate over children of /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (killedAny=0)
2015-02-26 12:30:32.229+0000: 31959: debug : virCgroupKillRecursiveInternal:3518 : Path /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ does not exist, assuming done
2015-02-26 12:30:32.229+0000: 31959: debug : virCgroupKillPainfully:3603 : Iteration 0 rc=0
2015-02-26 12:30:32.229+0000: 31959: debug : virCgroupKillPainfully:3610 : Complete 0
2015-02-26 12:30:32.229+0000: 31959: debug : virLXCProcessCleanup:158 : Stopping VM name=as_nornet1_0 pid=6254 reason=2
2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorClose:213 : mon=0x7ffc10020fb0
2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorClose:219 : Clear EOF callback mon=0x7ffc10020fb0
2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorEOFNotify:123 : EOF notify mon=0x7ffc10020fb0
2015-02-26 12:30:32.229+0000: 31959: debug : virLXCMonitorEOFNotify:133 : No EOF callback
2015-02-26 12:30:32.229+0000: 31951: debug : virLXCMonitorDispose:201 : mon=0x7ffc10020fb0
2015-02-26 12:30:32.230+0000: 6254: error : virNetSocketReadWire:1575 : End of file while reading data: Input/output error
2015-02-26 12:30:32.230+0000: 6254: debug : virLXCControllerClientCloseHook:719 : Client 0x7f899b12fc70 has closed
2015-02-26 12:30:32.230+0000: 6254: debug : virLXCControllerClientPrivateFree:731 : Got private data free 0x7f899b129fa0
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3331 : Removing cgroup 
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2das_nornet1_0.scope/
2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/cpuset/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16)
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/memory/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/devices/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2das_nornet1_0.scope/
2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/freezer/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16)
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/blkio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2das_nornet1_0.scope/
2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/net_cls,net_prio/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16)
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3352 : Removing cgroup /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2das_nornet1_0.scope/ and all child cgroups
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemoveRecursively:3302 : Removing cgroup /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2das_nornet1_0.scope/
2015-02-26 12:30:32.336+0000: 31959: error : virCgroupRemoveRecursively:3305 : Unable to remove /sys/fs/cgroup/perf_event/machine.slice/machine-lxc\x2das_nornet1_0.scope/ (16)
2015-02-26 12:30:32.336+0000: 31959: debug : virCgroupRemove:3356 : Done removing cgroup 


Notice it is getting errno==16==EBUSY when trying to delete the cgroup.

Also, it is earlier getting an ENOENT error killing tasks in /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-lxc\x2das_nornet1_0.scope/ 

I wonder if there are somehow tasks present in the other cgroup controllers, that were not present in the main cpu,cpuacct controller. That could certainly explain the scenario we're seeing.

Out of interest what do you have actually running inside the container ? Are they single applications, or full Fedora installs, or something else ?

Comment 21 Thomas Dreibholz 2015-02-26 13:19:51 UTC
Created attachment 995588 [details]
"rpm -qa" output inside a container

Comment 22 Thomas Dreibholz 2015-02-26 13:25:06 UTC
The system inside the container is a probably quite standard, lightweight installation of Fedora Core 18. I have attached the output of "rpm -qa".

Note, that using Fedora Core 20 inside the container also runs into the same problem. So, the version inside the container makes no difference for this bug.

Comment 23 Thomas Dreibholz 2015-02-26 13:30:00 UTC
An important thing to mention is: when using Fedora Core 18 for hosting the system, the bug does not occur. In this case, the container destruction always works as expected. The Fedora Core 18 system also uses the same version of libvirt (1.2.11), but it had worked with older versions as well. So, for my productive system installation of NorNet, I am still on Fedora Core 18 therefore. The bug actually prevents me from upgrading my systems from 18 to 20.

Comment 24 Daniel Berrangé 2015-02-26 14:21:04 UTC
IIRC, Fedora 18 did not have the systemd support for machined, so libvirt would be taking a differnet codepath on that host, thus avoiding the error scenario

Comment 25 Thomas Dreibholz 2015-02-26 14:47:13 UTC
The last successful F20 build-and-test run at OneLab seems to have been on May 26, 2014 (http://web.archive.org/web/20140621155333/http://build.onelab.eu/lxc/): 2014.05.26--f20.warn. Unfortunately, Wayback has not stored the build-and-test log. Then, probably some change in the Fedora Core 20 packages introduced the issue. The build takes most of the packages directly from an F20 mirror, and just adds a newer libvirt version (upstream sources without any changes) and some applications that should not affect containers.

Comment 26 Thierry Parmentelat 2015-02-26 15:26:33 UTC
I would like to chime in at this point. 

I do not think it is reliable to assume that this feature was working for us before that date (may 26 last year). I think I remember the tests at that time were not as complete as they are now, and that issue was in there but just went undetected.

As for the question about what is going on in the container: in the context of the tests, each container is a rather small fedora20 instance (Thomas uses fedora18, but as he says the results are the same). We do add a very simple user-defined init-script that gets triggered through systemd. Also we have a home-brewed mechanism for punching holes in the isolation between host and guest, named vsys, that is implemented as processes running in the host and listening to pipes that can be reached from the guest. 

It might very well be that either of these 2 mechanisms is in the way here when we try to turn down a container properly, but if it is the case it is hard for us to fix the problem because we do not understand what is going on under the hood.

hope this helps

Comment 27 Thomas Dreibholz 2015-02-27 12:27:03 UTC
I also tried with a container installation created from the Fedora template
/usr/share/lxc/templates/lxc-fedora, i.e. a plain, minimal F20 installation inside the container. There is no difference in the result, the bug still occurs as before.

Comment 28 Fedora End Of Life 2015-05-29 10:58:43 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 29 Cole Robinson 2015-05-31 18:51:03 UTC
Given that f20 is end of life soon, I suggest any of the reporters check to see if this bug is still relevant on f21 or f22

Comment 30 Fedora End Of Life 2015-06-29 15:18:42 UTC
Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.