Bug 1758648

Summary: unable to kill / sync a container
Product: [Fedora] Fedora Reporter: Lokesh Mandvekar <lsm5>
Component: podmanAssignee: Matthew Heon <mheon>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 31CC: bbaude, debarshir, dwalsh, gscrivan, jnovy, jonathan, lsm5, mheon, paul, p_s_oberoi, santiago
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-02 19:30:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lokesh Mandvekar 2019-10-04 17:31:36 UTC
Description of problem:

So, I have this single (zombie?) container that just refuses to die.

lsm5 @ nagato : ~(master) $ podman ps
CONTAINER ID  IMAGE                     COMMAND    CREATED       STATUS           PORTS  NAMES
7fe220fd9fba  nanadai.me/bionic:latest  /bin/bash  25 hours ago  Up 25 hours ago         frosty_mayer

[Skipped "Error: can only kill running containers.." from this one]
lsm5 @ nagato : ~(master) $ podman kill --all
2019-10-04T17:25:40.000761384Z: error opening file '/run/user/0/crun/7fe220fd9fba37eb2d1e4fc9565930168b1c8b1cd37b646ca12d7060716c6339/status': No such file or directory
Error: error sending signal to container 7fe220fd9fba37eb2d1e4fc9565930168b1c8b1cd37b646ca12d7060716c6339: `/usr/bin/crun kill 7fe220fd9fba37eb2d1e4fc9565930168b1c8b1cd37b646ca12d7060716c6339 9` failed: exit status 1


lsm5 @ nagato : ~(master) $ podman inspect 7fe220fd9fba | grep Pid
            "Pid": 16443,
            "ConmonPid": 16436,
        "ConmonPidFile": "/var/run/containers/storage/overlay-containers/7fe220fd9fba37eb2d1e4fc9565930168b1c8b1cd37b646ca12d7060716c6339/userdata/conmon.pid",
            "PidMode": "",
            "PidsLimit": 0,

lsm5 @ nagato : ~(master) $ sudo kill -9 16443
kill: sending signal to 16443 failed: No such process

lsm5 @ nagato : ~(master) $ podman ps -a --sync
ERRO[0000] unable to get container info: "error decoding container status for container 7fe220fd9fba37eb2d1e4fc9565930168b1c8b1cd37b646ca12d7060716c6339: EOF" 


Version-Release number of selected component (if applicable):
podman-1.6.1-2.fc31.x86_64

How reproducible:
observed consistently atm
(though I'm not sure how exactly to reproduce it from a clean install)

I wonder if this'll work the same across a reboot.

Comment 1 Lokesh Mandvekar 2019-10-04 17:33:01 UTC
This doesn't happen with the other containers I have btw.

Comment 2 Lokesh Mandvekar 2019-10-04 17:45:53 UTC
So, I was able to reproduce this just once in the very next reboot for the same container image. But, not in subsequent reboots.

Also, I consistently see a lot of the following messages when running the very first container:

ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 02edd56bf00b46851117a988ae23a370890029fb9e9fe9501a9e7f067fb89c07: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 0b1b7a8c2c9d3076427227facac71278815ad7fbeea1427808b7f3e70cbb4fbd: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 0eadba4583bff0ec18ffd0fcee080a129f3320c01183fe4bfc585127ecb89327: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 127177d12fc7ce80a18ea47b610352400829458f6e75c3947f5418e89be64bac: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 141f0fc567dfb7e62b845961b64d4089841e66b90eda6bc62c1df8e7f3a9f1fb: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 144c0836e377fd8b2bd59b835982e90dfab8db602090e55f2bb5cd5c1550c4fa: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 15fbd06ce5b1e6ac71dcdafba32fa8ef30c8773e499772d86d495e4e6dcc3789: missing prevResult from earlier plugin 
ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
ERRO[0002] Error while removing pod from CNI network "podman": missing prevResult from earlier plugin 
ERRO[0002] Error refreshing container 1cee9987f12dd0d1b025a629010ac0eacee9dea05c9976a65fd98fdb914cc8f2: missing prevResult from earlier plugin

Comment 3 Lokesh Mandvekar 2019-10-04 17:47:13 UTC
(In reply to Lokesh Mandvekar from comment #2)
> So, I was able to reproduce this just once in the very next reboot for the
> same container image. But, not in subsequent reboots.
> 
> Also, I consistently see a lot of the following messages when running the
> very first container:
> 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 02edd56bf00b46851117a988ae23a370890029fb9e9fe9501a9e7f067fb89c07: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 0b1b7a8c2c9d3076427227facac71278815ad7fbeea1427808b7f3e70cbb4fbd: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 0eadba4583bff0ec18ffd0fcee080a129f3320c01183fe4bfc585127ecb89327: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 127177d12fc7ce80a18ea47b610352400829458f6e75c3947f5418e89be64bac: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 141f0fc567dfb7e62b845961b64d4089841e66b90eda6bc62c1df8e7f3a9f1fb: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 144c0836e377fd8b2bd59b835982e90dfab8db602090e55f2bb5cd5c1550c4fa: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 15fbd06ce5b1e6ac71dcdafba32fa8ef30c8773e499772d86d495e4e6dcc3789: missing
> prevResult from earlier plugin 
> ERRO[0002] Error deleting network: missing prevResult from earlier plugin 
> ERRO[0002] Error while removing pod from CNI network "podman": missing
> prevResult from earlier plugin 
> ERRO[0002] Error refreshing container
> 1cee9987f12dd0d1b025a629010ac0eacee9dea05c9976a65fd98fdb914cc8f2: missing
> prevResult from earlier plugin

The container does run after this though. Not a fatal error.

Comment 4 Matthew Heon 2019-10-04 18:21:50 UTC
I think we may have two issues here. One that caused the container to die without that being recorded by Conmon, and another where trying to do a `crun status` on the container gets us an EOF. I think the second one is pretty concerning - it prevents `ps --sync` from accurately determining the status of a container when `crun` is in play.

Comment 5 Lokesh Mandvekar 2019-10-07 00:12:02 UTC
Update, I let my system sit for a while with 2 running containers, got back and tried to exec into both, and I can see the error now for both of them.

Comment 6 Jonathan Dieter 2019-12-03 20:28:10 UTC
I'm also seeing this on F31 with a number of my containers.  The containers are running fine, but I'm unable to exec into them, and I get an error when I try to stop them:

# podman exec -it mariadb /bin/bash
Error: error opening file '/run/user/0/crun/d2f2fbb1e6db942def1f1430342e8c8c6919f9696604c65780626e20569126c7/status': No such file or directory: OCI runtime command not found error
# podman exec -it nextcloud /bin/bash
Error: error opening file '/run/user/0/crun/97a9d1f70771ff8006c45c452ce9abd0af57cf7586c37ca038dfb89dea477d9b/status': No such file or directory: OCI runtime command not found error
# ls -l /run/user/0/crun/
total 0

Comment 7 Jonathan Dieter 2019-12-03 20:31:13 UTC
FWIW, I checked the modified time for the directory, and it looks like it was last changed at 4:00AM.  It makes it look like this might be some kind of cron job being overly aggressive while cleaning up:

# ls -ld /run/user/0/crun
drwx------. 2 root root 40 Dec  3 04:00 /run/user/0/crun

Comment 8 Daniel Walsh 2019-12-04 01:35:06 UTC
It looks like the data was removed for the running container.

Comment 9 Daniel Walsh 2019-12-04 01:35:35 UTC
Can you stop and restart the container?

Comment 10 Giuseppe Scrivano 2019-12-04 09:01:18 UTC
Is the container under `/run/crun`?

Comment 11 Jonathan Dieter 2019-12-04 09:09:03 UTC
(In reply to Daniel Walsh from comment #9)
> Can you stop and restart the container?

The only way to stop the container is to manually kill conmon.  Once I've done that, I can restart the container.

Comment 12 Jonathan Dieter 2019-12-04 09:11:18 UTC
(In reply to Giuseppe Scrivano from comment #10)
> Is the container under `/run/crun`?

I'm not sure what you mean.  The container filesystem is in `/var/lib/containers/storage` (which I think is the default).  There is no `/run/crun` directory on my system.

Comment 13 Giuseppe Scrivano 2019-12-04 14:25:26 UTC
Debarshi, have you seen this issue before?

Comment 14 Jonathan Dieter 2019-12-04 21:54:10 UTC
Ok, I think I have some idea of what's happening, but not necessarily why.

I've setup libpod.conf to use crun as my runtime.  I have a systemd service that starts the container.  My service uses --uidmap and --gidmap to map the users in the container to unprivileged users on the host.  According to the `crun` man page, it uses $XDG_RUNTIME_DIR/crun when run as a non-privileged user, which seems to be `/run/user/0/crun` (even though crun itself is running as root).  Whenever I log into my server as root and then logout again, /run/user/0 is wiped as part of the session cleanup.

In this case, it seems that crun should be using `/run/crun`, and I'm not sure why it's not (or if I've missed something)

I'm including the systemd service I'm using to start to container.  I realize there are some non-best-practices there, but I don't think anything there should be causing this particular bug.

$ rpm -q crun podman
crun-0.10.6-1.fc31.x86_64
podman-1.6.2-2.fc31.x86_64

$ cat /etc/systemd/system/nextcloud.service 
[Unit]
Description=Nextcloud
Wants=mariadb.service network-online.target
After=mariadb.service network-online.target

[Service]
Restart=on-failure
ExecStartPre=-/usr/bin/podman pull docker.io/library/nextcloud:stable
ExecStartPre=-/usr/bin/podman stop nextcloud
ExecStartPre=-/usr/bin/podman rm nextcloud
ExecStart=/usr/bin/podman run \
	--name nextcloud \
	--uidmap 0:110000:4999 \
        --gidmap 0:110000:4999 \
	--uidmap 65534:114999:1 \
	--gidmap 65534:114999:1 \
	--ip 10.88.1.3 \
	--add-host mariadb:10.88.1.2 \
	--add-host nextcloud:10.88.1.3 \
	--hostname nextcloud \
	--conmon-pidfile=/run/nextcloud.pid \
	--tty \
	-p 127.0.0.1:8888:80 \
	-v /fast/nextcloud/data:/var/www/html:Z \
	-v /home/nextcloud/users:/var/www/html/users:Z \
	docker.io/library/nextcloud:stable
ExecStop=/usr/bin/podman stop nextcloud
PIDFile=/run/nextcloud.pid

[Install]
WantedBy=multi-user.target

Comment 15 Jonathan Dieter 2019-12-04 22:10:56 UTC
A bit more (hopefully welcome) analysis:

In https://github.com/containers/libpod/blob/309452dcfc4bea9cbeff03c8b7c9a0811521e524/libpod/runtime.go#L77, there's a function to set XDG_RUNTIME_DIR if podman is run rootless.  In https://github.com/containers/crun/blob/91db0f4a76db33d6221e18ae0560cd715b2e3273/src/libcrun/status.c#L33, crun will use $XDG_RUNTIME_DIR/crun if the variable is set, whether the current UID is root or not.

I think what's happening is that the --uidmap flag is turning rootless mode on, even though the container is started using the root user.  Because rootless mode is turned on, podman sets XDG_RUNTIME_DIR, which then causes crun to use $XDG_RUNTIME_DIR/crun as its working path, which then causes problems if the root user ever logs out of the system.

I think the best solution would be to have crun check if it's running as root before using $XDG_RUNTIME_DIR/crun.

Comment 16 Jonathan Dieter 2019-12-07 14:45:50 UTC
I've spent more time on this today and figured out what's going wrong and how to fix it.  My previous analysis was incorrect in that IsRootless() does return false when run as root, even if --uidmap is passed.  It also turns out that, in rootless mode, crun still thinks it's root, so my original plan for a fix also doesn't work.

The real problem is that GetRuntimeDir() in pkg/util/utils_supported.go will *always* return /run/user/<uid>, whether the container is started by root or not.  This also applies to containers started by root without the --uidmap flag passed.  I think the best fix is to update GetRuntimeDir() to return an empty path is IsRootless() is false.  The runtime should then use the default path for storing state, which should be accessible by root.  For crun, the default path is /run/crun.

I've put up a PR to implement this fix: https://github.com/containers/libpod/pull/4657

Comment 17 Matthew Heon 2020-01-20 23:48:33 UTC
This ought to be fixed, as Podman 1.7.0 was released with the given patch included

Comment 18 Debarshi Ray 2020-03-09 16:10:58 UTC
(In reply to Giuseppe Scrivano from comment #13)
> Debarshi, have you seen this issue before?

No, I don't recall seeing this ever.

Comment 19 Jonathan Dieter 2020-05-02 19:30:36 UTC
This has been fixed since Podman 1.7.0 was released.