1693129 – podman exec fails with 'cannot exec into container that is not running' but container is indeed running

Bug 1693129 - podman exec fails with 'cannot exec into container that is not running' but container is indeed running

Summary: podman exec fails with 'cannot exec into container that is not running' but c...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-paunch
Sub Component:
Version:	15.0 (Stein)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Baker
QA Contact:	nlevinki
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-27 08:36 UTC by Michele Baldessari
Modified:	2022-08-26 19:21 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-02 11:49:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-18426	0	None	None	None	2022-08-26 19:21:45 UTC

Description Michele Baldessari 2019-03-27 08:36:46 UTC

Description of problem:
So I needed to patch some containers on my OSP15 undercloud with the following script:
[stack@undercloud-0 ~]$ cat x.sh 
"""
MISTRAL_CONTAINERS="mistral_api mistral_executor mistral_event_engine mistral_engine"
for i in $MISTRAL_CONTAINERS; do
    sudo podman exec -it -u root $i sh -c 'dnf install -y http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 15-trunk; dnf install -y patch patchutils'
    sudo podman exec -it -u root $i sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download;  base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/tripleo_common/*" | patch -d /usr/lib/python3.6/site-packages/tripleo_common/ -p2  -b -z .service; rm /tmp/foo.b64'
    sudo podman exec -it -u root $i sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download;  base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/workbooks/*" | patch -d /usr/share/tripleo-common/workbooks/ -p2  -b -z .service;  rm /tmp/foo.b64'
    sudo systemctl restart "tripleo_$i.service"
done
"""

The script failed on a container because exec seemed to think it was not running:
"""
+ sudo podman exec -it -u root mistral_executor sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download;  base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/tripleo_common/*" | patch -d /usr/lib/python3.6/site-packages/tripleo_common/ -p2  -b -z .service; rm /tmp/foo.b64'
curl (https://review.openstack.org/changes/638323/revisions/current/patch?download): response: 200, time: 0.470660, size: 8132
+ sudo podman exec -it -u root mistral_executor sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download;  base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/workbooks/*" | patch -d /usr/share/tripleo-common/workbooks/ -p2  -b -z .service;  rm /tmp/foo.b64'
curl (https://review.openstack.org/changes/638323/revisions/current/patch?download): response: 200, time: 0.480547, size: 8132
+ sudo systemctl restart tripleo_mistral_executor.service
+ for i in $MISTRAL_CONTAINERS
+ sudo podman exec -it -u root mistral_event_engine sh -c 'dnf install -y http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 15-trunk; dnf install -y patch patchutils'
cannot exec into container that is not running
"""

Indeed 'podman ps' seems to think it is not there:
[stack@undercloud-0 ~]$ sudo podman ps |grep mistral_event_engine

But that is not true because it really is still running and 'podman ps -a' shows it as created:
[stack@undercloud-0 ~]$ sudo podman ps -a |grep mistral_event_engine
0a4ae7dec2ca  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mistral-event-engine:latest       kolla_start           8 hours ago  Created                        mistral_event_engine

To 'fix/workaround' this I need to run with --sync: 'podman ps --sync'
[stack@undercloud-0 ~]$ sudo podman ps --sync |grep mistral_event_engine
0a4ae7dec2ca  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mistral-event-engine:latest       kolla_start           8 hours ago  Up 8 hours ago           mistral_event_engine

And now the container is there....and the script runs just fine.

Version-Release number of selected component (if applicable):

[stack@undercloud-0 ~]$ sudo podman info
host:
  BuildahVersion: 1.6-dev
  Conmon:
    package: podman-1.0.0-2.git921f98f.module+el8+2785+ff8a053f.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.14.0-dev, commit: be8255a19cda8a598d76dfa49e16e337769d4528-dirty'
  Distribution:
    distribution: '"rhel"'
    version: "8.0"
  MemFree: 480124928
  MemTotal: 20874752000
  OCIRuntime:
    package: runc-1.0.0-54.rc5.dev.git2abd837.module+el8+2769+577ad176.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
  SwapFree: 0
  SwapTotal: 0
  arch: amd64
  cpus: 8
  hostname: undercloud-0.redhat.local
  kernel: 4.18.0-80.el8.x86_64
  os: linux
  rootless: false
  uptime: 8h 57m 46s (Approximately 0.33 days)
insecure registries:
  registries:
  - 192.168.24.1:8787
  - 192.168.24.3:8787
  - brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888
registries:
  registries:
  - registry.redhat.io
  - quay.io
  - docker.io
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 89
  GraphDriverName: overlay
  GraphOptions: null
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
  ImageStore:
    number: 36
  RunRoot: /var/run/containers/storage

[stack@undercloud-0 ~]$ sudo podman version
Version:       1.0.2-dev
Go Version:    go1.11.5
OS/Arch:       linux/amd64

podman-1.0.0-2.git921f98f.module+el8+2785+ff8a053f.x86_64

How reproducible:
Fairly often but not 100%

Actual results:
exec fails

Expected results:
exec succeeds without me having to run 'podman ps --sync'.

Additional info:
At http://file.rdu.redhat.com/~mbaldess/osp15/podman_ps_bz/ I added the following info:
1. sosreport of the undercloud
2. tar gz of /var/lib/containers (excluding the 'overlays' subfolder as that is too large)

Comment 1 Artem Hrechanychenko 2019-03-27 10:14:12 UTC

That also leads to another bz:

After undercloud installation next mistral containers fails their healthchecks

tripleo_mistral_api_healthcheck
tripleo_mistral_event_engine_healthcheck

Container created but not running
561e39cb9819  192.168.24.1:8787/rhosp15/openstack-mistral-api:20190325.1                kolla_start           14 hours ago  Created                         mistral_api


● tripleo_mistral_api_healthcheck.service - mistral_api healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_mistral_api_healthcheck.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-03-27 09:57:57 UTC; 1s ago
  Process: 968328 ExecStart=/usr/bin/podman exec mistral_api /openstack/healthcheck (code=exited, status=125)
 Main PID: 968328 (code=exited, status=125)

Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: Starting mistral_api healthcheck...
Mar 27 09:57:57 undercloud-0.redhat.local podman[968328]: cannot exec into container that is not running
Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: tripleo_mistral_api_healthcheck.service: Main process exited, code=exited, status=125/n/a
Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: tripleo_mistral_api_healthcheck.service: Failed with result 'exit-code'.
Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: Failed to start mistral_api healthcheck.
Version-Release number of selected component (if applicable):

mistral_event_engine

● tripleo_mistral_event_engine_healthcheck.service - mistral_event_engine healthcheck
   Loaded: loaded (/etc/systemd/system/tripleo_mistral_event_engine_healthcheck.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2019-03-27 09:59:27 UTC; 19s ago
  Process: 969964 ExecStart=/usr/bin/podman exec mistral_event_engine /openstack/healthcheck (code=exited, status=125)
 Main PID: 969964 (code=exited, status=125)

Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: Starting mistral_event_engine healthcheck...
Mar 27 09:59:27 undercloud-0.redhat.local podman[969964]: cannot exec into container that is not running
Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: tripleo_mistral_event_engine_healthcheck.service: Main process exited, code=exited, status=125/n/a
Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: tripleo_mistral_event_engine_healthcheck.service: Failed with result 'exit-code'.
Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: Failed to start mistral_event_engine healthcheck


the same for :
● tripleo_glance_api_healthcheck.service           loaded failed failed glance_api healthcheck          
● tripleo_mistral_event_engine_healthcheck.service loaded failed failed mistral_event_engine healthcheck
● tripleo_mistral_executor_healthcheck.service     loaded failed failed mistral_executor healthcheck      
● tripleo_nova_api_healthcheck.service             loaded failed failed nova_api healthcheck            
● tripleo_nova_conductor_healthcheck.service       loaded failed failed nova_conductor healthcheck      
● tripleo_nova_metadata_healthcheck.service        loaded failed failed nova_metadata healthcheck       
● tripleo_nova_placement_healthcheck.service       loaded failed failed nova_placement healthcheck      
● tripleo_nova_scheduler_healthcheck.service       loaded failed failed nova_scheduler healthcheck

Comment 2 Matthew Heon 2019-04-17 15:56:32 UTC

I think something is happening to Podman's temporary directory here (default /var/run/libpod) that is causing the 'alive' file to not be present when Podman checks for it. Without that file, Podman assumes that the system has restarted, and performs a full refresh of container state (resets container states to reflect a 'fresh start' after system restart that cleared all running processes, mounts, etc).

I don't think the removal of the alive file itself is a Podman bug - I suspect that somewhere, you may be running Podman inside a container without the /var/run/libpod directory mounted in, which causes this particular issue (we can detect a different directory path and handle that, but we can't easily detect if the path is the same but refers to a different directory because of mount namespaces/bind mounts/etc).

Still, Podman can probably handle this case more gracefully. We ought to be able to poke containers/storage and runc to get the actual state of the container on a refresh, as opposed to resetting our state to a complete clean sheet. This could potentially be very slow, though - I'll look into it.

Comment 3 Daniel Walsh 2019-04-25 13:51:14 UTC

Matt any movement on this?

Comment 4 Matthew Heon 2019-04-25 13:58:37 UTC

We can't fix this on the Podman side. If we're remounted without the tmp directory, things are going to break.

The best I can do it add warnings to make it more obvious when it's happening.

To be clear, I'm fairly convinced this is a problem with how Podman is being called, not Podman itself. Somewhere a Podman command is being run with a different /var/run (probably podman being run in a container), which is something we can't handle in a sane fashion (breaks both Podman itself and c/storage)

Comment 6 Emilien Macchi 2019-07-02 11:49:34 UTC

I think this one is old and it works fine now. Please re-open if not.

Note You need to log in before you can comment on or make changes to this bug.