Description of problem: So I needed to patch some containers on my OSP15 undercloud with the following script: [stack@undercloud-0 ~]$ cat x.sh """ MISTRAL_CONTAINERS="mistral_api mistral_executor mistral_event_engine mistral_engine" for i in $MISTRAL_CONTAINERS; do sudo podman exec -it -u root $i sh -c 'dnf install -y http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 15-trunk; dnf install -y patch patchutils' sudo podman exec -it -u root $i sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download; base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/tripleo_common/*" | patch -d /usr/lib/python3.6/site-packages/tripleo_common/ -p2 -b -z .service; rm /tmp/foo.b64' sudo podman exec -it -u root $i sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download; base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/workbooks/*" | patch -d /usr/share/tripleo-common/workbooks/ -p2 -b -z .service; rm /tmp/foo.b64' sudo systemctl restart "tripleo_$i.service" done """ The script failed on a container because exec seemed to think it was not running: """ + sudo podman exec -it -u root mistral_executor sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download; base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/tripleo_common/*" | patch -d /usr/lib/python3.6/site-packages/tripleo_common/ -p2 -b -z .service; rm /tmp/foo.b64' curl (https://review.openstack.org/changes/638323/revisions/current/patch?download): response: 200, time: 0.470660, size: 8132 + sudo podman exec -it -u root mistral_executor sh -c 'curl -o /tmp/foo.b64 https://review.openstack.org/changes/638323/revisions/current/patch?download; base64 -d /tmp/foo.b64 | filterdiff -p1 -i "*/workbooks/*" | patch -d /usr/share/tripleo-common/workbooks/ -p2 -b -z .service; rm /tmp/foo.b64' curl (https://review.openstack.org/changes/638323/revisions/current/patch?download): response: 200, time: 0.480547, size: 8132 + sudo systemctl restart tripleo_mistral_executor.service + for i in $MISTRAL_CONTAINERS + sudo podman exec -it -u root mistral_event_engine sh -c 'dnf install -y http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm; rhos-release 15-trunk; dnf install -y patch patchutils' cannot exec into container that is not running """ Indeed 'podman ps' seems to think it is not there: [stack@undercloud-0 ~]$ sudo podman ps |grep mistral_event_engine But that is not true because it really is still running and 'podman ps -a' shows it as created: [stack@undercloud-0 ~]$ sudo podman ps -a |grep mistral_event_engine 0a4ae7dec2ca brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mistral-event-engine:latest kolla_start 8 hours ago Created mistral_event_engine To 'fix/workaround' this I need to run with --sync: 'podman ps --sync' [stack@undercloud-0 ~]$ sudo podman ps --sync |grep mistral_event_engine 0a4ae7dec2ca brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15/openstack-mistral-event-engine:latest kolla_start 8 hours ago Up 8 hours ago mistral_event_engine And now the container is there....and the script runs just fine. Version-Release number of selected component (if applicable): [stack@undercloud-0 ~]$ sudo podman info host: BuildahVersion: 1.6-dev Conmon: package: podman-1.0.0-2.git921f98f.module+el8+2785+ff8a053f.x86_64 path: /usr/libexec/podman/conmon version: 'conmon version 1.14.0-dev, commit: be8255a19cda8a598d76dfa49e16e337769d4528-dirty' Distribution: distribution: '"rhel"' version: "8.0" MemFree: 480124928 MemTotal: 20874752000 OCIRuntime: package: runc-1.0.0-54.rc5.dev.git2abd837.module+el8+2769+577ad176.x86_64 path: /usr/bin/runc version: 'runc version spec: 1.0.0' SwapFree: 0 SwapTotal: 0 arch: amd64 cpus: 8 hostname: undercloud-0.redhat.local kernel: 4.18.0-80.el8.x86_64 os: linux rootless: false uptime: 8h 57m 46s (Approximately 0.33 days) insecure registries: registries: - 192.168.24.1:8787 - 192.168.24.3:8787 - brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888 registries: registries: - registry.redhat.io - quay.io - docker.io store: ConfigFile: /etc/containers/storage.conf ContainerStore: number: 89 GraphDriverName: overlay GraphOptions: null GraphRoot: /var/lib/containers/storage GraphStatus: Backing Filesystem: xfs Native Overlay Diff: "true" Supports d_type: "true" ImageStore: number: 36 RunRoot: /var/run/containers/storage [stack@undercloud-0 ~]$ sudo podman version Version: 1.0.2-dev Go Version: go1.11.5 OS/Arch: linux/amd64 podman-1.0.0-2.git921f98f.module+el8+2785+ff8a053f.x86_64 How reproducible: Fairly often but not 100% Actual results: exec fails Expected results: exec succeeds without me having to run 'podman ps --sync'. Additional info: At http://file.rdu.redhat.com/~mbaldess/osp15/podman_ps_bz/ I added the following info: 1. sosreport of the undercloud 2. tar gz of /var/lib/containers (excluding the 'overlays' subfolder as that is too large)
That also leads to another bz: After undercloud installation next mistral containers fails their healthchecks tripleo_mistral_api_healthcheck tripleo_mistral_event_engine_healthcheck Container created but not running 561e39cb9819 192.168.24.1:8787/rhosp15/openstack-mistral-api:20190325.1 kolla_start 14 hours ago Created mistral_api ● tripleo_mistral_api_healthcheck.service - mistral_api healthcheck Loaded: loaded (/etc/systemd/system/tripleo_mistral_api_healthcheck.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2019-03-27 09:57:57 UTC; 1s ago Process: 968328 ExecStart=/usr/bin/podman exec mistral_api /openstack/healthcheck (code=exited, status=125) Main PID: 968328 (code=exited, status=125) Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: Starting mistral_api healthcheck... Mar 27 09:57:57 undercloud-0.redhat.local podman[968328]: cannot exec into container that is not running Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: tripleo_mistral_api_healthcheck.service: Main process exited, code=exited, status=125/n/a Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: tripleo_mistral_api_healthcheck.service: Failed with result 'exit-code'. Mar 27 09:57:57 undercloud-0.redhat.local systemd[1]: Failed to start mistral_api healthcheck. Version-Release number of selected component (if applicable): mistral_event_engine ● tripleo_mistral_event_engine_healthcheck.service - mistral_event_engine healthcheck Loaded: loaded (/etc/systemd/system/tripleo_mistral_event_engine_healthcheck.service; disabled; vendor preset: disabled) Active: failed (Result: exit-code) since Wed 2019-03-27 09:59:27 UTC; 19s ago Process: 969964 ExecStart=/usr/bin/podman exec mistral_event_engine /openstack/healthcheck (code=exited, status=125) Main PID: 969964 (code=exited, status=125) Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: Starting mistral_event_engine healthcheck... Mar 27 09:59:27 undercloud-0.redhat.local podman[969964]: cannot exec into container that is not running Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: tripleo_mistral_event_engine_healthcheck.service: Main process exited, code=exited, status=125/n/a Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: tripleo_mistral_event_engine_healthcheck.service: Failed with result 'exit-code'. Mar 27 09:59:27 undercloud-0.redhat.local systemd[1]: Failed to start mistral_event_engine healthcheck the same for : ● tripleo_glance_api_healthcheck.service loaded failed failed glance_api healthcheck ● tripleo_mistral_event_engine_healthcheck.service loaded failed failed mistral_event_engine healthcheck ● tripleo_mistral_executor_healthcheck.service loaded failed failed mistral_executor healthcheck ● tripleo_nova_api_healthcheck.service loaded failed failed nova_api healthcheck ● tripleo_nova_conductor_healthcheck.service loaded failed failed nova_conductor healthcheck ● tripleo_nova_metadata_healthcheck.service loaded failed failed nova_metadata healthcheck ● tripleo_nova_placement_healthcheck.service loaded failed failed nova_placement healthcheck ● tripleo_nova_scheduler_healthcheck.service loaded failed failed nova_scheduler healthcheck
I think something is happening to Podman's temporary directory here (default /var/run/libpod) that is causing the 'alive' file to not be present when Podman checks for it. Without that file, Podman assumes that the system has restarted, and performs a full refresh of container state (resets container states to reflect a 'fresh start' after system restart that cleared all running processes, mounts, etc). I don't think the removal of the alive file itself is a Podman bug - I suspect that somewhere, you may be running Podman inside a container without the /var/run/libpod directory mounted in, which causes this particular issue (we can detect a different directory path and handle that, but we can't easily detect if the path is the same but refers to a different directory because of mount namespaces/bind mounts/etc). Still, Podman can probably handle this case more gracefully. We ought to be able to poke containers/storage and runc to get the actual state of the container on a refresh, as opposed to resetting our state to a complete clean sheet. This could potentially be very slow, though - I'll look into it.
Matt any movement on this?
We can't fix this on the Podman side. If we're remounted without the tmp directory, things are going to break. The best I can do it add warnings to make it more obvious when it's happening. To be clear, I'm fairly convinced this is a problem with how Podman is being called, not Podman itself. Somewhere a Podman command is being run with a different /var/run (probably podman being run in a container), which is something we can't handle in a sane fashion (breaks both Podman itself and c/storage)
I think this one is old and it works fine now. Please re-open if not.