Bug 2211395 - [rhel4edge] podman hangs creating a container in a pod [NEEDINFO]
Summary: [rhel4edge] podman hangs creating a container in a pod
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: podman
Version: 9.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: rc
: ---
Assignee: Matthew Heon
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard: CockpitTest
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-31 12:09 UTC by Jelle van der Waa
Modified: 2023-07-11 04:11 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
jnovy: needinfo? (jvanderwaa)


Attachments (Terms of Use)
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman (2.73 KB, text/plain)
2023-05-31 12:09 UTC, Jelle van der Waa
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-158694 0 None None None 2023-05-31 12:11:22 UTC

Description Jelle van der Waa 2023-05-31 12:09:07 UTC
Created attachment 1968069 [details]
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman

Created attachment 1968069 [details]
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman

Created attachment 1968069 [details]
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman

Description of problem:

In our Cockpit podman plugin testsuite we observe that podman sometimes hangs and our tests fail. 

Version-Release number of selected component (if applicable):

Client:       Podman Engine
Version:      4.2.0
API Version:  4.2.0
Go Version:   go1.18.9
Built:        Tue Feb  7 05:56:41 2023
OS/Arch:      linux/amd64

How reproducible:

On my work laptop (t14s) this is only reproducible when I amplify the failing test in CI by running it ~ 20 times in a row.

Steps to Reproduce:

This is a bit tricky, but in essence we have a `podman.socket` open for the UI and in our test create podman pods with containers. The test hangs when trying to create a container in a pod (after having created a pod before). So in essence we run, repeatedly

podman pod create --name pod"
podman run --name inpod --pod pod -tid ${IMG_BUSYBOX} sh -c 'exit 1'

Actual results:

The `podman run` command hangs

Expected results:

Container is created.

Additional info:

https://cockpit-logs.us-east-1.linodeobjects.com/pull-1235-20230526-162637-03f5f1da-rhel4edge/log.html#26

Comment 2 Tom Sweeney 2023-06-01 00:32:04 UTC
Assigning to @mheon

Comment 3 Matthew Heon 2023-06-01 13:07:31 UTC
Can you clarify whether the Podman commands you're running are using the remote API (podman.socket) or local Podman commands?

Comment 4 Matthew Heon 2023-06-01 13:09:35 UTC
Ah, answering myself, you're just providing equivalent commands...

Any chance you can get the actual API calls that reproduce this, then? It would help greatly if I could try to reproduce with `curl`.

Comment 6 Jelle van der Waa 2023-06-05 09:10:07 UTC
This issue occurs in our tests, I'll try if I can provide a script as reproducer but that might be a bit tricky. Our application subscribes to /libpod/events so it knows when to update containers. While this happens in the background the tests run the podman cli commands in the description.

After podman hangs, I can run `podman system renumber` but this does not resolve the hang.

Comment 10 Matthew Heon 2023-06-05 17:18:17 UTC
I'm working on a patch to Podman that will include additional debug information for locking issues. Would it be possible to use a patched Podman in your CI environment to get additional debug info?

Comment 11 Jelle van der Waa 2023-06-06 07:30:03 UTC
I tried to create a shell script to reproduce the issue but so far haven't been successful. Locally I can run our tests and reproduce the issue so I'm happy to run a patched podman if that helps.

Comment 12 Matthew Heon 2023-06-06 13:18:00 UTC
If you can build from here: https://github.com/containers/podman/pull/18796 and then run `podman system locks` after reproducing, that ought to help greatly.

Comment 13 Jelle van der Waa 2023-06-06 16:10:50 UTC
Took me a while to figure out how to build podman on RHEL 8 as a mock build fails due to the lack of go-md2man. Anyway the output when it fails is:

[root@rhel4edge-127-0-0-2-2201 ~]# ./podman system locks

No lock conflicts have been detected, system safe from deadlocks.

Lock 1 is presently being held
Lock 2 is presently being held
Lock 3 is presently being held

Comment 14 Matthew Heon 2023-06-06 17:50:55 UTC
You note that pods/containers are being created repeatedly - are they ever removed? How many are there when things start failing?

Is anything being done from the remote API (Podman socket) besides streaming events?

Comment 16 Jelle van der Waa 2023-06-07 08:46:22 UTC
Our tests try to reset the state of /var/lib/containers on every test run so we start with a fresh state. The newly attached strace -f failed on the first run, so that seems to exclude our cleanup mechanism.

When a test starts we open our application and for the system/user session call:

- libpod/info
- libpod/images/json
- libpod/containers/json
- libpod/pods/json
- libpod/events

While the page is loading we create a podman pod and try to create a container in it. So I suppose to reproduce this I could try to replicate this in a script.

Comment 17 Matthew Heon 2023-06-07 13:28:36 UTC
Actually, I'm overthinking this a bit. Can you hit the hanging process (the `podman run` ideally, but a `podman ps` that hung afterwards would also help) with a SIGABRT? This forces Go to dump stacktraces from every thread it's running and should tell exactly where it's managed to hang.

Comment 18 Jelle van der Waa 2023-06-07 18:08:53 UTC
Potential interesting journal output

Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd[1]: Started Process Core Dump (PID 2202/UID 0).
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:22 -0400] "GET /v1.12/libpod/pods/json?filters=%7B%22id%22%3A%5B%22c8479afffce329c86801d2d8cda591421a8fee440662>
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:23 -0400] "GET /v1.12/libpod/pods/json? HTTP/1.1" 200 554 "" ""
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd-coredump[2203]: Resource limits disable core dumping for process 1792 (podman).
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd-coredump[2203]: [🡕] Process 1792 (podman) of user 0 dumped core.
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd[1]: systemd-coredump: Deactivated successfully.
Jun 07 13:36:48 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:21 -0400] "GET /v1.12/libpod/containers/stats?stream=true HTTP/1.1" 200 1664 "" ""

Comment 20 Matthew Heon 2023-06-07 19:14:55 UTC
We're getting stuck trying to lock the infra container's lock to get its bind mounts (https://github.com/containers/podman/blob/main/libpod/container.go#L936). I don't know what would be holding that lock, though. The `podman pod create` command is not hanging, is it?

Any chance you have tests running in parallel, each trying to create a separate clean environment?

Comment 21 Martin Pitt 2023-07-03 06:42:38 UTC
No, we do test parallelization with multiple VMs. Inside of each OS VM, there is only ever a single test running, against the one and only system and/or user instance.

Comment 22 Martin Pitt 2023-07-05 07:41:30 UTC
Note that we have only ever seen this on two OSes:

 - rhel4edge, which is RHEL 9.1 (as reported in "Version:" field), with podman podman-4.2.0-11.el9_1.x86_64
 - Ubuntu 22.04 LTS with podman 3.4.4.

We don't see this on Fedora, RHEL 9.3, CentOS 8/9, Debian 12/unstable, later Ubuntus, etc. Either because it was fixed, or it just happens so rarely that we don't notice any more. So it's highly likely that this won't reproduce with current podman versions.

Comment 23 Paul Holzinger 2023-07-06 14:49:51 UTC
This looks like a ABBA deadlock to me, the container (locked) is started then gets the dependency container (pod infra) which is tries to lock as well and blocks.

I assume the podman service API process on the other hand locked the infra and then tries to access the container lock somehow, would it be possible for you to also get the stack trace of the service when it hangs?
I am currently searching in the code to understand where it hangs on the server side but so I far I did not see anything obvious.

I know we had a problem at some point with deadlocking with pods, maybe https://github.com/containers/podman/pull/14976/commits/af118f7c6a22d0e639ef5d60d66bb2a628523f6b
This one was fixed in 4.3 but is only triggered on podman removal, do you also run podman pod rm in your reproducer?

Comment 24 Paul Holzinger 2023-07-06 14:51:35 UTC
There is also https://github.com/containers/podman/issues/16326 which was fixed in 4.4.

Comment 25 Martin Pitt 2023-07-11 04:11:06 UTC
> do you also run podman pod rm in your reproducer?

Filling in this bit for Jelle, as he's on PTO this week. No, we don't. The test does not get very far, it already hangs here [1] in the setup phase. At this point the only things that happened are:

 1. Log into Cockpit, in other words, start and connect to podman.service
 2. podman pod create --name pod
 3. podman run --name inpod --pod pod -tid quay.io/prometheus/busybox sh -c 'exit 1'

I.e. there is no `--rm` option, but the container immediately stops after running (but in theory that shouldn't block the `podman run` command, so it hangs earlier than that) 

https://github.com/cockpit-project/cockpit-podman/blob/7822dfd77a21a2c0f58f4569c0e22fc9e4a39bef/test/check-application#L2207


Note You need to log in before you can comment on or make changes to this bug.