Bug 2211395
| Summary: | [rhel4edge] podman hangs creating a container in a pod | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Jelle van der Waa <jvanderwaa> | ||||
| Component: | podman | Assignee: | Matthew Heon <mheon> | ||||
| Status: | ASSIGNED --- | QA Contact: | atomic-bugs <atomic-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 9.1 | CC: | bbaude, dwalsh, jnovy, lsm5, mboddu, mheon, mpitt, pholzing, pthomas, tsweeney, umohnani | ||||
| Target Milestone: | rc | Flags: | jnovy:
needinfo?
(jvanderwaa) |
||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | CockpitTest | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | Type: | Bug | |||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Jelle van der Waa
2023-05-31 12:09:07 UTC
Assigning to @mheon Can you clarify whether the Podman commands you're running are using the remote API (podman.socket) or local Podman commands? Ah, answering myself, you're just providing equivalent commands... Any chance you can get the actual API calls that reproduce this, then? It would help greatly if I could try to reproduce with `curl`. This issue occurs in our tests, I'll try if I can provide a script as reproducer but that might be a bit tricky. Our application subscribes to /libpod/events so it knows when to update containers. While this happens in the background the tests run the podman cli commands in the description. After podman hangs, I can run `podman system renumber` but this does not resolve the hang. I'm working on a patch to Podman that will include additional debug information for locking issues. Would it be possible to use a patched Podman in your CI environment to get additional debug info? I tried to create a shell script to reproduce the issue but so far haven't been successful. Locally I can run our tests and reproduce the issue so I'm happy to run a patched podman if that helps. If you can build from here: https://github.com/containers/podman/pull/18796 and then run `podman system locks` after reproducing, that ought to help greatly. Took me a while to figure out how to build podman on RHEL 8 as a mock build fails due to the lack of go-md2man. Anyway the output when it fails is: [root@rhel4edge-127-0-0-2-2201 ~]# ./podman system locks No lock conflicts have been detected, system safe from deadlocks. Lock 1 is presently being held Lock 2 is presently being held Lock 3 is presently being held You note that pods/containers are being created repeatedly - are they ever removed? How many are there when things start failing? Is anything being done from the remote API (Podman socket) besides streaming events? Our tests try to reset the state of /var/lib/containers on every test run so we start with a fresh state. The newly attached strace -f failed on the first run, so that seems to exclude our cleanup mechanism. When a test starts we open our application and for the system/user session call: - libpod/info - libpod/images/json - libpod/containers/json - libpod/pods/json - libpod/events While the page is loading we create a podman pod and try to create a container in it. So I suppose to reproduce this I could try to replicate this in a script. Actually, I'm overthinking this a bit. Can you hit the hanging process (the `podman run` ideally, but a `podman ps` that hung afterwards would also help) with a SIGABRT? This forces Go to dump stacktraces from every thread it's running and should tell exactly where it's managed to hang. Potential interesting journal output Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd[1]: Started Process Core Dump (PID 2202/UID 0). Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:22 -0400] "GET /v1.12/libpod/pods/json?filters=%7B%22id%22%3A%5B%22c8479afffce329c86801d2d8cda591421a8fee440662> Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:23 -0400] "GET /v1.12/libpod/pods/json? HTTP/1.1" 200 554 "" "" Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd-coredump[2203]: Resource limits disable core dumping for process 1792 (podman). Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd-coredump[2203]: [🡕] Process 1792 (podman) of user 0 dumped core. Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd[1]: systemd-coredump: Deactivated successfully. Jun 07 13:36:48 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:21 -0400] "GET /v1.12/libpod/containers/stats?stream=true HTTP/1.1" 200 1664 "" "" We're getting stuck trying to lock the infra container's lock to get its bind mounts (https://github.com/containers/podman/blob/main/libpod/container.go#L936). I don't know what would be holding that lock, though. The `podman pod create` command is not hanging, is it? Any chance you have tests running in parallel, each trying to create a separate clean environment? No, we do test parallelization with multiple VMs. Inside of each OS VM, there is only ever a single test running, against the one and only system and/or user instance. Note that we have only ever seen this on two OSes: - rhel4edge, which is RHEL 9.1 (as reported in "Version:" field), with podman podman-4.2.0-11.el9_1.x86_64 - Ubuntu 22.04 LTS with podman 3.4.4. We don't see this on Fedora, RHEL 9.3, CentOS 8/9, Debian 12/unstable, later Ubuntus, etc. Either because it was fixed, or it just happens so rarely that we don't notice any more. So it's highly likely that this won't reproduce with current podman versions. This looks like a ABBA deadlock to me, the container (locked) is started then gets the dependency container (pod infra) which is tries to lock as well and blocks. I assume the podman service API process on the other hand locked the infra and then tries to access the container lock somehow, would it be possible for you to also get the stack trace of the service when it hangs? I am currently searching in the code to understand where it hangs on the server side but so I far I did not see anything obvious. I know we had a problem at some point with deadlocking with pods, maybe https://github.com/containers/podman/pull/14976/commits/af118f7c6a22d0e639ef5d60d66bb2a628523f6b This one was fixed in 4.3 but is only triggered on podman removal, do you also run podman pod rm in your reproducer? There is also https://github.com/containers/podman/issues/16326 which was fixed in 4.4. > do you also run podman pod rm in your reproducer? Filling in this bit for Jelle, as he's on PTO this week. No, we don't. The test does not get very far, it already hangs here [1] in the setup phase. At this point the only things that happened are: 1. Log into Cockpit, in other words, start and connect to podman.service 2. podman pod create --name pod 3. podman run --name inpod --pod pod -tid quay.io/prometheus/busybox sh -c 'exit 1' I.e. there is no `--rm` option, but the container immediately stops after running (but in theory that shouldn't block the `podman run` command, so it hangs earlier than that) https://github.com/cockpit-project/cockpit-podman/blob/7822dfd77a21a2c0f58f4569c0e22fc9e4a39bef/test/check-application#L2207 |