Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2211395

Summary:

[rhel4edge] podman hangs creating a container in a pod

Product:

Red Hat Enterprise Linux 9

Reporter:

Jelle van der Waa <jvanderwaa>

Component:

podman

Assignee:

Matthew Heon <mheon>

Status:

CLOSED MIGRATED

QA Contact:

atomic-bugs <atomic-bugs>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

9.1

CC:

bbaude, dwalsh, jnovy, lsm5, mboddu, mheon, mpitt, pholzing, pthomas, tsweeney, umohnani

Target Milestone:

Keywords:

MigratedToJIRA

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

CockpitTest

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-09-11 20:15:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ps axHo stat,pid,tid,pgid,ppid,comm,wchan \| grep podman	none

Description Jelle van der Waa 2023-05-31 12:09:07 UTC

Created attachment 1968069 [details]
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman

Created attachment 1968069 [details]
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman

Created attachment 1968069 [details]
ps axHo stat,pid,tid,pgid,ppid,comm,wchan | grep podman

Description of problem:

In our Cockpit podman plugin testsuite we observe that podman sometimes hangs and our tests fail. 

Version-Release number of selected component (if applicable):

Client:       Podman Engine
Version:      4.2.0
API Version:  4.2.0
Go Version:   go1.18.9
Built:        Tue Feb  7 05:56:41 2023
OS/Arch:      linux/amd64

How reproducible:

On my work laptop (t14s) this is only reproducible when I amplify the failing test in CI by running it ~ 20 times in a row.

Steps to Reproduce:

This is a bit tricky, but in essence we have a `podman.socket` open for the UI and in our test create podman pods with containers. The test hangs when trying to create a container in a pod (after having created a pod before). So in essence we run, repeatedly

podman pod create --name pod"
podman run --name inpod --pod pod -tid ${IMG_BUSYBOX} sh -c 'exit 1'

Actual results:

The `podman run` command hangs

Expected results:

Container is created.

Additional info:

https://cockpit-logs.us-east-1.linodeobjects.com/pull-1235-20230526-162637-03f5f1da-rhel4edge/log.html#26

Comment 2 Tom Sweeney 2023-06-01 00:32:04 UTC

Assigning to @mheon

Comment 3 Matthew Heon 2023-06-01 13:07:31 UTC

Can you clarify whether the Podman commands you're running are using the remote API (podman.socket) or local Podman commands?

Comment 4 Matthew Heon 2023-06-01 13:09:35 UTC

Ah, answering myself, you're just providing equivalent commands...

Any chance you can get the actual API calls that reproduce this, then? It would help greatly if I could try to reproduce with `curl`.

Comment 6 Jelle van der Waa 2023-06-05 09:10:07 UTC

This issue occurs in our tests, I'll try if I can provide a script as reproducer but that might be a bit tricky. Our application subscribes to /libpod/events so it knows when to update containers. While this happens in the background the tests run the podman cli commands in the description.

After podman hangs, I can run `podman system renumber` but this does not resolve the hang.

Comment 10 Matthew Heon 2023-06-05 17:18:17 UTC

I'm working on a patch to Podman that will include additional debug information for locking issues. Would it be possible to use a patched Podman in your CI environment to get additional debug info?

Comment 11 Jelle van der Waa 2023-06-06 07:30:03 UTC

I tried to create a shell script to reproduce the issue but so far haven't been successful. Locally I can run our tests and reproduce the issue so I'm happy to run a patched podman if that helps.

Comment 12 Matthew Heon 2023-06-06 13:18:00 UTC

If you can build from here: https://github.com/containers/podman/pull/18796 and then run `podman system locks` after reproducing, that ought to help greatly.

Comment 13 Jelle van der Waa 2023-06-06 16:10:50 UTC

Took me a while to figure out how to build podman on RHEL 8 as a mock build fails due to the lack of go-md2man. Anyway the output when it fails is:

[root@rhel4edge-127-0-0-2-2201 ~]# ./podman system locks

No lock conflicts have been detected, system safe from deadlocks.

Lock 1 is presently being held
Lock 2 is presently being held
Lock 3 is presently being held

Comment 14 Matthew Heon 2023-06-06 17:50:55 UTC

You note that pods/containers are being created repeatedly - are they ever removed? How many are there when things start failing?

Is anything being done from the remote API (Podman socket) besides streaming events?

Comment 16 Jelle van der Waa 2023-06-07 08:46:22 UTC

Our tests try to reset the state of /var/lib/containers on every test run so we start with a fresh state. The newly attached strace -f failed on the first run, so that seems to exclude our cleanup mechanism.

When a test starts we open our application and for the system/user session call:

- libpod/info
- libpod/images/json
- libpod/containers/json
- libpod/pods/json
- libpod/events

While the page is loading we create a podman pod and try to create a container in it. So I suppose to reproduce this I could try to replicate this in a script.

Comment 17 Matthew Heon 2023-06-07 13:28:36 UTC

Actually, I'm overthinking this a bit. Can you hit the hanging process (the `podman run` ideally, but a `podman ps` that hung afterwards would also help) with a SIGABRT? This forces Go to dump stacktraces from every thread it's running and should tell exactly where it's managed to hang.

Comment 18 Jelle van der Waa 2023-06-07 18:08:53 UTC

Potential interesting journal output

Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd[1]: Started Process Core Dump (PID 2202/UID 0).
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:22 -0400] "GET /v1.12/libpod/pods/json?filters=%7B%22id%22%3A%5B%22c8479afffce329c86801d2d8cda591421a8fee440662>
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:23 -0400] "GET /v1.12/libpod/pods/json? HTTP/1.1" 200 554 "" ""
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd-coredump[2203]: Resource limits disable core dumping for process 1792 (podman).
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd-coredump[2203]: [🡕] Process 1792 (podman) of user 0 dumped core.
Jun 07 13:36:43 rhel4edge-127-0-0-2-2201 systemd[1]: systemd-coredump: Deactivated successfully.
Jun 07 13:36:48 rhel4edge-127-0-0-2-2201 podman[1721]: @ - - [07/Jun/2023:04:06:21 -0400] "GET /v1.12/libpod/containers/stats?stream=true HTTP/1.1" 200 1664 "" ""

Comment 20 Matthew Heon 2023-06-07 19:14:55 UTC

We're getting stuck trying to lock the infra container's lock to get its bind mounts (https://github.com/containers/podman/blob/main/libpod/container.go#L936). I don't know what would be holding that lock, though. The `podman pod create` command is not hanging, is it?

Any chance you have tests running in parallel, each trying to create a separate clean environment?

Comment 21 Martin Pitt 2023-07-03 06:42:38 UTC

No, we do test parallelization with multiple VMs. Inside of each OS VM, there is only ever a single test running, against the one and only system and/or user instance.

Comment 22 Martin Pitt 2023-07-05 07:41:30 UTC

Note that we have only ever seen this on two OSes:

 - rhel4edge, which is RHEL 9.1 (as reported in "Version:" field), with podman podman-4.2.0-11.el9_1.x86_64
 - Ubuntu 22.04 LTS with podman 3.4.4.

We don't see this on Fedora, RHEL 9.3, CentOS 8/9, Debian 12/unstable, later Ubuntus, etc. Either because it was fixed, or it just happens so rarely that we don't notice any more. So it's highly likely that this won't reproduce with current podman versions.

Comment 23 Paul Holzinger 2023-07-06 14:49:51 UTC

This looks like a ABBA deadlock to me, the container (locked) is started then gets the dependency container (pod infra) which is tries to lock as well and blocks.

I assume the podman service API process on the other hand locked the infra and then tries to access the container lock somehow, would it be possible for you to also get the stack trace of the service when it hangs?
I am currently searching in the code to understand where it hangs on the server side but so I far I did not see anything obvious.

I know we had a problem at some point with deadlocking with pods, maybe https://github.com/containers/podman/pull/14976/commits/af118f7c6a22d0e639ef5d60d66bb2a628523f6b
This one was fixed in 4.3 but is only triggered on podman removal, do you also run podman pod rm in your reproducer?

Comment 24 Paul Holzinger 2023-07-06 14:51:35 UTC

There is also https://github.com/containers/podman/issues/16326 which was fixed in 4.4.

Comment 25 Martin Pitt 2023-07-11 04:11:06 UTC

> do you also run podman pod rm in your reproducer?

Filling in this bit for Jelle, as he's on PTO this week. No, we don't. The test does not get very far, it already hangs here [1] in the setup phase. At this point the only things that happened are:

 1. Log into Cockpit, in other words, start and connect to podman.service
 2. podman pod create --name pod
 3. podman run --name inpod --pod pod -tid quay.io/prometheus/busybox sh -c 'exit 1'

I.e. there is no `--rm` option, but the container immediately stops after running (but in theory that shouldn't block the `podman run` command, so it hangs earlier than that) 

https://github.com/cockpit-project/cockpit-podman/blob/7822dfd77a21a2c0f58f4569c0e22fc9e4a39bef/test/check-application#L2207

Comment 26 Jelle van der Waa 2023-09-11 09:14:31 UTC

With our latest rhel4edge image with podman-4.4.1-12.el9_2.x86_64, I can run the test locally 5 times without it error'ing out. So I suppose this is actually fixed by upgrading.

Comment 27 RHEL Program Management 2023-09-11 20:15:16 UTC

Issue migration from Bugzilla to Jira is in process at this time. This will be the last message in Jira copied from the Bugzilla bug.

Comment 28 RHEL Program Management 2023-09-11 20:15:44 UTC

This BZ has been automatically migrated to the issues.redhat.com Red Hat Issue Tracker. All future work related to this report will be managed there.

Due to differences in account names between systems, some fields were not replicated.  Be sure to add yourself to Jira issue's "Watchers" field to continue receiving updates and add others to the "Need Info From" field to continue requesting information.

To find the migrated issue, look in the "Links" section for a direct link to the new issue location. The issue key will have an icon of 2 footprints next to it, and begin with "RHEL-" followed by an integer.  You can also find this issue by visiting https://issues.redhat.com/issues/?jql= and searching the "Bugzilla Bug" field for this BZ's number, e.g. a search like:

"Bugzilla Bug" = 1234567

In the event you have trouble locating or viewing this issue, you can file an issue by sending mail to rh-issues. You can also visit https://access.redhat.com/articles/7032570 for general account information.