Bug 1958732 - podman REST API hang
Summary: podman REST API hang
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: podman
Version: 35
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Aditya R
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1956608 1962019
TreeView+ depends on / blocked
 
Reported: 2021-05-10 03:50 UTC by Nathan Scott
Modified: 2023-09-18 00:26 UTC (History)
14 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2022-12-13 15:22:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Nathan Scott 2021-05-10 03:50:17 UTC
I'm observing a situation where curl requests hang on subsequent REST API requests.  I'm running a couple of trivial containers (bash) and a single empty pod, all created via the podman CLI for the purposes of testing new PCP code.

Version-Release number of selected component (if applicable):
Fedora34, podman-3.2.0-0.1.rc1.fc34.x86_64

How reproducible:
Every time.

Steps to Reproduce:
$ sudo curl -s --unix-socket /run/podman/podman.sock http://d/v3.0.0/libpod/containers/json
[{"AutoRemove":false,"Command":["bash"],"Created":"2021-05-07T15:18:58.11280636+10:00","CreatedAt":"","Exited":false,"ExitedAt":-62135596800,"ExitCode":0,"Id":"d747ee1448150d051b34a8d675c9c8de49a9db78406da1b1fcad4c88523a1459","Image":"registry.fedoraproject.org/fedora:latest","ImageID":"eb7134a03cd6bd8a3de99c16cf174d66ad2d93724bac3307795efcd8aaf914c5","IsInfra":false,"Labels":{"license":"MIT","name":"fedora","vendor":"Fedora Project","version":"32"},"Mounts":[],"Names":["eloquent_euler"],"Namespaces":{},"Networks":["podman"],"Pid":3201686,"Pod":"","PodName":"","Ports":null,"Size":null,"StartedAt":1620364738,"State":"running","Status":""}]

$ sudo curl -s --unix-socket /run/podman/podman.sock http://d/v3.0.0/libpod/pods/json
[ hangs ]
^C

$ sudo curl -s --unix-socket /run/podman/podman.sock http://d/v3.0.0/libpod/containers/json
[ also hangs now ]
^C

However, accesses to a regular users podman socket (i.e. non-root) continue just fine for both cases (pods/json and containers/json).

Comment 1 Matthew Heon 2021-05-10 16:21:34 UTC
Can you verify that it only hangs after the Pods call happens, or that regardless of order, the second call always hangs?

Comment 2 Nathan Scott 2021-05-10 22:33:55 UTC
(In reply to Matthew Heon from comment #1)
> Can you verify that it only hangs after the Pods call happens, or that
> regardless of order, the second call always hangs?

Thanks Matt - yes, using the pods API definitely seems to be the trigger
for the hang.  With a freshly started REST service I can make any number
of successful containers/json queries, but as soon as I introduce a pods
query, it hangs.

One other thing I neglected to mention, which may help reproduce - I am
often times running curl in a way to replicate what the PCP client code
is going to be doing, i.e. reusing a single socket connection.

So any number of containers requests, chained from one client or not, is
successful:

$ sudo curl --unix-socket /run/podman/podman.sock http://d/v3.0.0/libpod/containers/json http://d/v3.0.0/libpod/containers/json http://d/v3.0.0/libpod/containers/json
[3 responses, curl completes]
$ 

but introducing pods/json and the wheels fall off...

$ sudo curl --unix-socket /run/podman/podman.sock http://d/v3.0.0/libpod/containers/json http://d/v3.0.0/libpod/pods/json  
[hangs immediately, either no response at all - or just the containers
response if the service was restarted just prior to running curl]

Comment 3 Jhon Honce 2021-05-11 22:40:12 UTC
@mheon FYI, using a build from master (8dcd5b893feb8b3b34386e6688d20b7ef098228b). I was not able to reproduce:

$ bin/podman system reset
curl --unix-socket /run/user/1000/podman/podman.sock http://d/v3.0.0/libpod/pods/json
[]
$ bin/podman pod create --name pod001
$ curl --unix-socket /run/user/1000/podman/podman.sock http://d/v3.0.0/libpod/pods/json
[{"Cgroup":"user.slice","Containers":[{"Id":"527c6d322dd68e261630a3f46b3a7db3b1b79ceb0c09d8aeb36012fa8efa28b2","Names":"b9ab3ec058fc-infra","Status":"configured"}],"Created":"2021-05-11T15:26:25.325078296-07:00","Id":"b9ab3ec058fc50417bd68e79a40b0ff75d67c074ca37cba1c73a012c7bb40953","InfraId":"527c6d322dd68e261630a3f46b3a7db3b1b79ceb0c09d8aeb36012fa8efa28b2","Name":"pod001","Namespace":"","Networks":null,"Status":"Created","Labels":{}}]

$ bin/podman pull fedora
...
$ bin/podman run -it fedora date
...
$ curl --unix-socket /run/user/1000/podman/podman.sock http://d/v3.0.0/libpod/pods/json
[{"Cgroup":"user.slice","Containers":[{"Id":"527c6d322dd68e261630a3f46b3a7db3b1b79ceb0c09d8aeb36012fa8efa28b2","Names":"b9ab3ec058fc-infra","Status":"configured"}],"Created":"2021-05-11T15:26:25.325078296-07:00","Id":"b9ab3ec058fc50417bd68e79a40b0ff75d67c074ca37cba1c73a012c7bb40953","InfraId":"527c6d322dd68e261630a3f46b3a7db3b1b79ceb0c09d8aeb36012fa8efa28b2","Name":"pod001","Namespace":"","Networks":null,"Status":"Created","Labels":{}}]

Comment 4 Nathan Scott 2021-05-11 22:46:50 UTC
Jhon, try the curl invocations from my second post?  Having multiple URLs
on the curl command line seemed to make things more likely to fail for me
(internally this results in a single socket connection with multiple HTTP
requests and responses are processed) - mixing pod and container requests
that way may be a critical part of the puzzle too.

Comment 5 Matthew Heon 2021-05-17 13:15:53 UTC
Concur with Jhon, this doesn't reproduce on current upstream. Possibly was a bug in the 3.2.0 RC, since fixed?

Can you try building the latest code from our main branch and verify if you can still see this, Nathan?

Comment 6 Nathan Scott 2021-05-20 03:41:33 UTC
I've upgraded to latest podman in Fedora 34 - the REST API hangs very quickly still with the podman version there:

$ rpm -qi podman
Name        : podman
Epoch       : 3
Version     : 3.1.2
Release     : 3.fc34
Architecture: x86_64

Which seems to be the latest stable release.  BTW there's a broken link here: https://podman.io/new/2021/04/02/new.html

I'll attempt to build and run unreleased upstream podman next (or is there a safe copr/CI/... build I could use?)

Comment 7 Nathan Scott 2021-05-20 04:22:59 UTC
OK, I've managed to build with no problems from the podman github repo.

The problem still exists there in the main branch, still trivially reproducible for me.  I managed to get gdb to attach to the hung process and this is where it seems to be - looks like a futex deadlock perhaps?

$ podman --version
podman version 3.2.0-dev
$ sudo gdb -p 35574
GNU gdb (GDB) Fedora 10.1-14.fc34
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 35574
[New LWP 35575]
[New LWP 35576]
[New LWP 35577]
[New LWP 35578]
[New LWP 35579]
[New LWP 35580]
[New LWP 35581]
[New LWP 35582]
[New LWP 35583]
[New LWP 35584]
[New LWP 35585]
[New LWP 35587]
[New LWP 35595]

warning: Expected absolute pathname for libpthread in the inferior, but got .gnu_debugdata for /lib64/libpthread.so.0.

warning: File "/usr/lib64/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/lib/golang/src/runtime/runtime-gdb.py".
Missing separate debuginfos, use: dnf debuginfo-install podman-3.1.2-3.fc34.x86_64
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
Loading Go Runtime support.
(gdb) runtime.futex () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:580
580		MOVL	AX, ret+40(FP)

(gdb) bt
#0  runtime.futex () at /usr/lib/golang/src/runtime/sys_linux_amd64.s:580
#1  0x000000000043e9c6 in runtime.futexsleep (addr=0x2896a30 <runtime.m0+336>, 
    val=0, ns=-1) at /usr/lib/golang/src/runtime/os_linux.go:44
#2  0x0000000000417b3f in runtime.notesleep (n=0x2896a30 <runtime.m0+336>)
    at /usr/lib/golang/src/runtime/lock_futex.go:159
#3  0x0000000000447eb9 in runtime.mPark ()
    at /usr/lib/golang/src/runtime/proc.go:1340
#4  0x0000000000449cce in runtime.stoplockedm ()
    at /usr/lib/golang/src/runtime/proc.go:2451
#5  0x000000000044beec in runtime.schedule ()
    at /usr/lib/golang/src/runtime/proc.go:3059
#6  0x000000000044c2bd in runtime.park_m (gp=0xc000602480)
    at /usr/lib/golang/src/runtime/proc.go:3274
#7  0x0000000000478d7b in runtime.mcall ()
    at /usr/lib/golang/src/runtime/asm_amd64.s:327
#8  0x0000000000478c74 in runtime.rt0_go ()
    at /usr/lib/golang/src/runtime/asm_amd64.s:226
#9  0x0000000000000000 in ?? ()
(gdb) 


This is a completely up-to-date Fedora 34 system as of this morning, with just podman built from main being the only upstream non-release bits (well, other than PCP - but that's not in play here).



$ sudo curl --unix-socket /run/podman/podman.sock http://d/v3.0.0/libpod/pods/json http://d/v3.0.0/libpod/pods/json 'http://d/v3.0.0/libpod/containers/stats?stream=false'  'http://d/v3.0.0/libpod/containers/json' 
(hangs)
^C

$ podman pod list
POD ID        NAME            STATUS   CREATED      INFRA ID      # OF CONTAINERS
a67e64258be1  jovial_lehmann  Created  2 weeks ago  8ecc96186124  1

$ sudo podman pod list
(hangs)
^C

$ podman container list
CONTAINER ID  IMAGE                                     COMMAND  CREATED        STATUS            PORTS   NAMES
936eecade00f  registry.fedoraproject.org/fedora:latest  bash     9 minutes ago  Up 9 minutes ago          nostalgic_germain
fb89274a964a  registry.fedoraproject.org/fedora:latest  bash     7 minutes ago  Up 7 minutes ago          silly_kalam

$ sudo podman container list
(hangs)


Looking at the gdb stack trace for the two hanging podman list commands, it looks the same as the REST API case.  (or maybe I'm misreading the stack trace - I don't have experience debugging golang code)

Comment 8 Daniel Walsh 2021-06-11 14:56:29 UTC
Are you still able to reproduce, since no one else has regenerated this issue.

Comment 9 Nathan Scott 2021-06-15 03:13:14 UTC
Yes, still fails reliably for me - same as always (numerous f34 updates & reboots since last post).  Let me know if there's any other debugging information I can provide.

strace is still suggesting its hanging in a futex call, as before...

$ sudo strace -p 1447472
strace: Process 1447472 attached
futex(0x56332c856ef0, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 1447472 detached
 <detached ...>


I see also that podman on the command line hangs too (i.e. not only via REST API?) - same sort of strace...

read(9, "# The containers configuration f"..., 16427) = 16426
read(9, "", 1)                          = 0
close(9)                                = 0
geteuid()                               = 0
newfstatat(AT_FDCWD, "/etc/containers/storage.conf", {st_mode=S_IFREG|0644, st_size=7834, ...}, 0) = 0
newfstatat(AT_FDCWD, "/etc/containers/storage.conf", {st_mode=S_IFREG|0644, st_size=7834, ...}, 0) = 0
futex(0xc000580150, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1 RT_2], NULL, 8) = 0
futex(0xc000580150, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1 RT_2], NULL, 8) = 0
futex(0xc000628150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55d95fe76ef0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xc000068d50, FUTEX_WAKE_PRIVATE, 1) = 1
rt_sigprocmask(SIG_SETMASK, ~[HUP INT QUIT ILL TRAP ABRT BUS FPE SEGV TERM STKFLT CHLD PROF SYS RTMIN RT_1 RT_2], NULL, 8) = 0
futex(0xc000180150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55d95fe76ef0, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55d95fe76ef0, FUTEX_WAIT_PRIVATE, 0, NULL[hangs here]

cheers.

Comment 10 Nathan Scott 2021-06-17 22:52:04 UTC
Are there any podman debugging diagnostics I can enable to dig deeper?  If not, could someone in the podman team provide a debug version (or even patch) I can run to assist further?  Thanks!

Comment 11 Nathan Scott 2021-11-08 22:54:33 UTC
Quick update - nothing has improved and the situation seems to have got significantly worse with F35.  If it wasn't made clear earlier, these bugs prevent monitoring of podman containers using tools like PCP and atop.  To make it even more clear - everything used to work just fine for us when libvarlink was being used by podman, but nothing has worked properly since.  :(


With F35 I now see previously-not-hanging API requests hanging every time, e.g.

# curl --output /tmp/containers.json --unix-socket /run/podman/podman.sock -v 'http://d/v3.0.0/libpod/containers/json' 
*   Trying /run/podman/podman.sock:0...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to d (/run/podman/podman.sock) port 80 (#0)
> GET /v3.0.0/libpod/containers/json HTTP/1.1
> Host: d
> User-Agent: curl/7.79.1
> Accept: */*
> 
  0     0    0     0    0     0      0      0 --:--:--  0:00:38 --:--:--     0


(that final counter - at 38 seconds so far there - just keeps going up, and the request never completes)


The example request (/libpod/images) used on the podman REST API docs first page works fine, albeit very slowly (2sec to respond to a single request - that's quite poor latency).

# curl --output /tmp/images.json --unix-socket /run/podman/podman.sock -v 'http://d/v3.0.0/libpod/images/json' 
*   Trying /run/podman/podman.sock:0...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to d (/run/podman/podman.sock) port 80 (#0)
> GET /v3.0.0/libpod/images/json HTTP/1.1
> Host: d
> User-Agent: curl/7.79.1
> Accept: */*
> 
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Api-Version: 1.40
< Content-Type: application/json
< Libpod-Api-Version: 3.4.1
< Server: Libpod/3.4.1 (linux)
< X-Reference-Id: 0xc002db8000
< Date: Mon, 08 Nov 2021 22:46:59 GMT
< Transfer-Encoding: chunked
< 
{ [7970 bytes data]
100 74332    0 74332    0     0  32735      0 --:--:--  0:00:02 --:--:-- 32745
* Connection #0 to host d left intact

(i.e. it completes successfully, but takes ~2 seconds - its visibly slow to complete on the command line)

Comment 12 Mark Wielaard 2021-11-15 14:13:41 UTC
Note that strace has a -k (obtain stack trace between each syscall) option.
Maybe it can show where the futex is waiting for/from.

Comment 14 Ben Cotton 2022-11-29 16:56:44 UTC
This message is a reminder that Fedora Linux 35 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 35 on 2022-12-13.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '35'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 35 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 15 Ben Cotton 2022-12-13 15:22:31 UTC
Fedora Linux 35 entered end-of-life (EOL) status on 2022-12-13.

Fedora Linux 35 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 16 Red Hat Bugzilla 2023-09-18 00:26:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.