Bug 1732525 - "podman ps" hangs and never returns [NEEDINFO]
Summary: "podman ps" hangs and never returns
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 8
Classification: Red Hat
Component: podman
Version: 8.0
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: rc
: 8.1
Assignee: Matthew Heon
QA Contact: atomic-bugs@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1186913 1727325 1767359
TreeView+ depends on / blocked
 
Reported: 2019-07-23 15:11 UTC by Daniel Alvarez Sanchez
Modified: 2019-12-11 12:35 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
pmoravec: needinfo? (bbaude)
pmoravec: needinfo? (mheon)
vrothber: needinfo? (dalvarez)


Attachments (Terms of Use)
strace output for 'podman ps' command (152.12 KB, text/plain)
2019-07-23 15:11 UTC, Daniel Alvarez Sanchez
no flags Details

Description Daniel Alvarez Sanchez 2019-07-23 15:11:16 UTC
Created attachment 1592903 [details]
strace output for 'podman ps' command

I don't have a reproducer for this but after running sosreports a few times I saw that it was timing out. Digging a bit further I found that the problem was 'podman ps' not returning.

When running it manually I can see that it'll never return. I'm attaching the strace output for the 'podman ps' command.

Apparently, there's more 'podman' commands running in parallel due to the sosreport plugin that I'm testing out [0] and this may be the root issue?

root      497383  487560  0 14:09 pts/2    00:00:00 timeout 300s podman exec -t ovn-dbs-bundle-podman-0 ovn-nbctl show
root      497387  497383  0 14:09 pts/2    00:00:00 podman exec -t ovn-dbs-bundle-podman-0 ovn-nbctl show
root      500136  487560  0 14:10 pts/2    00:00:00 timeout 300s podman ps
root      500138  500136  1 14:10 pts/2    00:00:00 podman ps


Looks like if I kill the first 3 processes above, then podman ps starts working again.


--- system info ---

# uname -a
Linux controller-0 4.18.0-80.4.2.el8_0.x86_64 #1 SMP Fri Jun 14 13:20:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.0 (Ootpa)

# podman -v
podman version 1.0.2


[0] https://github.com/danalsan/sos/commit/ba4bfa118c5b21ec738eeb9b3a4cf9b38144d825

Comment 1 Daniel Alvarez Sanchez 2019-07-24 07:48:19 UTC
[root@controller-0 plugins]# ps -aef|grep -e  "podman ps" -e "podman exec"
root      182773  174374  0 07:44 pts/2    00:00:00 timeout 300s podman exec ovn-dbs-bundle-podman-0 ovn-nbctl show
root      182777  182773  0 07:44 pts/2    00:00:00 podman exec ovn-dbs-bundle-podman-0 ovn-nbctl show
root      184552  174374  0 07:44 pts/2    00:00:00 timeout 300s podman ps
root      184555  184552  0 07:44 pts/2    00:00:00 podman ps
root      197528  142445  0 07:46 pts/3    00:00:00 grep --color=auto -e podman ps -e podman exec

[root@controller-0 plugins]# strace -f -p 182777
strace: Process 182777 attached with 11 threads
[pid 182815] --- stopped by SIGTTIN ---
[pid 182814] --- stopped by SIGTTIN ---
[pid 182811] --- stopped by SIGTTIN ---
[pid 182795] --- stopped by SIGTTIN ---
[pid 182794] --- stopped by SIGTTIN ---
[pid 182785] --- stopped by SIGTTIN ---
[pid 182784] --- stopped by SIGTTIN ---
[pid 182783] --- stopped by SIGTTIN ---
[pid 182781] --- stopped by SIGTTIN ---
[pid 182780] --- stopped by SIGTTIN ---
[pid 182777] --- stopped by SIGTTIN ---

[root@controller-0 plugins]# strace -f -p 184555
strace: Process 184555 attached with 11 threads
[pid 184619] futex(0xc0004a39c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 184576] futex(0xc000614148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 184575] epoll_pwait(4,  <unfinished ...>
[pid 184563] futex(0xc0004a2148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 184561] fcntl(89, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0} <unfinished ...>
[pid 184559] futex(0x561bf778bd98, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 184555] futex(0x561bf776e628, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 184564] futex(0x561bf776da10, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 184558] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
[pid 184557] futex(0x561bf778bec0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 184564] <... futex resumed> )      = 0
[pid 184564] epoll_wait(12,  <unfinished ...>
[pid 184556] restart_syscall(<... resuming interrupted futex ...>) = -1 EAGAIN (Resource temporarily unavailable)
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0
[pid 184556] futex(0x561bf776da10, FUTEX_WAIT_PRIVATE, 0, {tv_sec=60, tv_nsec=0}

Comment 2 Matthew Heon 2019-07-24 13:24:41 UTC
Something appears to have paused the 'podman exec' process while it is holding the container lock. 'podman ps' cannot proceed as it cannot acquire a lock necessary for operation.

I don't really know if I consider this a Podman bug - I'll check our expected behavior, but I'm fairly sure we don't expect 'exec' to continue executing after it gets a SIGTTIN. Then again, I'm not certain how we expect 'exec' to work in the background if that's true. I'll look into it.

Comment 7 Daniel Walsh 2019-08-14 13:27:26 UTC
Any update o this bugzilla?

Comment 8 Pavel Moravec 2019-11-08 11:35:58 UTC
(In reply to Daniel Walsh from comment #7)
> Any update o this bugzilla?

I would love to have this bug fixed. As currently it causes either of problems:

- either sosreport (the basic debugging tool used on support on majority of support cases) will not collect "podman ps" that substantially helps troubleshooting, I expect
- or sosreport with that command will hang (even worse option)

Both negatively impacts customer support, as it prolongs the time to obtain required troubleshooting data from a customer.

Comment 9 Daniel Walsh 2019-11-08 11:58:30 UTC
Matt, Valentin is it possible to put a timeout on lock waits?

Comment 10 Daniel Alvarez Sanchez 2019-11-15 10:21:16 UTC
Perhaps this is useful to determine if it's really a bug or not...


[root@controller-0 ~]# podman exec ovn-dbs-bundle-podman-0 ls  &
[1] 909385
[root@controller-0 ~]#
[1]+  Stopped                 podman exec ovn-dbs-bundle-podman-0 ls
[root@controller-0 ~]#
[root@controller-0 ~]#
[root@controller-0 ~]# podman ps
..... hanging until you kill the 'podman exec' command

When I control-C 'podman ps' and 'fg' the 'podman exec' command this is what I get:

^C
[root@controller-0 ~]# fg
podman exec ovn-dbs-bundle-podman-0 ls
Error: timed out waiting for runtime to create pidfile for exec session in container 3b763d3a55f54df6475a565e068c7536965f2607195c30615513822daf4fce6a: timed out waiting for file /var/run/containers/storage/overlay-containers/3b763d3a55f54df6475a565e068c7536965f2607195c30615513822daf4fce6a/userdata/exec_pid_082bdc5b83648a3b705f7222893d8c4b0fa3c01f2eb25f817bcea2bcd9f7fc7a: internal libpod error

Comment 11 Pavel Moravec 2019-12-11 10:58:04 UTC
Hello,
could you pretty please pay attention to this bug? It seems it can impact sosreport more than we expected, see recent

https://github.com/sosreport/sos/pull/1873#issuecomment-564456122

We introduced a workaround in sosreport that is not ideal (could also lead to some commands hung). But the root cause is still untouched, for months..

I am quite concerned of possible impact to sosreport, which is our primary diagnostic and troubleshooting tool. If sosreport wont be able to collect podman related data due to podman commands hung, our ability to support podman will be limited.


Could you please at least assess this bug?

Comment 12 Valentin Rothberg 2019-12-11 11:50:12 UTC
Hi Pavel,

thanks a lot for the reminder. We had plenty of fires to fight, apologies for the long silence on this bug. We will have a look right now.

Comment 13 Valentin Rothberg 2019-12-11 12:35:24 UTC
@Daniel,

can you provide a reproducer with explicit steps? I have been chatting with Pavel and have a better understanding of the environment and potential sources of the bug. A reproducer would be hugely beneficial.


Note You need to log in before you can comment on or make changes to this bug.