Bug 1671023
| Summary: | timeout not working with podman pull on rhel8 beta | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | John Fulton <johfulto> |
| Component: | podman | Assignee: | Valentin Rothberg <vrothber> |
| Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 | CC: | ddarrah, dwalsh, gabrioux, igreen, jligon, kdudka, lsm5, mheon, vrothber, ypu |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | podman 1.1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-05 21:01:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
John Fulton
2019-01-30 15:19:57 UTC
WORKAROUND: use "timeout -s 9" as per [1]. Note that the issue does not reproduce with other commands like `time timeout 1 find /` as the find is killed after 1 sec as it should be. It reproduces with podman though. Note the states that podman goes into when run as reported: [root@rhel8 ~]# ps axu | head -1 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND [root@rhel8 ~]# ps axu | grep 'podman pull' root 116280 0.0 0.0 221284 952 pts/2 S 22:38 0:00 timeout 300s podman pull docker.io/ceph/daemon:latest-master root 116281 0.0 0.3 2205484 59096 pts/2 Tl 22:38 0:00 podman pull docker.io/ceph/daemon:latest-master root 117093 0.0 0.0 221860 968 pts/1 S+ 22:42 0:00 grep --color=auto podman pull [root@rhel8 ~]# to review: S interruptible sleep (waiting for an event to complete) T stopped, either by a job control signal or because it is being traced. l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do) [1] [root@rhel8 ~]# time timeout -s 9 4 podman pull docker.io/ceph/daemon:latest-master Trying to pull docker.io/ceph/daemon:latest-master...Getting image source signatures Skipping blob a02a4930cb5d (already present): 71.68 MiB / 71.68 MiB [=======] 0s Skipping blob a9df2003811e (already present): 149.99 MiB / 149.99 MiB [=====] 0s Skipping blob 9f8cc4bfe3e6 (already present): 36.83 MiB / 36.83 MiB [=======] 0s Skipping blob d9b683c59cc1 (already present): 763 B / 763 B [===============] 0s Skipping blob 945ffa6dc462 (already present): 423 B / 423 B [===============] 0s Skipping blob 5079a1e5ea24 (already present): 292 B / 292 B [===============] 0s Skipping blob 40e4854b4ccc (already present): 30.96 KiB / 30.96 KiB [=======] 0s Skipping blob 9ffb0b7f6cdf (already present): 412 B / 412 B [===============] 0s Skipping blob 91fb7adff82f (already present): 483.29 KiB / 483.29 KiB [=====] 0s Skipping blob 7668d903fd4d (already present): 1.45 KiB / 1.45 KiB [=========] 0s Copying config f23e24277d1a: 15.38 KiB / 15.38 KiB [========================] 0s Writing manifest to image destination Storing signatures f23e24277d1a3d35999b36de8878638bab088293f227b42e7e9b33064387f350 real 0m3.415s user 0m0.072s sys 0m0.051s [root@rhel8 ~]# The `timeout` utility only sends signal to the process it started if it is still running after the specified time interval. It cannot guarantee immediate process termination in general. You might want to use `timeout -s KILL ...` to make the process terminate faster. Still it is up to the operating system how it handles the signal. an other strange behavior I hit with timeout + podman :
ok: [osd0 -> mon0] => changed=false
cmd:
- timeout
- '5'
- podman
- exec
- ceph-mon-mon0
- ceph
- --cluster
- ceph
- -s
- -f
- json
delta: '0:00:05.008254'
end: '2019-01-31 04:01:14.814374'
failed_when_result: false
msg: non-zero return code
rc: 124
start: '2019-01-31 04:01:09.806120'
stderr: ''
stderr_lines: []
stdout: |2-
{"fsid":"ae9a625a-dcbb-47c1-955d-937cd53f5d3a","health":{"checks":{},"status":"HEALTH_OK"},"election_epoch":8,"quorum":[0,1,2],"quorum_names":["mon2","mon1","mon0"],"quorum_age":110,"monmap":{"epoch":1,"fsid":"ae9a625a-dcbb-47c1-955d-937cd53f5d3a","modified":"2019-01-31
... omitted output ...
["balancer","crash","devicehealth","progress","status","volumes"]}},"servicemap":{"epoch":1,"modified":"0.000000","services":{}}}
Although the command was executed correctly, we can see it has returned 124.
I realized that it seems the command took 5 seconds to complete.
Then, I decided to rerun the same command with timeout 120
ok: [osd0 -> mon0] => changed=false
cmd:
- timeout
- '120'
- podman
- exec
- ceph-mon-mon0
- ceph
- --cluster
- ceph
- -s
- -f
- json
delta: '0:02:00.011980'
end: '2019-01-31 05:01:31.402003'
failed_when_result: false
msg: non-zero return code
rc: 124
start: '2019-01-31 04:59:31.390023'
stderr: ''
stderr_lines: []
stdout: |2-
{"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","health":{"checks":{},"status":"HEALTH_OK"},"election_epoch":3,"quorum":[0],"quorum_names":["mon0"],"quorum_age":62,"monmap":{"epoch":1,"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","modified":"2019-01-31
... omitted output ...
["balancer","crash","devicehealth","progress","status","volumes"]}},"servicemap":{"epoch":1,"modified":"0.000000","services":{}}}
Same behavior, the command seems to run well (since we can get the output) but we can see the command took 2min. Looks like something doesn't detect the command has completed or podman itself hangs after it displays the output.
I tried to run the command manually (without the playbook) on the node:
[root@mon0 ~]# time timeout 120 podman exec ceph-mon-mon0 ceph --cluster ceph -s -f json
{"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","health":{"checks":{},"status":"HEALTH_OK"},"election_epoch":3,"quorum":[0],"quorum_names":["mon0"],"quorum_age":728,"monmap":{"epoch":1,"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","modified":"2019-01-31 04:58:14.922298","created":"2019-
... omitted output ...
["balancer","crash","devicehealth","progress","status","volumes"]}},"servicemap":{"epoch":1,"modified":"0.000000","services":{}}}
real 0m0,526s
user 0m0,392s
sys 0m0,069s
[root@mon0 ~]#
an other test, (because I was suspecting something because of ansible):
guits@elisheba[guits-podman]$ ANSIBLE_SSH_ARGS='-F /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/vagrant_ssh_config' ~/tox/bin/ansible -i /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/hosts mon0 -m command -a 'time timeout 2 sleep 4'
mon0 | FAILED | rc=124 >>
Command exited with non-zero status 124
0.00user 0.00system 0:02.00elapsed 0%CPU (0avgtext+0avgdata 2268maxresident)k
0inputs+0outputs (0major+173minor)pagefaults 0swapsnon-zero return code
guits@elisheba[guits-podman]$ ANSIBLE_SSH_ARGS='-F /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/vagrant_ssh_config' ~/tox/bin/ansible -i /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/hosts mon0 -m command -a 'time timeout 4 sleep 2'
mon0 | CHANGED | rc=0 >>
0.00user 0.00system 0:02.00elapsed 0%CPU (0avgtext+0avgdata 2112maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps
As already explained, the `timeout` utility only sends signals. The problems you are describing are more about how `podman` handles the signals. I am switching the component to `podman`... So this means that podman is ignoreing the SIGTERM? Does that mean their is no bug? If SIGKILL is the only way to terminate a long running process that operates on some persistent data, it does sound like a bug to me. I think podman should handle proper termination upon receiving SIGTERM and ensure that persistent data ends up in some consistent state while terminating. If I run podman in one terminal and then send sigterm to it from another terminal I see. $ podman run -ti fedora sh # exit If I run a pull and then send a kill signal I see it stop with a SIGTTIOU # podman pull docker.io/ceph/daemon:latest-master Trying to pull docker.io/ceph/daemon:latest-master...Getting image source signatures Copying blob a02a4930cb5d: 8.38 MiB / 71.68 MiB [===>--------------------------] Copying blob da885c29d20c: 1.28 MiB / 161.91 MiB [>----------------------------] Copying blob b1f264fd2ddd: 16.28 MiB / 36.85 MiB [============>----------------] Copying blob 2b782190a07b: 765 B / 765 B [==================================] 2s Copying blob a02a4930cb5d: 35.14 MiB / 71.68 MiB [==============>--------------] Copying blob da885c29d20c: 7.54 MiB / 161.91 MiB [=>---------------------------] Copying blob b1f264fd2ddd: 36.85 MiB / 36.85 MiB [==========================] 6s Copying blob 2b782190a07b: 765 B / 765 B [==================================] 6s Copying blob 43df52c729f6: 424 B / 424 B [==================================] 6s Copying blob 44ea3313c897: 293 B / 293 B [==================================] 6s Copying blob 5d01166cd6fe: 30.96 KiB / 30.96 KiB [==========================] 6s Copying blob e880c2071faa: 412 B / 412 B [==================================] 6s Copying blob f70553bab26e: 483.29 KiB / 483.29 KiB [========================] 6s Copying blob 74104ee0b9c6: 1.46 KiB / 1.46 KiB [============================] 6s podman pkill -TERM %1 sh-4.4# [1]+ Stopped(SIGTTOU) podman pull docker.io/ceph/daemon:latest-master Valintin, do you think this is the library doing the terminal output catching the signal. # podman pull -q docker.io/ceph/daemon:latest-master & [1] 30689 sh-4.4# kill -TERM %1 sh-4.4# jobs [1]+ Terminated podman pull -q docker.io/ceph/daemon:latest-master This seems to work if I do it by hand, but it does not seem to work with timeout. # time timeout -s TERM .01 podman run fedora ls -lR / > /tmp/out real 0m0.012s user 0m0.006s sys 0m0.006s This works, but the code is sigproxy to the container processes. (In reply to Daniel Walsh from comment #12) > Valintin, do you think this is the library doing the terminal output > catching the signal. I need to verify but it would make sense. The bars are rendered in the background and the goroutines may catch those signals. Seems to work in podman 1.1 time timeout 4 ./bin/podman pull docker.io/ceph/daemon:latest-master Trying to pull docker://docker.io/ceph/daemon:latest-master...Getting image source signatures Copying blob a02a4930cb5d [=====>----------------------------] 12.7MiB / 71.7MiB Copying blob 4983d19d9a03 [>---------------------------------] 6.1MiB / 162.4MiB Copying blob 0df723c41791 [============>---------------------] 14.1MiB / 37.2MiB Copying blob 88ee529f6f81 [======================================] 766b / 766b Copying blob 88c8161d8087 [======================================] 425b / 425b Copying blob 31ac340e9ec9 [======================================] 294b / 294b Copying blob 1064234d281f [==================================] 31.0KiB / 31.0KiB Copying blob cec2144378b1 [======================================] 411b / 411b Copying blob 2686e5f3efb3 [================================] 483.3KiB / 483.3KiB Copying blob 3083c59b7809 [====================================] 1.5KiB / 1.5KiB real 0m4.028s user 0m3.839s sys 0m0.445s Test with podman-1.4.2-5.module+el8.1.0+4240+893c1ab8.x86_64 As the problem is not 100% triggered, run it 100 times with the command: for i in `seq 100`; do time timeout 8 podman pull docker.io/ceph/daemon:latest-master ;done . It always finished around 8s. So set it to verified. . Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:3403 |