Bug 1671023

Summary:	timeout not working with podman pull on rhel8 beta
Product:	Red Hat Enterprise Linux 8	Reporter:	John Fulton <johfulto>
Component:	podman	Assignee:	Valentin Rothberg <vrothber>
Status:	CLOSED ERRATA	QA Contact:	atomic-bugs <atomic-bugs>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	8.0	CC:	ddarrah, dwalsh, gabrioux, igreen, jligon, kdudka, lsm5, mheon, vrothber, ypu
Target Milestone:	rc	Flags:	pm-rhel: mirror+
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	podman 1.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 21:01:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Fulton 2019-01-30 15:19:57 UTC

Description of problem:

timeout does not stop commands which take longer than their duration. The following should not have been allowed to take more than 4 seconds as per timeout, but as per time it took approx 13 seconds.

[root@rhel8 ~]# time timeout 4 podman pull docker.io/ceph/daemon:latest-master                                                                                                                                                                                                 
Trying to pull docker.io/ceph/daemon:latest-master...Getting image source signatures
Skipping blob a02a4930cb5d (already present): 71.68 MiB / 71.68 MiB [=======] 6s
Copying blob a9df2003811e: 149.99 MiB / 149.99 MiB [========================] 6s
Copying blob 9f8cc4bfe3e6: 36.83 MiB / 36.83 MiB [==========================] 6s
Copying blob d9b683c59cc1: 763 B / 763 B [==================================] 6s
Copying blob 945ffa6dc462: 423 B / 423 B [==================================] 6s
Copying blob 5079a1e5ea24: 292 B / 292 B [==================================] 6s
Copying blob 40e4854b4ccc: 30.96 KiB / 30.96 KiB [==========================] 6s
Copying blob 9ffb0b7f6cdf: 412 B / 412 B [==================================] 6s
Copying blob 91fb7adff82f: 483.29 KiB / 483.29 KiB [========================] 6s
Copying blob 7668d903fd4d: 1.45 KiB / 1.45 KiB [============================] 6s
Copying config f23e24277d1a: 15.38 KiB / 15.38 KiB [========================] 0s
Writing manifest to image destination
Storing signatures
f23e24277d1a3d35999b36de8878638bab088293f227b42e7e9b33064387f350

real    0m13.660s
user    0m17.806s
sys     0m2.744s
[root@rhel8 ~]#

Version-Release number of selected component (if applicable):

[root@rhel8 ~]# rpm -qa | grep  coreutils
policycoreutils-python-utils-2.8-9.el8.noarch
coreutils-8.30-4.el8.x86_64
coreutils-common-8.30-4.el8.x86_64
python3-policycoreutils-2.8-9.el8.noarch
policycoreutils-2.8-9.el8.x86_64
[root@rhel8 ~]# which timeout
/usr/bin/timeout
[root@rhel8 ~]# md5sum /usr/bin/timeout
69fd7abac37dfd65ce66c566a9d1b73e  /usr/bin/timeout
[root@rhel8 ~]# 

The above is from a rhel-guest-image-8.0-1690.x86_64.qcow2 on RHEL7 libvirt hypervisor called haa-08:

[root@haa-08 ~]# rpm -qa kernel
kernel-3.10.0-957.el7.x86_64
[root@haa-08 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.6 (Maipo)
[root@haa-08 ~]# 

[root@haa-08 ~]# rpm -qa | grep virt
libvirt-daemon-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-config-nwfilter-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-rbd-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-iscsi-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-secret-4.5.0-10.el7_6.3.x86_64
libvirt-python-4.5.0-1.el7.x86_64
virt-what-1.18-4.el7.x86_64
libvirt-bash-completion-4.5.0-10.el7_6.3.x86_64
libvirt-client-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-core-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-nwfilter-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-lxc-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-disk-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-scsi-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-interface-4.5.0-10.el7_6.3.x86_64
libvirt-4.5.0-10.el7_6.3.x86_64
virt-install-1.5.0-1.el7.noarch
libvirt-libs-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-network-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-config-network-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-mpath-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-storage-logical-4.5.0-10.el7_6.3.x86_64
libvirt-daemon-driver-nodedev-4.5.0-10.el7_6.3.x86_64
libvirt-devel-4.5.0-10.el7_6.3.x86_64
virt-manager-common-1.5.0-1.el7.noarch
libvirt-daemon-kvm-4.5.0-10.el7_6.3.x86_64
[root@haa-08 ~]# 

How reproducible:
Rebuilt the RHEL8 VM from the image on two different hypervisors and reproduced twice on one and twice on the other.

Steps to Reproduce:

1. on a RHEL7 hypervisor build a RHEL8 VM as per https://gitlab.cee.redhat.com/mcornea/osp15/blob/master/standalone.md

2. connect to the RHEL8 and run 'time timeout 4 podman pull docker.io/ceph/daemon:latest-master'

Actual results:
Running 'time timeout 4 podman pull docker.io/ceph/daemon:latest-master' took more than 4 seconds but was not stopped by timeout.

Expected results:
Running 'time timeout 4 podman pull docker.io/ceph/daemon:latest-master' should take only 4 seconds and the podman pull should have been stopped by timeout.


Additional info:

Presented itself in https://bugzilla.redhat.com/show_bug.cgi?id=1670625

Comment 1 John Fulton 2019-01-30 16:24:33 UTC

WORKAROUND: use "timeout -s 9" as per [1]. 


Note that the issue does not reproduce with other commands like `time timeout 1 find /` as the find is killed after 1 sec as it should be. 

It reproduces with podman though. Note the states that podman goes into when run as reported:

[root@rhel8 ~]# ps axu | head -1
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
[root@rhel8 ~]# ps axu | grep 'podman pull'
root      116280  0.0  0.0 221284   952 pts/2    S    22:38   0:00 timeout 300s podman pull docker.io/ceph/daemon:latest-master
root      116281  0.0  0.3 2205484 59096 pts/2   Tl   22:38   0:00 podman pull docker.io/ceph/daemon:latest-master
root      117093  0.0  0.0 221860   968 pts/1    S+   22:42   0:00 grep --color=auto podman pull
[root@rhel8 ~]# 

to review:

S    interruptible sleep (waiting for an event to complete)
T    stopped, either by a job control signal or because it is being traced.
l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)


[1] 
[root@rhel8 ~]# time timeout -s 9 4 podman pull docker.io/ceph/daemon:latest-master
Trying to pull docker.io/ceph/daemon:latest-master...Getting image source signatures
Skipping blob a02a4930cb5d (already present): 71.68 MiB / 71.68 MiB [=======] 0s
Skipping blob a9df2003811e (already present): 149.99 MiB / 149.99 MiB [=====] 0s
Skipping blob 9f8cc4bfe3e6 (already present): 36.83 MiB / 36.83 MiB [=======] 0s
Skipping blob d9b683c59cc1 (already present): 763 B / 763 B [===============] 0s
Skipping blob 945ffa6dc462 (already present): 423 B / 423 B [===============] 0s
Skipping blob 5079a1e5ea24 (already present): 292 B / 292 B [===============] 0s
Skipping blob 40e4854b4ccc (already present): 30.96 KiB / 30.96 KiB [=======] 0s
Skipping blob 9ffb0b7f6cdf (already present): 412 B / 412 B [===============] 0s
Skipping blob 91fb7adff82f (already present): 483.29 KiB / 483.29 KiB [=====] 0s
Skipping blob 7668d903fd4d (already present): 1.45 KiB / 1.45 KiB [=========] 0s
Copying config f23e24277d1a: 15.38 KiB / 15.38 KiB [========================] 0s
Writing manifest to image destination
Storing signatures
f23e24277d1a3d35999b36de8878638bab088293f227b42e7e9b33064387f350

real	0m3.415s
user	0m0.072s
sys	0m0.051s
[root@rhel8 ~]#

Comment 2 Kamil Dudka 2019-01-30 16:39:59 UTC

The `timeout` utility only sends signal to the process it started if it is still running after the specified time interval.  It cannot guarantee immediate process termination in general.  You might want to use `timeout -s KILL ...` to make the process terminate faster.  Still it is up to the operating system how it handles the signal.

Comment 3 Guillaume Abrioux 2019-01-31 10:30:01 UTC

an other strange behavior I hit with timeout + podman :

ok: [osd0 -> mon0] => changed=false 
  cmd:
  - timeout
  - '5'
  - podman
  - exec
  - ceph-mon-mon0
  - ceph
  - --cluster
  - ceph
  - -s
  - -f
  - json
  delta: '0:00:05.008254'
  end: '2019-01-31 04:01:14.814374'
  failed_when_result: false
  msg: non-zero return code
  rc: 124
  start: '2019-01-31 04:01:09.806120'
  stderr: ''
  stderr_lines: []
  stdout: |2-
  
    {"fsid":"ae9a625a-dcbb-47c1-955d-937cd53f5d3a","health":{"checks":{},"status":"HEALTH_OK"},"election_epoch":8,"quorum":[0,1,2],"quorum_names":["mon2","mon1","mon0"],"quorum_age":110,"monmap":{"epoch":1,"fsid":"ae9a625a-dcbb-47c1-955d-937cd53f5d3a","modified":"2019-01-31 

... omitted output ...

["balancer","crash","devicehealth","progress","status","volumes"]}},"servicemap":{"epoch":1,"modified":"0.000000","services":{}}}


Although the command was executed correctly, we can see it has returned 124.
I realized that it seems the command took 5 seconds to complete.
Then, I decided to rerun the same command with timeout 120


ok: [osd0 -> mon0] => changed=false 
  cmd:
  - timeout
  - '120'
  - podman
  - exec
  - ceph-mon-mon0
  - ceph
  - --cluster
  - ceph
  - -s
  - -f
  - json
  delta: '0:02:00.011980'
  end: '2019-01-31 05:01:31.402003'
  failed_when_result: false
  msg: non-zero return code
  rc: 124
  start: '2019-01-31 04:59:31.390023'
  stderr: ''
  stderr_lines: []
  stdout: |2-
  
    {"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","health":{"checks":{},"status":"HEALTH_OK"},"election_epoch":3,"quorum":[0],"quorum_names":["mon0"],"quorum_age":62,"monmap":{"epoch":1,"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","modified":"2019-01-31 

... omitted output ...

["balancer","crash","devicehealth","progress","status","volumes"]}},"servicemap":{"epoch":1,"modified":"0.000000","services":{}}}


Same behavior, the command seems to run well (since we can get the output) but we can see the command took 2min. Looks like something doesn't detect the command has completed or podman itself hangs after it displays the output.

I tried to run the command manually (without the playbook) on the node:



[root@mon0 ~]# time timeout 120 podman exec ceph-mon-mon0 ceph --cluster ceph -s -f json

{"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","health":{"checks":{},"status":"HEALTH_OK"},"election_epoch":3,"quorum":[0],"quorum_names":["mon0"],"quorum_age":728,"monmap":{"epoch":1,"fsid":"3185b423-14d2-48fa-b75f-c9860866c112","modified":"2019-01-31 04:58:14.922298","created":"2019-

... omitted output ...

["balancer","crash","devicehealth","progress","status","volumes"]}},"servicemap":{"epoch":1,"modified":"0.000000","services":{}}}

real    0m0,526s
user    0m0,392s
sys     0m0,069s
[root@mon0 ~]# 


an other test, (because I was suspecting something because of ansible):

guits@elisheba[guits-podman]$ ANSIBLE_SSH_ARGS='-F /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/vagrant_ssh_config' ~/tox/bin/ansible -i /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/hosts mon0 -m command -a 'time timeout 2 sleep 4'
mon0 | FAILED | rc=124 >>
Command exited with non-zero status 124
0.00user 0.00system 0:02.00elapsed 0%CPU (0avgtext+0avgdata 2268maxresident)k
0inputs+0outputs (0major+173minor)pagefaults 0swapsnon-zero return code

guits@elisheba[guits-podman]$ ANSIBLE_SSH_ARGS='-F /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/vagrant_ssh_config' ~/tox/bin/ansible -i /home/guits/GIT/ceph-ansible/tests/functional/ooo_rhel8/hosts mon0 -m command -a 'time timeout 4 sleep 2'
mon0 | CHANGED | rc=0 >>
0.00user 0.00system 0:02.00elapsed 0%CPU (0avgtext+0avgdata 2112maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps

Comment 4 Kamil Dudka 2019-01-31 21:38:11 UTC

As already explained, the `timeout` utility only sends signals.  The problems you are describing are more about how `podman` handles the signals.  I am switching the component to `podman`...

Comment 7 Daniel Walsh 2019-02-01 18:18:22 UTC

So this means that podman is ignoreing the SIGTERM?

Comment 9 Daniel Walsh 2019-02-13 22:01:13 UTC

Does that mean their is no bug?

Comment 10 Kamil Dudka 2019-02-14 08:52:45 UTC

If SIGKILL is the only way to terminate a long running process that operates on some persistent data, it does sound like a bug to me.  I think podman should handle proper termination upon receiving SIGTERM and ensure that persistent data ends up in some consistent state while terminating.

Comment 11 Daniel Walsh 2019-02-14 15:10:34 UTC

If I run podman in one terminal and then send sigterm to it from another terminal I see.
$ podman run -ti fedora sh
# exit

If I run a pull and then send a kill signal I see it stop with a SIGTTIOU

# podman pull docker.io/ceph/daemon:latest-master  
Trying to pull docker.io/ceph/daemon:latest-master...Getting image source signatures
Copying blob a02a4930cb5d: 8.38 MiB / 71.68 MiB [===>--------------------------]
Copying blob da885c29d20c: 1.28 MiB / 161.91 MiB [>----------------------------]
Copying blob b1f264fd2ddd: 16.28 MiB / 36.85 MiB [============>----------------]
Copying blob 2b782190a07b: 765 B / 765 B [==================================] 2s
Copying blob a02a4930cb5d: 35.14 MiB / 71.68 MiB [==============>--------------]
Copying blob da885c29d20c: 7.54 MiB / 161.91 MiB [=>---------------------------]
Copying blob b1f264fd2ddd: 36.85 MiB / 36.85 MiB [==========================] 6s
Copying blob 2b782190a07b: 765 B / 765 B [==================================] 6s
Copying blob 43df52c729f6: 424 B / 424 B [==================================] 6s
Copying blob 44ea3313c897: 293 B / 293 B [==================================] 6s
Copying blob 5d01166cd6fe: 30.96 KiB / 30.96 KiB [==========================] 6s
Copying blob e880c2071faa: 412 B / 412 B [==================================] 6s
Copying blob f70553bab26e: 483.29 KiB / 483.29 KiB [========================] 6s
Copying blob 74104ee0b9c6: 1.46 KiB / 1.46 KiB [============================] 6s
podman pkill -TERM %1
sh-4.4# 

[1]+  Stopped(SIGTTOU)        podman pull docker.io/ceph/daemon:latest-master

Comment 12 Daniel Walsh 2019-02-14 15:24:28 UTC

Valintin, do you think this is the library doing the terminal output catching the signal.

# podman pull -q docker.io/ceph/daemon:latest-master &
[1] 30689
sh-4.4# kill -TERM %1
sh-4.4# jobs
[1]+  Terminated              podman pull -q docker.io/ceph/daemon:latest-master

This seems to work if I do it by hand, but it does not seem to work with timeout.



# time timeout -s TERM .01 podman run fedora ls -lR / > /tmp/out

real	0m0.012s
user	0m0.006s
sys	0m0.006s

This works, but the code is sigproxy to the container processes.

Comment 13 Valentin Rothberg 2019-02-14 15:30:00 UTC

(In reply to Daniel Walsh from comment #12)
> Valintin, do you think this is the library doing the terminal output
> catching the signal.

I need to verify but it would make sense. The bars are rendered in the background and the goroutines may catch those signals.

Comment 14 Daniel Walsh 2019-03-01 10:06:20 UTC

Seems to work in podman 1.1

  time timeout 4 ./bin/podman pull docker.io/ceph/daemon:latest-master 
Trying to pull docker://docker.io/ceph/daemon:latest-master...Getting image source signatures
Copying blob a02a4930cb5d [=====>----------------------------] 12.7MiB / 71.7MiB
Copying blob 4983d19d9a03 [>---------------------------------] 6.1MiB / 162.4MiB
Copying blob 0df723c41791 [============>---------------------] 14.1MiB / 37.2MiB
Copying blob 88ee529f6f81 [======================================] 766b / 766b
Copying blob 88c8161d8087 [======================================] 425b / 425b
Copying blob 31ac340e9ec9 [======================================] 294b / 294b
Copying blob 1064234d281f [==================================] 31.0KiB / 31.0KiB
Copying blob cec2144378b1 [======================================] 411b / 411b
Copying blob 2686e5f3efb3 [================================] 483.3KiB / 483.3KiB
Copying blob 3083c59b7809 [====================================] 1.5KiB / 1.5KiB

real	0m4.028s
user	0m3.839s
sys	0m0.445s

Comment 17 Joy Pu 2019-09-28 05:20:38 UTC

Test with podman-1.4.2-5.module+el8.1.0+4240+893c1ab8.x86_64 As the problem is not 100% triggered, run it 100 times with the command: for i in `seq 100`; do time timeout 8 podman pull docker.io/ceph/daemon:latest-master ;done . It always finished around 8s. So set it to verified.
 .

Comment 19 errata-xmlrpc 2019-11-05 21:01:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:3403