Bug 2090609

Summary:	ERRO[0009] Error forwarding signal 18 to container using rootless user with timeout+sleep in the podman run command
Product:	Red Hat Enterprise Linux 8	Reporter:	Sameer <snangare>
Component:	podman	Assignee:	Jindrich Novy <jnovy>
Status:	CLOSED ERRATA	QA Contact:	Alex Jia <ajia>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	8.6	CC:	ahuchcha, ajia, bbaude, dornelas, dwalsh, falim, jligon, jnovy, lsm5, mheon, pghadge, pthomas, tsweeney, umohnani, ypu
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	podman-4.1.1-6.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2097049 (view as bug list)		Environment:
Last Closed:	2022-11-08 09:15:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2097049
Deadline:	2022-08-23

Description Sameer 2022-05-26 06:30:54 UTC

Description of problem:

Running a podman container using `timeout + sleep` in the podman run command throws the below error.

ERRO[0000] container not running                        
ERRO[0003] forwarding signal 18 to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: error sending signal to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: `/usr/bin/runc kill d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef 18` failed: exit status 1

Version-Release number of selected component (if applicable):

  - podman version 4.0.2 but able to reproduced with earlier podman versions as well

How reproducible:

  - Intermittently 

Steps to Reproduce:

1.Run the command using timeout+sleep command inside the container

  $  timeout 4 podman run  -i --rm registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8 python3 -c 'import time; time.sleep(5)'

2.
3.

Actual results:

ERRO[0000] container not running                        
ERRO[0003] forwarding signal 18 to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: error sending signal to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: `/usr/bin/runc kill d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef 18` failed: exit status 1

Expected results:

- Container should exit gracefully without a deadlock

Additional info:

- Raising this BZ to Urgent priority because the customer reported this intermittent job failure error while running the playbooks from ansilbe platform and workflow template affected since it is unable to trigger in seq with success.

- Pasting the Business Impact to Cu as below
  
  # Business Impact:
         After one year of development and POC, we have built a good foundation of on-boarded applications and users, customer is now in the critical stage to start rollout, AE had aligned on the next 200 nodes premium support agreement as well, however this defect will fail almost every workflow and it is the (only) showstopper to the customer's rollout plan.
- The issue is seen intermittent.

Comment 3 Matthew Heon 2022-05-26 13:48:30 UTC

As a note, I would strongly recommend using `podman run --timeout` instead of the `timeout` command. `timeout` does not work as advertised with Podman when the container ignores the stop signal. In simple terms, `timeout` tries to manage Podman, but Podman is not the container. The container is not a direct child of Podman (it double-forks to daemonize, to allow it to survive the early death of the Podman process - users can deliberately request to detach from the container via the detach keys, for example, which causes Podman to exit, but the container must continue running). Timeout is sending signals to Podman, not to the container. Podman by default will forward most signals (with exceptions, the most notable being SIGSTOP and SIGKILL, which cannot be forwarded) into the container, but PID 1 in the container is special (the kernel automatically accords PID 1 in a namespace special treatment) and ignores all signals it does not explicitly register a signal handler to; thus, the usual stop signal that Timeout sends (SIGTERM, I believe) will have no effect on any program that does not explicitly set a signal handler to stop after it, and many common tools do not register a handler for SIGTERM. As a result, Timeout simply spins, sending signals to Podman which Podman dutifully forwards on to the container, which just ignores them. The situation can be worse than this if a container is run with `--sig-proxy=false` or Timeout is run with `-k` - Timeout will successfully kill Podman, but the container will continue running in the background because of its double-fork. In short, while this approach may work for Python (and honestly, from what I'm seeing with the reproducer, it doesn't - Podman exits after the amount of seconds in Python's sleep, not the 4 second timeout, seemingly indicating that timeout is just being ignored), it should not be considered general-purpose, and some containers will simply refuse to exit when managed by Timeout. You can potentially avoid this by setting container stop signals to SIGKILL, but this prevents any cleanup when a container is told to exit. Furthermore, we've seen occasional bad behavior from Timeout in the past where it will spam Podman with signals repeatedly, causing occasional logs like this when the container transitions to exited.

Comment 13 Daniel Walsh 2022-06-02 14:58:17 UTC

Podman run --timeout 4 
Is the way to handle this to get the container to work correctly. Having Podman killed is an unexpected occurrence.

Comment 15 Daniel Walsh 2022-06-03 12:49:54 UTC

I am not sure what you are asking.  If your customer wants to stop a container that runs longer then TIMEX, then podman run --timeout TIMEX is the way to go.

If you are asking, should we fix a bug where Podman deadlocks when killed? Then the answer is yes. But not sure why the customer should care, and not sure 
what kind of priority this gets, since the customer has a better solution.

Comment 16 Daniel Walsh 2022-06-03 12:52:08 UTC

Matt, do you know if `Podman run|start` catches the SIGTERM and that triggers a `podman stop`?

Comment 22 Daniel Walsh 2022-06-08 12:37:57 UTC

Matt it might make sense to catch SIGTERM and then send the STOP_SIGNAL to the container (IE Do a podman stop). This way if SIG_TERM is ignored then podman and the container will exit in 10 seconds.

Comment 43 Alex Jia 2022-07-12 12:51:09 UTC

Tested with podman-4.1.1-6.module+el8.7.0+15923+b0ec4f51.x86_64, and test result looks good.

[root@sweetpig-20 ~]# yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init registry.access.redhat.com/ubi8 sleep 5
ERRO[0000] container not running                        
ERRO[0000] container not running                        
ERRO[0000] container not running  

[root@sweetpig-20 ~]# timeout 4 podman run --log-level=info -i --rm registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8 python3 -c 'import time; time.sleep(5)'
INFO[0000] podman filtering at log level info           
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled 
INFO[0000] Setting parallel job count to 7              
Trying to pull registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob efe94c0f6ff6 [>-------------------------------------] 3.1MiB / 189.4MiB
Copying blob 1e09a5ee0038 [===>----------------------------------] 3.4MiB / 34.7MiB
Copying blob 47c1fb849539 [==>-----------------------------------] 1.8MiB / 19.9MiB
Copying blob 971ebcb22551 [--------------------------------------] 586.0KiB / 65.6MiB
Copying blob 0d725b91398e done  
INFO[0004] Received shutdown signal "terminated", terminating!  PID=30003
INFO[0004] Invoking shutdown handler "libpod"            PID=30003

[root@sweetpig-20 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.7 Beta (Ootpa)

[root@sweetpig-20 ~]# rpm -q podman runc systemd kernel
podman-4.1.1-6.module+el8.7.0+15923+b0ec4f51.x86_64
runc-1.1.3-2.module+el8.7.0+15923+b0ec4f51.x86_64
systemd-239-60.el8.x86_64
kernel-4.18.0-408.el8.x86_64

Comment 44 Alex Jia 2022-07-12 13:04:41 UTC

And also verified on podman-4.1.1-6.module+el8.7.0+15895+a6753917.x86_64.

[root@sweetpig-20 ~]# yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init registry.access.redhat.com/ubi8 sleep 5
ERRO[0000] container not running                        
[root@sweetpig-20 ~]# timeout 4 podman run --log-level=info -i --rm registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8 python3 -c 'import time; time.sleep(5)'
INFO[0000] podman filtering at log level info           
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled 
INFO[0000] Setting parallel job count to 7              
Trying to pull registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob efe94c0f6ff6 [>-------------------------------------] 2.6MiB / 189.4MiB
Copying blob 0d725b91398e done  
Copying blob 1e09a5ee0038 [==>-----------------------------------] 2.9MiB / 34.7MiB
Copying blob 47c1fb849539 [===>----------------------------------] 1.9MiB / 19.9MiB
Copying blob 971ebcb22551 [=>------------------------------------] 2.7MiB / 65.6MiB
INFO[0004] Received shutdown signal "terminated", terminating!  PID=35277
INFO[0004] Invoking shutdown handler "libpod"            PID=35277

Comment 47 errata-xmlrpc 2022-11-08 09:15:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7457