2090609 – ERRO[0009] Error forwarding signal 18 to container using rootless user with timeout+sleep in the podman run command

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2090609 - ERRO[0009] Error forwarding signal 18 to container using rootless user with timeout+sleep in the podman run command

Summary: ERRO[0009] Error forwarding signal 18 to container using rootless user with t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Deadline:	2022-08-23
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	podman
Sub Component:
Version:	8.6
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jindrich Novy
QA Contact:	Alex Jia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2097049
TreeView+	depends on / blocked

Reported:	2022-05-26 06:30 UTC by Sameer
Modified:	2022-11-08 09:33 UTC (History)
CC List:	15 users (show)
Fixed In Version:	podman-4.1.1-6.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2097049 (view as bug list)
Environment:
Last Closed:	2022-11-08 09:15:47 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	containers podman pull 14533	None	Merged	Do not error on signalling a just-stopped container	2022-06-15 09:05:46 UTC
Red Hat Issue Tracker	RHELPLAN-123432	None	None	None	2022-05-26 06:48:08 UTC
Red Hat Product Errata	RHSA-2022:7457	None	None	None	2022-11-08 09:16:27 UTC

Description Sameer 2022-05-26 06:30:54 UTC

Description of problem:

Running a podman container using `timeout + sleep` in the podman run command throws the below error.

ERRO[0000] container not running                        
ERRO[0003] forwarding signal 18 to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: error sending signal to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: `/usr/bin/runc kill d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef 18` failed: exit status 1

Version-Release number of selected component (if applicable):

  - podman version 4.0.2 but able to reproduced with earlier podman versions as well

How reproducible:

  - Intermittently 

Steps to Reproduce:

1.Run the command using timeout+sleep command inside the container

  $  timeout 4 podman run  -i --rm registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8 python3 -c 'import time; time.sleep(5)'

2.
3.

Actual results:

ERRO[0000] container not running                        
ERRO[0003] forwarding signal 18 to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: error sending signal to container d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef: `/usr/bin/runc kill d86c615290145c08884711ed1fdd9a00eb9dab7ab392f01498c3ab37e9f782ef 18` failed: exit status 1

Expected results:

- Container should exit gracefully without a deadlock

Additional info:

- Raising this BZ to Urgent priority because the customer reported this intermittent job failure error while running the playbooks from ansilbe platform and workflow template affected since it is unable to trigger in seq with success.

- Pasting the Business Impact to Cu as below
  
  # Business Impact:
         After one year of development and POC, we have built a good foundation of on-boarded applications and users, customer is now in the critical stage to start rollout, AE had aligned on the next 200 nodes premium support agreement as well, however this defect will fail almost every workflow and it is the (only) showstopper to the customer's rollout plan.
- The issue is seen intermittent.

Comment 3 Matthew Heon 2022-05-26 13:48:30 UTC

As a note, I would strongly recommend using `podman run --timeout` instead of the `timeout` command. `timeout` does not work as advertised with Podman when the container ignores the stop signal. In simple terms, `timeout` tries to manage Podman, but Podman is not the container. The container is not a direct child of Podman (it double-forks to daemonize, to allow it to survive the early death of the Podman process - users can deliberately request to detach from the container via the detach keys, for example, which causes Podman to exit, but the container must continue running). Timeout is sending signals to Podman, not to the container. Podman by default will forward most signals (with exceptions, the most notable being SIGSTOP and SIGKILL, which cannot be forwarded) into the container, but PID 1 in the container is special (the kernel automatically accords PID 1 in a namespace special treatment) and ignores all signals it does not explicitly register a signal handler to; thus, the usual stop signal that Timeout sends (SIGTERM, I believe) will have no effect on any program that does not explicitly set a signal handler to stop after it, and many common tools do not register a handler for SIGTERM. As a result, Timeout simply spins, sending signals to Podman which Podman dutifully forwards on to the container, which just ignores them. The situation can be worse than this if a container is run with `--sig-proxy=false` or Timeout is run with `-k` - Timeout will successfully kill Podman, but the container will continue running in the background because of its double-fork. In short, while this approach may work for Python (and honestly, from what I'm seeing with the reproducer, it doesn't - Podman exits after the amount of seconds in Python's sleep, not the 4 second timeout, seemingly indicating that timeout is just being ignored), it should not be considered general-purpose, and some containers will simply refuse to exit when managed by Timeout. You can potentially avoid this by setting container stop signals to SIGKILL, but this prevents any cleanup when a container is told to exit. Furthermore, we've seen occasional bad behavior from Timeout in the past where it will spam Podman with signals repeatedly, causing occasional logs like this when the container transitions to exited.

Comment 13 Daniel Walsh 2022-06-02 14:58:17 UTC

Podman run --timeout 4 
Is the way to handle this to get the container to work correctly. Having Podman killed is an unexpected occurrence.

Comment 15 Daniel Walsh 2022-06-03 12:49:54 UTC

I am not sure what you are asking.  If your customer wants to stop a container that runs longer then TIMEX, then podman run --timeout TIMEX is the way to go.

If you are asking, should we fix a bug where Podman deadlocks when killed? Then the answer is yes. But not sure why the customer should care, and not sure 
what kind of priority this gets, since the customer has a better solution.

Comment 16 Daniel Walsh 2022-06-03 12:52:08 UTC

Matt, do you know if `Podman run|start` catches the SIGTERM and that triggers a `podman stop`?

Comment 22 Daniel Walsh 2022-06-08 12:37:57 UTC

Matt it might make sense to catch SIGTERM and then send the STOP_SIGNAL to the container (IE Do a podman stop). This way if SIG_TERM is ignored then podman and the container will exit in 10 seconds.

Comment 43 Alex Jia 2022-07-12 12:51:09 UTC

Tested with podman-4.1.1-6.module+el8.7.0+15923+b0ec4f51.x86_64, and test result looks good.

[root@sweetpig-20 ~]# yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init registry.access.redhat.com/ubi8 sleep 5
ERRO[0000] container not running                        
ERRO[0000] container not running                        
ERRO[0000] container not running  

[root@sweetpig-20 ~]# timeout 4 podman run --log-level=info -i --rm registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8 python3 -c 'import time; time.sleep(5)'
INFO[0000] podman filtering at log level info           
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled 
INFO[0000] Setting parallel job count to 7              
Trying to pull registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob efe94c0f6ff6 [>-------------------------------------] 3.1MiB / 189.4MiB
Copying blob 1e09a5ee0038 [===>----------------------------------] 3.4MiB / 34.7MiB
Copying blob 47c1fb849539 [==>-----------------------------------] 1.8MiB / 19.9MiB
Copying blob 971ebcb22551 [--------------------------------------] 586.0KiB / 65.6MiB
Copying blob 0d725b91398e done  
INFO[0004] Received shutdown signal "terminated", terminating!  PID=30003
INFO[0004] Invoking shutdown handler "libpod"            PID=30003

[root@sweetpig-20 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.7 Beta (Ootpa)

[root@sweetpig-20 ~]# rpm -q podman runc systemd kernel
podman-4.1.1-6.module+el8.7.0+15923+b0ec4f51.x86_64
runc-1.1.3-2.module+el8.7.0+15923+b0ec4f51.x86_64
systemd-239-60.el8.x86_64
kernel-4.18.0-408.el8.x86_64

Comment 44 Alex Jia 2022-07-12 13:04:41 UTC

And also verified on podman-4.1.1-6.module+el8.7.0+15895+a6753917.x86_64.

[root@sweetpig-20 ~]# yes | head -10 | xargs --replace -P5 timeout 2 podman run --rm --init registry.access.redhat.com/ubi8 sleep 5
ERRO[0000] container not running                        
[root@sweetpig-20 ~]# timeout 4 podman run --log-level=info -i --rm registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8 python3 -c 'import time; time.sleep(5)'
INFO[0000] podman filtering at log level info           
INFO[0000] Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled 
INFO[0000] Setting parallel job count to 7              
Trying to pull registry.redhat.io/ansible-automation-platform-21/ee-supported-rhel8:latest...
Getting image source signatures
Checking if image destination supports signatures
Copying blob efe94c0f6ff6 [>-------------------------------------] 2.6MiB / 189.4MiB
Copying blob 0d725b91398e done  
Copying blob 1e09a5ee0038 [==>-----------------------------------] 2.9MiB / 34.7MiB
Copying blob 47c1fb849539 [===>----------------------------------] 1.9MiB / 19.9MiB
Copying blob 971ebcb22551 [=>------------------------------------] 2.7MiB / 65.6MiB
INFO[0004] Received shutdown signal "terminated", terminating!  PID=35277
INFO[0004] Invoking shutdown handler "libpod"            PID=35277

Comment 47 errata-xmlrpc 2022-11-08 09:15:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: container-tools:rhel8 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7457

Note You need to log in before you can comment on or make changes to this bug.