Bug 2000877

Summary:	OCP ignores STOPSIGNAL in Dockerfile and sends SIGTERM
Product:	OpenShift Container Platform	Reporter:	Szymon.Sawis
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	pmali
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	akretzsc, aos-bugs, dwalsh, ebrizuel, pehunt, rphillips, skclark, tsweeney
Version:	4.6
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: CRI-O using an outdated signal parsing library Consequence: A stop signal set to anything greater than SIGRTMIN would be ignored Fix: Update signal parsing library Result: stop signals > SIGRTMIN are sent as a stop signal	Story Points:	---
Clone Of:
Clones:	2084259 (view as bug list)		Environment:
Last Closed:	2022-03-10 16:07:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2084259

Description Szymon.Sawis 2021-09-03 09:25:37 UTC

Description of problem:
We'd like to terminate services inside the container more gracefully. Despite the set STOPSIGNAL SIGRTMIN+3 in Dockerfile, OCP ignores it and sends always SIGTERM. Then systemd tries to rexecute services, and when terminationGracePeriodSeconds expires, the SIGKILL signal is sent.

Version-Release number of selected component (if applicable):
4.6

How reproducible:
always

Steps to Reproduce:
1. Run journalctl -f inside container
2. Scale down sts to 0 replicas

Actual results:
Jul 21 09:36:49 ipshost-0 systemd: Received SIGTERM.
Jul 21 09:36:49 ipshost-0 systemd: Reexecuting.
Jul 21 09:36:49 ipshost-0 systemd: systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
Jul 21 09:36:49 ipshost-0 systemd: Detected virtualization other.
Jul 21 09:36:49 ipshost-0 systemd: Detected architecture x86-64.
Jul 21 09:36:49 ipshost-0 systemd: /usr/lib/systemd/system-generators/nfs-server-generator failed with error code 1.
Jul 21 09:36:49 ipshost-0 systemd: [/usr/lib/systemd/system/nfs-ganesha.service:28] Unknown lvalue 'LogsDirectory' in section 'Service'
Jul 21 09:36:49 ipshost-0 systemd: [/usr/lib/systemd/system/nfs-ganesha.service:29] Unknown lvalue 'StateDirectory' in section 'Service'
...
Jul 21 09:44:04 ipshost-0 su: (to nz) root on none
Jul 21 09:44:04 ipshost-0 dbus[107]: [system] Activating via systemd: service name='org.freedesktop.login1' unit='dbus-org.freedesktop.login1.service'
Jul 21 09:44:04 ipshost-0 dbus[107]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.login1.service': Unit is masked.
command terminated with exit code 137

Expected results:
OCP should send given signal to container instead SIGTERM signal. 

Additional info:

Comment 1 Ryan Phillips 2021-09-08 14:36:14 UTC

Can you provide us a reproducer with a Dockerfile?

Comment 2 Ezequiel Hector Brizuela 2021-09-10 02:35:43 UTC

You can use a this example:

-----

$ cat Dockerfile
FROM fedora:latest
COPY ./echo-the-trapper.sh /
RUN chmod +x /echo-the-trapper.sh
ENTRYPOINT ["/echo-the-trapper.sh"]
STOPSIGNAL SIGRTMIN+3

-----

$ cat echo-the-trapper.sh 
#!/bin/bash

echo "Starting $0"

# Signals to trap, you can add here as need, just remove the "SIG"
#  part, to list all the signals use 'kill -l':
signal2trap=(INT TERM EXIT HUP RTMIN+3)

# Default stdout for trap is standard stdout for local tests
trap_stdout=2

func_trap() {
    echo "Trapped: $1" >&${trap_stdout}
}


# If no stdin we're probably inside the container or
#  someone pipe info to the script o.0
if [[ ! -t 0 ]];then
    echo "In a container, adjusting!"
    trap_stdout=9
    mkfifo /tmp/bucket
    # Not sure why I cannot use $trap_stdout here
    # So I just hard-coded it until further investigation.
    exec 9<> /tmp/bucket
    signal_timeout=600 # Seconds
fi

echo "Setting the traps for signals ${signal2trap[*]// /|}"

for s in ${signal2trap[@]};do
  trap "func_trap ${s}" "${s}"
done

echo "Traps set for:"
trap -p
echo "Send any of them to PID ${$} (Ignore this PID in a container), we should echo the signal."

# Checking if we're on a container
if [[ $trap_stdout -ne 2 ]];then
  read -t $signal_timeout -u $trap_stdout msg
  echo $msg
fi

# On a container the 'read' will be ignored as stdin is closed
echo "Hit a key when you want to exit"
read

-----

You can use the script from the command line for test any change you do using 'bash echo-the-trapper.sh'
You can try this using podman with:

1- Build with 'podman build -t echo-the-trapper .'
2- Run it:
# sudo podman run -d --name echo-the-trapper-0  echo-the-trapper
3- Check the logs:
# sudo podman logs -f echo-the-trapper-0
4- Kill the process
# sudo podman stop echo-the-trapper-0 --log-level debug

On the logs of the pod you get the 'Trapped: RTMIN+3' line, and you can check on the output of the podman stop the correct signal is sent.

Comment 3 Ezequiel Hector Brizuela 2021-09-10 13:10:56 UTC

We ask to the customer to make another test with the signal number instead of the name and the issue persist.

Comment 4 Ryan Phillips 2021-09-13 15:47:13 UTC

This is the expected behavior for Openshift. The Kubelet is in control of the signals passed to the runtime. The Kubelet will send the SIGTERM to allow for a graceful shutdown, if the process has not shutdown by the gracefulTerrminatioPeriod (default 30 seconds) then the kubelet sends a SIGKILL. The STOPSIGNAL is the Dockerfile will not change this behavior.

Comment 5 Szymon.Sawis 2021-09-13 17:59:57 UTC

(In reply to Ryan Phillips from comment #4)
> This is the expected behavior for Openshift. The Kubelet is in control of
> the signals passed to the runtime. The Kubelet will send the SIGTERM to
> allow for a graceful shutdown, if the process has not shutdown by the
> gracefulTerrminatioPeriod (default 30 seconds) then the kubelet sends a
> SIGKILL. The STOPSIGNAL is the Dockerfile will not change this behavior.

Is there any way to force a different signal?

Comment 6 Ezequiel Hector Brizuela 2021-09-13 20:05:57 UTC

(In reply to Ryan Phillips from comment #4)
> This is the expected behavior for Openshift. The Kubelet is in control of
> the signals passed to the runtime. The Kubelet will send the SIGTERM to
> allow for a graceful shutdown, if the process has not shutdown by the
> gracefulTerrminatioPeriod (default 30 seconds) then the kubelet sends a
> SIGKILL. The STOPSIGNAL is the Dockerfile will not change this behavior.

The problem is that the first signal sent is not the one specified by the STOPSIGNAL value when the example runs on Openshift, but is correctly handled running the example pod in podman. The other part is the standard behaviour used to avoid to keep a pod that is stuck and not listening the signals.


To be a bit more clear:

- What is the current behavior:
The kubelet send the SIGTERM wait the gracefulTerminationPeriod and then send SIGKILL

- What is the expected behavior:
The kubelet sends the signal set as STOPSIGNAL (e.g: SIGRTMIN+3 or any other one) wait the gracefulTerminationPeriod and then send SIGKILL.
With the example I made the pod will log will show in the logs "Trapped: <STOPSIGNAL-VALUE>" and then after the gracefulTerminationPeriod it will receive the SIGKILL and die.


I dig up a bit in the k8s project upstream and see this:

https://github.com/kubernetes/kubernetes/issues/30051

And it mention that is supposed to be supported using docker, sadly I didn't have the chance to make a test environment with Openshift for this because currently the lab is a bit unstable.

Comment 7 Ryan Phillips 2021-09-28 20:24:26 UTC

Openshift uses crio for the container runtime. The STOPSIGNAL behavior is a side effect of Docker and is not really supported by Kubernetes.

Going to close this for now, since it is working as designed.

Comment 9 Ezequiel Hector Brizuela 2021-09-29 17:27:33 UTC

Checking with the customer, the image contains the correct stopsignal image definition:

# sudo podman inspect f8c7cfe0750e | grep StopSignal
            "StopSignal": "SIGRTMIN+3"

As the STOPSIGNAL is part of the OCI image-spec:

https://github.com/opencontainers/image-spec/blob/main/config.md

And that is part of the specification that is supported by crio. Checking on the kubernetes documentation we see on the Pod Lifecycle - Termination of Pods:

"Typically, the container runtime sends a TERM signal to the main process in each container. Many container runtimes respect the STOPSIGNAL value defined in the container image and send this instead of TERM. Once the grace period has expired, the KILL signal is sent to any remaining processes, and the Pod is then deleted from the API Server." 
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/

Apparently all the scaffolding is done to support this behaviour in crio and kubernetes mention that they honour the signal customization. I bring this to the slack chat to get some clarifications and we see that apparently could be an issue of the specific signal used in this case as is outside of the current signals supported by the go package used for the parse and send the signal.

I will ask to the customer to check with a signal that is in that package definition and will further update the bz with the results.

Comment 10 Peter Hunt 2021-09-29 17:32:16 UTC

fixed in attached PR

Comment 11 Ezequiel Hector Brizuela 2021-10-01 22:00:08 UTC

JFTR: I confirmed with the customer that the issue is isolated to signals above 31.

Comment 12 Peter Hunt 2021-10-14 15:32:30 UTC

PR merged

Comment 21 errata-xmlrpc 2022-03-10 16:07:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056