Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1800107

Summary:	Failed to stop container running with ubi-init: crio error unknown signal "RTMIN+3"
Product:	OpenShift Container Platform	Reporter:	shiva merla <shivakrishna.merla>
Component:	Containers	Assignee:	Jindrich Novy <jnovy>
Status:	CLOSED ERRATA	QA Contact:	weiwei jiang <wjiang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	aos-bugs, dwalsh, jnovy, jokerman, lsm5, mpatel, nagrawal, raunak.kumar, tsweeney
Target Milestone:	---
Target Release:	4.4.0
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhaos-4.4-rhel-7/cri-o-1.17.0-6.dev.rhaos4.4.gitdd5a702.el7 rhaos-4.4-rhel-8/cri-o-1.17.0-6.dev.rhaos4.4.gitdd5a702.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-13 21:56:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shiva merla 2020-02-06 20:01:06 UTC

Description of problem:
Failed to stop any container running with ubi7/ubi-init:latest images.

  Warning  FailedKillPod  2m5s (x5 over 2m34s)  kubelet, hiqa-win20.hulk.sos42.ns  error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
, failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
]

crio log:

Feb 06 18:51:03 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:03 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:04 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:04 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:05 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:05 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"


Version-Release number of selected component (if applicable):
[root@installer hpe-csi-operator]# oc version
Client Version: openshift-clients-4.2.0-201910041700
Server Version: 4.2.0
Kubernetes Version: v1.14.6+2e5ed54

[root@hiqa-win20 ~]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="42.80.20191010.0"
VERSION_ID="4.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20191010.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.2"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.2"
OSTREE_VERSION=42.80.20191010.0

[root@hiqa-win20 ~]# rpm -qa | grep runc
runc-1.0.0-61.rc8.rhaos4.2.git3cbe540.el8.x86_64

How reproducible:
Consistent

Steps to Reproduce:
1. Create a pod with container running with ubi7/ubi-init image
2. Stop the pod
3. Pod is stuck in terminating state with error as below even when container is available (crictl ps shows it)

  Warning  FailedKillPod  2m5s (x5 over 2m34s)  kubelet, hiqa-win20.hulk.sos42.ns  error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
, failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
]


Actual results:

Pod is stuck in terminating state with error as below:

  Warning  FailedKillPod  2m5s (x5 over 2m34s)  kubelet, hiqa-win20.hulk.sos42.ns  error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
, failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
]


Expected results:

Pod termination should go through and container should be killed.


Additional info:

Dockerfile:

Comment 1 Tom Sweeney 2020-02-07 16:25:55 UTC

Urvashi, could you take a peak at this please?

Comment 2 Daniel Walsh 2020-02-07 17:51:17 UTC

RTMIN+3 is the default signal that systemd (ubi-init) needs to be sent in order to shutdown properly.

Comment 3 shiva merla 2020-02-07 18:03:10 UTC

Any idea why CRIO is complaining that its an unknown signal then?

Comment 4 Daniel Walsh 2020-02-07 18:08:06 UTC

Nope, did some quick grepping and it is defined in the code base.  But I am not sure where the error is coming from.  Will let Urvashi look closer.
Just wanted to tell here where it is coming from.

Comment 5 Urvashi Mohnani 2020-02-11 16:25:39 UTC

Fix is in https://github.com/cri-o/cri-o/pull/3249. Will port to the various cri-o versions once this is in.

Comment 6 shiva merla 2020-02-11 23:19:32 UTC

Thanks for the update. Good to know that fix is identified. We are trying with ubi-minimal with systemd package meanwhile to workaround this issue.

Comment 7 Urvashi Mohnani 2020-02-13 18:22:23 UTC

The fix was merged and back-ported to all the release branches. cri-o builds with the fix in should be available at https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=63415

Comment 8 Tom Sweeney 2020-02-13 18:54:54 UTC

Setting to Post and assigning to Jindrich for final kit needs.

Comment 13 weiwei jiang 2020-03-02 07:59:09 UTC

Checked with 44.81.202003010930-0, the cri-o for rhel-8 is not attached in RHCOS now.

[core@wjio163021-mhndz-master-2 ~]$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76c613847ff18b5d5f172591a6f539ceedc1a301030b08d29f961012d4124db6
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.81.202003010930-0 (2020-03-01T09:35:50Z)

[core@wjio163021-mhndz-master-2 ~]$ rpm -qa|grep -i cri-o
cri-o-1.17.0-4.dev.rhaos4.4.gitc3436cc.el8.x86_64

Comment 19 weiwei jiang 2020-03-11 07:55:10 UTC

Checked with  cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7 and cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
The pods are deleted without error messages now.


==========> cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7
$ oc get nodes -o wide                                                                                                                                                                                                                                                        
NAME                                       STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME                                                         
yanyan-2q265-m-0.c.openshift-qe.internal   Ready    master   36m   v1.17.1   10.0.0.5                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
yanyan-2q265-m-1.c.openshift-qe.internal   Ready    master   36m   v1.17.1   10.0.0.4                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-m-2.c.openshift-qe.internal   Ready    master   35m   v1.17.1   10.0.0.6                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-w-a-l-rhel-0                  Ready    worker   86s   v1.17.1   10.0.32.5                   Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.12.1.el7.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7                              
yanyan-2q265-w-a-l-rhel-1                  Ready    worker   70s   v1.17.1   10.0.32.4                   Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.12.1.el7.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7               

$ oc run poc --image=registry.access.redhat.com/ubi7/ubi-init:latest                                                                                                                                                                                                    
kubectl run --generator=deploymentconfig/v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.                                                                                                            
deploymentconfig.apps.openshift.io/poc created 

$ oc get pods -o wide                                             
NAME           READY   STATUS      RESTARTS   AGE   IP           NODE                        NOMINATED NODE   READINESS GATES                                                                                                                                                   
poc-1-deploy   0/1     Completed   0          19s   10.130.2.7   yanyan-2q265-w-a-l-rhel-1   <none>           <none>                                                                                                                                                            
poc-1-dkf2p    1/1     Running     0          15s   10.130.2.8   yanyan-2q265-w-a-l-rhel-1   <none>           <none>    

$ oc delete pods poc-1-dkf2p --wait=false                                                                                                                                                                                                     
pod "poc-1-dkf2p" deleted
$ oc get pods -o wide                                                                                                                                                                                                                                                         
NAME           READY   STATUS        RESTARTS   AGE   IP           NODE                        NOMINATED NODE   READINESS GATES                                                                                                                                                 
poc-1-deploy   0/1     Completed     0          96s   10.130.2.7   yanyan-2q265-w-a-l-rhel-1   <none>           <none>                                                                                                                                                          
poc-1-dkf2p    1/1     Terminating   0          92s   10.130.2.8   yanyan-2q265-w-a-l-rhel-1   <none>           <none>                                                                                                                                                          
poc-1-zsjhk    1/1     Running       0          30s   10.130.2.9   yanyan-2q265-w-a-l-rhel-1   <none>           <none>     
$ oc get pods                                                                                                                                                                                                                                                                 
NAME           READY   STATUS      RESTARTS   AGE                                                                                                                                                                                                                               
poc-1-deploy   0/1     Completed   0          2m16s                                                                                                                                                                                                                             
poc-1-zsjhk    1/1     Running     0          70s    


=============> cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
$ oc get nodes -o wide                                                                                                                                                                                                                                                        
NAME                                       STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME                                                         
yanyan-2q265-m-0.c.openshift-qe.internal   Ready    master   57m   v1.17.1   10.0.0.5                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-m-1.c.openshift-qe.internal   Ready    master   57m   v1.17.1   10.0.0.4                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-m-2.c.openshift-qe.internal   Ready    master   55m   v1.17.1   10.0.0.6                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                   
$ oc get pods -o wide                                                                                                                                                                                                                                                   130 ↵
NAME           READY   STATUS      RESTARTS   AGE     IP            NODE                                       NOMINATED NODE   READINESS GATES
poc-2-deploy   0/1     Completed   0          10m     10.129.0.34   yanyan-2q265-m-1.c.openshift-qe.internal   <none>           <none>
poc-2-vhctl    1/1     Running     0          9m17s   10.129.0.35   yanyan-2q265-m-1.c.openshift-qe.internal   <none>           <none>
$ oc delete pods poc-2-vhctl --wait=false                                                                                                                                                                               130 ↵
pod "poc-2-vhctl" deleted                                                                                                                                                                                                                               
$ oc get pods -o wide 
NAME           READY   STATUS      RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
poc-2-6tpxr    1/1     Running     0          34s   10.128.0.52   yanyan-2q265-m-0.c.openshift-qe.internal   <none>           <none>
poc-2-deploy   0/1     Completed   0          12m   10.129.0.34   yanyan-2q265-m-1.c.openshift-qe.internal   <none>           <none>

Comment 21 errata-xmlrpc 2020-05-13 21:56:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581