Bug 1800107 - Failed to stop container running with ubi-init: crio error unknown signal "RTMIN+3"
Summary: Failed to stop container running with ubi-init: crio error unknown signal "RT...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 4.2.0
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Jindrich Novy
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-06 20:01 UTC by shiva merla
Modified: 2020-05-13 21:56 UTC (History)
9 users (show)

Fixed In Version: rhaos-4.4-rhel-7/cri-o-1.17.0-6.dev.rhaos4.4.gitdd5a702.el7 rhaos-4.4-rhel-8/cri-o-1.17.0-6.dev.rhaos4.4.gitdd5a702.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-13 21:56:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-13 21:56:54 UTC

Description shiva merla 2020-02-06 20:01:06 UTC
Description of problem:
Failed to stop any container running with ubi7/ubi-init:latest images.

  Warning  FailedKillPod  2m5s (x5 over 2m34s)  kubelet, hiqa-win20.hulk.sos42.ns  error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
, failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
]

crio log:

Feb 06 18:51:03 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:03 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:04 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:04 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:05 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"
Feb 06 18:51:05 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3"


Version-Release number of selected component (if applicable):
[root@installer hpe-csi-operator]# oc version
Client Version: openshift-clients-4.2.0-201910041700
Server Version: 4.2.0
Kubernetes Version: v1.14.6+2e5ed54

[root@hiqa-win20 ~]# cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
VERSION="42.80.20191010.0"
VERSION_ID="4.2"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20191010.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.2"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.2"
OSTREE_VERSION=42.80.20191010.0

[root@hiqa-win20 ~]# rpm -qa | grep runc
runc-1.0.0-61.rc8.rhaos4.2.git3cbe540.el8.x86_64

How reproducible:
Consistent

Steps to Reproduce:
1. Create a pod with container running with ubi7/ubi-init image
2. Stop the pod
3. Pod is stuck in terminating state with error as below even when container is available (crictl ps shows it)

  Warning  FailedKillPod  2m5s (x5 over 2m34s)  kubelet, hiqa-win20.hulk.sos42.ns  error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
, failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
]


Actual results:

Pod is stuck in terminating state with error as below:

  Warning  FailedKillPod  2m5s (x5 over 2m34s)  kubelet, hiqa-win20.hulk.sos42.ns  error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
, failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>"
]


Expected results:

Pod termination should go through and container should be killed.


Additional info:

Dockerfile:

Comment 1 Tom Sweeney 2020-02-07 16:25:55 UTC
Urvashi, could you take a peak at this please?

Comment 2 Daniel Walsh 2020-02-07 17:51:17 UTC
RTMIN+3 is the default signal that systemd (ubi-init) needs to be sent in order to shutdown properly.

Comment 3 shiva merla 2020-02-07 18:03:10 UTC
Any idea why CRIO is complaining that its an unknown signal then?

Comment 4 Daniel Walsh 2020-02-07 18:08:06 UTC
Nope, did some quick grepping and it is defined in the code base.  But I am not sure where the error is coming from.  Will let Urvashi look closer.
Just wanted to tell here where it is coming from.

Comment 5 Urvashi Mohnani 2020-02-11 16:25:39 UTC
Fix is in https://github.com/cri-o/cri-o/pull/3249. Will port to the various cri-o versions once this is in.

Comment 6 shiva merla 2020-02-11 23:19:32 UTC
Thanks for the update. Good to know that fix is identified. We are trying with ubi-minimal with systemd package meanwhile to workaround this issue.

Comment 7 Urvashi Mohnani 2020-02-13 18:22:23 UTC
The fix was merged and back-ported to all the release branches. cri-o builds with the fix in should be available at https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=63415

Comment 8 Tom Sweeney 2020-02-13 18:54:54 UTC
Setting to Post and assigning to Jindrich for final kit needs.

Comment 13 weiwei jiang 2020-03-02 07:59:09 UTC
Checked with 44.81.202003010930-0, the cri-o for rhel-8 is not attached in RHCOS now.

[core@wjio163021-mhndz-master-2 ~]$ rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76c613847ff18b5d5f172591a6f539ceedc1a301030b08d29f961012d4124db6
              CustomOrigin: Managed by machine-config-operator
                   Version: 44.81.202003010930-0 (2020-03-01T09:35:50Z)

[core@wjio163021-mhndz-master-2 ~]$ rpm -qa|grep -i cri-o
cri-o-1.17.0-4.dev.rhaos4.4.gitc3436cc.el8.x86_64

Comment 19 weiwei jiang 2020-03-11 07:55:10 UTC
Checked with  cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7 and cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
The pods are deleted without error messages now.


==========> cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7
$ oc get nodes -o wide                                                                                                                                                                                                                                                        
NAME                                       STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME                                                         
yanyan-2q265-m-0.c.openshift-qe.internal   Ready    master   36m   v1.17.1   10.0.0.5                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
yanyan-2q265-m-1.c.openshift-qe.internal   Ready    master   36m   v1.17.1   10.0.0.4                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-m-2.c.openshift-qe.internal   Ready    master   35m   v1.17.1   10.0.0.6                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-w-a-l-rhel-0                  Ready    worker   86s   v1.17.1   10.0.32.5                   Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.12.1.el7.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7                              
yanyan-2q265-w-a-l-rhel-1                  Ready    worker   70s   v1.17.1   10.0.32.4                   Red Hat Enterprise Linux Server 7.7 (Maipo)                    3.10.0-1062.12.1.el7.x86_64   cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7               

$ oc run poc --image=registry.access.redhat.com/ubi7/ubi-init:latest                                                                                                                                                                                                    
kubectl run --generator=deploymentconfig/v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.                                                                                                            
deploymentconfig.apps.openshift.io/poc created 

$ oc get pods -o wide                                             
NAME           READY   STATUS      RESTARTS   AGE   IP           NODE                        NOMINATED NODE   READINESS GATES                                                                                                                                                   
poc-1-deploy   0/1     Completed   0          19s   10.130.2.7   yanyan-2q265-w-a-l-rhel-1   <none>           <none>                                                                                                                                                            
poc-1-dkf2p    1/1     Running     0          15s   10.130.2.8   yanyan-2q265-w-a-l-rhel-1   <none>           <none>    

$ oc delete pods poc-1-dkf2p --wait=false                                                                                                                                                                                                     
pod "poc-1-dkf2p" deleted
$ oc get pods -o wide                                                                                                                                                                                                                                                         
NAME           READY   STATUS        RESTARTS   AGE   IP           NODE                        NOMINATED NODE   READINESS GATES                                                                                                                                                 
poc-1-deploy   0/1     Completed     0          96s   10.130.2.7   yanyan-2q265-w-a-l-rhel-1   <none>           <none>                                                                                                                                                          
poc-1-dkf2p    1/1     Terminating   0          92s   10.130.2.8   yanyan-2q265-w-a-l-rhel-1   <none>           <none>                                                                                                                                                          
poc-1-zsjhk    1/1     Running       0          30s   10.130.2.9   yanyan-2q265-w-a-l-rhel-1   <none>           <none>     
$ oc get pods                                                                                                                                                                                                                                                                 
NAME           READY   STATUS      RESTARTS   AGE                                                                                                                                                                                                                               
poc-1-deploy   0/1     Completed   0          2m16s                                                                                                                                                                                                                             
poc-1-zsjhk    1/1     Running     0          70s    


=============> cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8
$ oc get nodes -o wide                                                                                                                                                                                                                                                        
NAME                                       STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME                                                         
yanyan-2q265-m-0.c.openshift-qe.internal   Ready    master   57m   v1.17.1   10.0.0.5                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-m-1.c.openshift-qe.internal   Ready    master   57m   v1.17.1   10.0.0.4                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                              
yanyan-2q265-m-2.c.openshift-qe.internal   Ready    master   55m   v1.17.1   10.0.0.6                    Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa)   4.18.0-147.5.1.el8_1.x86_64   cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8                   
$ oc get pods -o wide                                                                                                                                                                                                                                                   130 ↵
NAME           READY   STATUS      RESTARTS   AGE     IP            NODE                                       NOMINATED NODE   READINESS GATES
poc-2-deploy   0/1     Completed   0          10m     10.129.0.34   yanyan-2q265-m-1.c.openshift-qe.internal   <none>           <none>
poc-2-vhctl    1/1     Running     0          9m17s   10.129.0.35   yanyan-2q265-m-1.c.openshift-qe.internal   <none>           <none>
$ oc delete pods poc-2-vhctl --wait=false                                                                                                                                                                               130 ↵
pod "poc-2-vhctl" deleted                                                                                                                                                                                                                               
$ oc get pods -o wide 
NAME           READY   STATUS      RESTARTS   AGE   IP            NODE                                       NOMINATED NODE   READINESS GATES
poc-2-6tpxr    1/1     Running     0          34s   10.128.0.52   yanyan-2q265-m-0.c.openshift-qe.internal   <none>           <none>
poc-2-deploy   0/1     Completed   0          12m   10.129.0.34   yanyan-2q265-m-1.c.openshift-qe.internal   <none>           <none>

Comment 21 errata-xmlrpc 2020-05-13 21:56:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.