Description of problem: Failed to stop any container running with ubi7/ubi-init:latest images. Warning FailedKillPod 2m5s (x5 over 2m34s) kubelet, hiqa-win20.hulk.sos42.ns error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>" , failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>" ] crio log: Feb 06 18:51:03 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3" Feb 06 18:51:03 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3" Feb 06 18:51:04 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3" Feb 06 18:51:04 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3" Feb 06 18:51:05 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3" Feb 06 18:51:05 hiqa-win20.hulk.sos42.ns crio[2158]: unknown signal "RTMIN+3" Version-Release number of selected component (if applicable): [root@installer hpe-csi-operator]# oc version Client Version: openshift-clients-4.2.0-201910041700 Server Version: 4.2.0 Kubernetes Version: v1.14.6+2e5ed54 [root@hiqa-win20 ~]# cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" VERSION="42.80.20191010.0" VERSION_ID="4.2" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 42.80.20191010.0 (Ootpa)" ID="rhcos" ID_LIKE="rhel fedora" ANSI_COLOR="0;31" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.2" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.2" OSTREE_VERSION=42.80.20191010.0 [root@hiqa-win20 ~]# rpm -qa | grep runc runc-1.0.0-61.rc8.rhaos4.2.git3cbe540.el8.x86_64 How reproducible: Consistent Steps to Reproduce: 1. Create a pod with container running with ubi7/ubi-init image 2. Stop the pod 3. Pod is stuck in terminating state with error as below even when container is available (crictl ps shows it) Warning FailedKillPod 2m5s (x5 over 2m34s) kubelet, hiqa-win20.hulk.sos42.ns error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>" , failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>" ] Actual results: Pod is stuck in terminating state with error as below: Warning FailedKillPod 2m5s (x5 over 2m34s) kubelet, hiqa-win20.hulk.sos42.ns error killing pod: [failed to "KillContainer" for "hpe-csi-driver" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>" , failed to "KillPodSandbox" for "509e55f7-4911-11ea-8cff-100c29227811" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container k8s_hpe-csi-driver_hpe-csi-controller-7f9dfb8f8c-ps2bk_hpe-csi_509e55f7-4911-11ea-8cff-100c29227811_0 in pod sandbox 6621fdd684b8c125ec4495d61d41676bbabca3af8465044b860f107360055a0f: failed to stop container \"e58f290b5993e54c9904005bd06ac6ed00d466f1b8aa1564364cfafc4ab9564b\": failed to find process: <nil>" ] Expected results: Pod termination should go through and container should be killed. Additional info: Dockerfile:
Urvashi, could you take a peak at this please?
RTMIN+3 is the default signal that systemd (ubi-init) needs to be sent in order to shutdown properly.
Any idea why CRIO is complaining that its an unknown signal then?
Nope, did some quick grepping and it is defined in the code base. But I am not sure where the error is coming from. Will let Urvashi look closer. Just wanted to tell here where it is coming from.
Fix is in https://github.com/cri-o/cri-o/pull/3249. Will port to the various cri-o versions once this is in.
Thanks for the update. Good to know that fix is identified. We are trying with ubi-minimal with systemd package meanwhile to workaround this issue.
The fix was merged and back-ported to all the release branches. cri-o builds with the fix in should be available at https://brewweb.engineering.redhat.com/brew/packageinfo?packageID=63415
Setting to Post and assigning to Jindrich for final kit needs.
Checked with 44.81.202003010930-0, the cri-o for rhel-8 is not attached in RHCOS now. [core@wjio163021-mhndz-master-2 ~]$ rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76c613847ff18b5d5f172591a6f539ceedc1a301030b08d29f961012d4124db6 CustomOrigin: Managed by machine-config-operator Version: 44.81.202003010930-0 (2020-03-01T09:35:50Z) [core@wjio163021-mhndz-master-2 ~]$ rpm -qa|grep -i cri-o cri-o-1.17.0-4.dev.rhaos4.4.gitc3436cc.el8.x86_64
Checked with cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7 and cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 The pods are deleted without error messages now. ==========> cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME yanyan-2q265-m-0.c.openshift-qe.internal Ready master 36m v1.17.1 10.0.0.5 Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 yanyan-2q265-m-1.c.openshift-qe.internal Ready master 36m v1.17.1 10.0.0.4 Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 yanyan-2q265-m-2.c.openshift-qe.internal Ready master 35m v1.17.1 10.0.0.6 Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 yanyan-2q265-w-a-l-rhel-0 Ready worker 86s v1.17.1 10.0.32.5 Red Hat Enterprise Linux Server 7.7 (Maipo) 3.10.0-1062.12.1.el7.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7 yanyan-2q265-w-a-l-rhel-1 Ready worker 70s v1.17.1 10.0.32.4 Red Hat Enterprise Linux Server 7.7 (Maipo) 3.10.0-1062.12.1.el7.x86_64 cri-o://1.17.0-9.dev.rhaos4.4.gitdfc8414.el7 $ oc run poc --image=registry.access.redhat.com/ubi7/ubi-init:latest kubectl run --generator=deploymentconfig/v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead. deploymentconfig.apps.openshift.io/poc created $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES poc-1-deploy 0/1 Completed 0 19s 10.130.2.7 yanyan-2q265-w-a-l-rhel-1 <none> <none> poc-1-dkf2p 1/1 Running 0 15s 10.130.2.8 yanyan-2q265-w-a-l-rhel-1 <none> <none> $ oc delete pods poc-1-dkf2p --wait=false pod "poc-1-dkf2p" deleted $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES poc-1-deploy 0/1 Completed 0 96s 10.130.2.7 yanyan-2q265-w-a-l-rhel-1 <none> <none> poc-1-dkf2p 1/1 Terminating 0 92s 10.130.2.8 yanyan-2q265-w-a-l-rhel-1 <none> <none> poc-1-zsjhk 1/1 Running 0 30s 10.130.2.9 yanyan-2q265-w-a-l-rhel-1 <none> <none> $ oc get pods NAME READY STATUS RESTARTS AGE poc-1-deploy 0/1 Completed 0 2m16s poc-1-zsjhk 1/1 Running 0 70s =============> cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME yanyan-2q265-m-0.c.openshift-qe.internal Ready master 57m v1.17.1 10.0.0.5 Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 yanyan-2q265-m-1.c.openshift-qe.internal Ready master 57m v1.17.1 10.0.0.4 Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 yanyan-2q265-m-2.c.openshift-qe.internal Ready master 55m v1.17.1 10.0.0.6 Red Hat Enterprise Linux CoreOS 44.81.202003101735-0 (Ootpa) 4.18.0-147.5.1.el8_1.x86_64 cri-o://1.17.0-8.dev.rhaos4.4.git36920a5.el8 $ oc get pods -o wide 130 ↵ NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES poc-2-deploy 0/1 Completed 0 10m 10.129.0.34 yanyan-2q265-m-1.c.openshift-qe.internal <none> <none> poc-2-vhctl 1/1 Running 0 9m17s 10.129.0.35 yanyan-2q265-m-1.c.openshift-qe.internal <none> <none> $ oc delete pods poc-2-vhctl --wait=false 130 ↵ pod "poc-2-vhctl" deleted $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES poc-2-6tpxr 1/1 Running 0 34s 10.128.0.52 yanyan-2q265-m-0.c.openshift-qe.internal <none> <none> poc-2-deploy 0/1 Completed 0 12m 10.129.0.34 yanyan-2q265-m-1.c.openshift-qe.internal <none> <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581