1677509 – Master-Api pod is going into CreateContainerError state after a hard reboot on the node

Bug 1677509 - Master-Api pod is going into CreateContainerError state after a hard reboot on the node

Summary: Master-Api pod is going into CreateContainerError state after a hard reboot o...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Giuseppe Scrivano
QA Contact:	weiwei jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-15 06:31 UTC by K Chandra Sekar
Modified:	2023-09-14 05:23 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-04-11 05:38:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0636	0	None	None	None	2019-04-11 05:38:35 UTC

Description K Chandra Sekar 2019-02-15 06:31:48 UTC

Description of problem:
Master-Api pod is going into CreateContainerError state after a hard reboot on the node with the following error and the customer is using CRI-O as the runtime . 

~~~~~~

 Feb 14 08:50:03 sdccsi-sesosm01 atomic-openshift-node[46337]: E0214 08:50:03.810919   46337 remote_runtime.go:187] CreateContainer in sandbox "11dfc686ca0151f6ede51f0d4a8f9a6373f7039a9a8a972d0d7b45bfd17478ce" from runtime service failed: rpc error: code = Unknown desc = that name is already in use "

~~~~~~


Actual results:
Master-Api pod is not starting after a hard reboot of the node.

Expected results:
Pods to be up and running after a hard reboot of a node running with CRI-O runtime

Additional info:

The Following are the CRI-O details 

cri-o-1.11.5-2.rhaos3.11.git1c8a4b1.el7.x86_64              
cri-tools-1.11.1-1.rhaos3.11.gitedabfb5.el7_5.x86_64        
criu-3.5-4.el7.x86_64  

This is along a Known Bug in Upstream link[1]

[1] -https://github.com/kubernetes-sigs/cri-o/issues/1742

Comment 2 Giuseppe Scrivano 2019-02-21 11:17:29 UTC

opened a PR for master here: https://github.com/kubernetes-sigs/cri-o/pull/2083

There are possibly still other ways why it can fail, but this solves the ones I was able to reproduce locally.

If the config.json file is not present or not complete (i.e. it failed to be written in the middle of a hard reboot), we would just skip loading the container and not report it to the Kubelet. So we would fail on the next request of creating it. With the patch I've proposed, if we fail to load a container/pod then we remove it also from the storage.

Comment 4 weiwei jiang 2019-03-19 06:20:58 UTC

Verified on 
[root@qe-wjiang311-crio-mrre-1 ~]# rpm -qa|grep -i cri-
cri-o-1.11.11-1.rhaos3.11.git474f73d.el7.x86_64
cri-tools-1.11.1-1.rhaos3.11.gitedabfb5.el7_5.x86_64

# journalctl -u atomic-openshift-node --since "20 minutes ago"|grep -i remote_runtime|grep -i createcontainer
<-----------EMPTY HERE-------->

And the cluster work well after hard reboot.
[root@qe-wjiang311-crio-mrre-1 ~]# oc get nodes -o wide 
NAME                       STATUS    ROLES            AGE       VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                      KERNEL-VERSION               CONTAINER-RUNTIME
qe-wjiang311-crio-mrre-1   Ready     compute,master   1h        v1.11.0+d4cacc0   10.0.76.212   <none>        Red Hat Enterprise Linux Server 7.6 (Maipo)   3.10.0-957.10.1.el7.x86_64   cri-o://1.11.11-1.rhaos3.11.git474f73d.el7
[root@qe-wjiang311-crio-mrre-1 ~]# oc get pods --all-namespaces
NAMESPACE                           NAME                                          READY     STATUS      RESTARTS   AGE
default                             docker-registry-1-dk4fv                       1/1       Running     1          59m
default                             dockergc-jcspw                                1/1       Running     1          59m
default                             registry-console-1-smbbt                      1/1       Running     1          59m
default                             router-1-r5p4z                                1/1       Running     1          59m
install-test                        mongodb-1-692mp                               1/1       Running     1          53m
install-test                        nodejs-mongodb-example-1-build                0/1       Completed   0          53m
install-test                        nodejs-mongodb-example-1-p8whd                1/1       Running     1          52m
kube-service-catalog                apiserver-rtznh                               1/1       Running     2          57m
kube-service-catalog                controller-manager-sxgs4                      1/1       Running     6          57m
kube-system                         master-api-qe-wjiang311-crio-mrre-1           1/1       Running     1          1h
kube-system                         master-controllers-qe-wjiang311-crio-mrre-1   1/1       Running     0          1h
kube-system                         master-etcd-qe-wjiang311-crio-mrre-1          1/1       Running     1          1h
openshift-ansible-service-broker    asb-1-bgdpk                                   1/1       Running     3          55m
openshift-console                   console-5c75c46588-2mwhr                      1/1       Running     1          58m
openshift-node                      sync-trkqj                                    1/1       Running     1          1h
openshift-sdn                       ovs-dnlhm                                     1/1       Running     1          1h
openshift-sdn                       sdn-4r666                                     1/1       Running     1          1h
openshift-template-service-broker   apiserver-b8f2l                               1/1       Running     3          55m
openshift-web-console               webconsole-7b8f84dfd4-pn7w9                   1/1       Running     1          59m

Comment 6 errata-xmlrpc 2019-04-11 05:38:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0636

Comment 7 Red Hat Bugzilla 2023-09-14 05:23:39 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.