Bug 1489358

Summary: pods failed to start while using cri-o-centos image
Product: OpenShift Container Platform Reporter: Gan Huang <ghuang>
Component: ContainersAssignee: Giuseppe Scrivano <gscrivan>
Status: CLOSED UPSTREAM QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, dcbw, dma, ghuang, gpei, gscrivan, jeder, jokerman, mmccomas, mpatel
Target Milestone: ---Keywords: TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-20 05:48:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gan Huang 2017-09-07 09:23:06 UTC
Description of problem:
Set "openshift_use_crio=True" to install OpenShift with cri-o. Installation failed at "Start the CRI-O service" during the installation.

Version-Release number of the following components:
openshift-ansible-3.7.0-0.125.0.git.0.91043b6.el7.noarch.rpm

# atomic images list
   REPOSITORY                         TAG      IMAGE ID       CREATED            VIRTUAL SIZE   TYPE      
>  docker.io/gscrivano/cri-o-centos   latest   217633c9f629   2017-09-07 01:26   374.33 MB      ostree    

RHEL-7.4

How reproducible:
always

Steps to Reproduce:
1. Set "openshift_use_crio=True"
2. Trigger the installation
3.

Actual results:

TASK [docker : Start the CRI-O service] ****************************************
Thursday 07 September 2017  06:17:35 +0000 (0:00:00.446)       0:02:35.209 **** 
fatal: [host-8-241-61.host.centralci.eng.rdu2.redhat.com]: FAILED! => {
    "changed": false, 
    "failed": true
}

MSG:

Unable to start service cri-o: Job for cri-o.service failed because the control process exited with error code. See "systemctl status cri-o.service" and "journalctl -xe" for details.


Expected results:
No errors

Additional info:
#journalctl -u cri-o

Sep 07 02:34:25 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service: control process exited, code=exited status=1
Sep 07 02:34:25 qe-ghuang-master-etcd-nfs-1 systemd[1]: Failed to start crio daemon.
Sep 07 02:34:25 qe-ghuang-master-etcd-nfs-1 systemd[1]: Unit cri-o.service entered failed state.
Sep 07 02:34:25 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service failed.
Sep 07 02:34:26 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service holdoff time over, scheduling restart.
Sep 07 02:34:26 qe-ghuang-master-etcd-nfs-1 systemd[1]: Starting crio daemon...
Sep 07 02:34:26 qe-ghuang-master-etcd-nfs-1 runc[17681]: invalid --runtime value "stat /usr/bin/runc: no such file or directory"
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service: main process exited, code=exited, status=1/FAILURE
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 runc[17706]: container "cri-o" does not exist
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service: control process exited, code=exited status=1
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: Failed to start crio daemon.
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: Unit cri-o.service entered failed state.
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service failed.
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service holdoff time over, scheduling restart.
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: start request repeated too quickly for cri-o.service
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: Failed to start crio daemon.
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: Unit cri-o.service entered failed state.
Sep 07 02:34:27 qe-ghuang-master-etcd-nfs-1 systemd[1]: cri-o.service failed.

Comment 1 Giuseppe Scrivano 2017-09-07 10:25:05 UTC
PR here:

https://github.com/projectatomic/atomic-system-containers/pull/110

Comment 2 Giuseppe Scrivano 2017-09-12 07:17:52 UTC
Gan, does it work now for you?  Can I close this BZ?

Comment 3 Gan Huang 2017-09-13 02:42:31 UTC
Waiting https://github.com/openshift/openshift-ansible/pull/5354 merged, or we have no way to specify the upstream cri-o-centos image.

Comment 4 Giuseppe Scrivano 2017-09-13 07:22:01 UTC
@Gan, I've tagged also docker.io/gscrivano/cri-o to be the same as docker.io/gscrivano/cri-o-centos.  Does that help?

Comment 5 Gan Huang 2017-09-13 09:55:43 UTC
Thanks Giuseppe, now I'm able to continue the testing with the centos image :)

Unfortunately it seems that cri-o service is not working well with OpenShift.

After the installation, all pods were in ContainerCreating status:

# oc get po
NAME                        READY     STATUS              RESTARTS   AGE
docker-registry-1-deploy    0/1       ContainerCreating   0          23m
registry-console-1-deploy   0/1       ContainerCreating   0          23m
router-1-deploy             0/1       ContainerCreating   0          24m

# oc describe po docker-registry-1-deploy
<--snip-->
Events:
  FirstSeen	LastSeen	Count	From									SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----									-------------	--------	------			-------
  26m		26m		1	default-scheduler									Normal		Scheduled		Successfully assigned docker-registry-1-deploy to qe-master-registry-router-nfs-etcd-1.0913-i4r.qe.rhcloud.com
  26m		26m		1	kubelet, qe-master-registry-router-nfs-etcd-1.0913-i4r.qe.rhcloud.com			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "deployer-token-489x3" 
  26m		1m		113	kubelet, qe-master-registry-router-nfs-etcd-1.0913-i4r.qe.rhcloud.com			Warning		FailedSync		Error syncing pod


the logs for atomic-openshift-node and cri-o services will be attached.

Comment 8 Giuseppe Scrivano 2017-09-13 19:37:19 UTC
might be related to https://github.com/projectatomic/atomic-system-containers/pull/113

I have already created new builds of the images including that change.  After that, I was able to deploy a cluster that uses docker.io/gscrivano/cri-o-centos

Comment 9 Gan Huang 2017-09-14 09:42:42 UTC
Thanks, the issue is gone when using the new build.

But seems encountering another issue, confirming...

Will paste the test result.

Comment 10 Gan Huang 2017-09-14 15:28:46 UTC
Installation succeeded. But S2I build failed:

Tested version:

# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

# uname -r
3.10.0-693.2.1.el7.x86_64

# atomic images list
   REPOSITORY                  TAG      IMAGE ID       CREATED            VIRTUAL SIZE   TYPE      
>  docker.io/gscrivano/cri-o   latest   62986b1ae28c   2017-09-14 06:01   447.2 MB       ostree    

# openshift version
openshift v3.7.0-0.126.1
kubernetes v1.7.0+80709908fd
etcd 3.2.1


1) Router and Register are running after the installation
# oc get po
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-qqtln    1/1       Running   0          5h
registry-console-1-9h432   1/1       Running   0          5h
router-1-7k09x             1/1       Running   0          5h

2) S2I build failed:
# oc get po -n install-test
NAME                             READY     STATUS       RESTARTS   AGE
cakephp-mysql-example-1-build    0/1       Init:Error   0          5h
mongodb-1-deploy                 0/1       Error        0          5h
mysql-1-deploy                   0/1       Error        0          5h
nodejs-mongodb-example-1-build   0/1       Init:Error   0          5h

# oc describe po cakephp-mysql-example-1-build -n install-test
<--snip-->
Init Containers:
  git-clone:
    Image ID:           51fb50a7b319edfeda417db03c602c3ee9279e652e8e12b3c40b5438f5a6b042
    Port:               <none>
    Command:
      openshift-git-clone
    Args:
      --loglevel=0
    State:      Terminated
      Reason:   Error
      Message:  Cloning "https://github.com/openshift/cakephp-ex.git" ...
error: fatal: unable to access 'https://github.com/openshift/cakephp-ex.git/': Could not resolve host: github.com; Unknown error
<--snip-->

3) Unable to run "oc rsh ${pod}"
# oc rsh router-1-7k09x
Error from server: error dialing backend: dial tcp 192.168.2.9:10010: getsockopt: no route to host

Marking it TestBlocker temporarily as it's blocking QE to test cri-o-centos on OpenShift.

Please let me know if any logs needed and feel free to reassign to proper component.

Comment 11 Giuseppe Scrivano 2017-09-14 15:43:43 UTC
can you quickly try?

# ping -c 1 github.com  (from the host)

# runc exec cri-o ping -c 1 github.com

what is the output of the two commands?

Comment 12 Gan Huang 2017-09-14 16:33:30 UTC
# ping -c 1 github.com 
PING github.com (192.30.253.113) 56(84) bytes of data.
64 bytes from lb-192-30-253-113-iad.github.com (192.30.253.113): icmp_seq=1 ttl=55 time=70.7 ms

--- github.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 70.708/70.708/70.708/0.000 ms


# runc exec cri-o ping -c 1 github.com
PING github.com (192.30.253.113) 56(84) bytes of data.
64 bytes from lb-192-30-253-113-iad.github.com (192.30.253.113): icmp_seq=1 ttl=55 time=70.4 ms

--- github.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 70.495/70.495/70.495/0.000 ms

Comment 13 Giuseppe Scrivano 2017-09-14 20:09:37 UTC
it looks a SELinux issue, probably caused by NFS, in fact if I "setenforce 0" I get a bit further.  I could not get the application deployed on your cluster, it seems there are some networking issues that prevent to pull from github.com.

I've tried in local and it works fine for me (I then hit https://github.com/openshift/origin/issues/16349).  Could you verify the same configuration you have used without crio works fine?

Comment 15 Giuseppe Scrivano 2017-09-15 13:53:29 UTC
I've just tried this on the all-in-one VM you have left running:

# oadm policy add-cluster-role-to-user cluster-admin system:serviceaccount:default:default
# setenforce 0
# oc new-app https://github.com/giuseppe/hello-openshift-plus.git

and it was deployed correctly.

When I try some of the OpenShift examples I get this error:

# oc logs bc/cakephp-ex
Cloning "https://github.com/openshift/cakephp-ex.git" ...
	Commit:	7969534afdf9490ca79e37e672f0b9c81887ec28 (Merge pull request #81 from bparees/readiness)
	Author:	Ben Parees <bparees.github.com>
	Date:	Mon Sep 11 01:15:51 2017 -0400
ERROR: Error writing header for "scripts": io: read/write on closed pipe
ERROR: Error writing tar: io: read/write on closed pipe
error: build error: Error response from daemon: {"message":"No such container: crio"}

Comment 17 Dan Williams 2017-09-15 17:42:34 UTC
On the networking side...

The CRIO RPM installs CNI network configs in /etc/cni/net.d, but its CNI implementation only uses the first one found in the directory just like Kubernetes does.  That's a limitation of Kubernetes at this point, and something we want to remove from Kube once the multi-network stuff lands.

The pattern that almost all complex CNI plugins for kube use is to write out a config to /etc/cni/net.d when they are ready.  Which openshift-sdn does.  But CRIO doesn't care, since it sees 100-crio-bridge.conf first and uses that.

So the end result is that you've asked OpenShift to use the openshift-sdn network plugin, but underneath, CRIO isn't using the openshift-sdn network plugin but its default bridge config instead.

So yeah, clearly your networking isn't going to work.

One suggestion is that when openshift-sdn is selected in ansible, "rm -rf /etc/cni/net.d/100-crio-bridge.conf /etc/cni/net.d/200-loopback.conf" as part of the ansible playbook for oepnshift-sdn or sometihng like that.

Comment 19 Giuseppe Scrivano 2017-09-18 12:13:52 UTC
with the new image for the system container, I see this error:

# oc new-app https://github.com/openshift/cakephp-ex.git
# oc logs -f bc/cakephp-ex

ERROR: Error writing header for "scripts": io: read/write on closed pipe
ERROR: Error writing tar: io: read/write on closed pipe
error: build error: Error response from daemon: {"message":"No such container: crio"}


I've tried also with cri-o directly on the host but I see the same error.  
Mrunal, should we file this separately?

Comment 20 Mrunal Patel 2017-09-19 17:45:20 UTC
This looks like it is build related and we are working with Ben Parees to fix that. We should track it separately and close it once we finish the build integration work.

Comment 21 Giuseppe Scrivano 2017-09-19 18:00:06 UTC
thanks for the explanation.

@Gan, are you fine to close this?

Comment 23 Red Hat Bugzilla 2023-09-14 04:07:34 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days