Bug 1485312 - Pods fail to start during containerized install
Summary: Pods fail to start during containerized install
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Release
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: Justin Pierce
QA Contact: Wei Sun
URL:
Whiteboard: aos-scalability-37
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-25 11:25 UTC by Jiří Mencák
Modified: 2018-04-05 09:29 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-04-05 09:28:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Openshift ansible log file. (2.09 MB, text/plain)
2017-08-25 11:25 UTC, Jiří Mencák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0636 0 None None None 2018-04-05 09:29:03 UTC

Description Jiří Mencák 2017-08-25 11:25:32 UTC
Created attachment 1318123 [details]
Openshift ansible log file.

Description of problem:
Pods fails to start during containerized install in RHEL Atomic Host 7.4 and ocp 3.7

Version-Release number of selected component (if applicable):
oc v3.7.0-0.109.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-31-58-50.us-west-2.compute.internal:8443
openshift v3.7.0-0.109.0
kubernetes v1.7.0+695f48a16f

How reproducible:
Always, during advanced install.

Steps to Reproduce:
1. Use the advanced method of installation to install ocp 3.7 on Atomic Host

Actual results:
$ oc get pods
NAME              READY     STATUS              RESTARTS   AGE
router-1-deploy   0/1       ContainerCreating   0          10m

$ journalctl -xe|grep router|tail -n1
Aug 25 11:01:02 ip-172-31-58-50.us-west-2.compute.internal dockerd-current[38691]: E0825 11:01:02.053510    2841 pod_workers.go:182] Error syncing pod 2e512fea-8983-11e7-8064-026e9c4855a2 ("router-1-deploy_default(2e512fea-8983-11e7-8064-026e9c4855a2)"), skipping: failed to "KillPodSandbox" for "2e512fea-8983-11e7-8064-026e9c4855a2" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"router-1-deploy_default\" network: failed to find plugin \"openshift-sdn\" in path [/opt/openshift-sdn/bin /opt/cni/bin]"

Expected results:
Pods starting correctly.

Additional info:
~/openshift-ansible $ git describe
openshift-ansible-3.7.0-0.109.0

$ rpm -q ansible
ansible-2.3.1.0-3.el7.noarch

$ ansible --version
ansible 2.3.1.0
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.5 (default, May  3 2017, 07:55:04) [GCC 4.8.5 20150623 (Red Hat 4.8.5-14)]

https://github.com/openshift/origin/issues/15953

I managed to install OCP 3.6 on the same RHEL Atomic Host image without any issues.

Comment 1 Giuseppe Scrivano 2017-08-25 13:54:23 UTC
Can you verify what is the output for?

"docker exec $NODE_CONTAINER ls /opt/cni"

If the files are missing there then the change (https://github.com/openshift/origin/pull/15468) didn't probably go into the image you are using.

Comment 2 Jiří Mencák 2017-08-25 14:03:33 UTC
Looks like it.

root@ip-172-31-58-50: ~ # docker ps|grep node
976e126bbae3        openshift3/node:v3.7.0-0.109.0          "/usr/local/bin/origi"   2 minutes ago       Up 2 minutes                            atomic-openshift-node
root@ip-172-31-58-50: ~ # docker exec 976e126bbae3 ls /opt/cni
ls: cannot access /opt/cni: No such file or directory

Comment 3 Steve Milner 2017-08-25 16:02:59 UTC
Verified the image is bad:

(from inside the image)
# rpm -q origin-sdn-ovs
package origin-sdn-ovs is not installed
[root@2a68f132e017 opt]# ls /opt/
[root@2a68f132e017 opt]#

Comment 5 Steve Milner 2017-08-25 16:07:24 UTC
Moving to Release component.

Comment 6 Justin Pierce 2017-08-25 17:49:02 UTC
It does look like the sdn-ovs RPM was lost in the Dockerfile reconciliation process. I've fixed the OCP version which you can find here: http://dist-git.host.prod.eng.bos.redhat.com/cgit/rpms/openshift-enterprise-node-docker/commit/Dockerfile?h=rhaos-3.7-rhel-7&id=51d624bbd507f304fddf1d09e0aad0b04187db23

This should be included in the next build of 3.7. 

* Note that for OCP, the rpm name would be atomic-openshift-sdn-ovs (not origin-sdn-ovs).

Comment 7 Justin Pierce 2017-08-28 21:18:10 UTC
This should be addressed as of: v3.7.0-0.117.0

Comment 8 Jiří Mencák 2017-08-29 06:50:52 UTC
Managed to get it working with v3.7.0-0.117.0 puddle.

[root@rhel-7 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Atomic Host release 7.4
[root@rhel-7 ~]# oc version
oc v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://rhel-7.4.novalocal:8443
openshift v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f
[root@rhel-7 ~]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-spd74    1/1       Running   0          2m
registry-console-1-fw2xq   1/1       Running   0          2m
router-1-8rblt             1/1       Running   0          3m

Thank you!

Comment 9 Johnny Liu 2017-08-29 07:10:52 UTC
Verified this bug with v3.7.0-0.117.0 images, and passed.

[root@qe-wmeng37-master-etcd-1 ~]# openshift version
openshift v3.7.0-0.117.0
kubernetes v1.7.0+695f48a16f
etcd 3.2.1
[root@qe-wmeng37-master-etcd-1 ~]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-3-px6fs    1/1       Running   0          5h
registry-console-1-g5mz0   1/1       Running   0          5h
router-1-094tp             1/1       Running   0          5h
[root@qe-wmeng37-master-etcd-1 ~]# docker ps|grep node
64f27b3d8a0d        openshift3/node:v3.7.0                  "/usr/local/bin/origi"   5 hours ago         Up 5 hours                              atomic-openshift-node
[root@qe-wmeng37-master-etcd-1 ~]# docker exec 64f27b3d8a0d ls /opt/cni
bin
[root@qe-wmeng37-master-etcd-1 ~]# docker exec 64f27b3d8a0d rpm -q atomic-openshift-sdn-ovs
atomic-openshift-sdn-ovs-3.7.0-0.117.0.git.0.b5a2a69.el7.x86_64

Comment 13 errata-xmlrpc 2018-04-05 09:28:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0636


Note You need to log in before you can comment on or make changes to this bug.