Bug 1565482 - SDN initialization failed for system container installation on RHEL
Summary: SDN initialization failed for system container installation on RHEL
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.0
Assignee: Vadim Rutkovsky
QA Contact: Gan Huang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-10 06:18 UTC by Gan Huang
Modified: 2018-07-30 19:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-30 19:12:35 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:13:02 UTC

Description Gan Huang 2018-04-10 06:18:01 UTC
Description of problem:
Node failed to start while triggering system container installation on RHEL:

"Apr 09 23:47:15 qe-ghuang-master-etcd-1 atomic-openshift-node[22470]: F0409 23:47:15.528946   22470 start_node.go:162] SDN initialization failed: OVS is not installed", 

Version-Release number of the following components:
openshift-ansible-3.10.0-0.16.0.git.0.8925606.el7.noarch.rpm

How reproducible:
always

Steps to Reproduce:
1. Trigger system container installation on RHEL:
# cat inventory
<--snip-->
openshift_use_system_containers=true
system_images_registry=registry.reg-aws.openshift.com:443
<--snip-->


Actual results:
"Apr 09 23:47:15 qe-ghuang-master-etcd-1 atomic-openshift-node[22470]: F0409 23:47:15.528946   22470 start_node.go:162] SDN initialization failed: OVS is not installed"

The whole logs will be attacked.

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Gan Huang 2018-04-10 06:43:15 UTC
This is system container installation, the node service file looks incorrect.

# systemctl cat atomic-openshift-node.service 
# /usr/lib/systemd/system/atomic-openshift-node.service
[Unit]
Description=Atomic OpenShift Node
After=docker.service
After=openvswitch.service
Wants=docker.service
Documentation=https://github.com/openshift/origin

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
Environment=GOTRACEBACK=crash
ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS
LimitNOFILE=65536
LimitCORE=infinity
WorkingDirectory=/var/lib/origin/
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/atomic-openshift-node.service.d/override.conf
[Unit]
After=cloud-init.service

Comment 4 Gan Huang 2018-04-13 08:18:02 UTC
The node service unit file had been corrected somehow.

# systemctl cat atomic-openshift-node.service 
# /etc/systemd/system/atomic-openshift-node.service
[Unit]
After=container-engine.service
After=openvswitch.service
Wants=container-engine.service
After=atomic-openshift-node-dep.service
After=atomic-openshift-master-controllers.service
Requires=dnsmasq.service
After=dnsmasq.service

[Service]
Type=notify
EnvironmentFile=/etc/sysconfig/atomic-openshift-node

ExecStartPre=/bin/bash -c 'export -p > /run/atomic-openshift-node-env'
ExecStart=/usr/bin/runc --systemd-cgroup run 'atomic-openshift-node'
ExecStop=/usr/bin/runc --systemd-cgroup kill 'atomic-openshift-node'
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
WorkingDirectory=/var/lib/containers/atomic/atomic-openshift-node.0
RuntimeDirectory=atomic-openshift-node

[Install]
WantedBy=container-engine.service

# /etc/systemd/system/atomic-openshift-node.service.d/override.conf
[Unit]
After=cloud-init.service

But hitting another issue.

The sdn pods are CrashLoopBackOff

# oc get pod -n openshift-sdn
NAME        READY     STATUS             RESTARTS   AGE
ovs-nwxv7   1/1       Running            0          1h
ovs-xf7h6   1/1       Running            0          1h
sdn-s65k9   0/1       CrashLoopBackOff   19         1h
sdn-t8prw   0/1       CrashLoopBackOff   21         1h

# oc logs sdn-s65k9 -n openshift-sdn
failed to open log file "/var/log/pods/ca1f0816-3ee7-11e8-8ee1-42010af0001f/sdn_19.log": open /var/log/pods/ca1f0816-3ee7-11e8-8ee1-42010af0001f/sdn_19.log: no such file or directory

# oc describe po sdn-s65k9 -n openshift-sdn
<--snip-->
Events:
  Type     Reason                 Age                From                                       Message
  ----     ------                 ----               ----                                       -------
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-opt-cni-bin"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run-openshift-sdn"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run-dbus"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-config"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-etc-cni-netd"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-lib-cni-networks-openshift-sdn"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run-kubernetes"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-sysconfig-node"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-modules"
  Normal   SuccessfulMountVolume  1h (x3 over 1h)    kubelet, qe-ghuang-node-registry-router-1  (combined from similar events): MountVolume.SetUp succeeded for volume "sdn-token-n9vqb"
  Normal   Pulling                1h                 kubelet, qe-ghuang-node-registry-router-1  pulling image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10"
  Normal   Pulled                 1h                 kubelet, qe-ghuang-node-registry-router-1  Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10"
  Normal   Pulled                 1h                 kubelet, qe-ghuang-node-registry-router-1  Container image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10" already present on machine
  Normal   Created                1h (x2 over 1h)    kubelet, qe-ghuang-node-registry-router-1  Created container
  Warning  Failed                 1h (x2 over 1h)    kubelet, qe-ghuang-node-registry-router-1  Error: failed to start container "sdn": Error response from daemon: error while creating mount source path '/opt/cni/bin': mkdir /opt/cni: read-only file system
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run-kubernetes"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-sysconfig-node"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run-dbus"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-lib-cni-networks-openshift-sdn"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-var-run-ovs"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-etc-cni-netd"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-modules"
  Normal   SuccessfulMountVolume  1h                 kubelet, qe-ghuang-node-registry-router-1  MountVolume.SetUp succeeded for volume "host-config"
  Normal   SuccessfulMountVolume  1h (x3 over 1h)    kubelet, qe-ghuang-node-registry-router-1  (combined from similar events): MountVolume.SetUp succeeded for volume "sdn-token-n9vqb"
  Normal   Pulled                 1h (x3 over 1h)    kubelet, qe-ghuang-node-registry-router-1  Container image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10" already present on machine
  Normal   Created                1h (x3 over 1h)    kubelet, qe-ghuang-node-registry-router-1  Created container
  Warning  Failed                 1h (x3 over 1h)    kubelet, qe-ghuang-node-registry-router-1  Error: failed to start container "sdn": Error response from daemon: error while creating mount source path '/opt/cni/bin': mkdir /opt/cni: read-only file system
  Warning  BackOff                2m (x308 over 1h)  kubelet, qe-ghuang-node-registry-router-1  Back-off restarting failed container


Note: this is an system container installation on RHEL. 

TASK [openshift_node : Install or Update node system container] ****************
Friday 13 April 2018  02:45:49 -0400 (0:01:43.746)       0:02:52.121 ********** 

changed: [qe-ghuang-master-etcd-1.0413-cot.qe.rhcloud.com] => {"changed": true, "failed": false, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nCreated file /opt/cni/bin/host-local\nCreated file /opt/cni/bin/openshift-sdn\nCreated file /opt/cni/bin/loopback\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"}
changed: [qe-ghuang-node-registry-router-1.0413-cot.qe.rhcloud.com] => {"changed": true, "failed": false, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nCreated file /opt/cni/bin/host-local\nCreated file /opt/cni/bin/openshift-sdn\nCreated file /opt/cni/bin/loopback\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"}

Comment 5 Vadim Rutkovsky 2018-04-19 11:37:29 UTC
>  Warning  Failed                 1h (x2 over 1h)    kubelet, qe-ghuang-node-registry-router-1  Error: failed to start container "sdn": Error response from daemon: error while creating mount source path '/opt/cni/bin': mkdir /opt/cni: read-only file system

/opt/cni/bin needs to be mounted, created https://github.com/openshift/origin/pull/19427 to fix it in the system container image

Comment 6 Scott Dodson 2018-04-20 14:34:03 UTC
merged

https://github.com/openshift/origin/pull/19427 and the follow up fix 
https://github.com/openshift/origin/pull/19445

Comment 7 Vadim Rutkovsky 2018-04-24 13:20:55 UTC
The fix for this is available in atomic-openshift-3.10.0-0.28.0.git.0.66790cb.el7 and openshift-ansible-3.10.0-0.28.0.git.0.439cb5c.el7

Comment 8 Gan Huang 2018-04-25 03:23:52 UTC
Verified in openshift-ansible-3.10.0-0.28.0.git.0.439cb5c.el7.noarch.rpm

Comment 10 errata-xmlrpc 2018-07-30 19:12:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.