Description of problem: Node failed to start while triggering system container installation on RHEL: "Apr 09 23:47:15 qe-ghuang-master-etcd-1 atomic-openshift-node[22470]: F0409 23:47:15.528946 22470 start_node.go:162] SDN initialization failed: OVS is not installed", Version-Release number of the following components: openshift-ansible-3.10.0-0.16.0.git.0.8925606.el7.noarch.rpm How reproducible: always Steps to Reproduce: 1. Trigger system container installation on RHEL: # cat inventory <--snip--> openshift_use_system_containers=true system_images_registry=registry.reg-aws.openshift.com:443 <--snip--> Actual results: "Apr 09 23:47:15 qe-ghuang-master-etcd-1 atomic-openshift-node[22470]: F0409 23:47:15.528946 22470 start_node.go:162] SDN initialization failed: OVS is not installed" The whole logs will be attacked. Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
This is system container installation, the node service file looks incorrect. # systemctl cat atomic-openshift-node.service # /usr/lib/systemd/system/atomic-openshift-node.service [Unit] Description=Atomic OpenShift Node After=docker.service After=openvswitch.service Wants=docker.service Documentation=https://github.com/openshift/origin [Service] Type=notify EnvironmentFile=/etc/sysconfig/atomic-openshift-node Environment=GOTRACEBACK=crash ExecStart=/usr/bin/openshift start node --config=${CONFIG_FILE} $OPTIONS LimitNOFILE=65536 LimitCORE=infinity WorkingDirectory=/var/lib/origin/ SyslogIdentifier=atomic-openshift-node Restart=always RestartSec=5s OOMScoreAdjust=-999 [Install] WantedBy=multi-user.target # /etc/systemd/system/atomic-openshift-node.service.d/override.conf [Unit] After=cloud-init.service
The node service unit file had been corrected somehow. # systemctl cat atomic-openshift-node.service # /etc/systemd/system/atomic-openshift-node.service [Unit] After=container-engine.service After=openvswitch.service Wants=container-engine.service After=atomic-openshift-node-dep.service After=atomic-openshift-master-controllers.service Requires=dnsmasq.service After=dnsmasq.service [Service] Type=notify EnvironmentFile=/etc/sysconfig/atomic-openshift-node ExecStartPre=/bin/bash -c 'export -p > /run/atomic-openshift-node-env' ExecStart=/usr/bin/runc --systemd-cgroup run 'atomic-openshift-node' ExecStop=/usr/bin/runc --systemd-cgroup kill 'atomic-openshift-node' SyslogIdentifier=atomic-openshift-node Restart=always RestartSec=5s WorkingDirectory=/var/lib/containers/atomic/atomic-openshift-node.0 RuntimeDirectory=atomic-openshift-node [Install] WantedBy=container-engine.service # /etc/systemd/system/atomic-openshift-node.service.d/override.conf [Unit] After=cloud-init.service But hitting another issue. The sdn pods are CrashLoopBackOff # oc get pod -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-nwxv7 1/1 Running 0 1h ovs-xf7h6 1/1 Running 0 1h sdn-s65k9 0/1 CrashLoopBackOff 19 1h sdn-t8prw 0/1 CrashLoopBackOff 21 1h # oc logs sdn-s65k9 -n openshift-sdn failed to open log file "/var/log/pods/ca1f0816-3ee7-11e8-8ee1-42010af0001f/sdn_19.log": open /var/log/pods/ca1f0816-3ee7-11e8-8ee1-42010af0001f/sdn_19.log: no such file or directory # oc describe po sdn-s65k9 -n openshift-sdn <--snip--> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-opt-cni-bin" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run-openshift-sdn" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run-dbus" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-config" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-etc-cni-netd" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-lib-cni-networks-openshift-sdn" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run-kubernetes" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-sysconfig-node" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-modules" Normal SuccessfulMountVolume 1h (x3 over 1h) kubelet, qe-ghuang-node-registry-router-1 (combined from similar events): MountVolume.SetUp succeeded for volume "sdn-token-n9vqb" Normal Pulling 1h kubelet, qe-ghuang-node-registry-router-1 pulling image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10" Normal Pulled 1h kubelet, qe-ghuang-node-registry-router-1 Successfully pulled image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10" Normal Pulled 1h kubelet, qe-ghuang-node-registry-router-1 Container image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10" already present on machine Normal Created 1h (x2 over 1h) kubelet, qe-ghuang-node-registry-router-1 Created container Warning Failed 1h (x2 over 1h) kubelet, qe-ghuang-node-registry-router-1 Error: failed to start container "sdn": Error response from daemon: error while creating mount source path '/opt/cni/bin': mkdir /opt/cni: read-only file system Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run-kubernetes" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-sysconfig-node" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run-dbus" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-lib-cni-networks-openshift-sdn" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-var-run-ovs" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-etc-cni-netd" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-modules" Normal SuccessfulMountVolume 1h kubelet, qe-ghuang-node-registry-router-1 MountVolume.SetUp succeeded for volume "host-config" Normal SuccessfulMountVolume 1h (x3 over 1h) kubelet, qe-ghuang-node-registry-router-1 (combined from similar events): MountVolume.SetUp succeeded for volume "sdn-token-n9vqb" Normal Pulled 1h (x3 over 1h) kubelet, qe-ghuang-node-registry-router-1 Container image "registry.reg-aws.openshift.com:443/openshift3/node:v3.10" already present on machine Normal Created 1h (x3 over 1h) kubelet, qe-ghuang-node-registry-router-1 Created container Warning Failed 1h (x3 over 1h) kubelet, qe-ghuang-node-registry-router-1 Error: failed to start container "sdn": Error response from daemon: error while creating mount source path '/opt/cni/bin': mkdir /opt/cni: read-only file system Warning BackOff 2m (x308 over 1h) kubelet, qe-ghuang-node-registry-router-1 Back-off restarting failed container Note: this is an system container installation on RHEL. TASK [openshift_node : Install or Update node system container] **************** Friday 13 April 2018 02:45:49 -0400 (0:01:43.746) 0:02:52.121 ********** changed: [qe-ghuang-master-etcd-1.0413-cot.qe.rhcloud.com] => {"changed": true, "failed": false, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nCreated file /opt/cni/bin/host-local\nCreated file /opt/cni/bin/openshift-sdn\nCreated file /opt/cni/bin/loopback\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"} changed: [qe-ghuang-node-registry-router-1.0413-cot.qe.rhcloud.com] => {"changed": true, "failed": false, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nCreated file /opt/cni/bin/host-local\nCreated file /opt/cni/bin/openshift-sdn\nCreated file /opt/cni/bin/loopback\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"}
> Warning Failed 1h (x2 over 1h) kubelet, qe-ghuang-node-registry-router-1 Error: failed to start container "sdn": Error response from daemon: error while creating mount source path '/opt/cni/bin': mkdir /opt/cni: read-only file system /opt/cni/bin needs to be mounted, created https://github.com/openshift/origin/pull/19427 to fix it in the system container image
merged https://github.com/openshift/origin/pull/19427 and the follow up fix https://github.com/openshift/origin/pull/19445
The fix for this is available in atomic-openshift-3.10.0-0.28.0.git.0.66790cb.el7 and openshift-ansible-3.10.0-0.28.0.git.0.439cb5c.el7
Verified in openshift-ansible-3.10.0-0.28.0.git.0.439cb5c.el7.noarch.rpm
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816