Bug 1565525 - Install failed due to "Node start failed"
Summary: Install failed due to "Node start failed"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.0
Assignee: Scott Dodson
QA Contact: Weihua Meng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-10 08:27 UTC by Weihua Meng
Modified: 2018-11-09 20:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-04-18 17:45:15 UTC
Target Upstream Version:


Attachments (Terms of Use)
install_log (10.86 KB, text/plain)
2018-04-10 08:27 UTC, Weihua Meng
no flags Details

Description Weihua Meng 2018-04-10 08:27:04 UTC
Created attachment 1419756 [details]
install_log

Description of problem:
Install failed due to "Node start failed"

Version-Release number of the following components:
openshift-ansible-3.10.0-0.16.0.git.0.8925606.el7.noarch.rpm

How reproducible:
Always

Steps to Reproduce:
1. install OCP 3.10
$ ansible-playbook playbooks/deploy_cluster.yml


Actual results:
TASK [openshift_node : debug] **************************************************
Tuesday 10 April 2018  02:58:17 -0400 (0:00:00.461)       0:15:38.320 ********* 
skipping: [shared-wmeng3107nsc-master-etcd-1.0410-2li.qe.rhcloud.com] => {"skip_reason": "Conditional result was False"}
ok: [shared-wmeng3107nsc-nrr-1.0410-2li.qe.rhcloud.com] => {
    "msg": [
        "-- Logs begin at Tue 2018-04-10 02:37:58 EDT, end at Tue 2018-04-10 02:59:01 EDT. --", 
        "Apr 10 02:54:02 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18333]: I0410 02:54:02.204507   18333 bootstrap.go:58] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file", 
        "Apr 10 02:59:01 shared-wmeng3107nsc-nrr-2 systemd[1]: atomic-openshift-node.service start operation timed out. Terminating.", 
        "Apr 10 02:59:01 shared-wmeng3107nsc-nrr-2 systemd[1]: Failed to start OpenShift Node.", 
        "Apr 10 02:59:01 shared-wmeng3107nsc-nrr-2 systemd[1]: Unit atomic-openshift-node.service entered failed state.", 
        "Apr 10 02:59:01 shared-wmeng3107nsc-nrr-2 systemd[1]: atomic-openshift-node.service failed."
    ]
}

TASK [openshift_node : fail] ***************************************************
Tuesday 10 April 2018  02:58:17 -0400 (0:00:00.078)       0:15:38.398 ********* 
skipping: [shared-wmeng3107nsc-master-etcd-1.0410-2li.qe.rhcloud.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
fatal: [shared-wmeng3107nsc-nrr-1.0410-2li.qe.rhcloud.com]: FAILED! => {"changed": false, "failed": true, "msg": "Node start failed."}
fatal: [shared-wmeng3107nsc-nrr-2.0410-2li.qe.rhcloud.com]: FAILED! => {"changed": false, "failed": true, "msg": "Node start failed."}

more log will be attached.

Expected results:
Install succeeds

Additional info:
There is failed start in log.
When I check on host, service is running.
[root@shared-wmeng3107nsc-nrr-2 ~]# systemctl status atomic-openshift-node.service 
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/atomic-openshift-node.service.d
           └─override.conf
   Active: active (running) since 二 2018-04-10 02:59:07 EDT; 26min ago
     Docs: https://github.com/openshift/origin
 Main PID: 18377 (hyperkube)
   Memory: 50.8M
   CGroup: /system.slice/atomic-openshift-node.service
           └─18377 /usr/bin/hyperkube kubelet --v=5 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=5m --authorizat...

4月 10 03:25:23 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:23.350262   18377 config.go:99] Looking for [api file], have seen map[file:{} api:{}]
4月 10 03:25:23 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:23.350326   18377 kubelet.go:1924] SyncLoop (housekeeping)
4月 10 03:25:24 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:24.284880   18377 generic.go:183] GenericPLEG: Relisting
4月 10 03:25:25 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:25.291076   18377 generic.go:183] GenericPLEG: Relisting
4月 10 03:25:25 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:25.350239   18377 config.go:99] Looking for [api file], have seen map[file:{} api:{}]
4月 10 03:25:25 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:25.350296   18377 kubelet.go:1924] SyncLoop (housekeeping)
4月 10 03:25:25 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: W0410 03:25:25.394184   18377 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
4月 10 03:25:25 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:25.394313   18377 kubelet.go:2103] Container runtime status: Runtime Conditions: RuntimeReady=true reason: messa...initialized
4月 10 03:25:25 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: E0410 03:25:25.394334   18377 kubelet.go:2106] Container runtime network not ready: NetworkReady=false reason:NetworkPluginN...initialized
4月 10 03:25:26 shared-wmeng3107nsc-nrr-2 atomic-openshift-node[18377]: I0410 03:25:26.297177   18377 generic.go:183] GenericPLEG: Relisting
Hint: Some lines were ellipsized, use -l to show in full.

Comment 1 weiwei jiang 2018-04-11 01:39:29 UTC
This is due to systemd kill the node process for TimeoutStartSec=300 which defined in /etc/systemd/system/atomic-openshift-node.service

Comment 2 Scott Dodson 2018-04-17 13:10:44 UTC
This should now be resolved on master, lets re-test.

Comment 3 Weihua Meng 2018-04-17 14:38:03 UTC
Fixed.
openshift-ansible-3.10.0-0.22.0.git.0.b6ec617.el7.noarch.rpm

Comment 4 Nick Curry 2018-11-09 20:11:18 UTC
I am seeing this in 3.11.

Can I get more information around what is causing this?


Note You need to log in before you can comment on or make changes to this bug.