Bug 1572870

Summary:	Fail to upgrade against bootstrap ocp due to node can not start
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Cluster Version Operator	Assignee:	Michael Gugino <mgugino>
Status:	CLOSED CURRENTRELEASE	QA Contact:	liujia <jiajliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.10.0	CC:	aos-bugs, dma, jiajliu, jokerman, mifiedle, mmccomas, vlaad, wmeng, xtian
Target Milestone:	---
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-08-13 15:11:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description liujia 2018-04-28 10:00:01 UTC

Description of problem:
Upgrade ocp v3.10 to latest version. Upgrade failed at task [openshift_node : Wait for master API to come back online] which caused by node can no start. It seems that node service need post a cert request to master, but master service has been stopped before upgrade node package.

TASK [openshift_node : Start services] **************************************************************************************************************************************
ok: [x.x.x.x] => (item=atomic-openshift-node) => {"changed": false, "failed_when_result": false, "item": "atomic-openshift-node", "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}

TASK [openshift_node : Wait for master API to come back online] *************************************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "elapsed": 600, "msg": "Timeout when waiting for qe-jliu-rp10-master-etcd-1:8443"}
        to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry


Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 atomic-openshift-node[7429]: F0428 05:41:46.837340    7429 server.go:233] failed to run Kubelet: cannot create certificate signing request: Post https://qe-jliu-rp10-master-etcd-1:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 10.240.0.36:8443: getsockopt: connection refused
Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: Failed to start OpenShift Node.
Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service failed.

# docker ps |grep master
8f962a5b90f5        registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67                           "/bin/sh -c '#!/bi..."   30 minutes ago      Up 30 minutes                           k8s_etcd_master-etcd-qe-jliu-rpm10-master-etcd-1_kube-system_a9f73a820430c627cbe1a4585ccdafdd_0


Version-Release number of the following components:
openshift-ansible-3.10.0-0.30.0.git.0.4f02952.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. Install ocp v3.10.0-0.29.0
2. Enable latest repos and upgrade against above ocp to latest version
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Scott Dodson 2018-05-10 13:33:46 UTC

Can this be reproduced in 3.10.0-0.37.0 or newer? The cert signing request shouldn't need to be fulfilled before the api server pod comes online. If this is still happening please capture the full `journalctl --no-pager -u atomic-openshift-node` and attach to the bug.

Comment 3 liujia 2018-05-14 09:17:54 UTC

Did not hit the issue on openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch.

Comment 4 Scott Dodson 2018-05-14 12:38:58 UTC

Ok, moving this to ON_QA, if we don't see this again lets go ahead and CLOSED CURRENTRELEASE

Comment 5 liujia 2018-05-15 00:57:49 UTC

verify on openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch

Comment 6 liujia 2018-05-16 04:41:09 UTC

Hit the issue again on openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch

TASK [openshift_node : Start services] **************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:33
ok: [x] => (item=atomic-openshift-node) => {"changed": false, "failed_when_result": false, "item": "atomic-openshift-node", "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}

TASK [openshift_node : Wait for master API to come back online] *************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:39
fatal: [x]: FAILED! => {"changed": false, "elapsed": 604, "msg": "Timeout when waiting for qe-jliu-rp10-master-etcd-1:8443"}


# docker ps|grep master
83bc3942cad4        registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67                           "/bin/sh -c '#!/bi..."   3 hours ago         Up 3 hours                              k8s_etcd_master-etcd-qe-jliu-rp10-master-etcd-1_kube-system_a9f73a820430c627cbe1a4585ccdafdd_0


May 15 22:14:58 qe-jliu-rp10-master-etcd-1 atomic-openshift-node[18055]: F0515 22:14:58.725923   18055 server.go:233] failed to run Kubelet: cannot create certificate signing request: Post https://qe-jliu-rp10-master-etcd-1:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp [fe80::4001:aff:fef0:17%eth0]:8443: getsockopt: connection refused
May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a
May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: Failed to start OpenShift Node.
May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: Unit atomic-openshift-node.service entered failed state.
May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service failed.

Node service log in attachment.

Comment 8 Scott Dodson 2018-05-18 02:48:14 UTC

I believe we've hit a similar permutation of this during a free-int 3.10.0-x.y.z to 3.10.0-x.y+1.z upgrade where due to master misconfiguration all master services were stopped taking the API down. Then the node failed to start with similar error.  Only when restoring the API running on the other two hosts was I able to bring the host in question back up.

Comment 9 Scott Dodson 2018-05-22 15:22:32 UTC

If this happens again lets gather

journalctl --no-pager -u atomic-openshift-node
ls -la /etc/origin/node
ls -la /etc/origin/node/certificates

Comment 10 Scott Dodson 2018-05-30 14:05:20 UTC

Possible dupe of 1579267

Comment 11 liujia 2018-05-31 09:44:59 UTC

Should not. BZ1579267 is based an successfully upgraded env with master and node services running well. This issue happened during upgrade process, master and node wait for each other.

Have a try on latest build openshift-ansible-3.10.0-0.54.0.git.0.537c485.el7.noarch. Original fail task seems work well now.

Hit another little issue for upgrade v3.10 to latest v3.10.

TASK [openshift_node : Wait for node to be ready] ******************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:46
Thursday 31 May 2018  09:27:25 +0000 (0:00:17.645)       0:14:47.947 ********** 
FAILED - RETRYING: Wait for node to be ready (36 retries left).
fatal: [x]: FAILED! => {"failed": true, "msg": "The conditional check 'node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. The error was: error while evaluating conditional (node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict object' has no attribute 'results'"}

Comment 12 Scott Dodson 2018-05-31 13:19:57 UTC

liujia,


Do you have AAAA records for the hostname qe-jliu-rp10-master-etcd-1? It looks like it's attempting to use IPv6 to connect to the API and while the API may be running we don't bind to the IPv6 addresses.

Can you see if the api is started at all? look for docker containers using `docker ps -a`

Also can you please archive the entire structure of /etc/origin/node and provide that so we can review all of the configuration files?

There may still be a deadlock problem here where the node doesn't bring up static pods while waiting to bootstrap. Need to discuss that w/ the Pod team.

Comment 13 liujia 2018-06-04 05:55:30 UTC

(In reply to liujia from comment #11)
> Should not. BZ1579267 is based an successfully upgraded env with master and
> node services running well. This issue happened during upgrade process,
> master and node wait for each other.
> 
> Have a try on latest build
> openshift-ansible-3.10.0-0.54.0.git.0.537c485.el7.noarch. Original fail task
> seems work well now.
> 
> Hit another little issue for upgrade v3.10 to latest v3.10.
> 
> TASK [openshift_node : Wait for node to be ready]
> ******************************
> task path:
> /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:
> 46
> Thursday 31 May 2018  09:27:25 +0000 (0:00:17.645)       0:14:47.947
> ********** 
> FAILED - RETRYING: Wait for node to be ready (36 retries left).
> fatal: [x]: FAILED! => {"failed": true, "msg": "The conditional check
> 'node_output.results.returncode == 0 and
> node_output.results.results[0].status.conditions | selectattr('type',
> 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed.
> The error was: error while evaluating conditional
> (node_output.results.returncode == 0 and
> node_output.results.results[0].status.conditions | selectattr('type',
> 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict
> object' has no attribute 'results'"}

Checked the node status when upgrade failed and exited.
# oc get node
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-13-253.ec2.internal   Ready     master    28m       v1.10.0+b81c8f8
ip-172-18-14-135.ec2.internal   Ready     compute   26m       v1.10.0+b81c8f8

# oc get pod -n kube-system
NAME                                               READY     STATUS    RESTARTS   AGE
master-api-ip-172-18-13-253.ec2.internal           1/1       Running   1          30m
master-controllers-ip-172-18-13-253.ec2.internal   1/1       Running   1          30m
master-etcd-ip-172-18-13-253.ec2.internal          1/1       Running   1          30m


# systemctl status atomic-openshift-node.service 
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/atomic-openshift-node.service.d
           └─override.conf
   Active: active (running) since Sun 2018-06-03 23:16:10 EDT; 4min 8s ago


After 1 hour(no operation), checked again.
# oc get node
NAME                            STATUS     ROLES     AGE       VERSION
ip-172-18-13-253.ec2.internal   NotReady   master    2h        v1.10.0+b81c8f8
ip-172-18-14-135.ec2.internal   Ready      compute   2h        v1.10.0+b81c8f8

# oc get pod -n kube-system
NAME                                               READY     STATUS    RESTARTS   AGE
master-api-ip-172-18-13-253.ec2.internal           1/1       Unknown   1          2h
master-controllers-ip-172-18-13-253.ec2.internal   1/1       Unknown   1          2h
master-etcd-ip-172-18-13-253.ec2.internal          1/1       Unknown   1          2h
 
Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Stopping OpenShift Node...
Jun 04 01:10:05 ip-172-18-13-253.ec2.internal atomic-openshift-node[12225]: I0604 01:10:05.195569   12225 docker_server.go:79] Stop docker server
Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE
Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Stopped OpenShift Node.
Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Unit atomic-openshift-node.service entered failed state.
Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: atomic-openshift-node.service failed.

Comment 17 Scott Dodson 2018-06-04 19:46:14 UTC

Comment 14, 15, 16 seem to be a different problem than originally reported earlier where the node cannot communicate with the API server to complete the bootstrapping certificate creation.

Comment 20 Scott Dodson 2018-06-05 18:22:51 UTC

Moving to 3.10.z since we've not found a reproducer yet.

Comment 22 Scott Dodson 2018-06-07 19:56:46 UTC

The logs of the last show that there were no 'results' item in the dictionary and that there were no retries made. 

https://github.com/openshift/openshift-ansible/pull/8674

Comment 24 liujia 2018-06-14 06:08:43 UTC

Verified on openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch