Description of problem: Upgrade ocp v3.10 to latest version. Upgrade failed at task [openshift_node : Wait for master API to come back online] which caused by node can no start. It seems that node service need post a cert request to master, but master service has been stopped before upgrade node package. TASK [openshift_node : Start services] ************************************************************************************************************************************** ok: [x.x.x.x] => (item=atomic-openshift-node) => {"changed": false, "failed_when_result": false, "item": "atomic-openshift-node", "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} TASK [openshift_node : Wait for master API to come back online] ************************************************************************************************************* fatal: [x.x.x.x]: FAILED! => {"changed": false, "elapsed": 600, "msg": "Timeout when waiting for qe-jliu-rp10-master-etcd-1:8443"} to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.retry Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 atomic-openshift-node[7429]: F0428 05:41:46.837340 7429 server.go:233] failed to run Kubelet: cannot create certificate signing request: Post https://qe-jliu-rp10-master-etcd-1:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp 10.240.0.36:8443: getsockopt: connection refused Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: Failed to start OpenShift Node. Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. Apr 28 05:41:46 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service failed. # docker ps |grep master 8f962a5b90f5 registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67 "/bin/sh -c '#!/bi..." 30 minutes ago Up 30 minutes k8s_etcd_master-etcd-qe-jliu-rpm10-master-etcd-1_kube-system_a9f73a820430c627cbe1a4585ccdafdd_0 Version-Release number of the following components: openshift-ansible-3.10.0-0.30.0.git.0.4f02952.el7.noarch How reproducible: always Steps to Reproduce: 1. Install ocp v3.10.0-0.29.0 2. Enable latest repos and upgrade against above ocp to latest version 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
Can this be reproduced in 3.10.0-0.37.0 or newer? The cert signing request shouldn't need to be fulfilled before the api server pod comes online. If this is still happening please capture the full `journalctl --no-pager -u atomic-openshift-node` and attach to the bug.
Did not hit the issue on openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch.
Ok, moving this to ON_QA, if we don't see this again lets go ahead and CLOSED CURRENTRELEASE
verify on openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch
Hit the issue again on openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch TASK [openshift_node : Start services] ************************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:33 ok: [x] => (item=atomic-openshift-node) => {"changed": false, "failed_when_result": false, "item": "atomic-openshift-node", "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} TASK [openshift_node : Wait for master API to come back online] ************************************************************************************************************* task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:39 fatal: [x]: FAILED! => {"changed": false, "elapsed": 604, "msg": "Timeout when waiting for qe-jliu-rp10-master-etcd-1:8443"} # docker ps|grep master 83bc3942cad4 registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67 "/bin/sh -c '#!/bi..." 3 hours ago Up 3 hours k8s_etcd_master-etcd-qe-jliu-rp10-master-etcd-1_kube-system_a9f73a820430c627cbe1a4585ccdafdd_0 May 15 22:14:58 qe-jliu-rp10-master-etcd-1 atomic-openshift-node[18055]: F0515 22:14:58.725923 18055 server.go:233] failed to run Kubelet: cannot create certificate signing request: Post https://qe-jliu-rp10-master-etcd-1:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp [fe80::4001:aff:fef0:17%eth0]:8443: getsockopt: connection refused May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: Failed to start OpenShift Node. May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service failed. Node service log in attachment.
I believe we've hit a similar permutation of this during a free-int 3.10.0-x.y.z to 3.10.0-x.y+1.z upgrade where due to master misconfiguration all master services were stopped taking the API down. Then the node failed to start with similar error. Only when restoring the API running on the other two hosts was I able to bring the host in question back up.
If this happens again lets gather journalctl --no-pager -u atomic-openshift-node ls -la /etc/origin/node ls -la /etc/origin/node/certificates
Possible dupe of 1579267
Should not. BZ1579267 is based an successfully upgraded env with master and node services running well. This issue happened during upgrade process, master and node wait for each other. Have a try on latest build openshift-ansible-3.10.0-0.54.0.git.0.537c485.el7.noarch. Original fail task seems work well now. Hit another little issue for upgrade v3.10 to latest v3.10. TASK [openshift_node : Wait for node to be ready] ****************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:46 Thursday 31 May 2018 09:27:25 +0000 (0:00:17.645) 0:14:47.947 ********** FAILED - RETRYING: Wait for node to be ready (36 retries left). fatal: [x]: FAILED! => {"failed": true, "msg": "The conditional check 'node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. The error was: error while evaluating conditional (node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict object' has no attribute 'results'"}
liujia, Do you have AAAA records for the hostname qe-jliu-rp10-master-etcd-1? It looks like it's attempting to use IPv6 to connect to the API and while the API may be running we don't bind to the IPv6 addresses. Can you see if the api is started at all? look for docker containers using `docker ps -a` Also can you please archive the entire structure of /etc/origin/node and provide that so we can review all of the configuration files? There may still be a deadlock problem here where the node doesn't bring up static pods while waiting to bootstrap. Need to discuss that w/ the Pod team.
(In reply to liujia from comment #11) > Should not. BZ1579267 is based an successfully upgraded env with master and > node services running well. This issue happened during upgrade process, > master and node wait for each other. > > Have a try on latest build > openshift-ansible-3.10.0-0.54.0.git.0.537c485.el7.noarch. Original fail task > seems work well now. > > Hit another little issue for upgrade v3.10 to latest v3.10. > > TASK [openshift_node : Wait for node to be ready] > ****************************** > task path: > /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml: > 46 > Thursday 31 May 2018 09:27:25 +0000 (0:00:17.645) 0:14:47.947 > ********** > FAILED - RETRYING: Wait for node to be ready (36 retries left). > fatal: [x]: FAILED! => {"failed": true, "msg": "The conditional check > 'node_output.results.returncode == 0 and > node_output.results.results[0].status.conditions | selectattr('type', > 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. > The error was: error while evaluating conditional > (node_output.results.returncode == 0 and > node_output.results.results[0].status.conditions | selectattr('type', > 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict > object' has no attribute 'results'"} Checked the node status when upgrade failed and exited. # oc get node NAME STATUS ROLES AGE VERSION ip-172-18-13-253.ec2.internal Ready master 28m v1.10.0+b81c8f8 ip-172-18-14-135.ec2.internal Ready compute 26m v1.10.0+b81c8f8 # oc get pod -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-18-13-253.ec2.internal 1/1 Running 1 30m master-controllers-ip-172-18-13-253.ec2.internal 1/1 Running 1 30m master-etcd-ip-172-18-13-253.ec2.internal 1/1 Running 1 30m # systemctl status atomic-openshift-node.service ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/atomic-openshift-node.service.d └─override.conf Active: active (running) since Sun 2018-06-03 23:16:10 EDT; 4min 8s ago After 1 hour(no operation), checked again. # oc get node NAME STATUS ROLES AGE VERSION ip-172-18-13-253.ec2.internal NotReady master 2h v1.10.0+b81c8f8 ip-172-18-14-135.ec2.internal Ready compute 2h v1.10.0+b81c8f8 # oc get pod -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-18-13-253.ec2.internal 1/1 Unknown 1 2h master-controllers-ip-172-18-13-253.ec2.internal 1/1 Unknown 1 2h master-etcd-ip-172-18-13-253.ec2.internal 1/1 Unknown 1 2h Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Stopping OpenShift Node... Jun 04 01:10:05 ip-172-18-13-253.ec2.internal atomic-openshift-node[12225]: I0604 01:10:05.195569 12225 docker_server.go:79] Stop docker server Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Stopped OpenShift Node. Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Unit atomic-openshift-node.service entered failed state. Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: atomic-openshift-node.service failed.
Comment 14, 15, 16 seem to be a different problem than originally reported earlier where the node cannot communicate with the API server to complete the bootstrapping certificate creation.
Moving to 3.10.z since we've not found a reproducer yet.
The logs of the last show that there were no 'results' item in the dictionary and that there were no retries made. https://github.com/openshift/openshift-ansible/pull/8674
Verified on openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch