Bug 1572870
Summary: | Fail to upgrade against bootstrap ocp due to node can not start | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
Component: | Cluster Version Operator | Assignee: | Michael Gugino <mgugino> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | liujia <jiajliu> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.10.0 | CC: | aos-bugs, dma, jiajliu, jokerman, mifiedle, mmccomas, vlaad, wmeng, xtian |
Target Milestone: | --- | ||
Target Release: | 3.10.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-13 15:11:09 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
liujia
2018-04-28 10:00:01 UTC
Can this be reproduced in 3.10.0-0.37.0 or newer? The cert signing request shouldn't need to be fulfilled before the api server pod comes online. If this is still happening please capture the full `journalctl --no-pager -u atomic-openshift-node` and attach to the bug. Did not hit the issue on openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch. Ok, moving this to ON_QA, if we don't see this again lets go ahead and CLOSED CURRENTRELEASE verify on openshift-ansible-3.10.0-0.41.0.git.0.88119e4.el7.noarch Hit the issue again on openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch TASK [openshift_node : Start services] ************************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:33 ok: [x] => (item=atomic-openshift-node) => {"changed": false, "failed_when_result": false, "item": "atomic-openshift-node", "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} TASK [openshift_node : Wait for master API to come back online] ************************************************************************************************************* task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:39 fatal: [x]: FAILED! => {"changed": false, "elapsed": 604, "msg": "Timeout when waiting for qe-jliu-rp10-master-etcd-1:8443"} # docker ps|grep master 83bc3942cad4 registry.access.redhat.com/rhel7/etcd@sha256:128649895440ef04261b5198f3b558fd368be62123114fa45e5cacd589d18c67 "/bin/sh -c '#!/bi..." 3 hours ago Up 3 hours k8s_etcd_master-etcd-qe-jliu-rp10-master-etcd-1_kube-system_a9f73a820430c627cbe1a4585ccdafdd_0 May 15 22:14:58 qe-jliu-rp10-master-etcd-1 atomic-openshift-node[18055]: F0515 22:14:58.725923 18055 server.go:233] failed to run Kubelet: cannot create certificate signing request: Post https://qe-jliu-rp10-master-etcd-1:8443/apis/certificates.k8s.io/v1beta1/certificatesigningrequests: dial tcp [fe80::4001:aff:fef0:17%eth0]:8443: getsockopt: connection refused May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: Failed to start OpenShift Node. May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: Unit atomic-openshift-node.service entered failed state. May 15 22:14:58 qe-jliu-rp10-master-etcd-1 systemd[1]: atomic-openshift-node.service failed. Node service log in attachment. I believe we've hit a similar permutation of this during a free-int 3.10.0-x.y.z to 3.10.0-x.y+1.z upgrade where due to master misconfiguration all master services were stopped taking the API down. Then the node failed to start with similar error. Only when restoring the API running on the other two hosts was I able to bring the host in question back up. If this happens again lets gather journalctl --no-pager -u atomic-openshift-node ls -la /etc/origin/node ls -la /etc/origin/node/certificates Possible dupe of 1579267 Should not. BZ1579267 is based an successfully upgraded env with master and node services running well. This issue happened during upgrade process, master and node wait for each other. Have a try on latest build openshift-ansible-3.10.0-0.54.0.git.0.537c485.el7.noarch. Original fail task seems work well now. Hit another little issue for upgrade v3.10 to latest v3.10. TASK [openshift_node : Wait for node to be ready] ****************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:46 Thursday 31 May 2018 09:27:25 +0000 (0:00:17.645) 0:14:47.947 ********** FAILED - RETRYING: Wait for node to be ready (36 retries left). fatal: [x]: FAILED! => {"failed": true, "msg": "The conditional check 'node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. The error was: error while evaluating conditional (node_output.results.returncode == 0 and node_output.results.results[0].status.conditions | selectattr('type', 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict object' has no attribute 'results'"} liujia, Do you have AAAA records for the hostname qe-jliu-rp10-master-etcd-1? It looks like it's attempting to use IPv6 to connect to the API and while the API may be running we don't bind to the IPv6 addresses. Can you see if the api is started at all? look for docker containers using `docker ps -a` Also can you please archive the entire structure of /etc/origin/node and provide that so we can review all of the configuration files? There may still be a deadlock problem here where the node doesn't bring up static pods while waiting to bootstrap. Need to discuss that w/ the Pod team. (In reply to liujia from comment #11) > Should not. BZ1579267 is based an successfully upgraded env with master and > node services running well. This issue happened during upgrade process, > master and node wait for each other. > > Have a try on latest build > openshift-ansible-3.10.0-0.54.0.git.0.537c485.el7.noarch. Original fail task > seems work well now. > > Hit another little issue for upgrade v3.10 to latest v3.10. > > TASK [openshift_node : Wait for node to be ready] > ****************************** > task path: > /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml: > 46 > Thursday 31 May 2018 09:27:25 +0000 (0:00:17.645) 0:14:47.947 > ********** > FAILED - RETRYING: Wait for node to be ready (36 retries left). > fatal: [x]: FAILED! => {"failed": true, "msg": "The conditional check > 'node_output.results.returncode == 0 and > node_output.results.results[0].status.conditions | selectattr('type', > 'match', '^Ready$') | map(attribute='status') | join | bool == True' failed. > The error was: error while evaluating conditional > (node_output.results.returncode == 0 and > node_output.results.results[0].status.conditions | selectattr('type', > 'match', '^Ready$') | map(attribute='status') | join | bool == True): 'dict > object' has no attribute 'results'"} Checked the node status when upgrade failed and exited. # oc get node NAME STATUS ROLES AGE VERSION ip-172-18-13-253.ec2.internal Ready master 28m v1.10.0+b81c8f8 ip-172-18-14-135.ec2.internal Ready compute 26m v1.10.0+b81c8f8 # oc get pod -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-18-13-253.ec2.internal 1/1 Running 1 30m master-controllers-ip-172-18-13-253.ec2.internal 1/1 Running 1 30m master-etcd-ip-172-18-13-253.ec2.internal 1/1 Running 1 30m # systemctl status atomic-openshift-node.service ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/atomic-openshift-node.service.d └─override.conf Active: active (running) since Sun 2018-06-03 23:16:10 EDT; 4min 8s ago After 1 hour(no operation), checked again. # oc get node NAME STATUS ROLES AGE VERSION ip-172-18-13-253.ec2.internal NotReady master 2h v1.10.0+b81c8f8 ip-172-18-14-135.ec2.internal Ready compute 2h v1.10.0+b81c8f8 # oc get pod -n kube-system NAME READY STATUS RESTARTS AGE master-api-ip-172-18-13-253.ec2.internal 1/1 Unknown 1 2h master-controllers-ip-172-18-13-253.ec2.internal 1/1 Unknown 1 2h master-etcd-ip-172-18-13-253.ec2.internal 1/1 Unknown 1 2h Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Stopping OpenShift Node... Jun 04 01:10:05 ip-172-18-13-253.ec2.internal atomic-openshift-node[12225]: I0604 01:10:05.195569 12225 docker_server.go:79] Stop docker server Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=1/FAILURE Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Stopped OpenShift Node. Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: Unit atomic-openshift-node.service entered failed state. Jun 04 01:10:05 ip-172-18-13-253.ec2.internal systemd[1]: atomic-openshift-node.service failed. Comment 14, 15, 16 seem to be a different problem than originally reported earlier where the node cannot communicate with the API server to complete the bootstrapping certificate creation. Moving to 3.10.z since we've not found a reproducer yet. The logs of the last show that there were no 'results' item in the dictionary and that there were no retries made. https://github.com/openshift/openshift-ansible/pull/8674 Verified on openshift-ansible-3.10.0-0.67.0.git.107.1bd1f01.el7.noarch |