Description of problem: Upgrade against rpm installed cluster failed at task [openshift_node : Wait for node to be ready]. TASK [openshift_node : Wait for node to be ready] ****************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:46 Wednesday 16 May 2018 10:15:47 +0000 (0:01:49.363) 0:30:20.280 ********* FAILED - RETRYING: Wait for node to be ready (36 retries left). .... FAILED - RETRYING: Wait for node to be ready (1 retries left). fatal: [qe-jliu-hr39a-master-1 -> qe-jliu-hr39a-master-1]: FAILED! => {"attempts": 36, "changed": false, "failed": true, "results": {"cmd": "/usr/bin/oc get node qe-jliu-hr39a-master-1 -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2018-05-16T08:59:36Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "n1-standard-4", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "us-central1", "failure-domain.beta.kubernetes.io/zone": "us-central1-a", "kubernetes.io/hostname": "qe-jliu-hr39a-master-1", "node-role.kubernetes.io/master": "true", "role": "node"}, "name": "qe-jliu-hr39a-master-1", "resourceVersion": "54046", "selfLink": "/api/v1/nodes/qe-jliu-hr39a-master-1", "uid": "752ab024-58e7-11e8-bdbd-42010af0005b"}, "spec": {"externalID": "6328787278136496779", "providerID": "gce://openshift-gce-devel/us-central1-a/qe-jliu-hr39a-master-1"}, "status": {"addresses": [{"address": "10.240.0.64", "type": "InternalIP"}, {"address": "35.232.138.47", "type": "ExternalIP"}, {"address": "qe-jliu-hr39a-master-1", "type": "Hostname"}], "allocatable": {"cpu": "4", "memory": "15132476Ki", "pods": "250"}, "capacity": {"cpu": "4", "memory": "15234876Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": null, "lastTransitionTime": "2018-05-16T08:59:36Z", "message": "openshift-sdn cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": [{"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-web-console@sha256:5494d94c35216cd8e92f245169343a81874f1254ae5bc60d2f296c0325132fd1", "registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.29"], "sizeBytes": 465717173}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker@sha256:1ec471293b26d7d222bed0c5981d4ab53efc047f8ce624332643ccf9fbbdaa10", "registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.29"], "sizeBytes": 299487698}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:0b5ad372b0e0e63d94d087311aaabb9af21a090b6487d9e820a4de9df9c85b83", "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.29"], "sizeBytes": 287603698}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/registry-console@sha256:85299d20fb6a70f7f594ab26fd95a4a58e44e73b7d61f5f8c6de6d5a77c1814f", "registry.reg-aws.openshift.com:443/openshift3/registry-console:v3.9"], "sizeBytes": 233035520}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-pod@sha256:d1219cf14c328ce50f931ac0dbd8afd6c4943672f3d3c942110744f57c8d3886", "registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.9.29"], "sizeBytes": 214083992}], "nodeInfo": {"architecture": "amd64", "bootID": "32e70ad7-9da4-4268-a0de-d8fb99f49bb3", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-862.el7.x86_64", "kubeProxyVersion": "v1.9.1+a0ce1bc657", "kubeletVersion": "v1.9.1+a0ce1bc657", "machineID": "de1ef526da6c49fa872b90de9d8a66d1", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux Server 7.5 (Maipo)", "systemUUID": "981D8703-4EEA-A3D9-C9D8-6DFDE0B4240C"}}}], "returncode": 0}, "state": "list"} But actually node was running well. # oc get node NAME STATUS ROLES AGE VERSION qe-jliu-hr39a-master-1 Ready master 1h v1.10.0+b81c8f8 qe-jliu-hr39a-master-2 Ready master 1h v1.9.1+a0ce1bc657 qe-jliu-hr39a-master-3 Ready master 1h v1.9.1+a0ce1bc657 qe-jliu-hr39a-router-registry-node-1 Ready compute 1h v1.9.1+a0ce1bc657 qe-jliu-hr39a-router-registry-node-2 Ready compute 1h v1.9.1+a0ce1bc657 # systemctl status atomic-openshift-node.service ● atomic-openshift-node.service - OpenShift Node Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2018-05-16 06:28:49 EDT; 12min ago Docs: https://github.com/openshift/origin Seems the time wait for node to be ready still not enough, because I run the command manually, got a different result about node status. [root@qe-jliu-hr39a-master-1 ~]# /usr/bin/oc get node qe-jliu-hr39a-master-1 -o json -n default <--snip--> "conditions": [ { "lastHeartbeatTime": null, "lastTransitionTime": "2018-05-16T08:59:36Z", "message": "openshift-sdn cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable" }, { "lastHeartbeatTime": "2018-05-16T10:36:00Z", "lastTransitionTime": "2018-05-16T10:28:16Z", "message": "kubelet has sufficient disk space available", "reason": "KubeletHasSufficientDisk", "status": "False", "type": "OutOfDisk" }, { "lastHeartbeatTime": "2018-05-16T10:36:00Z", "lastTransitionTime": "2018-05-16T10:28:16Z", "message": "kubelet has sufficient memory available", "reason": "KubeletHasSufficientMemory", "status": "False", "type": "MemoryPressure" }, { "lastHeartbeatTime": "2018-05-16T10:36:00Z", "lastTransitionTime": "2018-05-16T10:28:16Z", "message": "kubelet has no disk pressure", "reason": "KubeletHasNoDiskPressure", "status": "False", "type": "DiskPressure" }, { "lastHeartbeatTime": "2018-05-16T10:36:00Z", "lastTransitionTime": "2018-05-16T10:29:19Z", "message": "kubelet is posting ready status", "reason": "KubeletReady", "status": "True", "type": "Ready" }, { "lastHeartbeatTime": "2018-05-16T10:36:00Z", "lastTransitionTime": "2018-05-16T10:28:16Z", "message": "kubelet has sufficient PID available", "reason": "KubeletHasSufficientPID", "status": "False", "type": "PIDPressure" } ], <--snip--> Version-Release number of the following components: openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch How reproducible: always Steps to Reproduce: 1. rpm install ocp with ha deployement 2. run upgrade against above ocp 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: Only happened for ha deployment during test.
Hit it again based on openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch. Add testblocker for it block ha rpm upgrade.
This is more than likely just a timing issue. We can increase the timeout from its 6 minutes to 10 minutes.
We're going to pre-pull the the control-plane image on master hosts as early as possible. On Nodes we're going to pre-pull the node image and pod image as early as possible. All of these tasks are fire and forget, we never wait for them to return successfully.
Created PR to prepull images: https://github.com/openshift/openshift-ansible/pull/8172
Prepull should have resolved, please verify it no longer happens on openshift-ansible-3.10.0-0.51.0
verified on openshift-ansible-3.10.0-0.53.0.git.0.53fe016.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816