Bug 1578790

Summary: Upgrade failed at task [openshift_node : Wait for node to be ready] for ha deployment
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, jokerman, mifiedle, mmccomas, wmeng
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:15:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2018-05-16 11:55:31 UTC
Description of problem:
Upgrade against rpm installed cluster failed at task [openshift_node : Wait for node to be ready].

TASK [openshift_node : Wait for node to be ready] ******************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:46
Wednesday 16 May 2018  10:15:47 +0000 (0:01:49.363)       0:30:20.280 ********* 
FAILED - RETRYING: Wait for node to be ready (36 retries left).
....
FAILED - RETRYING: Wait for node to be ready (1 retries left).
fatal: [qe-jliu-hr39a-master-1 -> qe-jliu-hr39a-master-1]: FAILED! => {"attempts": 36, "changed": false, "failed": true, "results": {"cmd": "/usr/bin/oc get node qe-jliu-hr39a-master-1 -o json -n default", "results": [{"apiVersion": "v1", "kind": "Node", "metadata": {"annotations": {"volumes.kubernetes.io/controller-managed-attach-detach": "true"}, "creationTimestamp": "2018-05-16T08:59:36Z", "labels": {"beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "n1-standard-4", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "us-central1", "failure-domain.beta.kubernetes.io/zone": "us-central1-a", "kubernetes.io/hostname": "qe-jliu-hr39a-master-1", "node-role.kubernetes.io/master": "true", "role": "node"}, "name": "qe-jliu-hr39a-master-1", "resourceVersion": "54046", "selfLink": "/api/v1/nodes/qe-jliu-hr39a-master-1", "uid": "752ab024-58e7-11e8-bdbd-42010af0005b"}, "spec": {"externalID": "6328787278136496779", "providerID": "gce://openshift-gce-devel/us-central1-a/qe-jliu-hr39a-master-1"}, "status": {"addresses": [{"address": "10.240.0.64", "type": "InternalIP"}, {"address": "35.232.138.47", "type": "ExternalIP"}, {"address": "qe-jliu-hr39a-master-1", "type": "Hostname"}], "allocatable": {"cpu": "4", "memory": "15132476Ki", "pods": "250"}, "capacity": {"cpu": "4", "memory": "15234876Ki", "pods": "250"}, "conditions": [{"lastHeartbeatTime": null, "lastTransitionTime": "2018-05-16T08:59:36Z", "message": "openshift-sdn cleared kubelet-set NoRouteCreated", "reason": "RouteCreated", "status": "False", "type": "NetworkUnavailable"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "OutOfDisk"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "MemoryPressure"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "DiskPressure"}, {"lastHeartbeatTime": "2018-05-16T10:02:32Z", "lastTransitionTime": "2018-05-16T10:03:15Z", "message": "Kubelet stopped posting node status.", "reason": "NodeStatusUnknown", "status": "Unknown", "type": "Ready"}], "daemonEndpoints": {"kubeletEndpoint": {"Port": 10250}}, "images": [{"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-web-console@sha256:5494d94c35216cd8e92f245169343a81874f1254ae5bc60d2f296c0325132fd1", "registry.reg-aws.openshift.com:443/openshift3/ose-web-console:v3.9.29"], "sizeBytes": 465717173}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker@sha256:1ec471293b26d7d222bed0c5981d4ab53efc047f8ce624332643ccf9fbbdaa10", "registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker:v3.9.29"], "sizeBytes": 299487698}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:0b5ad372b0e0e63d94d087311aaabb9af21a090b6487d9e820a4de9df9c85b83", "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.29"], "sizeBytes": 287603698}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/registry-console@sha256:85299d20fb6a70f7f594ab26fd95a4a58e44e73b7d61f5f8c6de6d5a77c1814f", "registry.reg-aws.openshift.com:443/openshift3/registry-console:v3.9"], "sizeBytes": 233035520}, {"names": ["registry.reg-aws.openshift.com:443/openshift3/ose-pod@sha256:d1219cf14c328ce50f931ac0dbd8afd6c4943672f3d3c942110744f57c8d3886", "registry.reg-aws.openshift.com:443/openshift3/ose-pod:v3.9.29"], "sizeBytes": 214083992}], "nodeInfo": {"architecture": "amd64", "bootID": "32e70ad7-9da4-4268-a0de-d8fb99f49bb3", "containerRuntimeVersion": "docker://1.13.1", "kernelVersion": "3.10.0-862.el7.x86_64", "kubeProxyVersion": "v1.9.1+a0ce1bc657", "kubeletVersion": "v1.9.1+a0ce1bc657", "machineID": "de1ef526da6c49fa872b90de9d8a66d1", "operatingSystem": "linux", "osImage": "Red Hat Enterprise Linux Server 7.5 (Maipo)", "systemUUID": "981D8703-4EEA-A3D9-C9D8-6DFDE0B4240C"}}}], "returncode": 0}, "state": "list"}

But actually node was running well.
# oc get node
NAME                                   STATUS    ROLES     AGE       VERSION
qe-jliu-hr39a-master-1                 Ready     master    1h        v1.10.0+b81c8f8
qe-jliu-hr39a-master-2                 Ready     master    1h        v1.9.1+a0ce1bc657
qe-jliu-hr39a-master-3                 Ready     master    1h        v1.9.1+a0ce1bc657
qe-jliu-hr39a-router-registry-node-1   Ready     compute   1h        v1.9.1+a0ce1bc657
qe-jliu-hr39a-router-registry-node-2   Ready     compute   1h        v1.9.1+a0ce1bc657

# systemctl status atomic-openshift-node.service 
● atomic-openshift-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2018-05-16 06:28:49 EDT; 12min ago
     Docs: https://github.com/openshift/origin


Seems the time wait for node to be ready still not enough, because I run the command manually, got a different result about node status.

[root@qe-jliu-hr39a-master-1 ~]# /usr/bin/oc get node qe-jliu-hr39a-master-1 -o json -n default

<--snip-->
"conditions": [
            {
                "lastHeartbeatTime": null,
                "lastTransitionTime": "2018-05-16T08:59:36Z",
                "message": "openshift-sdn cleared kubelet-set NoRouteCreated",
                "reason": "RouteCreated",
                "status": "False",
                "type": "NetworkUnavailable"
            },
            {
                "lastHeartbeatTime": "2018-05-16T10:36:00Z",
                "lastTransitionTime": "2018-05-16T10:28:16Z",
                "message": "kubelet has sufficient disk space available",
                "reason": "KubeletHasSufficientDisk",
                "status": "False",
                "type": "OutOfDisk"
            },
            {
                "lastHeartbeatTime": "2018-05-16T10:36:00Z",
                "lastTransitionTime": "2018-05-16T10:28:16Z",
                "message": "kubelet has sufficient memory available",
                "reason": "KubeletHasSufficientMemory",
                "status": "False",
                "type": "MemoryPressure"
            },
            {
                "lastHeartbeatTime": "2018-05-16T10:36:00Z",
                "lastTransitionTime": "2018-05-16T10:28:16Z",
                "message": "kubelet has no disk pressure",
                "reason": "KubeletHasNoDiskPressure",
                "status": "False",
                "type": "DiskPressure"
            },
            {
                "lastHeartbeatTime": "2018-05-16T10:36:00Z",
                "lastTransitionTime": "2018-05-16T10:29:19Z",
                "message": "kubelet is posting ready status",
                "reason": "KubeletReady",
                "status": "True",
                "type": "Ready"
            },
            {
                "lastHeartbeatTime": "2018-05-16T10:36:00Z",
                "lastTransitionTime": "2018-05-16T10:28:16Z",
                "message": "kubelet has sufficient PID available",
                "reason": "KubeletHasSufficientPID",
                "status": "False",
                "type": "PIDPressure"
            }
        ], 
<--snip-->


Version-Release number of the following components:
openshift-ansible-3.10.0-0.46.0.git.0.85c3afd.el7.noarch

How reproducible:
always

Steps to Reproduce:
1. rpm install ocp with ha deployement
2. run upgrade against above ocp
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Only happened for ha deployment during test.

Comment 2 liujia 2018-05-17 05:42:19 UTC
Hit it again based on openshift-ansible-3.10.0-0.47.0.git.0.c018c8f.el7.noarch.

Add testblocker for it block ha rpm upgrade.

Comment 4 Scott Dodson 2018-05-17 13:15:04 UTC
This is more than likely just a timing issue. We can increase the timeout from its 6 minutes to 10 minutes.

Comment 5 Scott Dodson 2018-05-17 13:39:03 UTC
We're going to pre-pull the the control-plane image on master hosts as early as possible.

On Nodes we're going to pre-pull the node image and pod image as early as possible.

All of these tasks are fire and forget, we never wait for them to return successfully.

Comment 6 Vadim Rutkovsky 2018-05-22 17:17:01 UTC
Created PR to prepull images: https://github.com/openshift/openshift-ansible/pull/8172

Comment 7 Vadim Rutkovsky 2018-05-25 09:04:23 UTC
Prepull should have resolved, please verify it no longer happens on openshift-ansible-3.10.0-0.51.0

Comment 8 liujia 2018-05-28 10:57:14 UTC
verified on openshift-ansible-3.10.0-0.53.0.git.0.53fe016.el7.noarch

Comment 10 errata-xmlrpc 2018-07-30 19:15:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816