Description of problem: The RHEL worker scale-up playbook failed at "Wait for nodes to report ready" TASK [openshift_node : Wait for nodes to report ready] ************************* Thursday 15 August 2019 13:26:21 +0800 (0:00:30.580) 0:06:20.154 ******* FAILED - RETRYING: Wait for nodes to report ready (36 retries left). FAILED - RETRYING: Wait for nodes to report ready (35 retries left). ... FAILED - RETRYING: Wait for nodes to report ready (2 retries left). FAILED - RETRYING: Wait for nodes to report ready (1 retries left). failed: [rhel-1.qe-gpei-420815.qe.devcluster.openshift.com -> localhost] (item=rhel-1.qe-gpei-420815.qe.devcluster.openshift.com) => {"ansible_loop_var": "item", "attempts": 36, "changed": true, "cmd": ["oc", "get", "node", "rhel-1.qe-gpei-420815.qe.devcluster.openshift.com", "--config=/home/slave1/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.135821", "end": "2019-08-15 13:29:30.917780", "item": "rhel-1.qe-gpei-420815.qe.devcluster.openshift.com", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 13:29:30.781959", "stderr": "Error from server (NotFound): nodes \"rhel-1.qe-gpei-420815.qe.devcluster.openshift.com\" not found", "stderr_lines": ["Error from server (NotFound): nodes \"rhel-1.qe-gpei-420815.qe.devcluster.openshift.com\" not found"], "stdout": "", "stdout_lines": []} After approve the pending nodes csr manually, the RHEL nodes are available. Version-Release number of the following components: openshift-ansible-4.2.0-201908142219.git.188.7254b39.el7 How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
Nodes are approved automatically in my initial testing. I expect this is because they were rolled out using machine sets. Need to test how this looks when nodes are created outside of the cluster.
Confirmed, node certificates are not automatically approved if the node is not created by the cluster. This causes the playbook to fail at "Wait for nodes to report ready"... introduced in #11808. https://github.com/openshift/openshift-ansible/pull/11808
Verify this bug with openshift-ansible-4.2.0-201908251340.git.198.9c1655b.el7 TASK [openshift_node : Approve node-bootstrapper CSR] ************************** Monday 26 August 2019 11:42:36 +0800 (0:01:09.436) 0:04:20.220 ********* changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-51-169.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 1, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers | grep \" system:serviceaccount:openshift-machine-config-operator:node-bootstrapper \" | cut -d \" \" -f1`;\ndo\n oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig describe csr/$csr | grep \" system:node:ip-10-0-51-169.us-east-2.compute.internal$\";\n if [ $? -eq 0 ];\n then\n oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n if [ $? -eq 0 ];\n then\n count=$((count+1));\n fi;\n fi;\ndone; exit $((!count));\n", "delta": "0:00:02.862661", "end": "2019-08-26 11:42:39.206067", "item": "ip-10-0-51-169.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:36.343406", "stderr": "", "stderr_lines": [], "stdout": " Common Name: system:node:ip-10-0-51-169.us-east-2.compute.internal\ncertificatesigningrequest.certificates.k8s.io/csr-6hmxb approved", "stdout_lines": [" Common Name: system:node:ip-10-0-51-169.us-east-2.compute.internal", "certificatesigningrequest.certificates.k8s.io/csr-6hmxb approved"]} changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-50-20.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 1, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers | grep \" system:serviceaccount:openshift-machine-config-operator:node-bootstrapper \" | cut -d \" \" -f1`;\ndo\n oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig describe csr/$csr | grep \" system:node:ip-10-0-50-20.us-east-2.compute.internal$\";\n if [ $? -eq 0 ];\n then\n oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n if [ $? -eq 0 ];\n then\n count=$((count+1));\n fi;\n fi;\ndone; exit $((!count));\n", "delta": "0:00:02.756084", "end": "2019-08-26 11:42:42.115446", "item": "ip-10-0-50-20.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:39.359362", "stderr": "", "stderr_lines": [], "stdout": " Common Name: system:node:ip-10-0-50-20.us-east-2.compute.internal\ncertificatesigningrequest.certificates.k8s.io/csr-rx4fv approved", "stdout_lines": [" Common Name: system:node:ip-10-0-50-20.us-east-2.compute.internal", "certificatesigningrequest.certificates.k8s.io/csr-rx4fv approved"]} TASK [openshift_node : Approve node CSR] *************************************** Monday 26 August 2019 11:42:42 +0800 (0:00:05.966) 0:04:26.187 ********* FAILED - RETRYING: Approve node CSR (6 retries left). changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-51-169.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 2, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers | grep \" system:node:ip-10-0-51-169.us-east-2.compute.internal \" | cut -d \" \" -f1`;\ndo\n oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n if [ $? -eq 0 ];\n then\n count=$((count+1));\n fi;\ndone; exit $((!count));\n", "delta": "0:00:00.493370", "end": "2019-08-26 11:42:48.156555", "item": "ip-10-0-51-169.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:47.663185", "stderr": "", "stderr_lines": [], "stdout": "certificatesigningrequest.certificates.k8s.io/csr-gd6rz approved", "stdout_lines": ["certificatesigningrequest.certificates.k8s.io/csr-gd6rz approved"]} FAILED - RETRYING: Approve node CSR (6 retries left). changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-50-20.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 2, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers | grep \" system:node:ip-10-0-50-20.us-east-2.compute.internal \" | cut -d \" \" -f1`;\ndo\n oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n if [ $? -eq 0 ];\n then\n count=$((count+1));\n fi;\ndone; exit $((!count));\n", "delta": "0:00:00.504103", "end": "2019-08-26 11:42:54.152217", "item": "ip-10-0-50-20.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:53.648114", "stderr": "", "stderr_lines": [], "stdout": "certificatesigningrequest.certificates.k8s.io/csr-b9kwz approved", "stdout_lines": ["certificatesigningrequest.certificates.k8s.io/csr-b9kwz approved"]} TASK [openshift_node : Wait for nodes to report ready] ************************* Monday 26 August 2019 11:42:54 +0800 (0:00:12.039) 0:04:38.226 ********* FAILED - RETRYING: Wait for nodes to report ready (36 retries left). FAILED - RETRYING: Wait for nodes to report ready (35 retries left). changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-51-169.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 3, "changed": true, "cmd": ["oc", "get", "node", "ip-10-0-51-169.us-east-2.compute.internal", "--config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:01.233045", "end": "2019-08-26 11:43:06.273282", "item": "ip-10-0-51-169.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:43:05.040237", "stderr": "", "stderr_lines": [], "stdout": "True", "stdout_lines": ["True"]} FAILED - RETRYING: Wait for nodes to report ready (36 retries left). changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-50-20.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 2, "changed": true, "cmd": ["oc", "get", "node", "ip-10-0-50-20.us-east-2.compute.internal", "--config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.240301", "end": "2019-08-26 11:43:12.017293", "item": "ip-10-0-50-20.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:43:11.776992", "stderr": "", "stderr_lines": [], "stdout": "True", "stdout_lines": ["True"]}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922