Bug 1741432

Summary: Should approve node csr before checking nodes to be ready
Product: OpenShift Container Platform Reporter: Gaoyun Pei <gpei>
Component: InstallerAssignee: Jeremiah Stuever <jstuever>
Installer sub component: openshift-ansible QA Contact: Gaoyun Pei <gpei>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: jstuever, wmeng
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:35:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Gaoyun Pei 2019-08-15 05:57:46 UTC
Description of problem:
The RHEL worker scale-up playbook failed at "Wait for nodes to report ready"

TASK [openshift_node : Wait for nodes to report ready] *************************
Thursday 15 August 2019  13:26:21 +0800 (0:00:30.580)       0:06:20.154 ******* 
FAILED - RETRYING: Wait for nodes to report ready (36 retries left).
FAILED - RETRYING: Wait for nodes to report ready (35 retries left).
...
FAILED - RETRYING: Wait for nodes to report ready (2 retries left).
FAILED - RETRYING: Wait for nodes to report ready (1 retries left).
failed: [rhel-1.qe-gpei-420815.qe.devcluster.openshift.com -> localhost] (item=rhel-1.qe-gpei-420815.qe.devcluster.openshift.com) => {"ansible_loop_var": "item", "attempts": 36, "changed": true, "cmd": ["oc", "get", "node", "rhel-1.qe-gpei-420815.qe.devcluster.openshift.com", "--config=/home/slave1/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.135821", "end": "2019-08-15 13:29:30.917780", "item": "rhel-1.qe-gpei-420815.qe.devcluster.openshift.com", "msg": "non-zero return code", "rc": 1, "start": "2019-08-15 13:29:30.781959", "stderr": "Error from server (NotFound): nodes \"rhel-1.qe-gpei-420815.qe.devcluster.openshift.com\" not found", "stderr_lines": ["Error from server (NotFound): nodes \"rhel-1.qe-gpei-420815.qe.devcluster.openshift.com\" not found"], "stdout": "", "stdout_lines": []}


After approve the pending nodes csr manually, the RHEL nodes are available.


Version-Release number of the following components:
openshift-ansible-4.2.0-201908142219.git.188.7254b39.el7

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Jeremiah Stuever 2019-08-20 21:50:58 UTC
Nodes are approved automatically in my initial testing. I expect this is because they were rolled out using machine sets. Need to test how this looks when nodes are created outside of the cluster.

Comment 2 Jeremiah Stuever 2019-08-20 23:43:15 UTC
Confirmed, node certificates are not automatically approved if the node is not created by the cluster. This causes the playbook to fail at "Wait for nodes to report ready"... introduced in #11808.

https://github.com/openshift/openshift-ansible/pull/11808

Comment 4 Gaoyun Pei 2019-08-26 05:12:23 UTC
Verify this bug with openshift-ansible-4.2.0-201908251340.git.198.9c1655b.el7


TASK [openshift_node : Approve node-bootstrapper CSR] **************************
Monday 26 August 2019  11:42:36 +0800 (0:01:09.436)       0:04:20.220 ********* 
changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-51-169.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 1, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers  | grep \" system:serviceaccount:openshift-machine-config-operator:node-bootstrapper \"  | cut -d \" \" -f1`;\ndo\n  oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig describe csr/$csr    | grep \" system:node:ip-10-0-51-169.us-east-2.compute.internal$\";\n  if [ $? -eq 0 ];\n  then\n    oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n    if [ $? -eq 0 ];\n    then\n      count=$((count+1));\n    fi;\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:02.862661", "end": "2019-08-26 11:42:39.206067", "item": "ip-10-0-51-169.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:36.343406", "stderr": "", "stderr_lines": [], "stdout": "         Common Name:    system:node:ip-10-0-51-169.us-east-2.compute.internal\ncertificatesigningrequest.certificates.k8s.io/csr-6hmxb approved", "stdout_lines": ["         Common Name:    system:node:ip-10-0-51-169.us-east-2.compute.internal", "certificatesigningrequest.certificates.k8s.io/csr-6hmxb approved"]}
changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-50-20.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 1, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers  | grep \" system:serviceaccount:openshift-machine-config-operator:node-bootstrapper \"  | cut -d \" \" -f1`;\ndo\n  oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig describe csr/$csr    | grep \" system:node:ip-10-0-50-20.us-east-2.compute.internal$\";\n  if [ $? -eq 0 ];\n  then\n    oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n    if [ $? -eq 0 ];\n    then\n      count=$((count+1));\n    fi;\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:02.756084", "end": "2019-08-26 11:42:42.115446", "item": "ip-10-0-50-20.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:39.359362", "stderr": "", "stderr_lines": [], "stdout": "         Common Name:    system:node:ip-10-0-50-20.us-east-2.compute.internal\ncertificatesigningrequest.certificates.k8s.io/csr-rx4fv approved", "stdout_lines": ["         Common Name:    system:node:ip-10-0-50-20.us-east-2.compute.internal", "certificatesigningrequest.certificates.k8s.io/csr-rx4fv approved"]}

TASK [openshift_node : Approve node CSR] ***************************************
Monday 26 August 2019  11:42:42 +0800 (0:00:05.966)       0:04:26.187 ********* 
FAILED - RETRYING: Approve node CSR (6 retries left).
changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-51-169.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 2, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers  | grep \" system:node:ip-10-0-51-169.us-east-2.compute.internal \"  | cut -d \" \" -f1`;\ndo\n  oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n  if [ $? -eq 0 ];\n  then\n    count=$((count+1));\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:00.493370", "end": "2019-08-26 11:42:48.156555", "item": "ip-10-0-51-169.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:47.663185", "stderr": "", "stderr_lines": [], "stdout": "certificatesigningrequest.certificates.k8s.io/csr-gd6rz approved", "stdout_lines": ["certificatesigningrequest.certificates.k8s.io/csr-gd6rz approved"]}
FAILED - RETRYING: Approve node CSR (6 retries left).
changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-50-20.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 2, "changed": true, "cmd": "count=0; for csr in `oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig get csr --no-headers  | grep \" system:node:ip-10-0-50-20.us-east-2.compute.internal \"  | cut -d \" \" -f1`;\ndo\n  oc --config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig adm certificate approve ${csr};\n  if [ $? -eq 0 ];\n  then\n    count=$((count+1));\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:00.504103", "end": "2019-08-26 11:42:54.152217", "item": "ip-10-0-50-20.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:42:53.648114", "stderr": "", "stderr_lines": [], "stdout": "certificatesigningrequest.certificates.k8s.io/csr-b9kwz approved", "stdout_lines": ["certificatesigningrequest.certificates.k8s.io/csr-b9kwz approved"]}

TASK [openshift_node : Wait for nodes to report ready] *************************
Monday 26 August 2019  11:42:54 +0800 (0:00:12.039)       0:04:38.226 ********* 
FAILED - RETRYING: Wait for nodes to report ready (36 retries left).
FAILED - RETRYING: Wait for nodes to report ready (35 retries left).
changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-51-169.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 3, "changed": true, "cmd": ["oc", "get", "node", "ip-10-0-51-169.us-east-2.compute.internal", "--config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:01.233045", "end": "2019-08-26 11:43:06.273282", "item": "ip-10-0-51-169.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:43:05.040237", "stderr": "", "stderr_lines": [], "stdout": "True", "stdout_lines": ["True"]}
FAILED - RETRYING: Wait for nodes to report ready (36 retries left).
changed: [ip-10-0-51-169.us-east-2.compute.internal -> localhost] => (item=ip-10-0-50-20.us-east-2.compute.internal) => {"ansible_loop_var": "item", "attempts": 2, "changed": true, "cmd": ["oc", "get", "node", "ip-10-0-50-20.us-east-2.compute.internal", "--config=/home/slave3/workspace/ocp4-rhel-scaleup-runner/jenkins_kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.240301", "end": "2019-08-26 11:43:12.017293", "item": "ip-10-0-50-20.us-east-2.compute.internal", "rc": 0, "start": "2019-08-26 11:43:11.776992", "stderr": "", "stderr_lines": [], "stdout": "True", "stdout_lines": ["True"]}

Comment 5 errata-xmlrpc 2019-10-16 06:35:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922