Created attachment 1669573 [details] Scaleup debug logs Description of problem: RHEL scaleup fails with error "Failed to approve node CSR", please get debug logs from attahment Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version openshift-ansible-4.4.0-202003060720.git.0.085aeb0.el7 How reproducible: Always Steps to Reproduce: 1. Install a UPI cluster on baremetal with rhcos workers removed 2. Scaleup with RHEL nodes 3. Actual results: Scaleup fails Expected results: Scaleup succeed Additional info: TASK [openshift_node : Approve node CSR] *************************************** Thursday 12 March 2020 15:11:25 +0800 (0:06:53.930) 0:12:25.891 ******** FAILED - RETRYING: Approve node CSR (6 retries left). FAILED - RETRYING: Approve node CSR (5 retries left). FAILED - RETRYING: Approve node CSR (4 retries left). FAILED - RETRYING: Approve node CSR (3 retries left). FAILED - RETRYING: Approve node CSR (2 retries left). FAILED - RETRYING: Approve node CSR (1 retries left). failed: [wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com -> localhost] (item=wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com) => {"ansible_loop_var": "item", "attempts": 6, "changed": true, "cmd": "count=0; for csr in `oc --kubeconfig=/tmp/installer-MmSdAO/auth/kubeconfig get csr --no-headers | grep \" system:node:wsun443121-fcgdb-rhel-0 \" | cut -d \" \" -f1`;\ndo\n oc --kubeconfig=/tmp/installer-MmSdAO/auth/kubeconfig adm certificate approve ${csr};\n if [ $? -eq 0 ];\n then\n count=$((count+1));\n fi;\ndone; exit $((!count));\n", "delta": "0:00:00.196222", "end": "2020-03-12 15:11:58.259209", "item": "wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com", "msg": "non-zero return code", "rc": 1, "start": "2020-03-12 15:11:58.062987", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
Logs show that wsun443121-fcgdb-rhel-1.wsun443121.qe.devcluster.openshift.com did have it's csr approved however rhel-0 and rhel-2 did not. The resulting failure message obscured the fact rhel-1 was approved. Is this reliably reproducible?
Adding testblocker since it always happens for UPI on Bare Metal with proxy.
Still could reproduce it on Bare Metal today.
It's reproduced on GCP
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days