Bug 1812787

Summary: RHEL scaleup fails with error "Failed to approve node CSR"
Product: OpenShift Container Platform Reporter: Yang Yang <yanyang>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: aos-bugs, gpei, jokerman, rphillips, scuppett, wjiang, wsun
Target Milestone: ---Keywords: Regression, TestBlocker
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1815010 (view as bug list) Environment:
Last Closed: 2020-03-31 18:14:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1815010, 1817382    
Bug Blocks:    
Attachments:
Description Flags
Scaleup debug logs none

Description Yang Yang 2020-03-12 07:47:34 UTC
Created attachment 1669573 [details]
Scaleup debug logs

Description of problem:
RHEL scaleup fails with error "Failed to approve node CSR", please get debug logs from attahment

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

openshift-ansible-4.4.0-202003060720.git.0.085aeb0.el7

How reproducible:
Always

Steps to Reproduce:
1. Install a UPI cluster on baremetal with rhcos workers removed
2. Scaleup with RHEL nodes
3.

Actual results:
Scaleup fails

Expected results:
Scaleup succeed

Additional info:

TASK [openshift_node : Approve node CSR] ***************************************
Thursday 12 March 2020  15:11:25 +0800 (0:06:53.930)       0:12:25.891 ******** 
FAILED - RETRYING: Approve node CSR (6 retries left).
FAILED - RETRYING: Approve node CSR (5 retries left).
FAILED - RETRYING: Approve node CSR (4 retries left).
FAILED - RETRYING: Approve node CSR (3 retries left).
FAILED - RETRYING: Approve node CSR (2 retries left).
FAILED - RETRYING: Approve node CSR (1 retries left).
failed: [wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com -> localhost] (item=wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com) => {"ansible_loop_var": "item", "attempts": 6, "changed": true, "cmd": "count=0; for csr in `oc --kubeconfig=/tmp/installer-MmSdAO/auth/kubeconfig get csr --no-headers  | grep \" system:node:wsun443121-fcgdb-rhel-0 \"  | cut -d \" \" -f1`;\ndo\n  oc --kubeconfig=/tmp/installer-MmSdAO/auth/kubeconfig adm certificate approve ${csr};\n  if [ $? -eq 0 ];\n  then\n    count=$((count+1));\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:00.196222", "end": "2020-03-12 15:11:58.259209", "item": "wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com", "msg": "non-zero return code", "rc": 1, "start": "2020-03-12 15:11:58.062987", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Comment 1 Russell Teague 2020-03-12 18:03:45 UTC
Logs show that wsun443121-fcgdb-rhel-1.wsun443121.qe.devcluster.openshift.com did have it's csr approved however rhel-0 and rhel-2 did not.  The resulting failure message obscured the fact rhel-1 was approved.

Is this reliably reproducible?

Comment 3 Wei Sun 2020-03-13 02:17:32 UTC
Adding testblocker since it always happens for UPI on Bare Metal with proxy.

Comment 4 Wei Sun 2020-03-17 08:24:17 UTC
Still could reproduce it on Bare Metal today.

Comment 7 Yang Yang 2020-03-19 07:25:33 UTC
It's reproduced on GCP

Comment 13 Red Hat Bugzilla 2023-09-14 05:54:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days