Bug 1812787 - RHEL scaleup fails with error "Failed to approve node CSR"
Summary: RHEL scaleup fails with error "Failed to approve node CSR"
Keywords:
Status: CLOSED DUPLICATE of bug 1817382
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 1815010 1817382
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-12 07:47 UTC by Yang Yang
Modified: 2023-09-14 05:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1815010 (view as bug list)
Environment:
Last Closed: 2020-03-31 18:14:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Scaleup debug logs (33.30 KB, application/vnd.oasis.opendocument.text)
2020-03-12 07:47 UTC, Yang Yang
no flags Details

Description Yang Yang 2020-03-12 07:47:34 UTC
Created attachment 1669573 [details]
Scaleup debug logs

Description of problem:
RHEL scaleup fails with error "Failed to approve node CSR", please get debug logs from attahment

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

openshift-ansible-4.4.0-202003060720.git.0.085aeb0.el7

How reproducible:
Always

Steps to Reproduce:
1. Install a UPI cluster on baremetal with rhcos workers removed
2. Scaleup with RHEL nodes
3.

Actual results:
Scaleup fails

Expected results:
Scaleup succeed

Additional info:

TASK [openshift_node : Approve node CSR] ***************************************
Thursday 12 March 2020  15:11:25 +0800 (0:06:53.930)       0:12:25.891 ******** 
FAILED - RETRYING: Approve node CSR (6 retries left).
FAILED - RETRYING: Approve node CSR (5 retries left).
FAILED - RETRYING: Approve node CSR (4 retries left).
FAILED - RETRYING: Approve node CSR (3 retries left).
FAILED - RETRYING: Approve node CSR (2 retries left).
FAILED - RETRYING: Approve node CSR (1 retries left).
failed: [wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com -> localhost] (item=wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com) => {"ansible_loop_var": "item", "attempts": 6, "changed": true, "cmd": "count=0; for csr in `oc --kubeconfig=/tmp/installer-MmSdAO/auth/kubeconfig get csr --no-headers  | grep \" system:node:wsun443121-fcgdb-rhel-0 \"  | cut -d \" \" -f1`;\ndo\n  oc --kubeconfig=/tmp/installer-MmSdAO/auth/kubeconfig adm certificate approve ${csr};\n  if [ $? -eq 0 ];\n  then\n    count=$((count+1));\n  fi;\ndone; exit $((!count));\n", "delta": "0:00:00.196222", "end": "2020-03-12 15:11:58.259209", "item": "wsun443121-fcgdb-rhel-0.wsun443121.qe.devcluster.openshift.com", "msg": "non-zero return code", "rc": 1, "start": "2020-03-12 15:11:58.062987", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Comment 1 Russell Teague 2020-03-12 18:03:45 UTC
Logs show that wsun443121-fcgdb-rhel-1.wsun443121.qe.devcluster.openshift.com did have it's csr approved however rhel-0 and rhel-2 did not.  The resulting failure message obscured the fact rhel-1 was approved.

Is this reliably reproducible?

Comment 3 Wei Sun 2020-03-13 02:17:32 UTC
Adding testblocker since it always happens for UPI on Bare Metal with proxy.

Comment 4 Wei Sun 2020-03-17 08:24:17 UTC
Still could reproduce it on Bare Metal today.

Comment 7 Yang Yang 2020-03-19 07:25:33 UTC
It's reproduced on GCP

Comment 13 Red Hat Bugzilla 2023-09-14 05:54:15 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.