Bug 2090151

Summary: [RHEL scale up] increase the wait time so that the node has enough time to get ready
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: InstallerAssignee: Brent Barbachem <bbarbach>
Installer sub component: openshift-ansible QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bbarbach
Version: 4.9   
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When utilizing an OVN network rather than the default OSN Network, the scale-up task takes longer than usual. Consequence: The extra time required can cause the scaleup task to fail as the amount of time required to scaleup can be greater than the max amount listed. Fix: Double the amount of retries that occur during the scaleup phase. Result: It usually only required a couple of extra retries (the equivalent of 2-4 minutes). The extra retries that were programmed allows up to an extra 20 minutes. The extra retries and amount of time have allow the scaleup to successfully complete.
Story Points: ---
Clone Of:
: 2103537 2103538 (view as bug list) Environment:
Last Closed: 2022-08-10 11:14:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2103537, 2103538    

Description Yunfei Jiang 2022-05-25 09:06:16 UTC
Version :
4.9.0-0.nightly-2022-05-24-200205

Sometimes scale-up job hit following error, but eventually, all nodes are Ready and cluster is healthy.

TASK [openshift_node : Wait for node to report ready] **************************
Wednesday 25 May 2022  14:25:10 +0800 (0:00:19.202)       0:13:32.778 ********* 
FAILED - RETRYING: Wait for node to report ready (30 retries left).
<--SNIP-->
FAILED - RETRYING: Wait for node to report ready (1 retries left).
fatal: [ip-10-0-60-71.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-60-71.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.249540", "end": "2022-05-25 14:35:24.212666", "rc": 0, "start": "2022-05-25 14:35:23.963126", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}
fatal: [ip-10-0-61-254.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-61-254.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.266898", "end": "2022-05-25 14:35:24.213355", "rc": 0, "start": "2022-05-25 14:35:23.946457", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}


The timeline is:

1.[6:24-6:34] Approve CSR and wait for 10 min
TASK [openshift_node : Approve node CSRs] **************************************
Wednesday 25 May 2022  14:24:51 +0800 (0:04:04.743)       0:13:13.576 ********* 

2.[6:34], scale-up up job reported error, time out

3.[6:37:09], node reported Ready
May 25 06:37:09 ip-10-0-60-71.us-east-2.compute.internal hyperkube[2526]: I0525 06:37:09.201219    2526 kubelet_node_status.go:581] "Recording event message for node" node="ip-10-0-60-71.us-east-2.compute.in        ternal" event="NodeReady"
  - lastHeartbeatTime: "2022-05-25T07:16:01Z"
    lastTransitionTime: "2022-05-25T06:37:09Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready

How to reproduce it (as minimally and precisely as possible)?
> 30%

Steps to Reproduce:
1. Create a cluster with OVN network
2. Do scale up against above cluster

Expected results:
Scale-up job finished successfully

Suggestion:
Increase wait time to 16-18 mins.

Additional info:
this issue is applicable for 4.9 4.10 and 4.11

Comment 3 Yunfei Jiang 2022-06-24 10:54:35 UTC
verified. PASS.

openshift-ansible-4.11.0-202206240216.p0.g9de1722.assembly.stream.el8.noarch.rpm

Comment 4 errata-xmlrpc 2022-08-10 11:14:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069