Bug 2090151

Summary:	[RHEL scale up] increase the wait time so that the node has enough time to get ready
Product:	OpenShift Container Platform	Reporter:	Yunfei Jiang <yunjiang>
Component:	Installer	Assignee:	Brent Barbachem <bbarbach>
Installer sub component:	openshift-ansible	QA Contact:	Yunfei Jiang <yunjiang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	bbarbach
Version:	4.9
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: When utilizing an OVN network rather than the default OSN Network, the scale-up task takes longer than usual. Consequence: The extra time required can cause the scaleup task to fail as the amount of time required to scaleup can be greater than the max amount listed. Fix: Double the amount of retries that occur during the scaleup phase. Result: It usually only required a couple of extra retries (the equivalent of 2-4 minutes). The extra retries that were programmed allows up to an extra 20 minutes. The extra retries and amount of time have allow the scaleup to successfully complete.	Story Points:	---
Clone Of:
Clones:	2103537 2103538 (view as bug list)		Environment:
Last Closed:	2022-08-10 11:14:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2103537, 2103538

Description Yunfei Jiang 2022-05-25 09:06:16 UTC

Version :
4.9.0-0.nightly-2022-05-24-200205

Sometimes scale-up job hit following error, but eventually, all nodes are Ready and cluster is healthy.

TASK [openshift_node : Wait for node to report ready] **************************
Wednesday 25 May 2022  14:25:10 +0800 (0:00:19.202)       0:13:32.778 ********* 
FAILED - RETRYING: Wait for node to report ready (30 retries left).
<--SNIP-->
FAILED - RETRYING: Wait for node to report ready (1 retries left).
fatal: [ip-10-0-60-71.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-60-71.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.249540", "end": "2022-05-25 14:35:24.212666", "rc": 0, "start": "2022-05-25 14:35:23.963126", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}
fatal: [ip-10-0-61-254.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-61-254.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.266898", "end": "2022-05-25 14:35:24.213355", "rc": 0, "start": "2022-05-25 14:35:23.946457", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}


The timeline is:

1.[6:24-6:34] Approve CSR and wait for 10 min
TASK [openshift_node : Approve node CSRs] **************************************
Wednesday 25 May 2022  14:24:51 +0800 (0:04:04.743)       0:13:13.576 ********* 

2.[6:34], scale-up up job reported error, time out

3.[6:37:09], node reported Ready
May 25 06:37:09 ip-10-0-60-71.us-east-2.compute.internal hyperkube[2526]: I0525 06:37:09.201219    2526 kubelet_node_status.go:581] "Recording event message for node" node="ip-10-0-60-71.us-east-2.compute.in        ternal" event="NodeReady"
  - lastHeartbeatTime: "2022-05-25T07:16:01Z"
    lastTransitionTime: "2022-05-25T06:37:09Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready

How to reproduce it (as minimally and precisely as possible)?
> 30%

Steps to Reproduce:
1. Create a cluster with OVN network
2. Do scale up against above cluster

Expected results:
Scale-up job finished successfully

Suggestion:
Increase wait time to 16-18 mins.

Additional info:
this issue is applicable for 4.9 4.10 and 4.11

Comment 3 Yunfei Jiang 2022-06-24 10:54:35 UTC

verified. PASS.

openshift-ansible-4.11.0-202206240216.p0.g9de1722.assembly.stream.el8.noarch.rpm

Comment 4 errata-xmlrpc 2022-08-10 11:14:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069