2090151 – [RHEL scale up] increase the wait time so that the node has enough time to get ready

Bug 2090151 - [RHEL scale up] increase the wait time so that the node has enough time to get ready

Summary: [RHEL scale up] increase the wait time so that the node has enough time to ge...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Brent Barbachem
QA Contact:	Yunfei Jiang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2103537 2103538
TreeView+	depends on / blocked

Reported:	2022-05-25 09:06 UTC by Yunfei Jiang
Modified:	2022-08-10 11:14 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When utilizing an OVN network rather than the default OSN Network, the scale-up task takes longer than usual. Consequence: The extra time required can cause the scaleup task to fail as the amount of time required to scaleup can be greater than the max amount listed. Fix: Double the amount of retries that occur during the scaleup phase. Result: It usually only required a couple of extra retries (the equivalent of 2-4 minutes). The extra retries that were programmed allows up to an extra 20 minutes. The extra retries and amount of time have allow the scaleup to successfully complete.
Clone Of:
Clones:	2103537 2103538 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:14:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12394	0	None	open	BUG 2090151: Scaleup: Increase scaleup attempts in the workbook	2022-06-16 16:38:52 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:14:28 UTC

Description Yunfei Jiang 2022-05-25 09:06:16 UTC

Version :
4.9.0-0.nightly-2022-05-24-200205

Sometimes scale-up job hit following error, but eventually, all nodes are Ready and cluster is healthy.

TASK [openshift_node : Wait for node to report ready] **************************
Wednesday 25 May 2022  14:25:10 +0800 (0:00:19.202)       0:13:32.778 ********* 
FAILED - RETRYING: Wait for node to report ready (30 retries left).
<--SNIP-->
FAILED - RETRYING: Wait for node to report ready (1 retries left).
fatal: [ip-10-0-60-71.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-60-71.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.249540", "end": "2022-05-25 14:35:24.212666", "rc": 0, "start": "2022-05-25 14:35:23.963126", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}
fatal: [ip-10-0-61-254.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-61-254.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.266898", "end": "2022-05-25 14:35:24.213355", "rc": 0, "start": "2022-05-25 14:35:23.946457", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}


The timeline is:

1.[6:24-6:34] Approve CSR and wait for 10 min
TASK [openshift_node : Approve node CSRs] **************************************
Wednesday 25 May 2022  14:24:51 +0800 (0:04:04.743)       0:13:13.576 ********* 

2.[6:34], scale-up up job reported error, time out

3.[6:37:09], node reported Ready
May 25 06:37:09 ip-10-0-60-71.us-east-2.compute.internal hyperkube[2526]: I0525 06:37:09.201219    2526 kubelet_node_status.go:581] "Recording event message for node" node="ip-10-0-60-71.us-east-2.compute.in        ternal" event="NodeReady"
  - lastHeartbeatTime: "2022-05-25T07:16:01Z"
    lastTransitionTime: "2022-05-25T06:37:09Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready

How to reproduce it (as minimally and precisely as possible)?
> 30%

Steps to Reproduce:
1. Create a cluster with OVN network
2. Do scale up against above cluster

Expected results:
Scale-up job finished successfully

Suggestion:
Increase wait time to 16-18 mins.

Additional info:
this issue is applicable for 4.9 4.10 and 4.11

Comment 3 Yunfei Jiang 2022-06-24 10:54:35 UTC

verified. PASS.

openshift-ansible-4.11.0-202206240216.p0.g9de1722.assembly.stream.el8.noarch.rpm

Comment 4 errata-xmlrpc 2022-08-10 11:14:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.