Bug 2090151 - [RHEL scale up] increase the wait time so that the node has enough time to get ready
Summary: [RHEL scale up] increase the wait time so that the node has enough time to ge...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: Brent Barbachem
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks: 2103537 2103538
TreeView+ depends on / blocked
 
Reported: 2022-05-25 09:06 UTC by Yunfei Jiang
Modified: 2022-08-10 11:14 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When utilizing an OVN network rather than the default OSN Network, the scale-up task takes longer than usual. Consequence: The extra time required can cause the scaleup task to fail as the amount of time required to scaleup can be greater than the max amount listed. Fix: Double the amount of retries that occur during the scaleup phase. Result: It usually only required a couple of extra retries (the equivalent of 2-4 minutes). The extra retries that were programmed allows up to an extra 20 minutes. The extra retries and amount of time have allow the scaleup to successfully complete.
Clone Of:
: 2103537 2103538 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:14:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12394 0 None open BUG 2090151: Scaleup: Increase scaleup attempts in the workbook 2022-06-16 16:38:52 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:14:28 UTC

Description Yunfei Jiang 2022-05-25 09:06:16 UTC
Version :
4.9.0-0.nightly-2022-05-24-200205

Sometimes scale-up job hit following error, but eventually, all nodes are Ready and cluster is healthy.

TASK [openshift_node : Wait for node to report ready] **************************
Wednesday 25 May 2022  14:25:10 +0800 (0:00:19.202)       0:13:32.778 ********* 
FAILED - RETRYING: Wait for node to report ready (30 retries left).
<--SNIP-->
FAILED - RETRYING: Wait for node to report ready (1 retries left).
fatal: [ip-10-0-60-71.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-60-71.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.249540", "end": "2022-05-25 14:35:24.212666", "rc": 0, "start": "2022-05-25 14:35:23.963126", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}
fatal: [ip-10-0-61-254.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-61-254.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.266898", "end": "2022-05-25 14:35:24.213355", "rc": 0, "start": "2022-05-25 14:35:23.946457", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}


The timeline is:

1.[6:24-6:34] Approve CSR and wait for 10 min
TASK [openshift_node : Approve node CSRs] **************************************
Wednesday 25 May 2022  14:24:51 +0800 (0:04:04.743)       0:13:13.576 ********* 

2.[6:34], scale-up up job reported error, time out

3.[6:37:09], node reported Ready
May 25 06:37:09 ip-10-0-60-71.us-east-2.compute.internal hyperkube[2526]: I0525 06:37:09.201219    2526 kubelet_node_status.go:581] "Recording event message for node" node="ip-10-0-60-71.us-east-2.compute.in        ternal" event="NodeReady"
  - lastHeartbeatTime: "2022-05-25T07:16:01Z"
    lastTransitionTime: "2022-05-25T06:37:09Z"
    message: kubelet is posting ready status
    reason: KubeletReady
    status: "True"
    type: Ready

How to reproduce it (as minimally and precisely as possible)?
> 30%

Steps to Reproduce:
1. Create a cluster with OVN network
2. Do scale up against above cluster

Expected results:
Scale-up job finished successfully

Suggestion:
Increase wait time to 16-18 mins.

Additional info:
this issue is applicable for 4.9 4.10 and 4.11

Comment 3 Yunfei Jiang 2022-06-24 10:54:35 UTC
verified. PASS.

openshift-ansible-4.11.0-202206240216.p0.g9de1722.assembly.stream.el8.noarch.rpm

Comment 4 errata-xmlrpc 2022-08-10 11:14:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.