Cause:
When utilizing an OVN network rather than the default OSN Network, the scale-up task takes longer than usual.
Consequence:
The extra time required can cause the scaleup task to fail as the amount of time required to scaleup can be greater than the max amount listed.
Fix:
Double the amount of retries that occur during the scaleup phase.
Result:
It usually only required a couple of extra retries (the equivalent of 2-4 minutes). The extra retries that were programmed allows up to an extra 20 minutes. The extra retries and amount of time have allow the scaleup to successfully complete.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:5069
Version : 4.9.0-0.nightly-2022-05-24-200205 Sometimes scale-up job hit following error, but eventually, all nodes are Ready and cluster is healthy. TASK [openshift_node : Wait for node to report ready] ************************** Wednesday 25 May 2022 14:25:10 +0800 (0:00:19.202) 0:13:32.778 ********* FAILED - RETRYING: Wait for node to report ready (30 retries left). <--SNIP--> FAILED - RETRYING: Wait for node to report ready (1 retries left). fatal: [ip-10-0-60-71.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-60-71.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.249540", "end": "2022-05-25 14:35:24.212666", "rc": 0, "start": "2022-05-25 14:35:23.963126", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]} fatal: [ip-10-0-61-254.us-east-2.compute.internal -> localhost]: FAILED! => {"attempts": 30, "changed": false, "cmd": ["oc", "get", "node", "ip-10-0-61-254.us-east-2.compute.internal", "--kubeconfig=/tmp/installer-aVed14/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.266898", "end": "2022-05-25 14:35:24.213355", "rc": 0, "start": "2022-05-25 14:35:23.946457", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]} The timeline is: 1.[6:24-6:34] Approve CSR and wait for 10 min TASK [openshift_node : Approve node CSRs] ************************************** Wednesday 25 May 2022 14:24:51 +0800 (0:04:04.743) 0:13:13.576 ********* 2.[6:34], scale-up up job reported error, time out 3.[6:37:09], node reported Ready May 25 06:37:09 ip-10-0-60-71.us-east-2.compute.internal hyperkube[2526]: I0525 06:37:09.201219 2526 kubelet_node_status.go:581] "Recording event message for node" node="ip-10-0-60-71.us-east-2.compute.in ternal" event="NodeReady" - lastHeartbeatTime: "2022-05-25T07:16:01Z" lastTransitionTime: "2022-05-25T06:37:09Z" message: kubelet is posting ready status reason: KubeletReady status: "True" type: Ready How to reproduce it (as minimally and precisely as possible)? > 30% Steps to Reproduce: 1. Create a cluster with OVN network 2. Do scale up against above cluster Expected results: Scale-up job finished successfully Suggestion: Increase wait time to 16-18 mins. Additional info: this issue is applicable for 4.9 4.10 and 4.11