Bug 1915760 - Need to increase timeout to wait rhel worker get ready
Summary: Need to increase timeout to wait rhel worker get ready
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Russell Teague
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-13 11:58 UTC by Johnny Liu
Modified: 2021-02-24 15:53 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:52:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
scalup log (134.88 KB, text/plain)
2021-01-13 11:59 UTC, Johnny Liu
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12291 0 None closed Bug 1915760: Increase node ready wait to 10 minutes 2021-02-08 04:47:06 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:53:00 UTC

Description Johnny Liu 2021-01-13 11:58:45 UTC
Recently QE often hit rhel worker scaleup failure.
TASK [openshift_node : Approve node CSRs] **************************************
Tuesday 12 January 2021  12:53:43 +0800 (0:00:17.805)       0:08:32.742 ******* 
changed: [10.0.99.172] => {"changed": true, "client_approve_results": ["Attempt: 1, Node qeci-13399-rwght-rhel-1 not present or CSR not yet available", "qeci-13399-rwght-rhel-1: certificatesigningrequest.certificates.k8s.io/csr-mgdcl approved\n", "Attempt: 2, Node qeci-13399-rwght-rhel-1 not present or CSR not yet available", "Attempt: 3, Node qeci-13399-rwght-rhel-1 not present or CSR not yet available", "Node qeci-13399-rwght-rhel-1 is present in node list"], "rc": 0, "server_approve_results": ["qeci-13399-rwght-rhel-1: certificatesigningrequest.certificates.k8s.io/csr-vcj2t approved\n", "Node qeci-13399-rwght-rhel-1 API is ready"]}
changed: [10.0.97.118] => {"changed": true, "client_approve_results": ["Attempt: 1, Node qeci-13399-rwght-rhel-0 not present or CSR not yet available", "qeci-13399-rwght-rhel-0: certificatesigningrequest.certificates.k8s.io/csr-gq6zl approved\n", "Attempt: 2, Node qeci-13399-rwght-rhel-0 not present or CSR not yet available", "Attempt: 3, Node qeci-13399-rwght-rhel-0 not present or CSR not yet available", "Node qeci-13399-rwght-rhel-0 is present in node list"], "rc": 0, "server_approve_results": ["qeci-13399-rwght-rhel-0: certificatesigningrequest.certificates.k8s.io/csr-p8t7b approved\n", "Node qeci-13399-rwght-rhel-0 API is ready"]}

TASK [openshift_node : Wait for node to report ready] **************************
Tuesday 12 January 2021  12:54:01 +0800 (0:00:18.488)       0:08:51.230 ******* 
FAILED - RETRYING: Wait for node to report ready (36 retries left).
FAILED - RETRYING: Wait for node to report ready (36 retries left).
FAILED - RETRYING: Wait for node to report ready (35 retries left).
FAILED - RETRYING: Wait for node to report ready (35 retries left).
FAILED - RETRYING: Wait for node to report ready (34 retries left).
FAILED - RETRYING: Wait for node to report ready (34 retries left).
FAILED - RETRYING: Wait for node to report ready (33 retries left).
FAILED - RETRYING: Wait for node to report ready (33 retries left).
FAILED - RETRYING: Wait for node to report ready (32 retries left).
FAILED - RETRYING: Wait for node to report ready (32 retries left).
FAILED - RETRYING: Wait for node to report ready (31 retries left).
FAILED - RETRYING: Wait for node to report ready (31 retries left).
FAILED - RETRYING: Wait for node to report ready (30 retries left).
FAILED - RETRYING: Wait for node to report ready (30 retries left).
FAILED - RETRYING: Wait for node to report ready (29 retries left).
FAILED - RETRYING: Wait for node to report ready (29 retries left).
FAILED - RETRYING: Wait for node to report ready (28 retries left).
FAILED - RETRYING: Wait for node to report ready (28 retries left).
FAILED - RETRYING: Wait for node to report ready (27 retries left).
FAILED - RETRYING: Wait for node to report ready (27 retries left).
FAILED - RETRYING: Wait for node to report ready (26 retries left).
FAILED - RETRYING: Wait for node to report ready (26 retries left).
FAILED - RETRYING: Wait for node to report ready (25 retries left).
FAILED - RETRYING: Wait for node to report ready (25 retries left).
FAILED - RETRYING: Wait for node to report ready (24 retries left).
FAILED - RETRYING: Wait for node to report ready (24 retries left).
FAILED - RETRYING: Wait for node to report ready (23 retries left).
FAILED - RETRYING: Wait for node to report ready (23 retries left).
FAILED - RETRYING: Wait for node to report ready (22 retries left).
FAILED - RETRYING: Wait for node to report ready (22 retries left).
FAILED - RETRYING: Wait for node to report ready (21 retries left).
FAILED - RETRYING: Wait for node to report ready (21 retries left).
FAILED - RETRYING: Wait for node to report ready (20 retries left).
FAILED - RETRYING: Wait for node to report ready (20 retries left).
FAILED - RETRYING: Wait for node to report ready (19 retries left).
FAILED - RETRYING: Wait for node to report ready (19 retries left).
FAILED - RETRYING: Wait for node to report ready (18 retries left).
FAILED - RETRYING: Wait for node to report ready (18 retries left).
FAILED - RETRYING: Wait for node to report ready (17 retries left).
FAILED - RETRYING: Wait for node to report ready (17 retries left).
FAILED - RETRYING: Wait for node to report ready (16 retries left).
FAILED - RETRYING: Wait for node to report ready (16 retries left).
FAILED - RETRYING: Wait for node to report ready (15 retries left).
FAILED - RETRYING: Wait for node to report ready (15 retries left).
FAILED - RETRYING: Wait for node to report ready (14 retries left).
FAILED - RETRYING: Wait for node to report ready (14 retries left).
FAILED - RETRYING: Wait for node to report ready (13 retries left).
FAILED - RETRYING: Wait for node to report ready (13 retries left).
FAILED - RETRYING: Wait for node to report ready (12 retries left).
FAILED - RETRYING: Wait for node to report ready (12 retries left).
FAILED - RETRYING: Wait for node to report ready (11 retries left).
FAILED - RETRYING: Wait for node to report ready (11 retries left).
FAILED - RETRYING: Wait for node to report ready (10 retries left).
FAILED - RETRYING: Wait for node to report ready (10 retries left).
FAILED - RETRYING: Wait for node to report ready (9 retries left).
FAILED - RETRYING: Wait for node to report ready (9 retries left).
FAILED - RETRYING: Wait for node to report ready (8 retries left).
FAILED - RETRYING: Wait for node to report ready (8 retries left).
FAILED - RETRYING: Wait for node to report ready (7 retries left).
FAILED - RETRYING: Wait for node to report ready (7 retries left).
FAILED - RETRYING: Wait for node to report ready (6 retries left).
FAILED - RETRYING: Wait for node to report ready (6 retries left).
FAILED - RETRYING: Wait for node to report ready (5 retries left).
FAILED - RETRYING: Wait for node to report ready (5 retries left).
FAILED - RETRYING: Wait for node to report ready (4 retries left).
FAILED - RETRYING: Wait for node to report ready (4 retries left).
FAILED - RETRYING: Wait for node to report ready (3 retries left).
FAILED - RETRYING: Wait for node to report ready (3 retries left).
FAILED - RETRYING: Wait for node to report ready (2 retries left).
FAILED - RETRYING: Wait for node to report ready (2 retries left).
FAILED - RETRYING: Wait for node to report ready (1 retries left).
FAILED - RETRYING: Wait for node to report ready (1 retries left).
fatal: [10.0.97.118]: FAILED! => {"attempts": 36, "changed": false, "cmd": ["oc", "get", "node", "qeci-13399-rwght-rhel-0", "--kubeconfig=/tmp/installer-FrEA7T/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.158489", "end": "2021-01-12 12:57:13.937125", "rc": 0, "start": "2021-01-12 12:57:13.778636", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}
fatal: [10.0.99.172]: FAILED! => {"attempts": 36, "changed": false, "cmd": ["oc", "get", "node", "qeci-13399-rwght-rhel-1", "--kubeconfig=/tmp/installer-FrEA7T/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.155983", "end": "2021-01-12 12:57:14.074916", "rc": 0, "start": "2021-01-12 12:57:13.918933", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]}

From the log, we got the following:
Jan 11 23:57:10 qeci-13399-rwght-rhel-1 hyperkube[3191]: E0111 23:57:10.352787 3191 pod_workers.go:191] Error syncing pod 3c39f16e-471c-41e0-aa86-2b360f2892fd (\"network-metrics-daemon-q5v4p_openshift-multus(3c39f16e-471c-41e0-aa86-2b360f2892fd)\"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?\n

But once this failure happened, check both SDN pod's log and multus pod's log, all logs and other state looks good. So seem like the retries get timeout when waiting for rhel worker get ready.

[weliang@weliang ~]$ oc get node
NAME                         STATUS   ROLES    AGE     VERSION
weliang-122-dxqsm-master-0   Ready    master   4h39m   v1.20.0+394a5a3
weliang-122-dxqsm-master-1   Ready    master   4h35m   v1.20.0+394a5a3
weliang-122-dxqsm-master-2   Ready    master   4h34m   v1.20.0+394a5a3
weliang-122-dxqsm-rhel-0     Ready    worker   173m    v1.20.0+2e906f0
weliang-122-dxqsm-rhel-1     Ready    worker   173m    v1.20.0+2e906f0
weliang-122-dxqsm-worker-0   Ready    worker   4h24m   v1.20.0+394a5a3
weliang-122-dxqsm-worker-1   Ready    worker   4h24m   v1.20.0+394a5a3
weliang-122-dxqsm-worker-2   Ready    worker   4h24m   v1.20.0+394a5a3
[weliang@weliang ~]$ oc get co
NAME                                       VERSION      AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-fc.2   True        False         False      4h8m
baremetal                                  4.7.0-fc.2   True        False         False      4h34m
cloud-credential                           4.7.0-fc.2   True        False         False      4h41m
cluster-autoscaler                         4.7.0-fc.2   True        False         False      4h32m
config-operator                            4.7.0-fc.2   True        False         False      4h34m
console                                    4.7.0-fc.2   True        False         False      3h58m
csi-snapshot-controller                    4.7.0-fc.2   True        False         False      3h59m
dns                                        4.7.0-fc.2   True        False         False      4h31m
etcd                                       4.7.0-fc.2   True        False         False      4h30m
image-registry                             4.7.0-fc.2   True        False         False      4h6m
ingress                                    4.7.0-fc.2   True        False         False      4h20m
insights                                   4.7.0-fc.2   True        False         False      4h21m
kube-apiserver                             4.7.0-fc.2   True        False         False      4h30m
kube-controller-manager                    4.7.0-fc.2   True        False         False      4h27m
kube-scheduler                             4.7.0-fc.2   True        False         False      4h28m
kube-storage-version-migrator              4.7.0-fc.2   True        False         False      3h53m
machine-api                                4.7.0-fc.2   True        False         False      4h28m
machine-approver                           4.7.0-fc.2   True        False         False      4h33m
machine-config                             4.7.0-fc.2   True        False         False      4h31m
marketplace                                4.7.0-fc.2   True        False         False      3h56m
monitoring                                 4.7.0-fc.2   True        False         False      4h18m
network                                    4.7.0-fc.2   True        False         False      4h30m
node-tuning                                4.7.0-fc.2   True        False         False      4h32m
openshift-apiserver                        4.7.0-fc.2   True        False         False      4h25m
openshift-controller-manager               4.7.0-fc.2   True        False         False      4h24m
openshift-samples                          4.7.0-fc.2   True        False         False      4h
operator-lifecycle-manager                 4.7.0-fc.2   True        False         False      4h32m
operator-lifecycle-manager-catalog         4.7.0-fc.2   True        False         False      4h32m
operator-lifecycle-manager-packageserver   4.7.0-fc.2   True        False         False      3h57m
service-ca                                 4.7.0-fc.2   True        False         False      4h34m
storage                                    4.7.0-fc.2   True        False         False      3h56m
[weliang@weliang ~]$ oc get pod -n openshift-multus
NAME                                READY   STATUS    RESTARTS   AGE
multus-2mv75                        1/1     Running   0          4h36m
multus-2v88l                        1/1     Running   0          173m
multus-admission-controller-4lx4k   2/2     Running   0          4h31m
multus-admission-controller-crh2t   2/2     Running   0          4h32m
multus-admission-controller-s9f5x   2/2     Running   0          4h35m
multus-j22qz                        1/1     Running   0          4h39m
multus-j5nxv                        1/1     Running   0          4h24m
multus-lfv4w                        1/1     Running   0          4h35m
multus-mw2nn                        1/1     Running   0          173m
multus-qzcw8                        1/1     Running   0          4h24m
multus-zv68h                        1/1     Running   0          4h24m
network-metrics-daemon-cxk84        2/2     Running   0          4h39m
network-metrics-daemon-d6m2m        2/2     Running   0          4h24m
network-metrics-daemon-f9r9s        2/2     Running   0          4h24m
network-metrics-daemon-ffw9x        2/2     Running   0          4h24m
network-metrics-daemon-fnkdm        2/2     Running   0          173m
network-metrics-daemon-hv9vg        2/2     Running   0          4h36m
network-metrics-daemon-s2xhz        2/2     Running   0          173m
network-metrics-daemon-t5vmt        2/2     Running   0          4h35m
[weliang@weliang ~]$ oc get pod -n openshift-sdn
NAME                   READY   STATUS    RESTARTS   AGE
ovs-7nslw              1/1     Running   0          173m


So I think we can increase the retries to avoid such fake failure.

- block:
  - name: Wait for node to report ready
    command: >
      oc get node {{ ansible_nodename | lower }}
      --kubeconfig={{ openshift_node_kubeconfig_path }}
      --output=jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
    delegate_to: localhost
    register: oc_get
    until:
    - oc_get.stdout == "True"
    retries: 36
    delay: 5
    changed_when: false




Version:
openshift-ansible-4.7.0-202012172309.p0.git.0.8489ce2.el7.noarch.rpm


Platform:
PSI openstack


What happened?

#Enter text here.

#See the troubleshooting documentation (https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md) for ideas about what information to collect.

#For example, 

# If the installer fails to create resources (https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md#installer-fails-to-create-resources), attach the relevant portions of your `.openshift_install.log.`
# If the installer fails to bootstrap the cluster (https://github.com/openshift/installer/blob/master/docs/user/troubleshootingbootstrap.md), attach the bootstrap log bundle.
# If the installer fails to complete installation after bootstrapping completes (https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md#installer-fails-to-initialize-the-cluster), attach the must-gather log bundle using `oc adm must-gather`

# Always at least include the `.openshift_install.log`

What did you expect to happen?

#Enter text here.

How to reproduce it (as minimally and precisely as possible)?

$ your-commands-here

Anything else we need to know?

#Enter text here.

Comment 1 Johnny Liu 2021-01-13 11:59:36 UTC
Created attachment 1747037 [details]
scalup log

Comment 2 Russell Teague 2021-01-13 16:32:55 UTC
Looking in CI, the task in question takes around 15-20 retries before it comes up.  We can look at increasing the timeout, but I don't want to choose an arbitrary value. I would like to get more detail on how long it actually takes these hosts to report Ready after reboot.  For instance, are we just missing the Ready state or is it minutes later?

Comment 3 Russell Teague 2021-01-13 18:27:52 UTC
When attaching logs, please attach the complete log and not just a snippet.  Thanks

Comment 5 Johnny Liu 2021-01-14 03:07:24 UTC
The complete log is attached.

@Weibin, can you tell Russell how many minutes you wait before your checking when the scaleup job failed.

Comment 6 Weibin Liang 2021-01-15 19:52:44 UTC
Here the timestamps got from logs when scale up two new RHEL 7.9 nodes:

Fri Jan 15 14:24:00 EST 2021: Two new nodes were added and NotReady
Fri Jan 15 14:27:36 EST 2021: CI job failed
Fri Jan 15 14:28:10 EST 2021: Two new nodes are Ready
Fri Jan 15 14:28:25 EST 2021: All pods in openshift-multus and openshift-sdn got restart and back to Running
Fri Jan 15 14:29:06 EST 2021: Network operator back to available state 

Seems like it will take at least 5 more minutes from adding two nodes to network operator become available again

Comment 7 Johnny Liu 2021-01-28 02:08:34 UTC
Hi Russell, 

We are still seeing such issue in QE's ci testing, and weibin already provided info. 

Can we fix this in 4.7?

Comment 8 Johnny Liu 2021-01-28 03:12:04 UTC
From the timestamp in comment 6, seem like need to increase the timetime about 3 more mins. This is kinds of blocking QE's testing, so raise its severity.

Comment 9 Johnny Liu 2021-01-28 06:23:48 UTC
After compare with the same testing on 4.6, it is working well on 4.7. Talking with network QE team, seem like 4.7 introduce some resource in 4.7, e.g: network-diagnostics  pod, maybe that will explain whey 4.7 wait for more minutes than 4.6.

Comment 11 Gaoyun Pei 2021-02-08 07:34:42 UTC
Tried to add RHEL workers 5 times today, but didn't see the situation it has to wait a long time for step 
"Wait for node to report ready", all the 5 attempts were finished within one minute on this step.

Since PR was already merged into openshift-ansible-4.7.0-202102032256.p0.git.0.bf7d9a7.el7.noarch.rpm, so move this bug to VERIFIED.

Comment 14 errata-xmlrpc 2021-02-24 15:52:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.