Recently QE often hit rhel worker scaleup failure. TASK [openshift_node : Approve node CSRs] ************************************** Tuesday 12 January 2021 12:53:43 +0800 (0:00:17.805) 0:08:32.742 ******* changed: [10.0.99.172] => {"changed": true, "client_approve_results": ["Attempt: 1, Node qeci-13399-rwght-rhel-1 not present or CSR not yet available", "qeci-13399-rwght-rhel-1: certificatesigningrequest.certificates.k8s.io/csr-mgdcl approved\n", "Attempt: 2, Node qeci-13399-rwght-rhel-1 not present or CSR not yet available", "Attempt: 3, Node qeci-13399-rwght-rhel-1 not present or CSR not yet available", "Node qeci-13399-rwght-rhel-1 is present in node list"], "rc": 0, "server_approve_results": ["qeci-13399-rwght-rhel-1: certificatesigningrequest.certificates.k8s.io/csr-vcj2t approved\n", "Node qeci-13399-rwght-rhel-1 API is ready"]} changed: [10.0.97.118] => {"changed": true, "client_approve_results": ["Attempt: 1, Node qeci-13399-rwght-rhel-0 not present or CSR not yet available", "qeci-13399-rwght-rhel-0: certificatesigningrequest.certificates.k8s.io/csr-gq6zl approved\n", "Attempt: 2, Node qeci-13399-rwght-rhel-0 not present or CSR not yet available", "Attempt: 3, Node qeci-13399-rwght-rhel-0 not present or CSR not yet available", "Node qeci-13399-rwght-rhel-0 is present in node list"], "rc": 0, "server_approve_results": ["qeci-13399-rwght-rhel-0: certificatesigningrequest.certificates.k8s.io/csr-p8t7b approved\n", "Node qeci-13399-rwght-rhel-0 API is ready"]} TASK [openshift_node : Wait for node to report ready] ************************** Tuesday 12 January 2021 12:54:01 +0800 (0:00:18.488) 0:08:51.230 ******* FAILED - RETRYING: Wait for node to report ready (36 retries left). FAILED - RETRYING: Wait for node to report ready (36 retries left). FAILED - RETRYING: Wait for node to report ready (35 retries left). FAILED - RETRYING: Wait for node to report ready (35 retries left). FAILED - RETRYING: Wait for node to report ready (34 retries left). FAILED - RETRYING: Wait for node to report ready (34 retries left). FAILED - RETRYING: Wait for node to report ready (33 retries left). FAILED - RETRYING: Wait for node to report ready (33 retries left). FAILED - RETRYING: Wait for node to report ready (32 retries left). FAILED - RETRYING: Wait for node to report ready (32 retries left). FAILED - RETRYING: Wait for node to report ready (31 retries left). FAILED - RETRYING: Wait for node to report ready (31 retries left). FAILED - RETRYING: Wait for node to report ready (30 retries left). FAILED - RETRYING: Wait for node to report ready (30 retries left). FAILED - RETRYING: Wait for node to report ready (29 retries left). FAILED - RETRYING: Wait for node to report ready (29 retries left). FAILED - RETRYING: Wait for node to report ready (28 retries left). FAILED - RETRYING: Wait for node to report ready (28 retries left). FAILED - RETRYING: Wait for node to report ready (27 retries left). FAILED - RETRYING: Wait for node to report ready (27 retries left). FAILED - RETRYING: Wait for node to report ready (26 retries left). FAILED - RETRYING: Wait for node to report ready (26 retries left). FAILED - RETRYING: Wait for node to report ready (25 retries left). FAILED - RETRYING: Wait for node to report ready (25 retries left). FAILED - RETRYING: Wait for node to report ready (24 retries left). FAILED - RETRYING: Wait for node to report ready (24 retries left). FAILED - RETRYING: Wait for node to report ready (23 retries left). FAILED - RETRYING: Wait for node to report ready (23 retries left). FAILED - RETRYING: Wait for node to report ready (22 retries left). FAILED - RETRYING: Wait for node to report ready (22 retries left). FAILED - RETRYING: Wait for node to report ready (21 retries left). FAILED - RETRYING: Wait for node to report ready (21 retries left). FAILED - RETRYING: Wait for node to report ready (20 retries left). FAILED - RETRYING: Wait for node to report ready (20 retries left). FAILED - RETRYING: Wait for node to report ready (19 retries left). FAILED - RETRYING: Wait for node to report ready (19 retries left). FAILED - RETRYING: Wait for node to report ready (18 retries left). FAILED - RETRYING: Wait for node to report ready (18 retries left). FAILED - RETRYING: Wait for node to report ready (17 retries left). FAILED - RETRYING: Wait for node to report ready (17 retries left). FAILED - RETRYING: Wait for node to report ready (16 retries left). FAILED - RETRYING: Wait for node to report ready (16 retries left). FAILED - RETRYING: Wait for node to report ready (15 retries left). FAILED - RETRYING: Wait for node to report ready (15 retries left). FAILED - RETRYING: Wait for node to report ready (14 retries left). FAILED - RETRYING: Wait for node to report ready (14 retries left). FAILED - RETRYING: Wait for node to report ready (13 retries left). FAILED - RETRYING: Wait for node to report ready (13 retries left). FAILED - RETRYING: Wait for node to report ready (12 retries left). FAILED - RETRYING: Wait for node to report ready (12 retries left). FAILED - RETRYING: Wait for node to report ready (11 retries left). FAILED - RETRYING: Wait for node to report ready (11 retries left). FAILED - RETRYING: Wait for node to report ready (10 retries left). FAILED - RETRYING: Wait for node to report ready (10 retries left). FAILED - RETRYING: Wait for node to report ready (9 retries left). FAILED - RETRYING: Wait for node to report ready (9 retries left). FAILED - RETRYING: Wait for node to report ready (8 retries left). FAILED - RETRYING: Wait for node to report ready (8 retries left). FAILED - RETRYING: Wait for node to report ready (7 retries left). FAILED - RETRYING: Wait for node to report ready (7 retries left). FAILED - RETRYING: Wait for node to report ready (6 retries left). FAILED - RETRYING: Wait for node to report ready (6 retries left). FAILED - RETRYING: Wait for node to report ready (5 retries left). FAILED - RETRYING: Wait for node to report ready (5 retries left). FAILED - RETRYING: Wait for node to report ready (4 retries left). FAILED - RETRYING: Wait for node to report ready (4 retries left). FAILED - RETRYING: Wait for node to report ready (3 retries left). FAILED - RETRYING: Wait for node to report ready (3 retries left). FAILED - RETRYING: Wait for node to report ready (2 retries left). FAILED - RETRYING: Wait for node to report ready (2 retries left). FAILED - RETRYING: Wait for node to report ready (1 retries left). FAILED - RETRYING: Wait for node to report ready (1 retries left). fatal: [10.0.97.118]: FAILED! => {"attempts": 36, "changed": false, "cmd": ["oc", "get", "node", "qeci-13399-rwght-rhel-0", "--kubeconfig=/tmp/installer-FrEA7T/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.158489", "end": "2021-01-12 12:57:13.937125", "rc": 0, "start": "2021-01-12 12:57:13.778636", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]} fatal: [10.0.99.172]: FAILED! => {"attempts": 36, "changed": false, "cmd": ["oc", "get", "node", "qeci-13399-rwght-rhel-1", "--kubeconfig=/tmp/installer-FrEA7T/auth/kubeconfig", "--output=jsonpath={.status.conditions[?(@.type==\"Ready\")].status}"], "delta": "0:00:00.155983", "end": "2021-01-12 12:57:14.074916", "rc": 0, "start": "2021-01-12 12:57:13.918933", "stderr": "", "stderr_lines": [], "stdout": "False", "stdout_lines": ["False"]} From the log, we got the following: Jan 11 23:57:10 qeci-13399-rwght-rhel-1 hyperkube[3191]: E0111 23:57:10.352787 3191 pod_workers.go:191] Error syncing pod 3c39f16e-471c-41e0-aa86-2b360f2892fd (\"network-metrics-daemon-q5v4p_openshift-multus(3c39f16e-471c-41e0-aa86-2b360f2892fd)\"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?\n But once this failure happened, check both SDN pod's log and multus pod's log, all logs and other state looks good. So seem like the retries get timeout when waiting for rhel worker get ready. [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION weliang-122-dxqsm-master-0 Ready master 4h39m v1.20.0+394a5a3 weliang-122-dxqsm-master-1 Ready master 4h35m v1.20.0+394a5a3 weliang-122-dxqsm-master-2 Ready master 4h34m v1.20.0+394a5a3 weliang-122-dxqsm-rhel-0 Ready worker 173m v1.20.0+2e906f0 weliang-122-dxqsm-rhel-1 Ready worker 173m v1.20.0+2e906f0 weliang-122-dxqsm-worker-0 Ready worker 4h24m v1.20.0+394a5a3 weliang-122-dxqsm-worker-1 Ready worker 4h24m v1.20.0+394a5a3 weliang-122-dxqsm-worker-2 Ready worker 4h24m v1.20.0+394a5a3 [weliang@weliang ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-fc.2 True False False 4h8m baremetal 4.7.0-fc.2 True False False 4h34m cloud-credential 4.7.0-fc.2 True False False 4h41m cluster-autoscaler 4.7.0-fc.2 True False False 4h32m config-operator 4.7.0-fc.2 True False False 4h34m console 4.7.0-fc.2 True False False 3h58m csi-snapshot-controller 4.7.0-fc.2 True False False 3h59m dns 4.7.0-fc.2 True False False 4h31m etcd 4.7.0-fc.2 True False False 4h30m image-registry 4.7.0-fc.2 True False False 4h6m ingress 4.7.0-fc.2 True False False 4h20m insights 4.7.0-fc.2 True False False 4h21m kube-apiserver 4.7.0-fc.2 True False False 4h30m kube-controller-manager 4.7.0-fc.2 True False False 4h27m kube-scheduler 4.7.0-fc.2 True False False 4h28m kube-storage-version-migrator 4.7.0-fc.2 True False False 3h53m machine-api 4.7.0-fc.2 True False False 4h28m machine-approver 4.7.0-fc.2 True False False 4h33m machine-config 4.7.0-fc.2 True False False 4h31m marketplace 4.7.0-fc.2 True False False 3h56m monitoring 4.7.0-fc.2 True False False 4h18m network 4.7.0-fc.2 True False False 4h30m node-tuning 4.7.0-fc.2 True False False 4h32m openshift-apiserver 4.7.0-fc.2 True False False 4h25m openshift-controller-manager 4.7.0-fc.2 True False False 4h24m openshift-samples 4.7.0-fc.2 True False False 4h operator-lifecycle-manager 4.7.0-fc.2 True False False 4h32m operator-lifecycle-manager-catalog 4.7.0-fc.2 True False False 4h32m operator-lifecycle-manager-packageserver 4.7.0-fc.2 True False False 3h57m service-ca 4.7.0-fc.2 True False False 4h34m storage 4.7.0-fc.2 True False False 3h56m [weliang@weliang ~]$ oc get pod -n openshift-multus NAME READY STATUS RESTARTS AGE multus-2mv75 1/1 Running 0 4h36m multus-2v88l 1/1 Running 0 173m multus-admission-controller-4lx4k 2/2 Running 0 4h31m multus-admission-controller-crh2t 2/2 Running 0 4h32m multus-admission-controller-s9f5x 2/2 Running 0 4h35m multus-j22qz 1/1 Running 0 4h39m multus-j5nxv 1/1 Running 0 4h24m multus-lfv4w 1/1 Running 0 4h35m multus-mw2nn 1/1 Running 0 173m multus-qzcw8 1/1 Running 0 4h24m multus-zv68h 1/1 Running 0 4h24m network-metrics-daemon-cxk84 2/2 Running 0 4h39m network-metrics-daemon-d6m2m 2/2 Running 0 4h24m network-metrics-daemon-f9r9s 2/2 Running 0 4h24m network-metrics-daemon-ffw9x 2/2 Running 0 4h24m network-metrics-daemon-fnkdm 2/2 Running 0 173m network-metrics-daemon-hv9vg 2/2 Running 0 4h36m network-metrics-daemon-s2xhz 2/2 Running 0 173m network-metrics-daemon-t5vmt 2/2 Running 0 4h35m [weliang@weliang ~]$ oc get pod -n openshift-sdn NAME READY STATUS RESTARTS AGE ovs-7nslw 1/1 Running 0 173m So I think we can increase the retries to avoid such fake failure. - block: - name: Wait for node to report ready command: > oc get node {{ ansible_nodename | lower }} --kubeconfig={{ openshift_node_kubeconfig_path }} --output=jsonpath='{.status.conditions[?(@.type=="Ready")].status}' delegate_to: localhost register: oc_get until: - oc_get.stdout == "True" retries: 36 delay: 5 changed_when: false Version: openshift-ansible-4.7.0-202012172309.p0.git.0.8489ce2.el7.noarch.rpm Platform: PSI openstack What happened? #Enter text here. #See the troubleshooting documentation (https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md) for ideas about what information to collect. #For example, # If the installer fails to create resources (https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md#installer-fails-to-create-resources), attach the relevant portions of your `.openshift_install.log.` # If the installer fails to bootstrap the cluster (https://github.com/openshift/installer/blob/master/docs/user/troubleshootingbootstrap.md), attach the bootstrap log bundle. # If the installer fails to complete installation after bootstrapping completes (https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md#installer-fails-to-initialize-the-cluster), attach the must-gather log bundle using `oc adm must-gather` # Always at least include the `.openshift_install.log` What did you expect to happen? #Enter text here. How to reproduce it (as minimally and precisely as possible)? $ your-commands-here Anything else we need to know? #Enter text here.
Created attachment 1747037 [details] scalup log
Looking in CI, the task in question takes around 15-20 retries before it comes up. We can look at increasing the timeout, but I don't want to choose an arbitrary value. I would like to get more detail on how long it actually takes these hosts to report Ready after reboot. For instance, are we just missing the Ready state or is it minutes later?
When attaching logs, please attach the complete log and not just a snippet. Thanks
The complete log is attached. @Weibin, can you tell Russell how many minutes you wait before your checking when the scaleup job failed.
Here the timestamps got from logs when scale up two new RHEL 7.9 nodes: Fri Jan 15 14:24:00 EST 2021: Two new nodes were added and NotReady Fri Jan 15 14:27:36 EST 2021: CI job failed Fri Jan 15 14:28:10 EST 2021: Two new nodes are Ready Fri Jan 15 14:28:25 EST 2021: All pods in openshift-multus and openshift-sdn got restart and back to Running Fri Jan 15 14:29:06 EST 2021: Network operator back to available state Seems like it will take at least 5 more minutes from adding two nodes to network operator become available again
Hi Russell, We are still seeing such issue in QE's ci testing, and weibin already provided info. Can we fix this in 4.7?
From the timestamp in comment 6, seem like need to increase the timetime about 3 more mins. This is kinds of blocking QE's testing, so raise its severity.
After compare with the same testing on 4.6, it is working well on 4.7. Talking with network QE team, seem like 4.7 introduce some resource in 4.7, e.g: network-diagnostics pod, maybe that will explain whey 4.7 wait for more minutes than 4.6.
Tried to add RHEL workers 5 times today, but didn't see the situation it has to wait a long time for step "Wait for node to report ready", all the 5 attempts were finished within one minute on this step. Since PR was already merged into openshift-ansible-4.7.0-202102032256.p0.git.0.bf7d9a7.el7.noarch.rpm, so move this bug to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633