Description of problem: 120 node baremetal upgrade from 4.9.29 --> 4.10.13, with moderate pod scale, crashloops on machine-approver for ~2hrs. NAME READY STATUS RESTARTS AGE machine-approver-6c94df4c6c-fpdkn 2/2 Running 23 (5m27s ago) 124m Mon May 23 00:52:12 UTC 2022 NAME READY STATUS RESTARTS AGE machine-approver-6c94df4c6c-7xbjj 2/2 Running 0 48s F0522 23:50:17.693517 1 main.go:102] Can't create clients: failed to create client: Get "https://172.30.0.1:443/api?timeout=32s": dial tcp 172.30.0.1:443: i/o timeout Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy OCP 4.9.29 2. scale BM env to 118 workers 3. instantiate pod scale with network policies ( https://github.com/cloud-bulldozer/kube-burner/tree/master/examples/workloads/kubelet-density-cni-networkpolicy) 4. Upgrade to 4.10.13 Actual results: machine-approver pod crashloops for ~2hr but eventually reconciles but upgrade completion is delayed considerably Expected results:Successful upgrade in a reasonable time Additional info: must-gather available at: http://perf1.perf.lab.eng.bos.redhat.com/pub/dwilson/must-gather-machine-aprrover.tar.gz
Just to add more context, as Dave mentioned, we loaded the cluster with the kubelet-density-cni-networkpolicy workload. Specifically 2500 iterations deploying: - Creates namespace kubelet-density-cni-networkpolicy with the following objects - 1 deny-all network policy in the - 2500 webserver applications (nginx) - 2500 services, each of them backing one of the previous applications - 2500 network policies - 2500 client applications. (cURL-ing the webserver service)
PR that contains tentative fix (https://github.com/openshift/ovn-kubernetes/pull/1126) rectifies the issue and the upgrade complete in 1.5hrs, which is inline with expectations
Marking verified pre-merge per comment 4.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069