Description of problem: I am working on using the kubeflow mpi operator to run large MPI jobs within OCP. I have noticed that when trying to run a very large workload (512 pods with 1cpu) I get a large number of ContainerCreateErrors. Version-Release number of selected component (if applicable): 4.9.17 How reproducible: Fairly consistently Steps to Reproduce: 1. Create 40+ node cluster 2. Install kubeflow mpi operator 0.3 3. Create an MPI job with 512 pods Actual results: Many ContainerCreateError that occasionally don't resolve. Expected results: All pods should launch quickly and without issue. Additional info: Some background on this work: https://github.com/OpenShiftDemos/kubeflow-mpi-openfoam
fixed in attached PR
Marking it verified based on comment 6, 7 & 8.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069