Description of problem: Worker nodes with P4 disk and Standard_D2s_v3 vm_size. During our mastervert workload, we witnessed node flip to NotReady. root@ip-172-31-21-245: ~ # grep Not ready-master.out scale-q46lg-worker-centralus1-7p8hh NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus1-7p8hh NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 This workload only created 10 projects and ~100 pods with multiple configmaps/secrets/etc Version-Release number of selected component (if applicable): 4.3 How reproducible: 100% Steps to Reproduce: 1. Launch cloud in Azure, CentralUS, 3x D8v3 masters w/ P30, 3 infra D8v3 w/ P30, 25x D2v3 w/ P4 workers. 2. Run mastervert 100 Actual results: scale-q46lg-worker-centralus1-7p8hh NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus1-7p8hh NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-wjc2h NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 scale-q46lg-worker-centralus3-7vdzz NotReady worker 21h v1.16.2 Expected results: Nodes not flipping to NotReady We should be able to run this workload beyond 10 projects in this environment. I was able to run this workload on Azure with 100 projects, however, I was using D8v3 and P15. Additional info: This could very-well be due to limited resources, however, this is a very trivial workload. We should only allow users to choose instance / disk types that are suitable to sustain some meaningful load.
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1781345
*** This bug has been marked as a duplicate of bug 1801824 ***
Logs including must-gather output are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/azure/bz-1811159/.
Reopening - I do not believe that this is memory-related, it is related to IO load on Azure and it needs to be investigated. Compare https://aka.ms/aks/io-throttle-issue .
The bug I duplicated did not reserve a CPU reservation for the system components as well. It is not only memory related.