Bug 1811159

Summary: Azure : Node goes into NodeNotReady
Product: OpenShift Container Platform Reporter: Joe Talerico <jtaleric>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, jminter, jokerman, nelluri, zyu
Target Milestone: ---Keywords: Reopened
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-43
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 14:50:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joe Talerico 2020-03-06 18:12:26 UTC
Description of problem:
Worker nodes with P4 disk and Standard_D2s_v3 vm_size. During our mastervert workload, we witnessed node flip to NotReady. 

root@ip-172-31-21-245: ~ # grep Not ready-master.out
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2

This workload only created 10 projects and ~100 pods with multiple configmaps/secrets/etc

Version-Release number of selected component (if applicable): 4.3


How reproducible: 100%


Steps to Reproduce:
1. Launch cloud in Azure, CentralUS, 3x D8v3 masters w/ P30, 3 infra D8v3 w/ P30, 25x D2v3 w/ P4 workers.
2. Run mastervert 100


Actual results:
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2

Expected results:
Nodes not flipping to NotReady

We should be able to run this workload beyond 10 projects in this environment. I was able to run this workload on Azure with 100 projects, however, I was using D8v3 and P15. 

Additional info:
This could very-well be due to limited resources, however, this is a very trivial workload. We should only allow users to choose instance / disk types that are suitable to sustain some meaningful load.

Comment 1 Joe Talerico 2020-03-06 18:13:25 UTC
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1781345

Comment 2 Ryan Phillips 2020-03-06 18:24:04 UTC

*** This bug has been marked as a duplicate of bug 1801824 ***

Comment 3 Naga Ravi Chaitanya Elluri 2020-03-06 18:41:51 UTC
Logs including must-gather output are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/azure/bz-1811159/.

Comment 4 Jim Minter 2020-03-09 14:11:40 UTC
Reopening - I do not believe that this is memory-related, it is related to IO load on Azure and it needs to be investigated.  Compare https://aka.ms/aks/io-throttle-issue .

Comment 5 Ryan Phillips 2020-03-09 17:12:58 UTC
The bug I duplicated did not reserve a CPU reservation for the system components as well. It is not only memory related.