Bug 1811159

Summary:	Azure : Node goes into NodeNotReady
Product:	OpenShift Container Platform	Reporter:	Joe Talerico <jtaleric>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.3.0	CC:	aos-bugs, jminter, jokerman, nelluri, zyu
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	aos-scalability-43
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-10 14:50:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joe Talerico 2020-03-06 18:12:26 UTC

Description of problem:
Worker nodes with P4 disk and Standard_D2s_v3 vm_size. During our mastervert workload, we witnessed node flip to NotReady. 

root@ip-172-31-21-245: ~ # grep Not ready-master.out
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2

This workload only created 10 projects and ~100 pods with multiple configmaps/secrets/etc

Version-Release number of selected component (if applicable): 4.3


How reproducible: 100%


Steps to Reproduce:
1. Launch cloud in Azure, CentralUS, 3x D8v3 masters w/ P30, 3 infra D8v3 w/ P30, 25x D2v3 w/ P4 workers.
2. Run mastervert 100


Actual results:
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2

Expected results:
Nodes not flipping to NotReady

We should be able to run this workload beyond 10 projects in this environment. I was able to run this workload on Azure with 100 projects, however, I was using D8v3 and P15. 

Additional info:
This could very-well be due to limited resources, however, this is a very trivial workload. We should only allow users to choose instance / disk types that are suitable to sustain some meaningful load.

Comment 1 Joe Talerico 2020-03-06 18:13:25 UTC

Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1781345

Comment 2 Ryan Phillips 2020-03-06 18:24:04 UTC


*** This bug has been marked as a duplicate of bug 1801824 ***

Comment 3 Naga Ravi Chaitanya Elluri 2020-03-06 18:41:51 UTC

Logs including must-gather output are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/azure/bz-1811159/.

Comment 4 Jim Minter 2020-03-09 14:11:40 UTC

Reopening - I do not believe that this is memory-related, it is related to IO load on Azure and it needs to be investigated.  Compare https://aka.ms/aks/io-throttle-issue .

Comment 5 Ryan Phillips 2020-03-09 17:12:58 UTC

The bug I duplicated did not reserve a CPU reservation for the system components as well. It is not only memory related.