1811159 – Azure : Node goes into NodeNotReady

Bug 1811159 - Azure : Node goes into NodeNotReady

Summary: Azure : Node goes into NodeNotReady

Keywords:
Status:	CLOSED DUPLICATE of bug 1801824
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:	aos-scalability-43
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-06 18:12 UTC by Joe Talerico
Modified:	2020-03-18 14:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-10 14:50:19 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Joe Talerico 2020-03-06 18:12:26 UTC

Description of problem:
Worker nodes with P4 disk and Standard_D2s_v3 vm_size. During our mastervert workload, we witnessed node flip to NotReady. 

root@ip-172-31-21-245: ~ # grep Not ready-master.out
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2

This workload only created 10 projects and ~100 pods with multiple configmaps/secrets/etc

Version-Release number of selected component (if applicable): 4.3


How reproducible: 100%


Steps to Reproduce:
1. Launch cloud in Azure, CentralUS, 3x D8v3 masters w/ P30, 3 infra D8v3 w/ P30, 25x D2v3 w/ P4 workers.
2. Run mastervert 100


Actual results:
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus1-7p8hh   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-wjc2h   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2
scale-q46lg-worker-centralus3-7vdzz   NotReady   worker   21h   v1.16.2

Expected results:
Nodes not flipping to NotReady

We should be able to run this workload beyond 10 projects in this environment. I was able to run this workload on Azure with 100 projects, however, I was using D8v3 and P15. 

Additional info:
This could very-well be due to limited resources, however, this is a very trivial workload. We should only allow users to choose instance / disk types that are suitable to sustain some meaningful load.

Comment 1 Joe Talerico 2020-03-06 18:13:25 UTC

Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1781345

Comment 2 Ryan Phillips 2020-03-06 18:24:04 UTC


*** This bug has been marked as a duplicate of bug 1801824 ***

Comment 3 Naga Ravi Chaitanya Elluri 2020-03-06 18:41:51 UTC

Logs including must-gather output are at http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/4.3/logs/azure/bz-1811159/.

Comment 4 Jim Minter 2020-03-09 14:11:40 UTC

Reopening - I do not believe that this is memory-related, it is related to IO load on Azure and it needs to be investigated.  Compare https://aka.ms/aks/io-throttle-issue .

Comment 5 Ryan Phillips 2020-03-09 17:12:58 UTC

The bug I duplicated did not reserve a CPU reservation for the system components as well. It is not only memory related.

Note You need to log in before you can comment on or make changes to this bug.