Bug 1487339

Summary: Single pod is able to use enough resources to make a node go NotReady
Product: OpenShift Container Platform Reporter: Sten Turpin <sten>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED CURRENTRELEASE QA Contact: DeShuai Ma <dma>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.1CC: aos-bugs, decarr, jokerman, mmccomas, mwhittin, sten
Target Milestone: ---Keywords: OpsBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-12 19:29:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sten Turpin 2017-08-31 16:24:26 UTC
Description of problem: A pod can create enough of a workload that the node underneath becomes unresponsive 


Version-Release number of selected component (if applicable): atomic-openshift-3.6.173.0.5-1.git.0.f30b99e.el7.x86_64


How reproducible: Sometimes


Steps to Reproduce:
1. Run a pod with a particular workload. 

Actual results:
Node stops responding via SSH, reports NotReady in k8s


Expected results:
CGroups should mitigate resource use to ensure all pods can run and administrative access is maintained. 

Additional info:
We're observing this on AWS nodes, m4.xlarge with 250GB GP2 EBS. This is relevant because we have seen issues wit nodes being both CPU-bound and, spearately, io token bound.

Comment 2 Sten Turpin 2017-08-31 16:27:51 UTC
often shows https://bugzilla.redhat.com/show_bug.cgi?id=1459589 "iptables soft lockup" as a symptom.