Description of problem: 1 of 3 recurring issues observed on Starter clusters this week. A node goes NotReady, failed state can be associated with a particular pod, but that pod doesn't appear to be using an inordinate amount of resources. Version-Release number of selected component (if applicable): atomic-openshift-3.6.173.0.5-1.git.0.f30b99e.el7.x86_64 How reproducible: Rarely, tied to particular pods Steps to Reproduce: 1. User runs a pod 2. Node goes NotReady 3. Ops checks system load, docker stats, nothing appears to be out of reasonable bounds 4. Ops disables or moves the pod 5. Node recovers Actual results: Node goes NotReady, despite available resources. Expected results: Node should stay in Ready state, or report what failure prevents it from being Ready. Additional info:
These looks similar enough and are in starter cluster. *** This bug has been marked as a duplicate of bug 1486914 ***
Sten, Since you mentioning that issue occurs only with specific pod, Can you please share pod yaml file so that it could be reproduced on local system to understand what that pod does in order to make node NOT-READY.