Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1374407

Summary:	Atomic-openshift-node daemon freezes
Product:	OpenShift Container Platform	Reporter:	Miheer Salunke <misalunk>
Component:	Node	Assignee:	Seth Jennings <sjenning>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xiaoli Tian <xtian>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.1.0	CC:	aos-bugs, decarr, eparis, erich, gblomqui, jokerman, misalunk, mmccomas, pmorie, wmeng
Target Milestone:	---	Keywords:	Performance
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-07-03 15:17:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Miheer Salunke 2016-09-08 15:32:01 UTC

Description of problem:

Starting from some weeks ago, we started to see that sometimes atomic-openshift-node somehow "freezes", although it still succeeds posting its health check to the master (it appears as "Ready" if you run "oc get node").

The exact symptoms are:

- New pods are scheduled to the node but get stuck in "Pending" state indefinitely and never start.
- Pods deleted from that node get stuck in "Terminating" state indefinitely (of course, more time than the terminationGracePeriod) and are never deleted.

The only way to recover from this failure is to restart atomic-openshift-node daemon, which takes ~7 minutes. Then, the node starts to work properly again.

In previous versions of OpenShift, we observed a similar occasional behaviour with one difference: Restarting the daemon did not took ~7 minutes as it now takes. However, starting from 3.2.0, we stopped observing this behaviour until some weeks ago.

The only difference we saw is that the firmware of the hypervisors of the OpenStack where we deployed OpenShift was upgraded. This caused some VMs to be shutdown and restarted. It may be interesting to point this, but it is not likely to be so important.

This is happening on "OpenShift on top of Red Hat OpenStack."

Earlier we thought it was running into this: https://github.com/kubernetes/kubernetes/issues/31272 but at the moment of the crash there was NFS pv's published but not in use, only CEPH pv's were being used by pods.

Version-Release number of selected component (if applicable):
Openshift Enterprise 3.2

How reproducible:
On customer end

Steps to Reproduce:
1.Mentioned in description
2.
3.

Actual results:
Atomic-openshift-node daemon freezes

Expected results:
Atomic-openshift-node daemon shall not freeze

Additional info:

Comment 13 Greg Blomquist 2019-07-03 15:17:33 UTC

Customer closed case in 2017