Bug 1374407 - Atomic-openshift-node daemon freezes
Summary: Atomic-openshift-node daemon freezes
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: ---
Assignee: Seth Jennings
QA Contact: Xiaoli Tian
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-08 15:32 UTC by Miheer Salunke
Modified: 2019-08-08 02:47 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-03 15:17:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Miheer Salunke 2016-09-08 15:32:01 UTC
Description of problem:

Starting from some weeks ago, we started to see that sometimes atomic-openshift-node somehow "freezes", although it still succeeds posting its health check to the master (it appears as "Ready" if you run "oc get node"). 

The exact symptoms are:

- New pods are scheduled to the node but get stuck in "Pending" state indefinitely and never start.
- Pods deleted from that node get stuck in "Terminating" state indefinitely (of course, more time than the terminationGracePeriod) and are never deleted.

The only way to recover from this failure is to restart atomic-openshift-node daemon, which takes ~7 minutes. Then, the node starts to work properly again.

In previous versions of OpenShift, we observed a similar occasional behaviour  with one difference: Restarting the daemon did not took ~7 minutes as it now takes. However, starting from 3.2.0, we stopped observing this behaviour until some weeks ago.

The only difference we saw is that the firmware of the hypervisors of the OpenStack where we deployed OpenShift was upgraded. This caused some VMs to be shutdown and restarted. It may be interesting to point this, but it is not likely to be so important.


This is happening on "OpenShift on top of Red Hat OpenStack."

Earlier we thought it was running into this: https://github.com/kubernetes/kubernetes/issues/31272 but at the moment of the crash there was NFS pv's published but not in use, only CEPH pv's were being used by pods.

Version-Release number of selected component (if applicable):
Openshift Enterprise 3.2

How reproducible:
On customer end

Steps to Reproduce:
1.Mentioned in description
2.
3.

Actual results:
Atomic-openshift-node daemon freezes

Expected results:
Atomic-openshift-node daemon shall not freeze

Additional info:

Comment 13 Greg Blomquist 2019-07-03 15:17:33 UTC
Customer closed case in 2017


Note You need to log in before you can comment on or make changes to this bug.