Description of problem: Starting from some weeks ago, we started to see that sometimes atomic-openshift-node somehow "freezes", although it still succeeds posting its health check to the master (it appears as "Ready" if you run "oc get node"). The exact symptoms are: - New pods are scheduled to the node but get stuck in "Pending" state indefinitely and never start. - Pods deleted from that node get stuck in "Terminating" state indefinitely (of course, more time than the terminationGracePeriod) and are never deleted. The only way to recover from this failure is to restart atomic-openshift-node daemon, which takes ~7 minutes. Then, the node starts to work properly again. In previous versions of OpenShift, we observed a similar occasional behaviour with one difference: Restarting the daemon did not took ~7 minutes as it now takes. However, starting from 3.2.0, we stopped observing this behaviour until some weeks ago. The only difference we saw is that the firmware of the hypervisors of the OpenStack where we deployed OpenShift was upgraded. This caused some VMs to be shutdown and restarted. It may be interesting to point this, but it is not likely to be so important. This is happening on "OpenShift on top of Red Hat OpenStack." Earlier we thought it was running into this: https://github.com/kubernetes/kubernetes/issues/31272 but at the moment of the crash there was NFS pv's published but not in use, only CEPH pv's were being used by pods. Version-Release number of selected component (if applicable): Openshift Enterprise 3.2 How reproducible: On customer end Steps to Reproduce: 1.Mentioned in description 2. 3. Actual results: Atomic-openshift-node daemon freezes Expected results: Atomic-openshift-node daemon shall not freeze Additional info:
Customer closed case in 2017