Bug 1753160

Summary: [3.11] Pods in orphaned state
Product: OpenShift Container Platform Reporter: Camino Noguera <mnoguera>
Component: StorageAssignee: Hemant Kumar <hekumar>
Status: CLOSED WONTFIX QA Contact: Qin Ping <piqin>
Severity: low Docs Contact:
Priority: high    
Version: 3.11.0CC: aanjarle, aos-bugs, aos-storage-staff, bbennett, ckoep, jokerman, jsafrane, mtleilia, nchoudhu, rphillips, sburke, scuppett, ssadhale, swachira, vjaypurk
Target Milestone: ---Flags: mnoguera: needinfo-
hekumar: needinfo? (mtleilia)
Target Release: 3.11.z   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-23 16:28:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Camino Noguera 2019-09-18 09:48:27 UTC
Description of problem:
journalctl show an orphaned pod


Aug 28 10:33:51 sxxxxx atomic-openshift-node[14947]: E0828 10:33:51.246241   14947 kubelet_volumes.go:140] Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found, but volume paths are still present on disk: There were a total of 15 errors similar to this. Turn up verbosity to see them.

It is not clear why the issue started because there wasn't any node outage recently but they are having an issue with docker daemon in the node (usually) it hangs.

 
No persistent storage used.
 
It is supposed that was solved in 3.10:
https://access.redhat.com/errata/RHBA-2018:1816


Version-Release number of selected component (if applicable):
3.11.104
docker version 1.13.1-96

there is a workaround:

https://access.redhat.com/solutions/3237211

rm -rf /var/lib/orgin/openshift.local.volumes/pods/<pod-id>/

Comment 2 Jan Safranek 2019-09-23 12:03:56 UTC
The attached logs contains "Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found", however, it does not contain any details how the pod was deleted and why its volumes are orphaned.

To get rid of the message in 3.11, it should be enough to restart atomic-openshift-node service to clean up the orphans. If that does not work, deleting the orphans manually (https://access.redhat.com/solutions/3237211) will definitely help.

To find the root cause, we need to know:

1. What happened to the pod when it was deleted (was the node rebooted? was openshift-node service restarted?). Node logs where 'Orphaned pod XYZ' appeared from the first time + 10-20 minutes before would be the best.

2. *What* volume was orphaned. Simple "find /var/lib/origin/openshift.local.volumes/pods/0f59ef88-b4c1-11e9-afd0-00505680343f| is a good start. Not all volumes that Kubernetes uses are persistent, for example each pod gets a Secret volume with API token:

/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes/kubernetes.io~secret
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2/ready

With that listing, we can see that a secret volume was orphaned and we know where to look.

Comment 38 Ryan Phillips 2020-06-05 18:34:39 UTC
It looks like this is an issue with Ceph CSI. I am not sure if a patch will be backported into 3.11. I'm going to reassign this issue to storage to track.

Issue: https://github.com/kubernetes/kubernetes/issues/60987#issuecomment-638750828
Potential Upstream PR: https://github.com/ceph/ceph-csi/pull/1134

Comment 40 Hemant Kumar 2020-06-23 16:28:20 UTC
I am closing this bug since I haven't heard any update from reporter. Storage team is aware that this issue is not 100% fixed for certain drivers but since this does not affect stability of a cluster, we aren't actively working on it.

The reason I say fix is specific to volume driver is because after node/kubelet restart, the volume reconstruction from disk depends on plugin type. And that is the logic that needs tightening so as volumes can be cleaned up. If this re-occurs, we would appreciate a new bug with specific volume driver which is causing the problem.