Description of problem: journalctl show an orphaned pod Aug 28 10:33:51 sxxxxx atomic-openshift-node[14947]: E0828 10:33:51.246241 14947 kubelet_volumes.go:140] Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found, but volume paths are still present on disk: There were a total of 15 errors similar to this. Turn up verbosity to see them. It is not clear why the issue started because there wasn't any node outage recently but they are having an issue with docker daemon in the node (usually) it hangs. No persistent storage used. It is supposed that was solved in 3.10: https://access.redhat.com/errata/RHBA-2018:1816 Version-Release number of selected component (if applicable): 3.11.104 docker version 1.13.1-96 there is a workaround: https://access.redhat.com/solutions/3237211 rm -rf /var/lib/orgin/openshift.local.volumes/pods/<pod-id>/
The attached logs contains "Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found", however, it does not contain any details how the pod was deleted and why its volumes are orphaned. To get rid of the message in 3.11, it should be enough to restart atomic-openshift-node service to clean up the orphans. If that does not work, deleting the orphans manually (https://access.redhat.com/solutions/3237211) will definitely help. To find the root cause, we need to know: 1. What happened to the pod when it was deleted (was the node rebooted? was openshift-node service restarted?). Node logs where 'Orphaned pod XYZ' appeared from the first time + 10-20 minutes before would be the best. 2. *What* volume was orphaned. Simple "find /var/lib/origin/openshift.local.volumes/pods/0f59ef88-b4c1-11e9-afd0-00505680343f| is a good start. Not all volumes that Kubernetes uses are persistent, for example each pod gets a Secret volume with API token: /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes/kubernetes.io~secret /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2 /var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2/ready With that listing, we can see that a secret volume was orphaned and we know where to look.
It looks like this is an issue with Ceph CSI. I am not sure if a patch will be backported into 3.11. I'm going to reassign this issue to storage to track. Issue: https://github.com/kubernetes/kubernetes/issues/60987#issuecomment-638750828 Potential Upstream PR: https://github.com/ceph/ceph-csi/pull/1134
I am closing this bug since I haven't heard any update from reporter. Storage team is aware that this issue is not 100% fixed for certain drivers but since this does not affect stability of a cluster, we aren't actively working on it. The reason I say fix is specific to volume driver is because after node/kubelet restart, the volume reconstruction from disk depends on plugin type. And that is the logic that needs tightening so as volumes can be cleaned up. If this re-occurs, we would appreciate a new bug with specific volume driver which is causing the problem.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days