Bug 1753160 - [3.11] Pods in orphaned state
Summary: [3.11] Pods in orphaned state
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.11.0
Hardware: All
OS: All
high
low
Target Milestone: ---
: 3.11.z
Assignee: Hemant Kumar
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-18 09:48 UTC by Camino Noguera
Modified: 2024-02-04 04:25 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-23 16:28:20 UTC
Target Upstream Version:
Embargoed:
mnoguera: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1558600 0 unspecified CLOSED excessive amount of 'orphaned path' messages on openshift online nodes. 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1583707 0 unspecified CLOSED Orphaned pod found, but volumes not yet removed on OCP 3.7 2023-03-24 14:06:10 UTC
Red Hat Knowledge Base (Solution) 3237211 0 None None None 2019-09-18 10:00:41 UTC

Description Camino Noguera 2019-09-18 09:48:27 UTC
Description of problem:
journalctl show an orphaned pod


Aug 28 10:33:51 sxxxxx atomic-openshift-node[14947]: E0828 10:33:51.246241   14947 kubelet_volumes.go:140] Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found, but volume paths are still present on disk: There were a total of 15 errors similar to this. Turn up verbosity to see them.

It is not clear why the issue started because there wasn't any node outage recently but they are having an issue with docker daemon in the node (usually) it hangs.

 
No persistent storage used.
 
It is supposed that was solved in 3.10:
https://access.redhat.com/errata/RHBA-2018:1816


Version-Release number of selected component (if applicable):
3.11.104
docker version 1.13.1-96

there is a workaround:

https://access.redhat.com/solutions/3237211

rm -rf /var/lib/orgin/openshift.local.volumes/pods/<pod-id>/

Comment 2 Jan Safranek 2019-09-23 12:03:56 UTC
The attached logs contains "Orphaned pod "0f59ef88-b4c1-11e9-afd0-00505680343f" found", however, it does not contain any details how the pod was deleted and why its volumes are orphaned.

To get rid of the message in 3.11, it should be enough to restart atomic-openshift-node service to clean up the orphans. If that does not work, deleting the orphans manually (https://access.redhat.com/solutions/3237211) will definitely help.

To find the root cause, we need to know:

1. What happened to the pod when it was deleted (was the node rebooted? was openshift-node service restarted?). Node logs where 'Orphaned pod XYZ' appeared from the first time + 10-20 minutes before would be the best.

2. *What* volume was orphaned. Simple "find /var/lib/origin/openshift.local.volumes/pods/0f59ef88-b4c1-11e9-afd0-00505680343f| is a good start. Not all volumes that Kubernetes uses are persistent, for example each pod gets a Secret volume with API token:

/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/volumes/kubernetes.io~secret
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2
/var/lib/origin/openshift.local.volumes/pods/ca679063-2bcf-4c60-9bae-6ccbc44c2567/plugins/kubernetes.io~empty-dir/wrapped_default-token-tpsl2/ready

With that listing, we can see that a secret volume was orphaned and we know where to look.

Comment 38 Ryan Phillips 2020-06-05 18:34:39 UTC
It looks like this is an issue with Ceph CSI. I am not sure if a patch will be backported into 3.11. I'm going to reassign this issue to storage to track.

Issue: https://github.com/kubernetes/kubernetes/issues/60987#issuecomment-638750828
Potential Upstream PR: https://github.com/ceph/ceph-csi/pull/1134

Comment 40 Hemant Kumar 2020-06-23 16:28:20 UTC
I am closing this bug since I haven't heard any update from reporter. Storage team is aware that this issue is not 100% fixed for certain drivers but since this does not affect stability of a cluster, we aren't actively working on it.

The reason I say fix is specific to volume driver is because after node/kubelet restart, the volume reconstruction from disk depends on plugin type. And that is the logic that needs tightening so as volumes can be cleaned up. If this re-occurs, we would appreciate a new bug with specific volume driver which is causing the problem.

Comment 41 Red Hat Bugzilla 2024-02-04 04:25:22 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.